How to use court extractor

Overview

A Python library for extracting structured data from Russian court decisions, including modules to extract punishments, territorial jurisdiction, gender information, and criminal code articles.

Getting Gemini API credentials

To get the structured data about punishments from court decision you need to use PunishmentExtractor class that requires gemini API key that you can get from Gemini (Google Cloud AI Studio):

  1. open Enabled APIs and services tab
  2. create project
  3. open credentials tab
  4. choose Create credentials - API key
  5. copy the key to use when initializing the extractor

Setup Steps

1. Clone the repository:

git clone https://github.com/Cedar-Russia/court_extractor

2. Install dependencies:

pip install -r requirements.txt

Using the Parser

import os
import sys
import json
import pandas as pd

# Add the project root directory to Python path
project_root = os.path.abspath(os.path.join(os.path.dirname('__file__'), '..'))
sys.path.append(project_root)

Most of the modules in the library can take both a data table and a string. To show the functionality of the library, we will use the standard output of the parser that has both a CSV table and a JSON object with court decisions.

# downloading a sample court decision in JSON format
file_path = "data/raw/sample_court_decision.json"
with open(file_path, "r", encoding="utf-8") as f:
    data = json.load(f)

# downloading a dataframe with results
csv_path = 'data/raw/sample_decisions/sample_decisions.csv'
df = pd.read_csv(csv_path, sep=';')

Articles Extractor

Module used to extract articles, parts and subparts of the Criminal Code (Ugolovnyi Kodeks) and the Code of Administrative Offenses (Kodeks ob Administrativnykh Pravonarusheniyakh) from the string representing all articles under which charges are filed.

Initialization

# Importing the class that extracts articles
from src.articles import ArticlesExtractor

# Initializing an instance of the extractor
articles_extractor = ArticlesExtractor()

Before extracting articles from the table with court decisions, the class should be imported and an instance of the class should be initialized.

ArticlesExtractor has only one parameter: remove_duplicates, which is True by default, meaning that by default duplicate articles for the same defendant are removed.

Single String Extraction

The process_string() method processes individual text strings to extract legal articles. It accepts a string containing legal references and returns a structured dictionary with detailed information about each article found.

# Example 1: Process single string
test_string = "Губаев Борис Магомедович - ст.159 ч.2 УК РФ"

result = articles_extractor.process_string(test_string)
print("Single string processing result:")
print(result)

Output:

[{
   'person': 'Person 1',
   'articles': [{'article': '159', 'part': '2', 'subpart': None}],
   'code_type': 'CRIMINAL'
}]

The method returns a list containing dictionaries with the following structure:

Court Decision JSON Object Processing

In the JSON output of the court decision parser, the name of the defendant along with the articles they are charged with are stored in the names field.

# Example 2: Extract data from JSON object

result = articles_extractor.process_string(data['names'])
print("JSON field processing result:")
print(result)

Output:

[{
    'person': 'Person 1',
    'articles': [{'article': '116.1', 'part': '2', 'subpart': None}],
    'code_type': 'CRIMINAL'
}]

Processing Court Decisions DataFrame

To process a dataframe with a column where the articles are listed, the process_dataframe() method is used. This method can process multiple rows efficiently using parallel processing.

Parameters:

The extraction method cannot process None or null values, so rows with missing data in the target column must be excluded before processing.

The example uses df[df['names'].notna()] to filter out rows where the 'names' column is null, preventing processing errors.

# Example 3: Processing dataframe

results = articles_extractor.process_dataframe(df[df['names'].notna()], 'names')
print("DataFrame processing results (first row):")
print(results[0])

Output:

DataFrame processing results (first row):
[{
    'person': 'Person 1',
    'articles': [{'article': '116.1', 'part': '2', 'subpart': None}],
    'code_type': 'CRIMINAL'
}]

The method returns a list where each element corresponds to one row from the dataframe. Each element contains a list of dictionaries representing the extracted legal information.

Gender Extractor

A module used to extract the gender of judges and defendants from their full names within court decision texts. The extractor analyzes Russian names using multiple detection methods and returns gender classifications with confidence indicators.

Initialization

from src.gender import GenderExtractor
gender_extractor = GenderExtractor(russian_names_db=False)

Before extracting gender information from court decisions, the class should be imported and an instance of the class should be initialized.

Parameters:

Single String Processing

The extract_genders() method processes individual text strings to identify and classify the gender of people mentioned. It uses Named Entity Recognition to extract names and then applies gender detection algorithms.

# Example 1: Extract gender from single string

text = "Волостных Владислав Витальевич - ст.291 ч.3; ст.222 ч.1; ст.290 ч.5 п.в; ст.290 ч.5 п.в; ст.290 ч.5 п.в УК РФ"

result = gender_extractor.extract_genders(text)
print("Single string gender extraction result:")
print(result)

Output:

[('Волостных Владислав Витальевич', 'M')]

The method returns a list of tuples with the following structure:

Court Decision JSON Object Processing

In the JSON output of the court decision parser, defendant names are stored in the names field and judge names in the judge field.

# Example 2: Extract gender from JSON object
# Defendant gender

defendant = gender_extractor.extract_genders(data['names'])
print("Defendant gender extraction result:")
print(defendant)

# Judge gender
judge = gender_extractor.extract_genders(data['judge'])
print("Judge gender extraction result:")
print(judge)

Output:

Defendant gender extraction result:
[('Петров Николай Сергеевич', 'M')]

Judge gender extraction result:
[('Бабич Светлана Николаевна', 'F')]

Processing Court Decisions DataFrame

To process a dataframe with columns containing names, the extract_genders() method can be applied to each row using pandas' apply() function.

The extraction method cannot process None or null values, so conditional logic must be used when applying to dataframe columns. The examples use if x else None to handle missing data and prevent processing errors.

# Example 3: Processing dataframe - adding gender columns

df['judge_gender'] = df['judge'].apply(
    lambda x: gender_extractor.extract_genders(x) if x else None)
df['defendant_gender'] = df['names'].apply(
    lambda x: gender_extractor.extract_genders(x) if x else None)
print("DataFrame processing results (first row):")
print(f"Judge gender: {df['judge_gender'].iloc[0]}")
print(f"Defendant gender: {df['defendant_gender'].iloc[0]}")

Output:

DataFrame processing results (first row):
Judge gender: [('Рысков А Н', 'M')]
Defendant gender: [('Юсупов Гаяз Ризванович', 'M')]

The method returns a list of tuples. When applied to dataframes, new columns will contain these tuples for each processed row, or None for rows with missing data.

Municipality Extractor

A module used to determine the region and municipality where district courts operate based on court codes or court names. The extractor uses a curated dictionary of territorial jurisdiction for each district court, accounting for courts that serve multiple municipalities and large municipalities with several district courts.

Initialization

from src.districts import MunicipalityExtractor
municipality_extractor = MunicipalityExtractor()

Before extracting municipality information from court decisions, the class should be imported and an instance of the class should be initialized.

Parameters:

Single Court Code Processing

The get_municipality() method processes individual court codes to retrieve geographical and administrative information.

# Example 1: Get municipality for single court code

court_code = "61RS0006"
region, municipality, oktmo = municipality_extractor.get_municipality(court_code)
print("Single court code processing result:")
print(f"Court code: {court_code}")
print(f"Region: {region}")
print(f"Municipality: {municipality}")
print(f"OKTMO: {oktmo}")

Output:

Single court code processing result:
Court code: 61RS0006
Region: Ростовская область
Municipality: Ростов-на-Дону
OKTMO: 60701000001

The method returns three values:

If a court code is not found in the dictionary, the method prints "Check if court code is correct" and returns (None, None, None).

JSON Object Processing

Note: Working with raw data requires an additional function to extract court codes from Case Unique Identifiers (CUI).

def get_court_code(cui):
    """Extract court code from CUI (Case Unique Identifier)"""
    if '-' in cui:
        return cui.split('-')[0]
    else:
        return cui

In the JSON output of court decision parser, the Case Unique Identifier is stored in the cui field.

# Example 2: Extract court code from JSON object and get municipality

court_code = get_court_code(data['cui'])
region, municipality, oktmo = municipality_extractor.get_municipality(court_code)
print("JSON CUI processing result:")
print(f"Court code: {court_code}")
print(f"Region: {region}")
print(f"Municipality: {municipality}")
print(f"OKTMO: {oktmo}")

Output:

JSON CUI processing result:
Court code: 21RS0006
Region: Чувашская
Municipality: Канашский городской округ
OKTMO: 97707000

Processing Court Decisions DataFrame

The extractor includes a dedicated process_dataframe() method for efficient batch processing of dataframes. To get court code from the cui column a special function needs to be applied.

The extraction method cannot process None or null values, so use conditional logic when applying to dataframe columns.

# Example 3: Processing dataframe using built-in method

df['court_code'] = df['cui'].apply(lambda x: get_court_code(x) if x else None)
df = municipality_extractor.process_dataframe(df, 'court_code')
print("DataFrame processing results (first row):")
print(f"Court code: {df['court_code'].iloc[0]}")
print(f"Region: {df['region'].iloc[0]}")
print(f"Municipality: {df['municipality'].iloc[0]}")
print(f"OKTMO: {df['oktmo'].iloc[0]}")

Output:

DataFrame processing results (first row):
Court code: 21RS0006
Region: Чувашская
Municipality: Канашский городской округ
OKTMO: 97707000

The extractor returns an updated dataframe with additional columns containing information on the region, municipality and OKTMO (All-Russian Classifier of Municipal Territories code).

Punishment Extractor

A module used to extract structured data from court decision texts regarding punishments and their corresponding severity. The extractor analyzes both individual charge-specific punishments and aggregate sentences, handling complex legal texts that contain multiple sentencing decisions.

Initialization

from src.punishments import PunishmentExtractor
punishment_extractor = PunishmentExtractor(api_key="")  # Insert here your API key

Before extracting punishment information from court decisions, the class should be imported and an instance of the class should be initialized.

Parameters:

Single String Processing

The find_punishments() method processes individual court decision texts to extract punishment information. It returns both the processed text segment and the structured punishment data.

# Example 1: Extract punishments from single string
input_string = """Шестакова Александра Владимировича признать виновным в совершении
преступлений, предусмотренных п. «з» ч.2 ст.111, п.«а» ч.3 ст.158
Уголовного кодекса Российской Федерации и назначить ему наказание:
- по п. «з» ч.2 ст.111 УК РФ – в виде лишения свободы на срок три года;
- по п. «а» ч.3 ст.158 УК РФ – в виде лишения свободы на срок два года.
На основании ч.3 ст. 69 Уголовного кодекса Российской Федерации
по совокупности преступлений путем частичного сложения наказаний
окончательно назначить Шестакову Александру Владимировичу наказание
в виде лишения свободы на срок четыре года."""

initial_string, res = punishment_extractor.find_punishments(input_string)
print("Single string punishment extraction result:")
print(res)

Output:

Single string punishment extraction result:
{'individual_charges': [
    {'article': '111', 'part': '2', 'punishment_type': 'лишение свободы', 'duration': '3 года'},
    {'article': '158', 'part': '3', 'punishment_type': 'лишение свободы', 'duration': '2 года'}
],
'final_sentence': {'punishment_type': 'лишение свободы', 'duration': '4 года'}}

The method returns a tuple containing:

Important Notes:

JSON Object Processing

In the JSON output of court decision parsers, the full court decision text is stored in the text field.

# Example 2: Extract punishments from JSON object

initial_string, res = punishment_extractor.find_punishments(data['text'])
print("JSON text processing result:")
print("Extracted punishments:", res)

Output:

JSON text processing result:
Extracted punishments: {'individual_charges': [...], 'final_sentence': {...}}

Processing Court Decisions DataFrame

The extractor can be applied to dataframes containing court decision texts. Due to API requirements, processing may take longer for large datasets.

# Example 3: Processing dataframe
# Note: This extractor requires API key for processing

df['punishments'] = df['text'].apply(
    lambda x: punishment_extractor.find_punishments(x)[1] if x else None)
print("DataFrame processing results (first row):")
print(f"Punishments: {df['punishments'].iloc[0]}")

Output:

DataFrame processing results (first row):
Punishments: {'individual_charges': [{'article': '159', 'punishment_type': 'штраф', 'amount': '50000'}], 'final_sentence': {'punishment_type': 'штраф', 'amount': '50000'}}