How to use court parser

How to use court extractor

Overview

A Python library for extracting structured data from Russian court decisions, including modules to extract punishments, territorial jurisdiction, gender information, and criminal code articles.

Getting Gemini API credentials

To get the structured data about punishments from court decision you need to use PunishmentExtractor class that requires gemini API key that you can get from Gemini (Google Cloud AI Studio):

open Enabled APIs and services tab
create project
open credentials tab
choose Create credentials - API key
copy the key to use when initializing the

Setup Steps

Clone the repository:

git clone https://github.com/Cedar-Russia/court_extractor

Install dependencies:

pip install -r requirements.txt

Using the Parser

import os
import sys
import json
import pandas as pd

# Add the project root directory to Python pathproject_root = os.path.abspath(os.path.join(os.path.dirname('__file__'), '..'))
sys.path.append(project_root)

Most of the modules in the library can take both a data table and a string. To show the functionality of the library, we will use the standard output of the parser that has both a CSV table and a JSON object with court decisions.

# downloading a sample court decision in JSON format
file_path = "data/raw/sample_court_decision.json"
with open(file_path, "r", encoding="utf-8") as f:
	data = json.load(f)

# dowloading a dataframe with results
csv_path = 'data/raw/sample_decisions/sample_decisions.csv'
df = pd.read_csv(csv_path, sep=';')

Articles Extractor

Module used to extract articles, parts and subparts of the Criminal Code (Ugolovnyi Kodeks) and the Code of Administrative Offenses (Kodeks ob Administrativnykh Pravonarusheniyakh) from the string representing all articles under which charges are filed.

Initialization

# Importing the class that extracts articles
from src.articles import ArticlesExtractor

# Initializing an intstance of the extractor
articles_extractor = ArticlesExtractor()

Before extracting articles from the table with court decisions, the class should be imported and an instance of the class should be initialized.

ArticlesExtractor has only one parameter: remove_duplicates, which is True by default, meaning that by default duplicate articles for the same defendant are removed.

Single String Extraction

The process_string() method processes individual text strings to extract legal articles. It accepts a string containing legal references and returns a structured dictionary with detailed information about each article found.

# Example 1: Process single string>>> test_string = "Губаев Борис Магомедович - ст.159 ч.2 УК РФ"

result = articles_extractor.process_string(test_string)
print("Single string processing result:")
print(result)

Output:

[{
   'person': 'Person 1',
   'articles': [{'article': '159', 'part': '2', 'subpart': None}],
   'code_type': 'CRIMINAL'
}]

The method returns a list containing dictionaries with the following structure:

person: Identifier for the individual associated with the charges
articles: List of extracted articles with their components (article number, part, subpart)
code_type: Classification of the legal code (CRIMINAL or ADMINISTRATIVE)

Court Decision JSON Object Processing

In the JSON output of the court decision parser, the name of the defendant along with the articles they are charged with are stored in the names field.

# Example 2: Extract data from JSON object

result = articles_extractor.process_string(data['names'])
print("JSON field processing result:")
print(result)

Output:

[{
    'person': 'Person 1',
    'articles': [{'article': '116.1', 'part': '2', 'subpart': None}],
    'code_type': 'CRIMINAL'
    }]

Processing Court Decisions DataFrame

To process a dataframe with a column where the articles are listed, the process_dataframe() method is used. This method can process multiple rows efficiently using parallel processing.

Parameters:

df - the dataframe with court decisions
column_name - name of the column containing the text with articles to extract
parallel - enables parallel processing (default: True)
n_workers - number of worker threads for parallel processing (default: 4). Optimal values typically range from 2-8 depending on your system’s CPU cores and workload size.

The extraction method cannot process None or null values, so rows with missing data in the target column must be excluded before processing.

The example uses df[df['names'].notna()] to filter out rows where the ‘names’ column is null, preventing processing errors.

# Example 3: Processing dataframe

results = articles_extractor.process_dataframe(df[df['names'].notna()], 'names')
print("DataFrame processing results (first row):")
print(results[0])

Output:

DataFrame processing results (first row):
[{
    'person': 'Person 1',
    'articles': [{'article': '116.1', 'part': '2', 'subpart': None}],
    'code_type': 'CRIMINAL'
    }]

The method returns a list where each element corresponds to one row from the dataframe. Each element contains a list of dictionaries representing the extracted legal information

Gender Extractor

A module used to extract the gender of judges and defendants from their full names within court decision texts. The extractor analyzes Russian names using multiple detection methods and returns gender classifications with confidence indicators.

Initialization

from src.gender import GenderExtractor
gender_extractor = GenderExtractor(russian_names_db=False)

Before extracting gender information from court decisions, the class should be imported and an instance of the class should be initialized.

Parameters:

russian_names_db - enables additional Russian names database for gender detection (default: False)
When russian_names_db=False: Uses only pytrovich library
When russian_names_db=True: Cross-validates results between pytrovich and russiannames libraries
If both methods agree, returns the agreed gender
If methods disagree, returns ‘C’ for contradiction
If one method is undefined and the other has a result, returns the definitive result

Single String Processing

The extract_genders() method processes individual text strings to identify and classify the gender of people mentioned. It uses Named Entity Recognition to extract names and then applies gender detection algorithms.

# Example 1: Extract gender from single string

text = "Волостных Владислав Витальевич - ст.291 ч.3; ст.222 ч.1; ст.290 ч.5 п.в; ст.290 ч.5 п.в; ст.290 ч.5 п.в УК РФ"

result = gender_extractor.extract_genders(text)
print("Single string gender extraction result:")
print(result)

Output:

[('Волостных Владислав Витальевич', 'M')]

The method returns a list of tuples with the following structure:

Name: The full name as extracted and normalized from the text
Gender: A single character classification based on name analysis
M - Male
F - Female
U - Undefined (cannot determine gender)
C - Contradiction (conflicting gender indicators between detection methods)

Court decision JSON Object Processing

In the JSON output of the court decision parser, defendant names are stored in the names field and judge names in the judge field.

# Example 2: Extract gender from JSON object# Defendant gender

defendant = gender_extractor.extract_genders(data['names'])
print("Defendant gender extraction result:")
print(defendant)

# Judge genderjudge = gender_extractor.extract_genders(data['judge'])
print("Judge gender extraction result:")
print(judge)

Output:

Defendant gender extraction result:
[('Петров Николай Сергеевич', 'M')]

Judge gender extraction result:
[('Бабич Светлана Николаевна', 'F')]

Processing Court Decisions DataFrame

To process a dataframe with columns containing names, the extract_genders() method can be applied to each row using pandas’ apply() function.

The extraction method cannot process None or null values, so conditional logic must be used when applying to dataframe columns. The examples use if x else None to handle missing data and prevent processing errors.

# Example 3: Processing dataframe - adding gender columns

df['judge_gender'] = df['judge'].apply(
    lambda x: gender_extractor.extract_genders(x) if x else None)
df['defendant_gender'] = df['names'].apply(
    lambda x: gender_extractor.extract_genders(x) if x else None)
print("DataFrame processing results (first row):")
print(f"Judge gender: {df['judge_gender'].iloc[0]}")
print(f"Defendant gender: {df['defendant_gender'].iloc[0]}")

Output:

DataFrame processing results (first row):
Judge gender: [('Рысков А Н', 'M')]
Defendant gender: [('Юсупов Гаяз Ризванович', 'M')]

The method returns a list of tuples. When applied to dataframes, new columns will contain these tuples for each processed row, or None for rows with missing data.

Municipality Extractor

A module used to determine the region and municipality where district courts operate based on court codes or court names. The extractor uses a curated dictionary of territorial jurisdiction for each district court, accounting for courts that serve multiple municipalities and large municipalities with several district courts.

Initialization

from src.districts import MunicipalityExtractor
municipality_extractor = MunicipalityExtractor()

Before extracting municipality information from court decisions, the class should be imported and an instance of the class should be initialized.

Parameters:

use_name - determines whether to use court names instead of court codes for lookup (default: False)
dict_path - path to the court dictionary CSV file (default: uses built-in dictionary)

Single Court Code Processing

The get_municipality() method processes individual court codes to retrieve geographical and administrative information.

# Example 1: Get municipality for single court code

court_code = "61RS0006"
region, municipality, oktmo = municipality_extractor.get_municipality(court_code)
print("Single court code processing result:")
print(f"Court code: {court_code}")
print(f"Region: {region}")
print(f"Municipality: {municipality}")
print(f"OKTMO: {oktmo}")

Output:

Single court code processing result:
Court code: 61RS0006
Region: Ростовская область
Municipality: Ростов-на-Дону
OKTMO: 60701000001

The method returns three values:

- Region: The federal subject (oblast, republic, etc.) where the court is located
- Municipality: The specific city, district, or municipal entity
- OKTMO: All-Russian Classifier of Municipal Territories code

If a court code is not found in the dictionary, the method prints “Check if court code is correct” and returns (None, None, None).

JSON Object Processing

Note: Working with raw data requires an additional function to extract court codes from Case Unique Identifiers (CUI).

def get_court_code(cui):
    """Extract court code from CUI (Case Unique Identifier)"""    if '-' in cui:
        return cui.split('-')[0]
    else:
        return cui

In the JSON output of court decision parser, the Case Unique Identifier is stored in the
cui field.

# Example 2: Extract court code from JSON object and get municipality

court_code = get_court_code(data['cui'])
region, municipality, oktmo = municipality_extractor.get_municipality(court_code)
print("JSON CUI processing result:")
print(f"Court code: {court_code}")
print(f"Region: {region}")
print(f"Municipality: {municipality}")
print(f"OKTMO: {oktmo}")

Output:

JSON CUI processing result:
Court code: 21RS0006
Region: Чувашская
Municipality: Канашский городской округ
OKTMO: 97707000

Processing Court Decisions DataFrame

The extractor includes a dedicated process_dataframe() method for efficient batch processing of dataframes. To get court code from the cui column a special function needs to be applied.

The extraction method cannot process None or null values, so use conditional logic when applying to dataframe columns.

# Example 3: Processing dataframe using built-in method

df['court_code'] = df['cui'].apply(lambda x: get_court_code(x) if x else None)
df = municipality_extractor.process_dataframe(df, 'court_code')
print("DataFrame processing results (first row):")
print(f"Court code: {df['court_code'].iloc[0]}")
print(f"Region: {df['region'].iloc[0]}")
print(f"Municipality: {df['municipality'].iloc[0]}")
print(f"OKTMO: {df['oktmo'].iloc[0]}")

Output:

DataFrame processing results (first row):
Court code: 21RS0006
Region: Чувашская
Municipality: Канашский городской округ
OKTMO: 97707000

The extractor returns an updated dataframe with additional columns containing information on the region, municipality and OKTMO (All-Russian Classifier of Municipal Territories code).

Punishment Extractor

A module used to extract structured data from court decision texts regarding punishments and their corresponding severity. The extractor analyzes both individual charge-specific punishments and aggregate sentences, handling complex legal texts that contain multiple sentencing decisions.

Initialization

from src.punishments import PunishmentExtractor
punishment_extractor = PunishmentExtractor(api_key="")  # Insert here your API key

Before extracting punishment information from court decisions, the class should be imported and an instance of the class should be initialized.

Parameters:

api_key - required API key for processing legal texts (must be provided for the extractor to function)

Single String Processing

The find_punishments() method processes individual court decision texts to extract punishment information. It returns both the processed text segment and the structured punishment data.

# Example 1: Extract punishments from single stringinput_string = """Шестакова Александра Владимировича признать виновным в совершении преступлений, предусмотренных п. «з» ч.2 ст.111, п.«а» ч.3 ст.158 Уголовного кодекса Российской Федерации и назначить ему наказание: - по п. «з» ч.2 ст.111УК РФ – в виде лишения свободы насрок три года; - по п. «а» ч.3 ст.158 УК РФ – в виде лишения свободы на срок два года. На основании ч.3 ст. 69 Уголовного кодекса РоссийскойФедерации по совокупности преступлений путем частичного сложения наказаний окончательно назначить Шестакову Александру Владимировичу наказаниев виде лишения свободы на срок четыре года."""

initial_string, res = punishment_extractor.find_punishments(input_string)
print("Single string punishment extraction result:")
print(res)

Output:

Single string punishment extraction result:
{'individual_charges': [
    {'article': '111', 'part': '2', 'punishment_type': 'лишение свободы', 'duration': '3 года'},
    {'article': '158', 'part': '3', 'punishment_type': 'лишение свободы', 'duration': '2 года'}
],
'final_sentence': {'punishment_type': 'лишение свободы', 'duration': '4 года'}}

The method returns a tuple containing:

Initial string: The relevant text segment that was processed
Structured data: Dictionary with detailed punishment information that includes:
Individual charges: List of punishments for each specific criminal article
Final sentence: The aggregate or final punishment imposed by the court
Punishment types: Various forms including imprisonment (лишение свободы), fines (штраф), community service, etc.
Duration/Amount: Specific terms, periods, or monetary amounts

Important Notes:

This extractor requires a valid API key to function
The method cannot process None or null values, so use conditional logic when applying to dataframe columns
Processing large datasets may be time-intensive due to API calls
The extractor handles complex sentencing scenarios including concurrent and consecutive sentences

JSON Object Processing

In the JSON output of court decision parsers, the full court decision text is stored in the text field.

# Example 2: Extract punishments from JSON object

initial_string, res = punishment_extractor.find_punishments(data['text'])
print("JSON text processing result:")
print("Extracted punishments:", res)

Output:

JSON text processing result:
Extracted punishments: {'individual_charges': [...], 'final_sentence': {...}}

Processing Court Decisions DataFrame

The extractor can be applied to dataframes containing court decision texts. Due to API requirements, processing may take longer for large datasets.

# Example 3: Processing dataframe# Note: This extractor requires API key for processing

df['punishments'] = df['text'].apply(
    lambda x: punishment_extractor.find_punishments(x)[1] if x else None)
print("DataFrame processing results (first row):")
print(f"Punishments: {df['punishments'].iloc[0]}")

Output:

DataFrame processing results (first row):
Punishments: {'individual_charges': [{'article': '159', 'punishment_type': 'штраф', 'amount': '50000'}], 'final_sentence': {'punishment_type': 'штраф', 'amount': '50000'}}

SUBSCRIBE TO OUR NEWSLETTER