Skip to content

Feature: Intelligent Chunking v3

pipeline/rechunk_v3.py — The most sophisticated and critical component of the pipeline. Splits legal markdown (output of process_books.py) into semantically coherent chunks using domain-specific heuristics tuned for Brazilian and international legal textbooks.

PropertyValue
Scriptpipeline/rechunk_v3.py (890 lines)
InputMarkdown files in _staging/processed/{book}/
OutputRe-chunked markdown files (same directory, overwritten)
Min chunk1,500 characters (MIN_CHUNK_CHARS)
Max chunk15,000 characters (MAX_CHUNK_CHARS)
Test coverage0% — flagged as F26 (P1, v0.3)

The rechunker processes each book through five sequential passes. Each pass has a specific responsibility, and the output of one feeds into the next.

flowchart TD
INPUT["Markdown from process_books.py"]
P1["Pass 1: Section Split"]
P2["Pass 2: Classify"]
P3["Pass 3: Merge Small"]
P4["Pass 4: Split Oversized"]
P5["Pass 5: Cleanup"]
OUTPUT["Semantic chunks"]
INPUT --> P1
P1 -->|"Split by 14 section patterns"| P2
P2 -->|"Tag: noise, bibliography, summary, content"| P3
P3 -->|"Merge chunks < 1500 chars with neighbors"| P4
P4 -->|"Split chunks > 15000 chars at sentence boundaries"| P5
P5 -->|"Remove empty, normalize whitespace"| OUTPUT

Detects section boundaries using 14 regex patterns (see below) and splits the document at each boundary. Each detected section header becomes the title of a new chunk.

Classifies each chunk by content type using classify_block_content():

ClassificationDetection LogicHandling
examplePatterns: “por exemplo”, “imagine que”, “e.g.”, “for instance”Kept with preceding principle
tableMore than 5 pipe characters and 2+ newlinesCan be standalone chunk
characteristics3+ contract characteristic terms (bilateral, oneroso, consensual…)Kept as indivisible block
law_articleStarts with Art. {number}Kept with subsequent commentary
bibliography>50% of lines start with ALL CAPS author names, 5+ linesExtracted as separate chunk
regularDefaultNormal processing

Chunks below MIN_CHUNK_CHARS (1,500) are merged with their neighbors. The merge strategy preserves semantic coherence:

  • Examples merge with the preceding principle they illustrate
  • Footnotes merge with their referencing paragraph
  • Law articles merge with their commentary

Chunks exceeding MAX_CHUNK_CHARS (15,000) are split at sentence boundaries, ensuring no chunk exceeds the maximum while keeping sentences intact.

Final pass removes empty chunks, normalizes whitespace, and validates that all chunks meet the minimum content threshold (200 characters of real text).

The section detector uses these regex patterns in priority order. The first match wins.

#PatternTypeExample Match
1^#{1,3}\s+(.+)$md_header# Chapter 1
2^\*\*Chapter\s+\d+[\.:]?\*?\*?\s*(.*?)chapter_en**Chapter 5:** Title
3^\*?\*?Cap[ií]tulo\s+\*?\*?\w+\*?\*?\.?\s*(.*?)capitulo_ptCapitulo V - Dos Contratos
4^CHAPTER\s+\d+\.?\s*(.*)$chapter_capsCHAPTER 5 BILATERAL CONTRACTS
5^CAP[ÍI]TULO\s+\w+\.?\s*(.*)$capitulo_capsCAPITULO V
6^\*?\*?T[ÍI]TULO\s+\w+\*?\*?\.?\s*(.*)$titulo**TITULO VI** Dos Contratos
7^\*?\*?PARTE\s+\w+\*?\*?\.?\s*(.*)$partePARTE ESPECIAL
8^\*?\*?Part\s+\w+\*?\*?\.?\s*(.*)$part_en**Part One** General Theory
9^(?:#{1,3}\s+)?\*?\*?Art\.?\s+\d+[\.\)]?\*?\*?\s*(.*)$artigoArt. 481. or ### Art. 481
10^(?:#{1,3}\s+)?_?\*?\*?Se[çc][ãa]o\s+\w+_?\*?\*?\.?\s*(.*)$secaoSecao I - Disposicoes Gerais
11^\*?\*?Section\s+\w+\*?\*?\.?\s*(.*)$section_en**Section 3** Formation
12^(\d{1,3})\.\s+([A-Z][A-Z\s,]{8,80})$numbered_caps1. CONTRATOS BILATERAIS
13^\*\*(\d{1,3})\.?\*?\*?\s+(.{5,80})$numbered_bold**1.** Conceito e Natureza
14^([A-Z\s,]{15,80})$allcaps_titleCONTRATOS BILATERAIS E UNILATERAIS

Content with these title keywords is filtered as non-substantive:

NOISE_TITLES = {
# Portuguese
'prefácio', 'prefacio', 'agradecimentos', 'agradecimento',
'dedicatória', 'dedicatoria', 'palavras do coordenador',
'nota do editor', 'notas do editor', 'nota à edição',
'sobre o autor', 'sobre os autores', 'dados catalográficos',
'ficha catalográfica', 'expediente',
'editora forense', 'editora saraiva', 'editora atlas',
'editora renovar', 'editora revista dos tribunais',
'no_content_here',
# English
'preface', 'foreword', 'acknowledgements', 'acknowledgments',
'dedication', 'about the author', 'about the authors',
"editor's note", "publisher's note",
}

Sections matching these are extracted as separate bibliography chunks:

BIBLIOGRAPHY_TITLES = {
'bibliografia', 'referências bibliográficas', 'referências',
'bibliography', 'references', 'works cited', 'further reading',
'leituras complementares', 'obras consultadas',
}

Tables of contents are tagged as metadata (not discarded, but marked):

SUMMARY_TITLES = {
'sumário', 'sumario', 'índice', 'indice',
'table of contents', 'contents', 'summary',
'índice remissivo', 'indice remissivo',
'table of cases', 'table of legislation', 'table of statutes',
}

PDF extraction often produces repeated lines (book title, author name, chapter heading) at the top of every page. The rechunker detects these by frequency analysis across the document and filters them out before chunking.

Legal textbooks heavily use footnotes. The rechunker groups footnotes with their referencing paragraph using two detection functions:

def is_footnote_line(line: str) -> bool:
"""Detect footnote lines at bottom of text."""
stripped = line.strip()
# "1 Author, Book, p. 123" or "¹ Author..." or "[1] Author..."
if re.match(r'^\d{1,3}\s+[A-ZÁÉÍÓÚÀÂÊÔÃÕÇ]', stripped):
return True
if re.match(r'^[¹²³⁴⁵⁶⁷⁸⁹⁰]+\s+', stripped):
return True
if re.match(r'^\[\d{1,3}\]\s+', stripped):
return True
return False

When a chunk contains a law article transcription (e.g., Art. 476 do Codigo Civil...), the rechunker ensures the author’s commentary that follows is never separated from the article itself. This is critical because the commentary is only meaningful in the context of the article being discussed.

Contract characteristic lists (bilateral, oneroso, consensual, comutativo…) are detected when 3+ characteristic terms appear together and are kept as a single indivisible block, because splitting them would destroy the comparative analysis.

Terminal window
# Rechunk all books in staging
python3 pipeline/rechunk_v3.py
# Rechunk a specific book
python3 pipeline/rechunk_v3.py contratos-orlando-gomes
# Set minimum chunk size (characters)
python3 pipeline/rechunk_v3.py --min-chars 1500
# Preview changes without writing
python3 pipeline/rechunk_v3.py --dry-run
# Force rechunk even if already processed
python3 pipeline/rechunk_v3.py --force
ConstantValueDescription
MIN_CHUNK_CHARS1,500Minimum characters for a valid chunk
MAX_CHUNK_CHARS15,000Maximum characters before forced split
  • 0% test coverage for 890 lines of regex-heavy logic. A single false positive in section detection can cascade through the entire document. Tracked as F26.
  • Custom YAML parser uses regex instead of PyYAML. Special characters in titles (colons, quotes) can corrupt frontmatter. The same regex parser is duplicated across enrich_chunks.py and embed_doutrina.py. Tracked as F23.
  • Assumes hierarchical structure — books without clear H1/H2/section patterns (e.g., dictionaries, legal compilations, multi-author essays) produce poor chunking results.
  • Running header detection is heuristic — repeated lines that are NOT headers (e.g., a legal maxim appearing multiple times) can be falsely filtered.
  • allcaps_title pattern is broad — any line of 15-80 uppercase characters is treated as a section boundary, which can produce false positives on bibliographic entries, publisher names, or emphasized text.
  • No language-specific tokenization — the sentence splitter for oversized chunks does not use a Portuguese-specific tokenizer, which may split at abbreviations (e.g., “Art.”, “Dr.”).