Skip to content

Feature: Chunk Enrichment

pipeline/enrich_chunks.py — Classifies each chunk using the MiniMax M2.5 LLM to add structured legal metadata. This is the step that transforms raw text into searchable, filterable knowledge by tagging each chunk with its legal concept (instituto), content type, branch of law, procedural phase, and normative sources.

PropertyValue
Scriptpipeline/enrich_chunks.py (403 lines)
InputChunked markdown files with status_enriquecimento: "pendente"
OutputSame files with enriched frontmatter (13+ metadata fields)
LLMMiniMax M2.5 via Anthropic SDK with custom base_url
Concurrency5 threads, 0.5s delay between requests
IdempotentYes — skips chunks with status_enriquecimento: "completo" or "lixo"
flowchart TD
LOAD["Load enrich_prompt.md template"]
SCAN["Scan for unenriched chunks"]
NOISE["Check: is_noise_chunk?"]
SKIP["Mark as lixo, skip"]
SEND["Send to MiniMax M2.5"]
PARSE["extract_json() from response"]
MERGE["merge_classification() into frontmatter"]
WRITE["Write enriched file"]
LOAD --> SCAN
SCAN --> NOISE
NOISE -->|"Yes (preface, dedication, <200 chars)"| SKIP
NOISE -->|"No"| SEND
SEND --> PARSE
PARSE -->|"Valid JSON"| MERGE
PARSE -->|"Invalid/empty"| ERR["Log error, counter++"]
MERGE --> WRITE

The enrichment prompt is loaded from pipeline/enrich_prompt.md. Placeholder variables are substituted per chunk:

  • {livro_titulo} — book title
  • {autor} — author name
  • {capitulo} — chapter title
  • {chunk_numero} / {chunk_total} — chunk position
  • {chunk_content} — first 8,000 characters of the chunk body

Before sending to the LLM, each chunk is checked against noise patterns. Chunks matching any of these are marked as "lixo" (trash) without API calls:

  • Title contains: “prefacio”, “agradecimento”, “dedicatoria”, “nota do editor”, “sobre o autor”, etc.
  • Body text is shorter than 200 characters

The chunk text + prompt is sent to MiniMax M2.5 via the Anthropic SDK with a custom base URL:

client = anthropic.Anthropic(
api_key=api_key,
base_url="https://api.minimax.io/anthropic"
)
message = client.messages.create(
model="MiniMax-M2.5",
max_tokens=2000,
system="Voce e um classificador juridico especializado. "
"Responda APENAS com JSON valido, sem markdown, sem backticks, sem explicacoes.",
messages=[{"role": "user", "content": prompt}]
)

The LLM response is parsed using extract_json(), which has a brace-matching fallback for cases where the LLM wraps the JSON in markdown code fences or adds surrounding text:

def extract_json(text: str) -> dict:
text = text.strip()
text = re.sub(r'^```json\s*', '', text)
text = re.sub(r'\s*```$', '', text)
text = text.strip()
# Try direct parse first
try:
return json.loads(text)
except json.JSONDecodeError:
pass
# Extract first complete JSON object by brace matching
start = text.find('{')
if start == -1:
raise json.JSONDecodeError("No JSON object found", text, 0)
depth = 0
in_string = False
escape = False
for i in range(start, len(text)):
c = text[i]
if escape:
escape = False
continue
if c == '\\' and in_string:
escape = True
continue
if c == '"' and not escape:
in_string = not in_string
continue
if in_string:
continue
if c == '{':
depth += 1
elif c == '}':
depth -= 1
if depth == 0:
return json.loads(text[start:i+1])
raise json.JSONDecodeError("Incomplete JSON object", text, len(text))

The parsed JSON fields are merged into the existing frontmatter via merge_classification(), and status metadata is appended:

enriched["status_enriquecimento"] = "completo"
enriched["data_enriquecimento"] = datetime.now().isoformat()
enriched["modelo_enriquecimento"] = "MiniMax-M2.5"

The merge_classification() function maps 13 fields from the LLM response into the chunk frontmatter. These fields are the foundation for all downstream search and filtering.

#FieldTypeDescriptionExample Values
1categoriastrHigh-level category"doutrina", "legislacao_comentada"
2tipo_contratualstrContract type (if applicable)"compra_e_venda", "locacao"
3objeto_especificostrSpecific subject matter"clausula_penal", "exceptio"
4institutolist[str]Legal concepts/institutes["exceptio_non_adimpleti_contractus", "contrato_bilateral"]
5sub_institutolist[str]Sub-concepts["inadimplemento_relativo"]
6faselist[str]Contract/procedural phase["formacao", "execucao", "extincao"]
7ramostrBranch of law"direito_civil", "processo_civil"
8fontes_normativaslist[str]Statutory references["CC art. 476", "CC art. 477"]
9tipo_conteudolist[str]Content type classification["definicao", "requisitos", "jurisprudencia_comentada"]
10utilidadestrPractical utility rating"alta", "media", "baixa"
11confiabilidadestrSource reliability"alta", "media"
12jurisdicao_estrangeirabool or strForeign jurisdiction referencefalse, "common_law"
13justificativastrLLM reasoning for classificationFree text explaining the tagging logic

In addition to these 13 LLM-provided fields, merge_classification() adds 3 system fields:

FieldTypeValue
status_enriquecimentostr"completo"
data_enriquecimentostrISO 8601 timestamp
modelo_enriquecimentostr"MiniMax-M2.5"
VariableRequiredDescription
MINIMAX_API_KEYYes (unless --dry-run)MiniMax API authentication
VAULT_PATHNoBase directory (default: /mnt/c/Users/sensd/vault)
Terminal window
# Enrich all unenriched chunks across all books
python3 pipeline/enrich_chunks.py all
# Enrich a specific book
python3 pipeline/enrich_chunks.py contratos-orlando-gomes
# Re-enrich already completed chunks
python3 pipeline/enrich_chunks.py all --force
# Limit to first N chunks (for testing)
python3 pipeline/enrich_chunks.py all --limit 10
# Preview without API calls
python3 pipeline/enrich_chunks.py all --dry-run
# Use a specific API key
python3 pipeline/enrich_chunks.py all --api-key "sk-..."
# Adjust thread count (default: 5)
python3 pipeline/enrich_chunks.py all --workers 3
SettingValueDescription
WORKERS5Number of concurrent threads
DELAY_BETWEEN_REQUESTS0.5sDelay after each API call (per thread)

The estimated processing time is displayed at startup:

Estimativa: ~{(total * 0.5) / workers / 60:.0f} min

For 1,000 chunks with 5 workers: approximately 1.7 minutes.

Results are appended to Logs/enrichment_log.jsonl:

{
"timestamp": "2026-02-28T14:30:00",
"file": "/path/to/chunk.md",
"success": true,
"model": "MiniMax-M2.5",
"tags_count": 3,
"tipo_conteudo": ["definicao", "requisitos"]
}
  • No schema validation on LLM output. If the LLM returns unexpected field values (e.g., ramo: "unknown_branch"), the data is accepted without validation. Invalid metadata propagates to embeddings and search. Tracked as mitigation M10 in the ROADMAP.
  • Broad except Exception catches in classify_chunk() and process_one() swallow all errors, including API failures, rate limits, and network issues. The error counter increments but the specific cause is not logged in classify_chunk().
  • No accuracy measurement. The quality of LLM classification has never been validated against human judgment. Mitigation M06 proposes sampling 200 chunks for manual review. If accuracy is below 85%, all enrichment should be redone.
  • MiniMax via Anthropic SDK uses an undocumented compatibility layer. If MiniMax changes their API, the integration could break silently. Decision D06 in the ROADMAP considers migrating to Claude or local models.
  • Thread safety relies on threading.Lock() for counters and log writes. The anthropic.Anthropic client is shared across threads without explicit documentation that it is thread-safe.
  • Chunk content is truncated to 8,000 characters before sending to the LLM. Longer chunks may have important content in their tail that the classifier never sees.