Feature: Chunk Enrichment
Chunk Enrichment (F03)
Section titled “Chunk Enrichment (F03)”pipeline/enrich_chunks.py — Classifies each chunk using the MiniMax M2.5 LLM to add structured legal metadata. This is the step that transforms raw text into searchable, filterable knowledge by tagging each chunk with its legal concept (instituto), content type, branch of law, procedural phase, and normative sources.
Overview
Section titled “Overview”| Property | Value |
|---|---|
| Script | pipeline/enrich_chunks.py (403 lines) |
| Input | Chunked markdown files with status_enriquecimento: "pendente" |
| Output | Same files with enriched frontmatter (13+ metadata fields) |
| LLM | MiniMax M2.5 via Anthropic SDK with custom base_url |
| Concurrency | 5 threads, 0.5s delay between requests |
| Idempotent | Yes — skips chunks with status_enriquecimento: "completo" or "lixo" |
How It Works
Section titled “How It Works”flowchart TD LOAD["Load enrich_prompt.md template"] SCAN["Scan for unenriched chunks"] NOISE["Check: is_noise_chunk?"] SKIP["Mark as lixo, skip"] SEND["Send to MiniMax M2.5"] PARSE["extract_json() from response"] MERGE["merge_classification() into frontmatter"] WRITE["Write enriched file"]
LOAD --> SCAN SCAN --> NOISE NOISE -->|"Yes (preface, dedication, <200 chars)"| SKIP NOISE -->|"No"| SEND SEND --> PARSE PARSE -->|"Valid JSON"| MERGE PARSE -->|"Invalid/empty"| ERR["Log error, counter++"] MERGE --> WRITEStep 1: Load prompt template
Section titled “Step 1: Load prompt template”The enrichment prompt is loaded from pipeline/enrich_prompt.md. Placeholder variables are substituted per chunk:
{livro_titulo}— book title{autor}— author name{capitulo}— chapter title{chunk_numero}/{chunk_total}— chunk position{chunk_content}— first 8,000 characters of the chunk body
Step 2: Noise detection
Section titled “Step 2: Noise detection”Before sending to the LLM, each chunk is checked against noise patterns. Chunks matching any of these are marked as "lixo" (trash) without API calls:
- Title contains: “prefacio”, “agradecimento”, “dedicatoria”, “nota do editor”, “sobre o autor”, etc.
- Body text is shorter than 200 characters
Step 3: LLM classification
Section titled “Step 3: LLM classification”The chunk text + prompt is sent to MiniMax M2.5 via the Anthropic SDK with a custom base URL:
client = anthropic.Anthropic( api_key=api_key, base_url="https://api.minimax.io/anthropic")
message = client.messages.create( model="MiniMax-M2.5", max_tokens=2000, system="Voce e um classificador juridico especializado. " "Responda APENAS com JSON valido, sem markdown, sem backticks, sem explicacoes.", messages=[{"role": "user", "content": prompt}])Step 4: JSON extraction
Section titled “Step 4: JSON extraction”The LLM response is parsed using extract_json(), which has a brace-matching fallback for cases where the LLM wraps the JSON in markdown code fences or adds surrounding text:
def extract_json(text: str) -> dict: text = text.strip() text = re.sub(r'^```json\s*', '', text) text = re.sub(r'\s*```$', '', text) text = text.strip()
# Try direct parse first try: return json.loads(text) except json.JSONDecodeError: pass
# Extract first complete JSON object by brace matching start = text.find('{') if start == -1: raise json.JSONDecodeError("No JSON object found", text, 0)
depth = 0 in_string = False escape = False for i in range(start, len(text)): c = text[i] if escape: escape = False continue if c == '\\' and in_string: escape = True continue if c == '"' and not escape: in_string = not in_string continue if in_string: continue if c == '{': depth += 1 elif c == '}': depth -= 1 if depth == 0: return json.loads(text[start:i+1])
raise json.JSONDecodeError("Incomplete JSON object", text, len(text))Step 5: Merge classification
Section titled “Step 5: Merge classification”The parsed JSON fields are merged into the existing frontmatter via merge_classification(), and status metadata is appended:
enriched["status_enriquecimento"] = "completo"enriched["data_enriquecimento"] = datetime.now().isoformat()enriched["modelo_enriquecimento"] = "MiniMax-M2.5"Metadata Schema
Section titled “Metadata Schema”The merge_classification() function maps 13 fields from the LLM response into the chunk frontmatter. These fields are the foundation for all downstream search and filtering.
| # | Field | Type | Description | Example Values |
|---|---|---|---|---|
| 1 | categoria | str | High-level category | "doutrina", "legislacao_comentada" |
| 2 | tipo_contratual | str | Contract type (if applicable) | "compra_e_venda", "locacao" |
| 3 | objeto_especifico | str | Specific subject matter | "clausula_penal", "exceptio" |
| 4 | instituto | list[str] | Legal concepts/institutes | ["exceptio_non_adimpleti_contractus", "contrato_bilateral"] |
| 5 | sub_instituto | list[str] | Sub-concepts | ["inadimplemento_relativo"] |
| 6 | fase | list[str] | Contract/procedural phase | ["formacao", "execucao", "extincao"] |
| 7 | ramo | str | Branch of law | "direito_civil", "processo_civil" |
| 8 | fontes_normativas | list[str] | Statutory references | ["CC art. 476", "CC art. 477"] |
| 9 | tipo_conteudo | list[str] | Content type classification | ["definicao", "requisitos", "jurisprudencia_comentada"] |
| 10 | utilidade | str | Practical utility rating | "alta", "media", "baixa" |
| 11 | confiabilidade | str | Source reliability | "alta", "media" |
| 12 | jurisdicao_estrangeira | bool or str | Foreign jurisdiction reference | false, "common_law" |
| 13 | justificativa | str | LLM reasoning for classification | Free text explaining the tagging logic |
In addition to these 13 LLM-provided fields, merge_classification() adds 3 system fields:
| Field | Type | Value |
|---|---|---|
status_enriquecimento | str | "completo" |
data_enriquecimento | str | ISO 8601 timestamp |
modelo_enriquecimento | str | "MiniMax-M2.5" |
Configuration
Section titled “Configuration”Environment Variables
Section titled “Environment Variables”| Variable | Required | Description |
|---|---|---|
MINIMAX_API_KEY | Yes (unless --dry-run) | MiniMax API authentication |
VAULT_PATH | No | Base directory (default: /mnt/c/Users/sensd/vault) |
CLI Arguments
Section titled “CLI Arguments”# Enrich all unenriched chunks across all bookspython3 pipeline/enrich_chunks.py all
# Enrich a specific bookpython3 pipeline/enrich_chunks.py contratos-orlando-gomes
# Re-enrich already completed chunkspython3 pipeline/enrich_chunks.py all --force
# Limit to first N chunks (for testing)python3 pipeline/enrich_chunks.py all --limit 10
# Preview without API callspython3 pipeline/enrich_chunks.py all --dry-run
# Use a specific API keypython3 pipeline/enrich_chunks.py all --api-key "sk-..."
# Adjust thread count (default: 5)python3 pipeline/enrich_chunks.py all --workers 3Concurrency Settings
Section titled “Concurrency Settings”| Setting | Value | Description |
|---|---|---|
WORKERS | 5 | Number of concurrent threads |
DELAY_BETWEEN_REQUESTS | 0.5s | Delay after each API call (per thread) |
The estimated processing time is displayed at startup:
Estimativa: ~{(total * 0.5) / workers / 60:.0f} minFor 1,000 chunks with 5 workers: approximately 1.7 minutes.
Logging
Section titled “Logging”Results are appended to Logs/enrichment_log.jsonl:
{ "timestamp": "2026-02-28T14:30:00", "file": "/path/to/chunk.md", "success": true, "model": "MiniMax-M2.5", "tags_count": 3, "tipo_conteudo": ["definicao", "requisitos"]}Known Issues
Section titled “Known Issues”- No schema validation on LLM output. If the LLM returns unexpected field values (e.g.,
ramo: "unknown_branch"), the data is accepted without validation. Invalid metadata propagates to embeddings and search. Tracked as mitigation M10 in the ROADMAP. - Broad
except Exceptioncatches inclassify_chunk()andprocess_one()swallow all errors, including API failures, rate limits, and network issues. The error counter increments but the specific cause is not logged inclassify_chunk(). - No accuracy measurement. The quality of LLM classification has never been validated against human judgment. Mitigation M06 proposes sampling 200 chunks for manual review. If accuracy is below 85%, all enrichment should be redone.
- MiniMax via Anthropic SDK uses an undocumented compatibility layer. If MiniMax changes their API, the integration could break silently. Decision D06 in the ROADMAP considers migrating to Claude or local models.
- Thread safety relies on
threading.Lock()for counters and log writes. Theanthropic.Anthropicclient is shared across threads without explicit documentation that it is thread-safe. - Chunk content is truncated to 8,000 characters before sending to the LLM. Longer chunks may have important content in their tail that the classifier never sees.