Skip to content

Settings & Configuration

Douto’s configurable parameters. Most are currently hardcoded in the source files and need to be externalized in future milestones.

SettingValueLocationConfigurable
DEFAULT_TIER"cost_effective"Line 37Yes, via --tier CLI arg
Chapter split patternH1/H2 markdown headersHardcoded in split_into_chapters()No
Input directory$VAULT_PATH/Knowledge/_staging/input/HardcodedNo
Output directory$VAULT_PATH/Knowledge/_staging/processed/HardcodedNo
Failed directory$VAULT_PATH/Knowledge/_staging/failed/HardcodedNo

LlamaParse Tiers:

TierQualitySpeedCostWhen to use
agenticBestSlowestHighestScanned PDFs, complex layouts, tables
cost_effectiveGood (default)MediumMediumClean text PDFs, most legal textbooks
fastBasicFastestLowestSimple text-only documents
Terminal window
python3 pipeline/process_books.py --tier agentic livro.pdf

SettingValueLocationConfigurable
MIN_CHUNK_CHARS1500Line 32Yes, via --min-chars CLI arg
MAX_CHUNK_CHARS15000Line 33No (hardcoded)
SECTION_PATTERNS16 regex patternsLines 41-72No (hardcoded)
Running header thresholdFrequency-based detectionHardcoded heuristicNo

Section Detection Patterns (16 total):

The rechunker recognizes the following structural patterns in legal markdown:

Pattern TypeExampleRegex ID
Markdown headers## Section Titlemd_header
English chapters**Chapter 5:** Titlechapter_en
Portuguese chapters**Capitulo X** Titlecapitulo_pt
All-caps CHAPTERCHAPTER 5 Titlechapter_caps
All-caps CAPITULOCAPITULO Xcapitulo_caps
Titulo (title/book)TITULO VItitulo
Parte (part)PARTE GERALparte
English PartPart Onepart_en
Legal articleArt. 481. or ### Art. 481artigo
Portuguese sectionSecao Isecao
English sectionSection Xsection_en
Numbered caps1. TITULO EM MAIUSCULASnumbered_caps
Numbered bold**1.** Titlenumbered_bold
All-caps title lineRESPONSABILIDADE CIVIL OBJETIVAallcaps_title
Bold caps title**SOME TITLE HERE**bold_caps_title

Chunking Rules (hardcoded, not configurable):

  • Footnotes are grouped with their parent paragraph
  • Law articles + commentary are never separated
  • Practical examples stay with the principle they illustrate
  • Running headers (repeated title/author) are filtered by frequency
  • Bibliographies are extracted as separate chunks with type "bibliografia"
  • Prefaces, acknowledgments, cataloging cards are filtered as noise

SettingValueLocationConfigurable
MINIMAX_BASE_URL"https://api.minimax.io/anthropic"Line 30No (hardcoded)
MINIMAX_MODEL"MiniMax-M2.5"Line 31No (hardcoded)
WORKERS5Line 34No (hardcoded)
DELAY_BETWEEN_REQUESTS0.5 secondsLine 35No (hardcoded)
PROMPT_PATHpipeline/enrich_prompt.mdLine 27No (hardcoded)

Enrichment Metadata Fields:

The LLM classifies each chunk into these structured fields:

FieldTypeDescriptionExample values
categoriastringHigh-level category"doutrina", "legislacao_comentada"
institutolist[string]Legal institutes["boa-fe objetiva", "exceptio non adimpleti"]
tipo_conteudostringContent type"definicao", "requisitos", "exemplo", "critica"
fasestringProcedural phase"formacao", "execucao", "extincao"
ramostringBranch of law"civil", "processual_civil", "empresarial"
fontes_normativaslist[string]Statutory references["CC art. 476", "CPC art. 300"]

embed_doutrina.py — Embedding Generation

Section titled “embed_doutrina.py — Embedding Generation”
SettingValueLocationConfigurable
MODEL_NAME"rufimelo/Legal-BERTimbau-sts-base"Line 24No (hardcoded)
Embedding dimensions768Model-determinedNo
BATCH_SIZE32Line 25No (hardcoded)
MAX_TOKENS512Line 26No (hardcoded, BERTimbau limit)
NormalizationTrueHardcodedNo

Text Composition Template:

Embeddings are generated from a composed text, not the raw chunk body:

[categoria | instituto_1, instituto_2 | tipo_conteudo | titulo | corpo_truncado_at_512_tokens]

Output Files:

FileContents
embeddings_{area}.json768-dim vectors per chunk, normalized
search_corpus_{area}.jsonChunk text + metadata for display
bm25_index_{area}.jsonPre-tokenized terms for BM25 ranking

SettingValueLocationConfigurable
semantic_weight0.7Line 163 (function default)Yes, via /weight in interactive mode
BM25 k11.5Line 126No (hardcoded)
BM25 b0.75Line 126No (hardcoded)
Default top_k5Line 263Yes, via --top CLI arg or /top command

Search Modes (interactive):

CommandModeDescription
/hybridHybrid (default)semantic_weight * cosine + (1 - semantic_weight) * BM25
/semSemantic onlyPure cosine similarity on embeddings
/bm25BM25 onlyPure keyword ranking
/area contratosArea filterRestrict search to a specific legal area
/filtro instituto=XMetadata filterFilter by enrichment metadata field
/verboseVerbose outputShow full chunk text in results
/top NTop-KChange number of results returned

BM25 Parameters Explained:

  • k1 = 1.5 — Controls term frequency saturation. Higher values give more weight to repeated terms. Standard range: 1.2-2.0.
  • b = 0.75 — Controls document length normalization. b = 1.0 means full normalization; b = 0.0 means no normalization. Standard value for general text.

These conventions are defined in CLAUDE.md and enforced manually:

MOC files require:

---
type: moc
domain: civil # legal domain
description: "..." # brief description
---

Chunk files require:

---
knowledge_id: "contratos-orlando-gomes-cap05-001"
tipo: chunk
titulo: "Exceptio non adimpleti contractus"
livro_titulo: "Contratos"
autor: "Orlando Gomes"
area_direito: civil
status_enriquecimento: completo # or "pendente" or "lixo"
instituto: ["exceptio non adimpleti contractus"]
tipo_conteudo: definicao
ramo: civil
---
TypePatternExample
MOCMOC_{DOMAIN}.mdMOC_CIVIL.md
Book directory{author}-{title} (slugified)contratos-orlando-gomes/
Chunk filechunk_{NNN}.mdchunk_001.md

Always use Obsidian-style wikilinks for internal references:

[[MOC_CIVIL]] # correct
[MOC Civil](mocs/MOC_CIVIL.md) # incorrect — never use relative markdown links

Current settings are scattered across 5 scripts as hardcoded constants. The roadmap includes several steps toward centralized configuration:

MilestoneFeatureWhat changes
v0.2F22All paths use os.environ.get() with consistent fallbacks
v0.2F23Shared settings extracted to pipeline/utils.py
v0.3F31Makefile with configurable targets (make pipeline, make test)
v0.3F32ruff linter configuration

Planned Feature — A centralized config.yaml or pyproject.toml for all pipeline settings is under consideration but not yet on the roadmap. Currently, editing source files is the only way to change most parameters.