Skip to content

Architecture Overview

Douto operates in two complementary modes: a batch ETL pipeline that transforms legal PDFs into searchable data, and a markdown-based knowledge graph navigable by humans and AI agents. It is not a web application or a running service — it is a set of processing tools and a structured knowledge base.

Batch Processing Pipeline — Five independent Python scripts executed sequentially. Each reads from disk, processes data, and writes back to disk. No database, no message queue, no orchestrator.

Markdown Knowledge Graph — An Obsidian-compatible hierarchy using YAML frontmatter, wikilinks, and Maps of Content (MOCs). Designed for dual consumption: human navigation in Obsidian and programmatic querying by AI agents.

flowchart TD
PDF["📄 Legal PDFs<br/>(staging/input/)"]
PB["process_books.py<br/>LlamaParse API"]
RC["rechunk_v3.py<br/>5-pass algorithm"]
EN["enrich_chunks.py<br/>MiniMax M2.5 LLM"]
EM["embed_doutrina.py<br/>Legal-BERTimbau"]
SE["search_doutrina_v2.py<br/>Hybrid: Semantic + BM25"]
JSON["📦 JSON Artifacts<br/>embeddings + corpus + BM25"]
PDF -->|"PDF files"| PB
PB -->|"markdown<br/>(chapters)"| RC
RC -->|"intelligent<br/>chunks"| EN
EN -->|"enriched chunks<br/>(13 metadata fields)"| EM
EM -->|"768-dim vectors"| JSON
JSON --> SE

Each arrow represents a file-system handoff — there is no in-memory pipeline or streaming. Scripts can be re-run independently with --force or --dry-run.

The knowledge base has three layers:

LayerFilePurposeStatus
Rootknowledge/INDEX_DOUTO.mdSkill graph entry point — maps 8 legal domainsActive
Domain Mapsknowledge/mocs/MOC_*.mdBooks per domain with metadata and status3 active, 1 placeholder, 4 missing
Atomic Notesknowledge/nodes/*.mdOne note per legal concept (instituto)Planned (directory exists, no content)

The hierarchy uses Obsidian conventions: [[wikilinks]] for navigation, YAML frontmatter for structured metadata.

The pipeline produces three JSON files per legal area (e.g., contratos, processo_civil):

FileContentsEstimated Size
embeddings_{area}.jsondoc_ids[] + embeddings[][] (768-dim float32 vectors)~500 MB for 31,500 chunks
search_corpus_{area}.jsonFull metadata per chunk (title, author, instituto, tipo, etc.)~200 MB
bm25_index_{area}.jsondoc_ids[] + documents[] (tokenized text for BM25)~300 MB

These files are loaded entirely into memory by search_doutrina_v2.py at startup.

graph TB
subgraph "Douto — Pipeline"
PB["process_books.py"]
RC["rechunk_v3.py"]
EN["enrich_chunks.py"]
EM["embed_doutrina.py"]
SE["search_doutrina_v2.py"]
end
subgraph "Douto — Knowledge Base"
IX["INDEX_DOUTO.md"]
MC["MOC_CIVIL<br/>35 books"]
MP["MOC_PROCESSUAL<br/>8 books"]
ME["MOC_EMPRESARIAL<br/>7 books"]
end
subgraph "sens.legal Ecosystem"
VA["Valter<br/>FastAPI + Neo4j + Qdrant"]
JU["Juca<br/>Next.js Frontend"]
LE["Leci<br/>Legislation"]
end
subgraph "External Services"
LP["LlamaParse API"]
MM["MiniMax M2.5 API"]
HF["HuggingFace<br/>Legal-BERTimbau"]
end
PB --> RC --> EN --> EM --> SE
SE -.->|"JSON files"| VA
VA --> JU
IX --> MC & MP & ME
PB -.-> LP
EN -.-> MM
EM -.-> HF

Currently, Douto integrates with the ecosystem via JSON files deposited in a shared directory. There is no API, MCP server, or real-time query capability. MCP integration is planned for v0.4.

From CLAUDE.md, in priority order:

  1. Correctness — especially doctrinal data, citations, legal metadata
  2. Simplicity — code another agent understands without context
  3. Maintainability — easy to change without breaking
  4. Reversibility — decisions that can be undone
  5. Performance — optimize only with evidence of a problem

Operational principles:

  • Idempotent — every script is safe to re-run (skip markers, --force to override)
  • Dry-run first — every script supports --dry-run
  • Structured logging — events go to processing_log.jsonl

These are architectural constraints, not bugs. Each has a tracking reference:

LimitationImpactTracking
No database — JSON flat filesDoesn’t scale past ~100 books, full load into memoryADR-003
No API or MCP serverNo real-time queries from other agentsF30, v0.4
No CI/CDNo automated testing or lintingF39, v0.5
Hardcoded paths in 2 scriptsPipeline runs only on creator’s machineF22, v0.2
0% test coverageRegressions undetectable except by manual inspectionF26-F27, v0.3
Missing enrichment promptenrich_prompt.md not in repo — enrichment unreproducibleM01