Skip to content

Roadmap

Where Douto is going — from pipeline stabilization to full sens.legal integration.

Douto will be the doctrine knowledge backend of the sens.legal ecosystem. When mature, it will:

  • Process legal textbooks autonomously (PDF to chunks to embeddings)
  • Maintain a navigable skill graph organized by legal domain
  • Expose doctrine knowledge via MCP/API for Valter, Juca, and Leci to query in real time
  • Support briefings, risk analysis, and legal document drafting with authoritative doctrinal references

The current corpus contains ~50 books and ~31,500 chunks across civil law, civil procedure, and business law.

CategoryCountDetails
Implemented18F01-F18: full pipeline (PDF to search), skill graph, 4 MOCs, idempotency, logging, dry-run
In progress3F19 (MOC Consumidor), F20 (env var standardization), F21 (atomic notes)
Planned11F22-F32: path fixes, utils.py, tests, MOCs, README, Makefile, linting, MCP
Ideas10F33-F42: Neo4j integration, cross-references, Docker, CI/CD, eval set
Pending decisions7D01-D07: integration protocol, repo vs. module, tracking, model choice
Test coverage0%No test framework, no tests
  • Complete pipeline: process_books.py -> rechunk_v3.py -> enrich_chunks.py -> embed_doutrina.py -> search_doutrina_v2.py
  • Hybrid search (semantic + BM25) with metadata filters
  • Interactive CLI search with /area, /filtro, /verbose commands
  • 4 populated MOCs: Civil (35 books), Processual (8 books), Empresarial (7 books), Consumidor (placeholder)
  • Structured enrichment metadata: instituto, tipo_conteudo, ramo, fase, fontes_normativas
  • Pipeline only runs on the creator’s machine (hardcoded paths)
  • No automated tests
  • No real-time integration with sens.legal ecosystem (JSON files only)
  • 4 of 8 MOCs do not exist as files (Tributario, Constitucional, Compliance, Sucessoes)
  • enrich_prompt.md is missing from the repository (cannot enrich new chunks)
  • Unpinned dependency versions
MilestoneNameKey DeliverableStatus
v0.2Pipeline EstavelRuns on any machinePlanned
v0.2.5Data ValidationMetadata quality gate (>= 85%)Proposed (post-PREMORTEM)
v0.3Quality & CoverageTests, MOCs, docs, lintingPlanned
v0.3.5Doctrine SynthesisSynthesis EngineProposed
v0.4sens.legal IntegrationMCP serverPlanned
v0.5Knowledge Graph & AutomationAtomic notes, eval set, CI/CDPlanned
v0.6Legal OntologyConcept graphProposed
v1.0Integrated PlatformFull ecosystem integrationPlanned

See Milestones for detailed breakdown of each.

v0.2 Pipeline Estavel
|
v
v0.2.5 Data Validation <-- Proposed checkpoint
| - Validate 200 chunks
| - Create eval set
| - Schema validation
| - Gate: accuracy >= 85%?
v
v0.3 Quality & Coverage (tests, MOCs, docs)
|
v
v0.3.5 Doctrine Synthesis (if quality gate passed)
|
v
v0.4 sens.legal Integration (MCP server)
|
v
v0.5 Knowledge Graph & Automation
|
v
v0.6 Legal Ontology (proposed)
|
v
v1.0 Integrated Platform

Seven architectural decisions remain unresolved. Two of them (D01 and D02) block the v0.4 milestone entirely.

#QuestionImpactBlocks
D01Integration protocol: MCP stdio, MCP HTTP/SSE, REST, or keep JSON files?Defines long-term architecturev0.4
D02Independent service or Valter module?Douto’s identity as a servicev0.4
D03Atomic notes: auto-generated or manually curated?Volume vs. quality trade-offv0.5
D04Issue tracking: Linear (SEN-XXX) or GitHub Issues?Contribution workflow
D05Doctrine schema in Neo4j?Knowledge graph integrationv1.0
D06Keep MiniMax M2.5 or migrate enrichment model?Cost, quality, dependency
D07Are the inferred priorities correct?Entire roadmap may reorderAll

See Architecture Decisions for detailed analysis of each option.

The top 5 risks from the PREMORTEM analysis, ordered by likelihood:

Probability: High | Solo developer maintains 5 repos (Valter, Juca, Leci, Joseph, Douto). Valter and Juca are customer-facing and likely take priority. Douto may go 6+ months without commits, losing context and momentum.

Probability: High | Enrichment metadata has never been validated against human judgment. If 30-40% of instituto and tipo_conteudo classifications are wrong, filtered search returns garbage and any synthesis features amplify errors. No eval set exists to measure this.

Probability: High | enrich_prompt.md is missing. Dependency versions are unpinned. If the corpus needs reprocessing (new model, bug fix, new domain), the result will differ from the original. Two inconsistent datasets with no way to return to the previous state.

Probability: Medium | If Valter needs doctrine before v0.4, the team may build valter/stores/doutrina/ with Qdrant (already available). Once Valter has a “good enough” doctrine module, integrating Douto becomes harder to justify than rewriting.

Probability: Certain | Embeddings stored as flat JSON (~2 GB for 31,500 chunks). BM25 recalculates document frequencies per query. Load time is seconds. Adding 50 more books doubles everything. Unusable as an MCP tool with this latency.

For the full risk analysis including 14 technical risks, 5 product risks, 4 execution risks, and 7 edge cases, see PREMORTEM.md in the repository root.

#FeatureScript/File
F01PDF extraction via LlamaParseprocess_books.py
F02Intelligent legal chunking v3rechunk_v3.py
F03Chunk enrichment via MiniMax M2.5enrich_chunks.py
F04Embedding generation (Legal-BERTimbau 768-dim)embed_doutrina.py
F05Hybrid search (semantic + BM25 + filters)search_doutrina_v2.py
F06Multi-area search supportsearch_doutrina_v2.py
F07Interactive CLI searchsearch_doutrina_v2.py
F08Skill graph INDEXINDEX_DOUTO.md
F09MOC Direito Civil (35 books)MOC_CIVIL.md
F10MOC Processual Civil (8 books)MOC_PROCESSUAL.md
F11MOC Empresarial (7 books)MOC_EMPRESARIAL.md
F12Pipeline idempotencyAll scripts
F13Structured logging (JSONL)All scripts
F14Dry-run in all scriptsAll scripts
F15Standardized YAML frontmatterAll scripts
F16AGENTS.mdAGENTS.md
F17CLAUDE.mdCLAUDE.md
F18PROJECT_MAP.mdPROJECT_MAP.md

See Milestones for which features belong to which milestone.

#FeaturePriorityMilestone
F22Standardize paths (eliminate hardcodes)P0v0.2
F23Extract pipeline/utils.pyP1v0.2
F24Pin dependency versionsP1v0.2
F25Create missing MOCs (4 MOCs)P1v0.3
F26Tests for rechunk_v3.pyP1v0.3
F27Tests for utility functionsP2v0.3
F28Complete READMEP2v0.3
F29Douto -> Valter integration protocolP1v0.4
F30MCP server for doctrineP1v0.4
F31Makefile for pipeline orchestrationP2v0.3
F32Linting with ruffP2v0.3