Roadmap
Roadmap
Section titled “Roadmap”Where Douto is going — from pipeline stabilization to full sens.legal integration.
Product Vision
Section titled “Product Vision”Douto will be the doctrine knowledge backend of the sens.legal ecosystem. When mature, it will:
- Process legal textbooks autonomously (PDF to chunks to embeddings)
- Maintain a navigable skill graph organized by legal domain
- Expose doctrine knowledge via MCP/API for Valter, Juca, and Leci to query in real time
- Support briefings, risk analysis, and legal document drafting with authoritative doctrinal references
The current corpus contains ~50 books and ~31,500 chunks across civil law, civil procedure, and business law.
Current Status
Section titled “Current Status”| Category | Count | Details |
|---|---|---|
| Implemented | 18 | F01-F18: full pipeline (PDF to search), skill graph, 4 MOCs, idempotency, logging, dry-run |
| In progress | 3 | F19 (MOC Consumidor), F20 (env var standardization), F21 (atomic notes) |
| Planned | 11 | F22-F32: path fixes, utils.py, tests, MOCs, README, Makefile, linting, MCP |
| Ideas | 10 | F33-F42: Neo4j integration, cross-references, Docker, CI/CD, eval set |
| Pending decisions | 7 | D01-D07: integration protocol, repo vs. module, tracking, model choice |
| Test coverage | 0% | No test framework, no tests |
What Works Today
Section titled “What Works Today”- Complete pipeline:
process_books.py->rechunk_v3.py->enrich_chunks.py->embed_doutrina.py->search_doutrina_v2.py - Hybrid search (semantic + BM25) with metadata filters
- Interactive CLI search with
/area,/filtro,/verbosecommands - 4 populated MOCs: Civil (35 books), Processual (8 books), Empresarial (7 books), Consumidor (placeholder)
- Structured enrichment metadata: instituto, tipo_conteudo, ramo, fase, fontes_normativas
What Doesn’t Work
Section titled “What Doesn’t Work”- Pipeline only runs on the creator’s machine (hardcoded paths)
- No automated tests
- No real-time integration with sens.legal ecosystem (JSON files only)
- 4 of 8 MOCs do not exist as files (Tributario, Constitucional, Compliance, Sucessoes)
enrich_prompt.mdis missing from the repository (cannot enrich new chunks)- Unpinned dependency versions
Milestone Overview
Section titled “Milestone Overview”| Milestone | Name | Key Deliverable | Status |
|---|---|---|---|
| v0.2 | Pipeline Estavel | Runs on any machine | Planned |
| v0.2.5 | Data Validation | Metadata quality gate (>= 85%) | Proposed (post-PREMORTEM) |
| v0.3 | Quality & Coverage | Tests, MOCs, docs, linting | Planned |
| v0.3.5 | Doctrine Synthesis | Synthesis Engine | Proposed |
| v0.4 | sens.legal Integration | MCP server | Planned |
| v0.5 | Knowledge Graph & Automation | Atomic notes, eval set, CI/CD | Planned |
| v0.6 | Legal Ontology | Concept graph | Proposed |
| v1.0 | Integrated Platform | Full ecosystem integration | Planned |
See Milestones for detailed breakdown of each.
Milestone Sequence
Section titled “Milestone Sequence”v0.2 Pipeline Estavel | vv0.2.5 Data Validation <-- Proposed checkpoint | - Validate 200 chunks | - Create eval set | - Schema validation | - Gate: accuracy >= 85%? vv0.3 Quality & Coverage (tests, MOCs, docs) | vv0.3.5 Doctrine Synthesis (if quality gate passed) | vv0.4 sens.legal Integration (MCP server) | vv0.5 Knowledge Graph & Automation | vv0.6 Legal Ontology (proposed) | vv1.0 Integrated PlatformPending Decisions
Section titled “Pending Decisions”Seven architectural decisions remain unresolved. Two of them (D01 and D02) block the v0.4 milestone entirely.
| # | Question | Impact | Blocks |
|---|---|---|---|
| D01 | Integration protocol: MCP stdio, MCP HTTP/SSE, REST, or keep JSON files? | Defines long-term architecture | v0.4 |
| D02 | Independent service or Valter module? | Douto’s identity as a service | v0.4 |
| D03 | Atomic notes: auto-generated or manually curated? | Volume vs. quality trade-off | v0.5 |
| D04 | Issue tracking: Linear (SEN-XXX) or GitHub Issues? | Contribution workflow | — |
| D05 | Doctrine schema in Neo4j? | Knowledge graph integration | v1.0 |
| D06 | Keep MiniMax M2.5 or migrate enrichment model? | Cost, quality, dependency | — |
| D07 | Are the inferred priorities correct? | Entire roadmap may reorder | All |
See Architecture Decisions for detailed analysis of each option.
Risk Summary
Section titled “Risk Summary”The top 5 risks from the PREMORTEM analysis, ordered by likelihood:
1. Death by Abandonment
Section titled “1. Death by Abandonment”Probability: High | Solo developer maintains 5 repos (Valter, Juca, Leci, Joseph, Douto). Valter and Juca are customer-facing and likely take priority. Douto may go 6+ months without commits, losing context and momentum.
2. Illusion of Quality
Section titled “2. Illusion of Quality”Probability: High | Enrichment metadata has never been validated against human judgment. If 30-40% of instituto and tipo_conteudo classifications are wrong, filtered search returns garbage and any synthesis features amplify errors. No eval set exists to measure this.
3. Irreproducible Foundation
Section titled “3. Irreproducible Foundation”Probability: High | enrich_prompt.md is missing. Dependency versions are unpinned. If the corpus needs reprocessing (new model, bug fix, new domain), the result will differ from the original. Two inconsistent datasets with no way to return to the previous state.
4. Redundancy with Valter
Section titled “4. Redundancy with Valter”Probability: Medium | If Valter needs doctrine before v0.4, the team may build valter/stores/doutrina/ with Qdrant (already available). Once Valter has a “good enough” doctrine module, integrating Douto becomes harder to justify than rewriting.
5. Zero Scalability
Section titled “5. Zero Scalability”Probability: Certain | Embeddings stored as flat JSON (~2 GB for 31,500 chunks). BM25 recalculates document frequencies per query. Load time is seconds. Adding 50 more books doubles everything. Unusable as an MCP tool with this latency.
For the full risk analysis including 14 technical risks, 5 product risks, 4 execution risks, and 7 edge cases, see PREMORTEM.md in the repository root.
Feature Backlog
Section titled “Feature Backlog”Implemented (F01-F18)
Section titled “Implemented (F01-F18)”| # | Feature | Script/File |
|---|---|---|
| F01 | PDF extraction via LlamaParse | process_books.py |
| F02 | Intelligent legal chunking v3 | rechunk_v3.py |
| F03 | Chunk enrichment via MiniMax M2.5 | enrich_chunks.py |
| F04 | Embedding generation (Legal-BERTimbau 768-dim) | embed_doutrina.py |
| F05 | Hybrid search (semantic + BM25 + filters) | search_doutrina_v2.py |
| F06 | Multi-area search support | search_doutrina_v2.py |
| F07 | Interactive CLI search | search_doutrina_v2.py |
| F08 | Skill graph INDEX | INDEX_DOUTO.md |
| F09 | MOC Direito Civil (35 books) | MOC_CIVIL.md |
| F10 | MOC Processual Civil (8 books) | MOC_PROCESSUAL.md |
| F11 | MOC Empresarial (7 books) | MOC_EMPRESARIAL.md |
| F12 | Pipeline idempotency | All scripts |
| F13 | Structured logging (JSONL) | All scripts |
| F14 | Dry-run in all scripts | All scripts |
| F15 | Standardized YAML frontmatter | All scripts |
| F16 | AGENTS.md | AGENTS.md |
| F17 | CLAUDE.md | CLAUDE.md |
| F18 | PROJECT_MAP.md | PROJECT_MAP.md |
Planned (F22-F32)
Section titled “Planned (F22-F32)”See Milestones for which features belong to which milestone.
| # | Feature | Priority | Milestone |
|---|---|---|---|
| F22 | Standardize paths (eliminate hardcodes) | P0 | v0.2 |
| F23 | Extract pipeline/utils.py | P1 | v0.2 |
| F24 | Pin dependency versions | P1 | v0.2 |
| F25 | Create missing MOCs (4 MOCs) | P1 | v0.3 |
| F26 | Tests for rechunk_v3.py | P1 | v0.3 |
| F27 | Tests for utility functions | P2 | v0.3 |
| F28 | Complete README | P2 | v0.3 |
| F29 | Douto -> Valter integration protocol | P1 | v0.4 |
| F30 | MCP server for doctrine | P1 | v0.4 |
| F31 | Makefile for pipeline orchestration | P2 | v0.3 |
| F32 | Linting with ruff | P2 | v0.3 |