Skip to content

Features

Every feature in Douto, organized by implementation status. Each pipeline stage and knowledge-base component links to a dedicated page with architecture details, code examples, and known limitations.

BadgeMeaning
ImplementedIn production, functional, used in the current pipeline
In ProgressPartially implemented — structure exists but incomplete
PlannedOn the roadmap with an assigned milestone
IdeaProposed but not yet scheduled
InnovationStrategic proposals from INNOVATION_LAYER.md

The core processing pipeline transforms legal textbook PDFs into searchable embeddings through five sequential stages.

#FeatureStatusScriptPage
F01PDF ExtractionImplementedprocess_books.pypdf-extraction
F02Intelligent Chunking v3Implementedrechunk_v3.pyintelligent-chunking
F03Chunk EnrichmentImplementedenrich_chunks.pyenrichment
F04Embedding GenerationImplementedembed_doutrina.pyembeddings
F05Hybrid SearchImplementedsearch_doutrina_v2.pyhybrid-search
F06Multi-Area SearchImplementedsearch_doutrina_v2.pyhybrid-search
F07Interactive Search CLIImplementedsearch_doutrina_v2.pyhybrid-search
flowchart LR
PDF["PDF files"]
MD["Markdown + chapters"]
CHK["Semantic chunks"]
ENR["Enriched chunks"]
EMB["Embeddings + corpus"]
SRCH["Search results"]
PDF -->|"F01 process_books.py"| MD
MD -->|"F02 rechunk_v3.py"| CHK
CHK -->|"F03 enrich_chunks.py"| ENR
ENR -->|"F04 embed_doutrina.py"| EMB
EMB -->|"F05 search_doutrina_v2.py"| SRCH

The knowledge base is an Obsidian-style skill graph that organizes the doctrinal corpus by legal domain.

#FeatureStatusArtifactPage
F08Skill Graph INDEXImplementedINDEX_DOUTO.mdskill-graph
F09MOC Direito CivilImplementedMOC_CIVIL.mdmocs
F10MOC Processual CivilImplementedMOC_PROCESSUAL.mdmocs
F11MOC EmpresarialImplementedMOC_EMPRESARIAL.mdmocs
F19MOC ConsumidorIn ProgressMOC_CONSUMIDOR.mdmocs
F21Atomic Notes (nodes/)In Progressknowledge/nodes/atomic-notes

Cross-cutting capabilities that ensure reliability and debuggability across all pipeline stages.

#FeatureStatusDescription
F12Idempotent ProcessingImplementedMarkers prevent re-processing; --force to override
F13Structured LoggingImplementedprocessing_log.jsonl with append-only events
F14Dry-Run ModeImplemented--dry-run flag on all scripts that mutate data
F15Standardized YAML FrontmatterImplementedConsistent schema: knowledge_id, tipo, titulo, livro_titulo, autor, area_direito, status_enriquecimento
#FeatureStatusDescription
F16AGENTS.mdImplementedAgent identity, responsibilities, limits, git protocol
F17CLAUDE.mdImplementedCoding guidelines for AI agents aligned with the sens.legal ecosystem
F18PROJECT_MAP.mdImplementedFull project diagnostic and architecture map

Organized by target milestone. Source: ROADMAP.md.

#FeaturePriorityDescription
F22Standardize pathsP0Eliminate hardcoded absolute paths in process_books.py and rechunk_v3.py
F23Extract pipeline/utils.pyP1Deduplicate parse_frontmatter(), slugify(), build_frontmatter()
F24Pin dependency versionsP1requirements.txt with exact versions
F20Complete env var standardizationP12/5 scripts use os.environ.get(), 3 have hardcoded paths
#FeaturePriorityDescription
F25Create missing MOCsP1MOC_TRIBUTARIO, MOC_CONSTITUCIONAL, MOC_COMPLIANCE, MOC_SUCESSOES
F26Tests for rechunk_v3.pyP1pytest with real legal markdown fixtures
F27Tests for utility functionsP2parse_frontmatter, slugify, extract_json, compose_embedding_text
F28Complete READMEP2Setup, prerequisites, env vars, usage, architecture
F31Pipeline MakefileP2make pipeline, make search, make test, make lint
F32Linting with ruffP2Configure ruff, integrate into Makefile
F42Version enrich promptP1enrich_prompt.md referenced in code but missing from repo
#FeaturePriorityDescription
F29Douto-to-Valter integrationP1Define protocol: file, API, or MCP
F30MCP Server for doctrineP1Expose search via Model Context Protocol

Features inferred from the sens.legal ecosystem architecture.

#FeaturePriorityMilestoneDescription
F33Doutrina in Neo4jP2v1.0Ingest doctrine nodes into Valter’s knowledge graph
F34Doutrina-jurisprudencia cross-refP2v1.0Auto-link when STJ decisions cite a doctrinal author
F35Doutrina-legislacao cross-refP3v1.0Link doctrinal commentary to statutory provisions in Leci
F36Auto-generate atomic notesP2v0.5One note per instituto from enriched chunks
F37Progressive Briefing supportP2v1.0Feed Juca’s 4-phase briefing with doctrinal sources
F38Docker pipelineP3v1.0Containerize with PyTorch + pre-downloaded models
F39Basic CI/CDP3v0.5GitHub Actions: ruff lint + pytest on PRs
F40Embedding quality eval setP2v0.5Query-answer pairs to measure recall@k and nDCG
F41Unified ingestion CLIP3v0.5douto ingest livro.pdf runs the full pipeline

Strategic proposals from INNOVATION_LAYER.md. These would transform Douto from a “book search engine” into a “doctrinal reasoning engine.”

#FeaturePriorityMilestoneDescription
F43Doctrine Synthesis EngineP1v0.3.5Synthesize all chunks for a given legal concept across all books
F44Synthesis promptP1v0.3.5Carefully designed prompt for generating Doctrine Briefs
F45Doctrine Brief templateP1v0.3.5Standardized output format (Markdown + JSON)
F46Ontological concept extractionP2v0.6Collect all institutos and co-occurrences from the corpus
F47Relationship typingP2v0.6LLM classifies relationships (IS_A, APPLIES_TO, REQUIRES, etc.)
F48Ontology export and visualizationP3v0.6GraphML, RDF/Turtle, JSON, interactive visualization