Introduction
Introduction
Section titled “Introduction”Douto is the doctrine knowledge agent of the sens.legal platform. It processes legal textbooks into structured, searchable knowledge that lawyers and AI agents can query in real time.
The Problem
Section titled “The Problem”Legal research in Brazil requires consulting multiple authors on the same legal concept. A lawyer researching exceptio non adimpleti contractus might need to compare positions from Orlando Gomes, Caio Mário, and Pontes de Miranda — each in a different book, different chapter, different edition. This manual cross-referencing typically takes 2-4 hours per concept.
Current legal tech platforms (Jusbrasil, Turivius, Vlex, LexisNexis) perform search — they return documents that match a query. None of them perform synthesis — aggregating and comparing doctrinal positions across authors.
Douto bridges this gap by transforming raw legal textbooks into structured, classified, searchable knowledge with metadata that enables filtering by legal concept, content type, legal branch, and procedural phase.
What Douto Does
Section titled “What Douto Does”Douto operates in two complementary modes:
Batch Processing Pipeline
Section titled “Batch Processing Pipeline”Five Python scripts executed sequentially transform legal PDFs into searchable data:
PDF → process_books.py → rechunk_v3.py → enrich_chunks.py → embed_doutrina.py → search_doutrina_v2.pyEach stage adds structure: raw PDF becomes markdown, markdown becomes intelligent chunks, chunks get classified with legal metadata, metadata-enriched text becomes embeddings, and embeddings enable semantic search.
Navigable Knowledge Base
Section titled “Navigable Knowledge Base”An Obsidian-style markdown hierarchy organized by legal domain:
INDEX_DOUTO.md (root — 8 legal domains) └── MOC_CIVIL.md (35 books, ~9,365 chunks) └── MOC_PROCESSUAL.md (8 books, ~22,182 chunks) └── MOC_EMPRESARIAL.md (7 books) └── nodes/ (atomic notes — planned)Who Uses Douto
Section titled “Who Uses Douto”Douto serves three audiences:
| Audience | How they use Douto | Available today? |
|---|---|---|
| Lawyers | Search doctrine via Juca frontend during case research | Not yet — requires v0.4 integration |
| AI agents | Query doctrine via MCP/API during briefings and analysis | Not yet — MCP planned for v0.4 |
| Developers | Extend the pipeline, add books, improve the knowledge base | Yes — via CLI |
Where Douto Fits
Section titled “Where Douto Fits”In the sens.legal ecosystem, each agent handles a different pillar of legal knowledge:
| Agent | Pillar | Current corpus |
|---|---|---|
| Valter | Case law (jurisprudência) | 23,400+ STJ decisions |
| Leci | Legislation (legislação) | Federal laws |
| Douto | Doctrine (doutrina) | ~50 books, ~31,500 chunks |
| Joseph | Orchestration | Coordinates all agents |
| Juca | Frontend | Presents results to lawyers |
When fully integrated, a lawyer asking Juca about a legal concept will receive a unified view combining case law from Valter, legislation from Leci, and doctrine from Douto.
Core Concepts
Section titled “Core Concepts”These terms appear throughout the documentation:
| Term | Definition |
|---|---|
| Chunk | A semantically coherent fragment of a legal book (200-1,000 tokens), produced by rechunk_v3.py, with YAML frontmatter metadata |
| Instituto jurídico | A legal concept or institute — e.g., exceptio non adimpleti contractus, boa-fé objetiva. The fundamental unit of classification. |
| Enrichment | The process of classifying chunks with structured metadata using an LLM (currently MiniMax M2.5) |
| Embedding | A 768-dimensional vector representing a chunk’s semantic content, generated by Legal-BERTimbau |
| MOC | Map of Content — an index file listing all books within a legal domain |
| Skill graph | The hierarchical knowledge structure: INDEX → MOCs → Chunks → Atomic Notes |
| Hybrid search | Combination of semantic search (cosine similarity) and BM25 (keyword matching) with weighted scoring |
For full definitions, see the Glossary.
What Douto Does NOT Do
Section titled “What Douto Does NOT Do”Clear boundaries from AGENTS.md:
- Does not manage cases — that’s Joseph (orchestrator)
- Does not search case law — that’s Valter (23,400+ STJ decisions)
- Does not search legislation — that’s Leci (federal laws)
- Does not manage infrastructure — that’s Valter (FastAPI backend)
- Does not serve a web interface — that’s Juca (Next.js frontend)