Skip to content

Introduction

Douto is the doctrine knowledge agent of the sens.legal platform. It processes legal textbooks into structured, searchable knowledge that lawyers and AI agents can query in real time.

Legal research in Brazil requires consulting multiple authors on the same legal concept. A lawyer researching exceptio non adimpleti contractus might need to compare positions from Orlando Gomes, Caio Mário, and Pontes de Miranda — each in a different book, different chapter, different edition. This manual cross-referencing typically takes 2-4 hours per concept.

Current legal tech platforms (Jusbrasil, Turivius, Vlex, LexisNexis) perform search — they return documents that match a query. None of them perform synthesis — aggregating and comparing doctrinal positions across authors.

Douto bridges this gap by transforming raw legal textbooks into structured, classified, searchable knowledge with metadata that enables filtering by legal concept, content type, legal branch, and procedural phase.

Douto operates in two complementary modes:

Five Python scripts executed sequentially transform legal PDFs into searchable data:

PDF → process_books.py → rechunk_v3.py → enrich_chunks.py → embed_doutrina.py → search_doutrina_v2.py

Each stage adds structure: raw PDF becomes markdown, markdown becomes intelligent chunks, chunks get classified with legal metadata, metadata-enriched text becomes embeddings, and embeddings enable semantic search.

An Obsidian-style markdown hierarchy organized by legal domain:

INDEX_DOUTO.md (root — 8 legal domains)
└── MOC_CIVIL.md (35 books, ~9,365 chunks)
└── MOC_PROCESSUAL.md (8 books, ~22,182 chunks)
└── MOC_EMPRESARIAL.md (7 books)
└── nodes/ (atomic notes — planned)

Douto serves three audiences:

AudienceHow they use DoutoAvailable today?
LawyersSearch doctrine via Juca frontend during case researchNot yet — requires v0.4 integration
AI agentsQuery doctrine via MCP/API during briefings and analysisNot yet — MCP planned for v0.4
DevelopersExtend the pipeline, add books, improve the knowledge baseYes — via CLI

In the sens.legal ecosystem, each agent handles a different pillar of legal knowledge:

AgentPillarCurrent corpus
ValterCase law (jurisprudência)23,400+ STJ decisions
LeciLegislation (legislação)Federal laws
DoutoDoctrine (doutrina)~50 books, ~31,500 chunks
JosephOrchestrationCoordinates all agents
JucaFrontendPresents results to lawyers

When fully integrated, a lawyer asking Juca about a legal concept will receive a unified view combining case law from Valter, legislation from Leci, and doctrine from Douto.

These terms appear throughout the documentation:

TermDefinition
ChunkA semantically coherent fragment of a legal book (200-1,000 tokens), produced by rechunk_v3.py, with YAML frontmatter metadata
Instituto jurídicoA legal concept or institute — e.g., exceptio non adimpleti contractus, boa-fé objetiva. The fundamental unit of classification.
EnrichmentThe process of classifying chunks with structured metadata using an LLM (currently MiniMax M2.5)
EmbeddingA 768-dimensional vector representing a chunk’s semantic content, generated by Legal-BERTimbau
MOCMap of Content — an index file listing all books within a legal domain
Skill graphThe hierarchical knowledge structure: INDEX → MOCs → Chunks → Atomic Notes
Hybrid searchCombination of semantic search (cosine similarity) and BM25 (keyword matching) with weighted scoring

For full definitions, see the Glossary.

Clear boundaries from AGENTS.md:

  • Does not manage cases — that’s Joseph (orchestrator)
  • Does not search case law — that’s Valter (23,400+ STJ decisions)
  • Does not search legislation — that’s Leci (federal laws)
  • Does not manage infrastructure — that’s Valter (FastAPI backend)
  • Does not serve a web interface — that’s Juca (Next.js frontend)