FAQ
Frequently Asked Questions
Section titled “Frequently Asked Questions”Answers to common questions about Douto, organized by audience.
For Developers
Section titled “For Developers”How do I add a new book to the corpus?
Section titled “How do I add a new book to the corpus?”Run the full pipeline sequentially. Place the PDF in the staging input directory, then execute each step:
# 1. Place PDF in input directorycp livro.pdf $VAULT_PATH/Knowledge/_staging/input/
# 2. Extract PDF to markdownpython3 pipeline/process_books.py livro.pdf
# 3. Chunk the markdownpython3 pipeline/rechunk_v3.py slug-of-book
# 4. Enrich chunks with metadatapython3 pipeline/enrich_chunks.py slug-of-book
# 5. Generate embeddingspython3 pipeline/embed_doutrina.py
# 6. Verify with a searchpython3 pipeline/search_doutrina_v2.py "query from the book" --area contratosWhy doesn’t the pipeline run on my machine?
Section titled “Why doesn’t the pipeline run on my machine?”The most common cause is hardcoded paths. Two of the five scripts (process_books.py and rechunk_v3.py) have absolute paths from the creator’s machine baked into the source code:
process_books.pyline 27:/home/sensd/.openclaw/workspace/vaultrechunk_v3.pyline 29:/mnt/c/Users/sensd/vault
Workaround: Edit the VAULT_PATH line in each script to point to your local vault path.
Permanent fix: F22 (v0.2) will standardize all scripts to use os.environ.get("VAULT_PATH").
See Environment Variables for the full variable reference.
Why is there no database?
Section titled “Why is there no database?”Douto stores embeddings and search indices as flat JSON files. This was the simplest approach for a single-machine prototype. The trade-offs:
| Flat JSON (current) | Vector DB (planned) |
|---|---|
| No infrastructure needed | Requires Qdrant/FAISS setup |
| Simple to debug (human-readable) | Binary/opaque storage |
| Full load into memory per query | Indexed, sub-second queries |
| ~2 GB for 31,500 chunks | Compact, scalable storage |
| Does not scale past ~100 books | Scales to millions of vectors |
Migration to a vector DB (likely Qdrant, since Valter already uses it) is planned for v0.4 (mitigation M12).
Why MiniMax M2.5 instead of Claude for enrichment?
Section titled “Why MiniMax M2.5 instead of Claude for enrichment?”Cost. Enriching 31,500 chunks with a classification prompt requires significant token throughput. MiniMax M2.5 is substantially cheaper than Claude for this batch workload. The trade-off is quality — MiniMax is a generic model, not fine-tuned for Brazilian law.
This is an open decision (D06). Options under evaluation:
| Option | Cost | Quality | Dependency |
|---|---|---|---|
| MiniMax M2.5 (current) | Low | Unknown (unvalidated) | Fragile SDK hack |
| Claude | Higher | Likely better | Ecosystem-consistent |
| Local model | Zero | Unknown | Setup complexity |
Can I use a different embedding model?
Section titled “Can I use a different embedding model?”Technically yes, but it requires re-embedding the entire corpus (~31,500 chunks). The model is hardcoded in embed_doutrina.py (line 24) and search_doutrina_v2.py (line 24) as rufimelo/Legal-BERTimbau-sts-base.
Important considerations:
- All existing embeddings become incompatible (different vector space)
- Search quality may improve or degrade — there is no benchmark comparison yet (planned in F40)
- The current model was trained on Portuguese (PT-PT) legal text, which may not be optimal for Brazilian legal terminology
Where is the enrichment prompt?
Section titled “Where is the enrichment prompt?”How do I contribute tests?
Section titled “How do I contribute tests?”This is the highest-impact contribution you can make. Douto has 0% test coverage.
- Create a
tests/directory structure (see Testing) - Add
pytestto the development dependencies - Start with
rechunk_v3.pyfunctions:detect_section(),classify_title(),smart_split() - Use real markdown snippets from legal books as fixtures
- Mock all external API calls (MiniMax, LlamaParse, HuggingFace)
See the Testing page for the full planned strategy and example tests.
For Users (Lawyers)
Section titled “For Users (Lawyers)”What legal domains does Douto cover?
Section titled “What legal domains does Douto cover?”Currently, three domains have populated content:
| Domain | Books | Chunks | Coverage |
|---|---|---|---|
| Direito Civil | 35 | ~9,365 | Contracts, obligations, civil liability, property |
| Direito Processual Civil | 8 | ~22,182 | CPC commentary, general theory, procedures |
| Direito Empresarial | 7 | — | Venture capital, smart contracts, commercial litigation |
Gaps: Direito do Consumidor has a placeholder MOC. Tributario, Constitucional, Compliance, and Sucessoes have no content at all. If you search for a topic in an uncovered domain, you will get empty results.
How accurate are the search results?
Section titled “How accurate are the search results?”Honest answer: we do not know. There is no evaluation set, no accuracy benchmark, and no human validation of search quality.
What we do know:
- The hybrid search combines semantic similarity (meaning) with keyword matching (exact terms)
- Results are ranked by a combined score (70% semantic, 30% keyword by default)
- Metadata filters (by instituto, ramo, tipo) depend on the enrichment quality, which has not been validated
Quality measurement is planned for v0.2.5 (validation of 200 chunks) and v0.5 (formal eval set with 30+ queries).
Can I trust the citations?
Section titled “Can I trust the citations?”With caution. Citations include the book title, author, and chapter, but not page numbers. There are known risks:
- Chunking errors — the chunk boundary may not align with the chapter boundary in the original book, leading to misattribution (e.g., citing Chapter 5 when the content is from Chapter 4)
- Quotation nesting — legal authors frequently quote other authors at length. A chunk may be attributed to the book’s author when the content is actually a quotation from another scholar
- No edition tracking — if a newer edition of a book is processed, old chunks remain in the index. You may receive citations from an outdated edition
Recommendation: Always verify doctrinal citations against the original source before using them in legal documents.
Will this replace legal research?
Section titled “Will this replace legal research?”No. Douto is a search and retrieval tool, not a replacement for legal analysis. It helps you find relevant doctrinal passages faster, but:
- The corpus is limited (~50 books, not comprehensive)
- Metadata may contain errors
- No system can replace a lawyer’s judgment about relevance and applicability
- The tool does not understand the nuances of your specific case
For Stakeholders
Section titled “For Stakeholders”What is the competitive advantage?
Section titled “What is the competitive advantage?”Douto’s differentiator is structured metadata on Brazilian legal doctrine. Each of the ~31,500 chunks is classified with its legal institute, content type, procedural phase, branch of law, and statutory references. This enables filtered semantic search that generic legal search engines cannot do.
No competitor currently offers this level of structured access to Brazilian legal textbooks.
What is the IP situation?
Section titled “What is the IP situation?”What is the timeline to production?
Section titled “What is the timeline to production?”Based on the current roadmap:
| Milestone | Target | What it enables |
|---|---|---|
| v0.2 | ~March 2026 | Pipeline runs on any machine |
| v0.3 | ~May 2026 | Tests, docs, lint — project is contributable |
| v0.4 | ~August 2026 | MCP server — Valter can query doctrine |
Caveats:
- The roadmap is maintained by a solo developer managing 5 repositories
- 7 architectural decisions are unresolved, 2 of which block v0.4
- No external users have tested the system
- Timelines are estimates, not commitments
How much does it cost to run?
Section titled “How much does it cost to run?”| Component | Cost type | Estimate |
|---|---|---|
| LlamaParse | Per-PDF, one-time | ~$0.01-0.10 per book (cost_effective tier) |
| MiniMax M2.5 | Per-chunk enrichment | Low (exact pricing varies) |
| Legal-BERTimbau | Free (open source model) | $0 |
| Compute | CPU/GPU for embeddings | Local machine, no cloud cost |
| Storage | JSON files | ~2 GB for current corpus |
What happens if the solo developer is unavailable?
Section titled “What happens if the solo developer is unavailable?”This is identified as risk RE01 (highest probability in the PREMORTEM). Currently:
- The pipeline runs only on the developer’s machine
- The enrichment prompt is not in the repository
- Dependencies are unpinned
- There are no tests and no CI/CD
- Documentation is in progress (these docs)
The v0.2 and v0.3 milestones specifically address this bus-factor risk by making the project portable and contributable. Until those milestones are completed, another developer would face significant onboarding friction.