External Integrations
External Integrations
Section titled “External Integrations”How Douto connects to external services and the sens.legal ecosystem.
LlamaParse (PDF Extraction)
Section titled “LlamaParse (PDF Extraction)”Service: LlamaParse by LlamaIndex
Purpose: Convert legal PDF textbooks into structured markdown
Used by: process_books.py
Auth: LLAMA_CLOUD_API_KEY environment variable (loaded implicitly by the SDK)
- Create a free account at cloud.llamaindex.ai.
- Generate an API key from the dashboard.
- Set the environment variable:
export LLAMA_CLOUD_API_KEY="llx-your-key-here"LlamaParse offers three processing tiers. The default in Douto is cost_effective:
| Tier | Best for | Speed | Cost |
|---|---|---|---|
agentic | Scanned PDFs, complex tables, multi-column layouts | Slowest | Highest |
cost_effective | Clean-text legal textbooks (default) | Medium | Medium |
fast | Simple text-only documents | Fastest | Lowest |
Override the tier per run:
python3 pipeline/process_books.py --tier agentic livro.pdfUsage Notes
Section titled “Usage Notes”- PDF extraction is a one-time operation per book. Once converted to markdown, the original PDF is not needed again by the pipeline.
- Processed markdown is saved to
$VAULT_PATH/Knowledge/_staging/processed/{slug}/. - If extraction fails, the PDF is moved to
$VAULT_PATH/Knowledge/_staging/failed/. - LlamaParse uses
asynciointernally. This is the only async component in the pipeline.
MiniMax M2.5 (Chunk Enrichment)
Section titled “MiniMax M2.5 (Chunk Enrichment)”Service: MiniMax M2.5 LLM
Purpose: Classify chunks with structured legal metadata (instituto, tipo_conteudo, ramo, etc.)
Used by: enrich_chunks.py
Auth: MINIMAX_API_KEY environment variable
- Obtain an API key from MiniMax.
- Set the environment variable:
export MINIMAX_API_KEY="your-minimax-api-key"The Anthropic SDK Hack
Section titled “The Anthropic SDK Hack”Concurrency Settings
Section titled “Concurrency Settings”Enrichment runs with 5 concurrent threads and a 0.5-second delay between requests to avoid rate limiting:
| Parameter | Value |
|---|---|
WORKERS | 5 threads |
DELAY_BETWEEN_REQUESTS | 0.5 seconds |
| Model | MiniMax-M2.5 |
Missing Prompt File
Section titled “Missing Prompt File”Pending Decision: D06
Section titled “Pending Decision: D06”The choice of MiniMax M2.5 as the enrichment model is under review. Options being evaluated:
| Option | Pros | Cons |
|---|---|---|
| Keep MiniMax M2.5 | Works, cheap | Fragile SDK hack, generic model |
| Migrate to Claude | Ecosystem consistency | Higher cost |
| Local model | Zero cost, no dependency | Slower, setup complexity |
| Evaluate later | No effort now | Risk compounds |
HuggingFace (Embedding Model)
Section titled “HuggingFace (Embedding Model)”Service: HuggingFace Hub (public model)
Purpose: Download and cache the Legal-BERTimbau embedding model
Used by: embed_doutrina.py, search_doutrina_v2.py
Auth: None required (public model)
Model Details
Section titled “Model Details”| Property | Value |
|---|---|
| Model ID | rufimelo/Legal-BERTimbau-sts-base |
| Dimensions | 768 |
| Max tokens | 512 |
| Language | Portuguese (trained on PT-PT legal corpus) |
| Size on disk | ~500 MB |
| License | Open source |
The model is automatically downloaded on first run by the sentence-transformers library. No manual setup is needed.
To pre-download the model (useful for offline environments or Docker):
python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('rufimelo/Legal-BERTimbau-sts-base')"To control where the model is cached:
export HF_HOME="/path/to/cache"# or specifically:export SENTENCE_TRANSFORMERS_HOME="/path/to/cache"sens.legal Ecosystem Integration
Section titled “sens.legal Ecosystem Integration”Current State (v0.1)
Section titled “Current State (v0.1)”Integration with the broader sens.legal ecosystem is currently through static JSON files:
embed_doutrina.py | vembeddings_doutrina.json ─── deposited in ──→ Juca/Valter data directorysearch_corpus_doutrina.json ($OUTPUT_PATH)bm25_index_doutrina.json- No real-time query capability from other agents.
- No API or protocol.
- Valter and Juca read the JSON files from a shared filesystem path.
- Updates require re-running the embedding pipeline and restarting consumers.
Ecosystem Components
Section titled “Ecosystem Components”| Component | Role | Stack | Douto’s relationship |
|---|---|---|---|
| Valter | Backend API — STJ case law, knowledge graph, vector search | FastAPI, PostgreSQL, Qdrant, Neo4j, Redis | Primary consumer of Douto’s embeddings |
| Juca | Frontend — user interface for lawyers | Next.js | Accesses doctrine through Valter |
| Leci | Legislation service | Next.js, PostgreSQL, Drizzle | Future cross-reference target (F35) |
| Joseph | Orchestrator — coordinates agents | — | Future coordination with Douto queries |
Planned Integration (v0.4)
Section titled “Planned Integration (v0.4)”Planned Feature — MCP server for doctrine search is on the roadmap (F30) but not yet implemented.
The v0.4 milestone will establish programmatic integration between Douto and the sens.legal ecosystem:
MCP Server with at least 3 tools:
| Tool | Description |
|---|---|
search_doutrina | Hybrid search across doctrine corpus |
get_chunk | Retrieve a specific chunk by ID with full metadata |
list_areas | List available legal domains with corpus statistics |
Protocol Decision (D01) — not yet resolved:
| Option | Description | Pros | Cons |
|---|---|---|---|
| MCP stdio | Standard MCP transport | Aligned with Valter’s MCP | Process-per-query overhead |
| MCP HTTP/SSE | Persistent MCP connection | More flexible, lower latency | More infrastructure |
| REST API (FastAPI) | Conventional HTTP API | Simple, well-understood | Not aligned with MCP ecosystem |
| Keep JSON files | Current approach | Zero effort | No real-time queries, doesn’t scale |
Architecture Decision (D02) — not yet resolved:
Whether Douto remains an independent service or is absorbed as a module within Valter (valter/stores/doutrina/). This decision blocks v0.4.
Integration Diagram
Section titled “Integration Diagram”graph TB subgraph "Current (v0.1)" ED["embed_doutrina.py"] -->|JSON files| DIR["Shared directory"] DIR -->|reads| VA1["Valter"] end
subgraph "Planned (v0.4)" MCP["Douto MCP Server"] -->|search_doutrina| VA2["Valter"] MCP -->|get_chunk| VA2 MCP -->|list_areas| VA2 VA2 -->|via Valter| JU["Juca"] CD["Claude Desktop"] -->|MCP| MCP end