Skip to content

External Integrations

How Douto connects to external services and the sens.legal ecosystem.

Service: LlamaParse by LlamaIndex Purpose: Convert legal PDF textbooks into structured markdown Used by: process_books.py Auth: LLAMA_CLOUD_API_KEY environment variable (loaded implicitly by the SDK)

  1. Create a free account at cloud.llamaindex.ai.
  2. Generate an API key from the dashboard.
  3. Set the environment variable:
Terminal window
export LLAMA_CLOUD_API_KEY="llx-your-key-here"

LlamaParse offers three processing tiers. The default in Douto is cost_effective:

TierBest forSpeedCost
agenticScanned PDFs, complex tables, multi-column layoutsSlowestHighest
cost_effectiveClean-text legal textbooks (default)MediumMedium
fastSimple text-only documentsFastestLowest

Override the tier per run:

Terminal window
python3 pipeline/process_books.py --tier agentic livro.pdf
  • PDF extraction is a one-time operation per book. Once converted to markdown, the original PDF is not needed again by the pipeline.
  • Processed markdown is saved to $VAULT_PATH/Knowledge/_staging/processed/{slug}/.
  • If extraction fails, the PDF is moved to $VAULT_PATH/Knowledge/_staging/failed/.
  • LlamaParse uses asyncio internally. This is the only async component in the pipeline.

Service: MiniMax M2.5 LLM Purpose: Classify chunks with structured legal metadata (instituto, tipo_conteudo, ramo, etc.) Used by: enrich_chunks.py Auth: MINIMAX_API_KEY environment variable

  1. Obtain an API key from MiniMax.
  2. Set the environment variable:
Terminal window
export MINIMAX_API_KEY="your-minimax-api-key"

Enrichment runs with 5 concurrent threads and a 0.5-second delay between requests to avoid rate limiting:

ParameterValue
WORKERS5 threads
DELAY_BETWEEN_REQUESTS0.5 seconds
ModelMiniMax-M2.5

The choice of MiniMax M2.5 as the enrichment model is under review. Options being evaluated:

OptionProsCons
Keep MiniMax M2.5Works, cheapFragile SDK hack, generic model
Migrate to ClaudeEcosystem consistencyHigher cost
Local modelZero cost, no dependencySlower, setup complexity
Evaluate laterNo effort nowRisk compounds

Service: HuggingFace Hub (public model) Purpose: Download and cache the Legal-BERTimbau embedding model Used by: embed_doutrina.py, search_doutrina_v2.py Auth: None required (public model)

PropertyValue
Model IDrufimelo/Legal-BERTimbau-sts-base
Dimensions768
Max tokens512
LanguagePortuguese (trained on PT-PT legal corpus)
Size on disk~500 MB
LicenseOpen source

The model is automatically downloaded on first run by the sentence-transformers library. No manual setup is needed.

To pre-download the model (useful for offline environments or Docker):

Terminal window
python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('rufimelo/Legal-BERTimbau-sts-base')"

To control where the model is cached:

Terminal window
export HF_HOME="/path/to/cache"
# or specifically:
export SENTENCE_TRANSFORMERS_HOME="/path/to/cache"

Integration with the broader sens.legal ecosystem is currently through static JSON files:

embed_doutrina.py
|
v
embeddings_doutrina.json ─── deposited in ──→ Juca/Valter data directory
search_corpus_doutrina.json ($OUTPUT_PATH)
bm25_index_doutrina.json
  • No real-time query capability from other agents.
  • No API or protocol.
  • Valter and Juca read the JSON files from a shared filesystem path.
  • Updates require re-running the embedding pipeline and restarting consumers.
ComponentRoleStackDouto’s relationship
ValterBackend API — STJ case law, knowledge graph, vector searchFastAPI, PostgreSQL, Qdrant, Neo4j, RedisPrimary consumer of Douto’s embeddings
JucaFrontend — user interface for lawyersNext.jsAccesses doctrine through Valter
LeciLegislation serviceNext.js, PostgreSQL, DrizzleFuture cross-reference target (F35)
JosephOrchestrator — coordinates agentsFuture coordination with Douto queries

Planned Feature — MCP server for doctrine search is on the roadmap (F30) but not yet implemented.

The v0.4 milestone will establish programmatic integration between Douto and the sens.legal ecosystem:

MCP Server with at least 3 tools:

ToolDescription
search_doutrinaHybrid search across doctrine corpus
get_chunkRetrieve a specific chunk by ID with full metadata
list_areasList available legal domains with corpus statistics

Protocol Decision (D01) — not yet resolved:

OptionDescriptionProsCons
MCP stdioStandard MCP transportAligned with Valter’s MCPProcess-per-query overhead
MCP HTTP/SSEPersistent MCP connectionMore flexible, lower latencyMore infrastructure
REST API (FastAPI)Conventional HTTP APISimple, well-understoodNot aligned with MCP ecosystem
Keep JSON filesCurrent approachZero effortNo real-time queries, doesn’t scale

Architecture Decision (D02) — not yet resolved:

Whether Douto remains an independent service or is absorbed as a module within Valter (valter/stores/doutrina/). This decision blocks v0.4.

graph TB
subgraph "Current (v0.1)"
ED["embed_doutrina.py"] -->|JSON files| DIR["Shared directory"]
DIR -->|reads| VA1["Valter"]
end
subgraph "Planned (v0.4)"
MCP["Douto MCP Server"] -->|search_doutrina| VA2["Valter"]
MCP -->|get_chunk| VA2
MCP -->|list_areas| VA2
VA2 -->|via Valter| JU["Juca"]
CD["Claude Desktop"] -->|MCP| MCP
end