Technology Stack
Technology Stack
Section titled “Technology Stack”Every technology used in Douto, why it was chosen, and where it’s used.
Languages
Section titled “Languages”| Language | Version | Usage |
|---|---|---|
| Python 3 | 3.10+ (required for tuple[dict, str] type hints) | All 5 pipeline scripts |
| Markdown | — | Knowledge base (Obsidian conventions, YAML frontmatter, wikilinks) |
Core Dependencies
Section titled “Core Dependencies”From pipeline/requirements.txt:
| Package | Version | Purpose | Used In |
|---|---|---|---|
sentence-transformers | unpinned | Embedding generation via Legal-BERTimbau | embed_doutrina.py |
torch | unpinned | ML backend for sentence-transformers | embed_doutrina.py |
numpy | unpinned | Vector math (cosine similarity, score normalization) | embed_doutrina.py, search_doutrina_v2.py |
anthropic | unpinned | Python SDK used as HTTP client for MiniMax M2.5 API | enrich_chunks.py |
llama-parse | unpinned | PDF → markdown extraction via LlamaIndex cloud | process_books.py |
ML Models
Section titled “ML Models”| Model | Provider | Dimensions | Max Tokens | Purpose |
|---|---|---|---|---|
rufimelo/Legal-BERTimbau-sts-base | HuggingFace | 768 | 512 | Semantic embeddings for legal text |
| MiniMax-M2.5 | MiniMax (via Anthropic SDK) | — | ~8,000 | Chunk enrichment and classification |
| LlamaParse | LlamaIndex | — | — | PDF → markdown extraction |
Legal-BERTimbau was trained on Portuguese legal corpora. It’s the standard choice for Portuguese legal NLP, though it was trained on PT-PT (Portugal), not PT-BR (Brazil). No benchmark comparison with alternatives (multilingual-e5, nomic-embed, Cohere embed v3) has been performed for Douto’s specific domain.
External Services
Section titled “External Services”| Service | Purpose | Authentication | Required For |
|---|---|---|---|
| LlamaParse API | PDF → markdown conversion | LLAMA_CLOUD_API_KEY | process_books.py only |
| MiniMax M2.5 API | Chunk classification with legal metadata | MINIMAX_API_KEY | enrich_chunks.py only |
| HuggingFace Hub | Model download (auto on first run) | None (public model) | embed_doutrina.py (first run) |
Standard Library Usage
Section titled “Standard Library Usage”The following stdlib modules are used across pipeline scripts — no external packages needed:
re, json, pathlib, argparse, asyncio, shutil, threading, math, collections, os, sys, time, datetime
Notably, rechunk_v3.py (the most complex script at 890 lines) uses only stdlib modules.
Infrastructure
Section titled “Infrastructure”| Category | Current State | Planned |
|---|---|---|
| Build system | None — scripts run manually | Makefile (F31) |
| Database | None — JSON flat files | Vector DB migration (M12) |
| Containerization | None | Docker (F38) |
| CI/CD | None | GitHub Actions (F39) |
| Linting | None | ruff (F32) |
| Testing | None (0% coverage) | pytest (F26-F27) |
Dependency Graph
Section titled “Dependency Graph”graph LR LP["LlamaParse API"] --> PB["process_books.py"] PB -->|"stdlib only"| RC["rechunk_v3.py"] AN["anthropic SDK"] --> EN["enrich_chunks.py"] EN -->|"via base_url"| MM["MiniMax API"] ST["sentence-transformers"] --> EM["embed_doutrina.py"] TO["torch"] --> ST NP["numpy"] --> EM NP --> SE["search_doutrina_v2.py"] HF["HuggingFace Hub"] -.->|"model download"| ST