Skip to content

Technology Stack

Every technology used in Douto, why it was chosen, and where it’s used.

LanguageVersionUsage
Python 33.10+ (required for tuple[dict, str] type hints)All 5 pipeline scripts
MarkdownKnowledge base (Obsidian conventions, YAML frontmatter, wikilinks)

From pipeline/requirements.txt:

PackageVersionPurposeUsed In
sentence-transformersunpinnedEmbedding generation via Legal-BERTimbauembed_doutrina.py
torchunpinnedML backend for sentence-transformersembed_doutrina.py
numpyunpinnedVector math (cosine similarity, score normalization)embed_doutrina.py, search_doutrina_v2.py
anthropicunpinnedPython SDK used as HTTP client for MiniMax M2.5 APIenrich_chunks.py
llama-parseunpinnedPDF → markdown extraction via LlamaIndex cloudprocess_books.py
ModelProviderDimensionsMax TokensPurpose
rufimelo/Legal-BERTimbau-sts-baseHuggingFace768512Semantic embeddings for legal text
MiniMax-M2.5MiniMax (via Anthropic SDK)~8,000Chunk enrichment and classification
LlamaParseLlamaIndexPDF → markdown extraction

Legal-BERTimbau was trained on Portuguese legal corpora. It’s the standard choice for Portuguese legal NLP, though it was trained on PT-PT (Portugal), not PT-BR (Brazil). No benchmark comparison with alternatives (multilingual-e5, nomic-embed, Cohere embed v3) has been performed for Douto’s specific domain.

ServicePurposeAuthenticationRequired For
LlamaParse APIPDF → markdown conversionLLAMA_CLOUD_API_KEYprocess_books.py only
MiniMax M2.5 APIChunk classification with legal metadataMINIMAX_API_KEYenrich_chunks.py only
HuggingFace HubModel download (auto on first run)None (public model)embed_doutrina.py (first run)

The following stdlib modules are used across pipeline scripts — no external packages needed:

re, json, pathlib, argparse, asyncio, shutil, threading, math, collections, os, sys, time, datetime

Notably, rechunk_v3.py (the most complex script at 890 lines) uses only stdlib modules.

CategoryCurrent StatePlanned
Build systemNone — scripts run manuallyMakefile (F31)
DatabaseNone — JSON flat filesVector DB migration (M12)
ContainerizationNoneDocker (F38)
CI/CDNoneGitHub Actions (F39)
LintingNoneruff (F32)
TestingNone (0% coverage)pytest (F26-F27)
graph LR
LP["LlamaParse API"] --> PB["process_books.py"]
PB -->|"stdlib only"| RC["rechunk_v3.py"]
AN["anthropic SDK"] --> EN["enrich_chunks.py"]
EN -->|"via base_url"| MM["MiniMax API"]
ST["sentence-transformers"] --> EM["embed_doutrina.py"]
TO["torch"] --> ST
NP["numpy"] --> EM
NP --> SE["search_doutrina_v2.py"]
HF["HuggingFace Hub"] -.->|"model download"| ST