Installation
Installation
Section titled “Installation”This guide covers the complete setup for running every stage of the Douto pipeline, from PDF extraction through search. If you only need to search the existing corpus, see Quickstart.
System Requirements
Section titled “System Requirements”| Requirement | Minimum | Recommended |
|---|---|---|
| Python | 3.10+ | 3.11+ |
| RAM | 4 GB | 8 GB+ |
| Disk | 2 GB (models + corpus) | 10 GB (with all books) |
| GPU | Not required | CUDA-compatible (speeds up embedding generation) |
| OS | Linux, macOS, WSL2 | Linux or macOS |
Step 1: Clone the Repository
Section titled “Step 1: Clone the Repository”git clone https://github.com/sensdiego/douto.gitcd doutoStep 2: Create a Virtual Environment
Section titled “Step 2: Create a Virtual Environment”python3 -m venv .venvsource .venv/bin/activate # Linux/macOS# .venv\Scripts\activate # Windowspip install -r pipeline/requirements.txtCurrent dependencies:
| Package | Purpose | Size |
|---|---|---|
sentence-transformers | Embedding generation (Legal-BERTimbau) | ~200 MB |
torch | ML backend for sentence-transformers | ~800 MB |
numpy | Vector operations (cosine similarity, etc.) | ~30 MB |
anthropic | SDK for MiniMax M2.5 API (via custom base_url) | ~5 MB |
llama-parse | PDF extraction via LlamaIndex | ~10 MB |
Step 3: Configure Environment Variables
Section titled “Step 3: Configure Environment Variables”# Required for all pipeline stagesexport VAULT_PATH="/path/to/your/vault"
# Required for PDF extraction (process_books.py)export LLAMA_CLOUD_API_KEY="your-llamaparse-api-key"
# Required for chunk enrichment (enrich_chunks.py)export MINIMAX_API_KEY="your-minimax-api-key"
# Optional: customize output pathsexport OUTPUT_PATH="/path/to/output" # default: ~/.openclaw/workspace/juca/dataexport DATA_PATH="/path/to/search/data" # default: same as OUTPUT_PATHThe VAULT_PATH directory should be an Obsidian vault with the following structure:
$VAULT_PATH/└── Knowledge/ └── _staging/ ├── input/ # Place PDFs here ├── processed/ # Output from process_books.py and rechunk_v3.py └── failed/ # PDFs that failed extractionFor a complete environment variable reference, see Environment Variables.
Step 4: Download the Embedding Model
Section titled “Step 4: Download the Embedding Model”The sentence-transformers library auto-downloads rufimelo/Legal-BERTimbau-sts-base (~500 MB) on first run. To pre-download:
python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('rufimelo/Legal-BERTimbau-sts-base')"Step 5: Verify Installation
Section titled “Step 5: Verify Installation”# Check Python versionpython3 --version # Should be 3.10+
# Check core dependenciespython3 -c "from sentence_transformers import SentenceTransformer; print('sentence-transformers OK')"python3 -c "import torch; print(f'torch OK, CUDA: {torch.cuda.is_available()}')"python3 -c "import numpy; print(f'numpy OK, version: {numpy.__version__}')"python3 -c "import anthropic; print('anthropic OK')"
# Check search CLIpython3 pipeline/search_doutrina_v2.py --helpRunning the Full Pipeline
Section titled “Running the Full Pipeline”Each stage depends on the output of the previous one. Run them in order:
Stage 1: PDF Extraction
Section titled “Stage 1: PDF Extraction”# Place PDFs in $VAULT_PATH/Knowledge/_staging/input/python3 pipeline/process_books.py --dry-run # Preview what will be processedpython3 pipeline/process_books.py # Run extractionpython3 pipeline/process_books.py --tier fast # Use cheaper LlamaParse tierRequires: LLAMA_CLOUD_API_KEY. See PDF Extraction.
Stage 2: Intelligent Chunking
Section titled “Stage 2: Intelligent Chunking”python3 pipeline/rechunk_v3.py --dry-run # Previewpython3 pipeline/rechunk_v3.py # Process all bookspython3 pipeline/rechunk_v3.py contratos-gomes # Process one bookpython3 pipeline/rechunk_v3.py --min-chars 1500 # Custom minimum chunk sizeNo API keys required. See Intelligent Chunking.
Stage 3: Chunk Enrichment
Section titled “Stage 3: Chunk Enrichment”python3 pipeline/enrich_chunks.py --dry-run # Previewpython3 pipeline/enrich_chunks.py all # Enrich all chunkspython3 pipeline/enrich_chunks.py contratos # Enrich one areaRequires: MINIMAX_API_KEY. See Enrichment.
Stage 4: Embedding Generation
Section titled “Stage 4: Embedding Generation”python3 pipeline/embed_doutrina.py --dry-run # Previewpython3 pipeline/embed_doutrina.py # Generate embeddingsNo API keys required (model is downloaded from HuggingFace). See Embeddings.
Stage 5: Search
Section titled “Stage 5: Search”python3 pipeline/search_doutrina_v2.py --interativo # Interactive modepython3 pipeline/search_doutrina_v2.py "query" --area allSee Hybrid Search.
Troubleshooting
Section titled “Troubleshooting”FileNotFoundError on hardcoded paths
Section titled “FileNotFoundError on hardcoded paths”Two scripts have hardcoded paths. Edit them directly or wait for F22:
# process_books.py line 27 — change to:VAULT_PATH = Path(os.environ.get("VAULT_PATH", "/your/path"))
# rechunk_v3.py line 29 — change to:VAULT_PATH = Path(os.environ.get("VAULT_PATH", "/your/path"))PyTorch CUDA errors
Section titled “PyTorch CUDA errors”If your GPU isn’t compatible, force CPU mode:
export CUDA_VISIBLE_DEVICES=""enrich_prompt.md not found
Section titled “enrich_prompt.md not found”The enrichment prompt file is missing from the repository. This is a known critical issue (RT01 in PREMORTEM.md). Until it’s recovered, enrichment of new chunks will fail.
For more solutions, see Troubleshooting.