Skip to content

Installation

This guide covers the complete setup for running every stage of the Douto pipeline, from PDF extraction through search. If you only need to search the existing corpus, see Quickstart.

RequirementMinimumRecommended
Python3.10+3.11+
RAM4 GB8 GB+
Disk2 GB (models + corpus)10 GB (with all books)
GPUNot requiredCUDA-compatible (speeds up embedding generation)
OSLinux, macOS, WSL2Linux or macOS
Terminal window
git clone https://github.com/sensdiego/douto.git
cd douto
Terminal window
python3 -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
pip install -r pipeline/requirements.txt

Current dependencies:

PackagePurposeSize
sentence-transformersEmbedding generation (Legal-BERTimbau)~200 MB
torchML backend for sentence-transformers~800 MB
numpyVector operations (cosine similarity, etc.)~30 MB
anthropicSDK for MiniMax M2.5 API (via custom base_url)~5 MB
llama-parsePDF extraction via LlamaIndex~10 MB
Terminal window
# Required for all pipeline stages
export VAULT_PATH="/path/to/your/vault"
# Required for PDF extraction (process_books.py)
export LLAMA_CLOUD_API_KEY="your-llamaparse-api-key"
# Required for chunk enrichment (enrich_chunks.py)
export MINIMAX_API_KEY="your-minimax-api-key"
# Optional: customize output paths
export OUTPUT_PATH="/path/to/output" # default: ~/.openclaw/workspace/juca/data
export DATA_PATH="/path/to/search/data" # default: same as OUTPUT_PATH

The VAULT_PATH directory should be an Obsidian vault with the following structure:

$VAULT_PATH/
└── Knowledge/
└── _staging/
├── input/ # Place PDFs here
├── processed/ # Output from process_books.py and rechunk_v3.py
└── failed/ # PDFs that failed extraction

For a complete environment variable reference, see Environment Variables.

The sentence-transformers library auto-downloads rufimelo/Legal-BERTimbau-sts-base (~500 MB) on first run. To pre-download:

Terminal window
python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('rufimelo/Legal-BERTimbau-sts-base')"
Terminal window
# Check Python version
python3 --version # Should be 3.10+
# Check core dependencies
python3 -c "from sentence_transformers import SentenceTransformer; print('sentence-transformers OK')"
python3 -c "import torch; print(f'torch OK, CUDA: {torch.cuda.is_available()}')"
python3 -c "import numpy; print(f'numpy OK, version: {numpy.__version__}')"
python3 -c "import anthropic; print('anthropic OK')"
# Check search CLI
python3 pipeline/search_doutrina_v2.py --help

Each stage depends on the output of the previous one. Run them in order:

Terminal window
# Place PDFs in $VAULT_PATH/Knowledge/_staging/input/
python3 pipeline/process_books.py --dry-run # Preview what will be processed
python3 pipeline/process_books.py # Run extraction
python3 pipeline/process_books.py --tier fast # Use cheaper LlamaParse tier

Requires: LLAMA_CLOUD_API_KEY. See PDF Extraction.

Terminal window
python3 pipeline/rechunk_v3.py --dry-run # Preview
python3 pipeline/rechunk_v3.py # Process all books
python3 pipeline/rechunk_v3.py contratos-gomes # Process one book
python3 pipeline/rechunk_v3.py --min-chars 1500 # Custom minimum chunk size

No API keys required. See Intelligent Chunking.

Terminal window
python3 pipeline/enrich_chunks.py --dry-run # Preview
python3 pipeline/enrich_chunks.py all # Enrich all chunks
python3 pipeline/enrich_chunks.py contratos # Enrich one area

Requires: MINIMAX_API_KEY. See Enrichment.

Terminal window
python3 pipeline/embed_doutrina.py --dry-run # Preview
python3 pipeline/embed_doutrina.py # Generate embeddings

No API keys required (model is downloaded from HuggingFace). See Embeddings.

Terminal window
python3 pipeline/search_doutrina_v2.py --interativo # Interactive mode
python3 pipeline/search_doutrina_v2.py "query" --area all

See Hybrid Search.

Two scripts have hardcoded paths. Edit them directly or wait for F22:

# process_books.py line 27 — change to:
VAULT_PATH = Path(os.environ.get("VAULT_PATH", "/your/path"))
# rechunk_v3.py line 29 — change to:
VAULT_PATH = Path(os.environ.get("VAULT_PATH", "/your/path"))

If your GPU isn’t compatible, force CPU mode:

Terminal window
export CUDA_VISIBLE_DEVICES=""

The enrichment prompt file is missing from the repository. This is a known critical issue (RT01 in PREMORTEM.md). Until it’s recovered, enrichment of new chunks will fail.

For more solutions, see Troubleshooting.