Installation

This guide covers the complete setup for running every stage of the Douto pipeline, from PDF extraction through search. If you only need to search the existing corpus, see Quickstart.

System Requirements

Requirement	Minimum	Recommended
Python	3.10+	3.11+
RAM	4 GB	8 GB+
Disk	2 GB (models + corpus)	10 GB (with all books)
GPU	Not required	CUDA-compatible (speeds up embedding generation)
OS	Linux, macOS, WSL2	Linux or macOS

Step 1: Clone the Repository

git clone https://github.com/sensdiego/douto.git
cd douto

Step 2: Create a Virtual Environment

python3 -m venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows
pip install -r pipeline/requirements.txt

Current dependencies:

Package	Purpose	Size
`sentence-transformers`	Embedding generation (Legal-BERTimbau)	~200 MB
`torch`	ML backend for sentence-transformers	~800 MB
`numpy`	Vector operations (cosine similarity, etc.)	~30 MB
`anthropic`	SDK for MiniMax M2.5 API (via custom base_url)	~5 MB
`llama-parse`	PDF extraction via LlamaIndex	~10 MB

Step 3: Configure Environment Variables

# Required for all pipeline stages
export VAULT_PATH="/path/to/your/vault"

# Required for PDF extraction (process_books.py)
export LLAMA_CLOUD_API_KEY="your-llamaparse-api-key"

# Required for chunk enrichment (enrich_chunks.py)
export MINIMAX_API_KEY="your-minimax-api-key"

# Optional: customize output paths
export OUTPUT_PATH="/path/to/output"    # default: ~/.openclaw/workspace/juca/data
export DATA_PATH="/path/to/search/data" # default: same as OUTPUT_PATH

The VAULT_PATH directory should be an Obsidian vault with the following structure:

$VAULT_PATH/
└── Knowledge/
    └── _staging/
        ├── input/      # Place PDFs here
        ├── processed/  # Output from process_books.py and rechunk_v3.py
        └── failed/     # PDFs that failed extraction

For a complete environment variable reference, see Environment Variables.

Step 4: Download the Embedding Model

The sentence-transformers library auto-downloads rufimelo/Legal-BERTimbau-sts-base (~500 MB) on first run. To pre-download:

python3 -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('rufimelo/Legal-BERTimbau-sts-base')"

Step 5: Verify Installation

# Check Python version
python3 --version  # Should be 3.10+

# Check core dependencies
python3 -c "from sentence_transformers import SentenceTransformer; print('sentence-transformers OK')"
python3 -c "import torch; print(f'torch OK, CUDA: {torch.cuda.is_available()}')"
python3 -c "import numpy; print(f'numpy OK, version: {numpy.__version__}')"
python3 -c "import anthropic; print('anthropic OK')"

# Check search CLI
python3 pipeline/search_doutrina_v2.py --help

Running the Full Pipeline

Each stage depends on the output of the previous one. Run them in order:

Stage 1: PDF Extraction

# Place PDFs in $VAULT_PATH/Knowledge/_staging/input/
python3 pipeline/process_books.py --dry-run   # Preview what will be processed
python3 pipeline/process_books.py             # Run extraction
python3 pipeline/process_books.py --tier fast # Use cheaper LlamaParse tier

Requires: LLAMA_CLOUD_API_KEY. See PDF Extraction.

Stage 2: Intelligent Chunking

python3 pipeline/rechunk_v3.py --dry-run       # Preview
python3 pipeline/rechunk_v3.py                  # Process all books
python3 pipeline/rechunk_v3.py contratos-gomes  # Process one book
python3 pipeline/rechunk_v3.py --min-chars 1500 # Custom minimum chunk size

No API keys required. See Intelligent Chunking.

Stage 3: Chunk Enrichment

python3 pipeline/enrich_chunks.py --dry-run   # Preview
python3 pipeline/enrich_chunks.py all         # Enrich all chunks
python3 pipeline/enrich_chunks.py contratos   # Enrich one area

Requires: MINIMAX_API_KEY. See Enrichment.

Stage 4: Embedding Generation

python3 pipeline/embed_doutrina.py --dry-run  # Preview
python3 pipeline/embed_doutrina.py            # Generate embeddings

No API keys required (model is downloaded from HuggingFace). See Embeddings.

Stage 5: Search

python3 pipeline/search_doutrina_v2.py --interativo  # Interactive mode
python3 pipeline/search_doutrina_v2.py "query" --area all

See Hybrid Search.

Troubleshooting

`FileNotFoundError` on hardcoded paths

Two scripts have hardcoded paths. Edit them directly or wait for F22:

# process_books.py line 27 — change to:
VAULT_PATH = Path(os.environ.get("VAULT_PATH", "/your/path"))

# rechunk_v3.py line 29 — change to:
VAULT_PATH = Path(os.environ.get("VAULT_PATH", "/your/path"))

PyTorch CUDA errors

If your GPU isn’t compatible, force CPU mode:

export CUDA_VISIBLE_DEVICES=""

`enrich_prompt.md` not found

The enrichment prompt file is missing from the repository. This is a known critical issue (RT01 in PREMORTEM.md). Until it’s recovered, enrichment of new chunks will fail.

For more solutions, see Troubleshooting.

Installation

Installation

System Requirements

Step 1: Clone the Repository

Step 2: Create a Virtual Environment

Step 3: Configure Environment Variables

Step 4: Download the Embedding Model

Step 5: Verify Installation

Running the Full Pipeline

Stage 1: PDF Extraction

Stage 2: Intelligent Chunking

Stage 3: Chunk Enrichment

Stage 4: Embedding Generation

Stage 5: Search

Troubleshooting

FileNotFoundError on hardcoded paths

PyTorch CUDA errors

enrich_prompt.md not found

`FileNotFoundError` on hardcoded paths

`enrich_prompt.md` not found