Skip to content

Environment Variables

Every environment variable used by Douto, which scripts read it, default values, and whether it is required.

VariableRequiredDefaultUsed InDescription
VAULT_PATHYesvaries (see Known Issues)enrich_chunks.py, embed_doutrina.pyPath to the Obsidian vault containing processed markdown chunks in Knowledge/_staging/processed/
OUTPUT_PATHNo~/.openclaw/workspace/juca/dataembed_doutrina.pyDirectory where embedding, corpus, and BM25 index JSON files are written
DATA_PATHNo~/.openclaw/workspace/juca/datasearch_doutrina_v2.pyDirectory containing pre-built search data (embeddings, corpus, BM25 index)
MINIMAX_API_KEYYes (for enrichment)enrich_chunks.pyAPI key for MiniMax M2.5, used via the Anthropic SDK with a custom base_url
LLAMA_CLOUD_API_KEYYes (for extraction)process_books.pyAPI key for LlamaParse (LlamaIndex), loaded implicitly by the SDK
VariableRequiredDefaultDescription
HF_HOMENo~/.cache/huggingfaceOverride the HuggingFace cache directory where Legal-BERTimbau is downloaded
SENTENCE_TRANSFORMERS_HOMENo$HF_HOMEOverride the sentence-transformers model cache specifically
CUDA_VISIBLE_DEVICESNoall GPUsRestrict which GPU(s) PyTorch uses for embedding generation

The scripts expect a specific directory structure under VAULT_PATH:

$VAULT_PATH/
Knowledge/
_staging/
input/ # PDFs to process (process_books.py reads from here)
processed/ # Markdown chapters + enriched chunks (all scripts)
failed/ # Failed PDF extractions
processing_log.jsonl

OUTPUT_PATH and DATA_PATH both default to the same directory (~/.openclaw/workspace/juca/data), where the embedding and search index files live:

$OUTPUT_PATH/ (== $DATA_PATH/)
embeddings_doutrina.json
search_corpus_doutrina.json
bm25_index_doutrina.json
embeddings_processo_civil.json
search_corpus_processo_civil.json
bm25_index_processo_civil.json

You only need DATA_PATH:

Terminal window
export DATA_PATH="/path/to/juca/data"
python3 pipeline/search_doutrina_v2.py --interativo

You need VAULT_PATH (source chunks) and OUTPUT_PATH (where to write embeddings):

Terminal window
export VAULT_PATH="/path/to/vault"
export OUTPUT_PATH="/path/to/juca/data"
python3 pipeline/embed_doutrina.py

You need VAULT_PATH (source chunks) and MINIMAX_API_KEY:

Terminal window
export VAULT_PATH="/path/to/vault"
export MINIMAX_API_KEY="your-minimax-api-key"
python3 pipeline/enrich_chunks.py all

All variables are required:

Terminal window
export VAULT_PATH="/path/to/vault"
export OUTPUT_PATH="/path/to/juca/data"
export MINIMAX_API_KEY="your-minimax-api-key"
export LLAMA_CLOUD_API_KEY="your-llamaparse-key"
python3 pipeline/process_books.py # PDF -> markdown
python3 pipeline/rechunk_v3.py # markdown -> chunks
python3 pipeline/enrich_chunks.py all # chunks -> classified
python3 pipeline/embed_doutrina.py # chunks -> embeddings
python3 pipeline/search_doutrina_v2.py -i # interactive search
Terminal window
# Douto — Environment Variables
# Copy this to .env and run: source .env
# Path to the Obsidian vault with Knowledge/_staging/ structure
export VAULT_PATH="/path/to/your/vault"
# Where embedding and search index JSONs are written/read
export OUTPUT_PATH="/path/to/juca/data"
export DATA_PATH="/path/to/juca/data" # typically same as OUTPUT_PATH
# API Keys — required for specific pipeline stages
export MINIMAX_API_KEY="your-minimax-api-key" # enrichment (enrich_chunks.py)
export LLAMA_CLOUD_API_KEY="your-llamaparse-key" # PDF extraction (process_books.py)
# Optional: HuggingFace model cache (Legal-BERTimbau is ~500MB)
# export HF_HOME="/path/to/hf-cache"
# Optional: GPU control
# export CUDA_VISIBLE_DEVICES="0" # use only GPU 0