Environment Variables
Environment Variables
Section titled “Environment Variables”Every environment variable used by Douto, which scripts read it, default values, and whether it is required.
Variable Reference
Section titled “Variable Reference”| Variable | Required | Default | Used In | Description |
|---|---|---|---|---|
VAULT_PATH | Yes | varies (see Known Issues) | enrich_chunks.py, embed_doutrina.py | Path to the Obsidian vault containing processed markdown chunks in Knowledge/_staging/processed/ |
OUTPUT_PATH | No | ~/.openclaw/workspace/juca/data | embed_doutrina.py | Directory where embedding, corpus, and BM25 index JSON files are written |
DATA_PATH | No | ~/.openclaw/workspace/juca/data | search_doutrina_v2.py | Directory containing pre-built search data (embeddings, corpus, BM25 index) |
MINIMAX_API_KEY | Yes (for enrichment) | — | enrich_chunks.py | API key for MiniMax M2.5, used via the Anthropic SDK with a custom base_url |
LLAMA_CLOUD_API_KEY | Yes (for extraction) | — | process_books.py | API key for LlamaParse (LlamaIndex), loaded implicitly by the SDK |
Optional / Implicit Variables
Section titled “Optional / Implicit Variables”| Variable | Required | Default | Description |
|---|---|---|---|
HF_HOME | No | ~/.cache/huggingface | Override the HuggingFace cache directory where Legal-BERTimbau is downloaded |
SENTENCE_TRANSFORMERS_HOME | No | $HF_HOME | Override the sentence-transformers model cache specifically |
CUDA_VISIBLE_DEVICES | No | all GPUs | Restrict which GPU(s) PyTorch uses for embedding generation |
Known Issues with Paths
Section titled “Known Issues with Paths”Path Relationship
Section titled “Path Relationship”The scripts expect a specific directory structure under VAULT_PATH:
$VAULT_PATH/ Knowledge/ _staging/ input/ # PDFs to process (process_books.py reads from here) processed/ # Markdown chapters + enriched chunks (all scripts) failed/ # Failed PDF extractions processing_log.jsonlOUTPUT_PATH and DATA_PATH both default to the same directory (~/.openclaw/workspace/juca/data), where the embedding and search index files live:
$OUTPUT_PATH/ (== $DATA_PATH/) embeddings_doutrina.json search_corpus_doutrina.json bm25_index_doutrina.json embeddings_processo_civil.json search_corpus_processo_civil.json bm25_index_processo_civil.jsonSetup by Use Case
Section titled “Setup by Use Case”Search Only (querying existing data)
Section titled “Search Only (querying existing data)”You only need DATA_PATH:
export DATA_PATH="/path/to/juca/data"python3 pipeline/search_doutrina_v2.py --interativoEmbedding Generation
Section titled “Embedding Generation”You need VAULT_PATH (source chunks) and OUTPUT_PATH (where to write embeddings):
export VAULT_PATH="/path/to/vault"export OUTPUT_PATH="/path/to/juca/data"python3 pipeline/embed_doutrina.pyChunk Enrichment
Section titled “Chunk Enrichment”You need VAULT_PATH (source chunks) and MINIMAX_API_KEY:
export VAULT_PATH="/path/to/vault"export MINIMAX_API_KEY="your-minimax-api-key"python3 pipeline/enrich_chunks.py allFull Pipeline (PDF to search)
Section titled “Full Pipeline (PDF to search)”All variables are required:
export VAULT_PATH="/path/to/vault"export OUTPUT_PATH="/path/to/juca/data"export MINIMAX_API_KEY="your-minimax-api-key"export LLAMA_CLOUD_API_KEY="your-llamaparse-key"
python3 pipeline/process_books.py # PDF -> markdownpython3 pipeline/rechunk_v3.py # markdown -> chunkspython3 pipeline/enrich_chunks.py all # chunks -> classifiedpython3 pipeline/embed_doutrina.py # chunks -> embeddingspython3 pipeline/search_doutrina_v2.py -i # interactive searchExample .env File
Section titled “Example .env File”# Douto — Environment Variables# Copy this to .env and run: source .env
# Path to the Obsidian vault with Knowledge/_staging/ structureexport VAULT_PATH="/path/to/your/vault"
# Where embedding and search index JSONs are written/readexport OUTPUT_PATH="/path/to/juca/data"export DATA_PATH="/path/to/juca/data" # typically same as OUTPUT_PATH
# API Keys — required for specific pipeline stagesexport MINIMAX_API_KEY="your-minimax-api-key" # enrichment (enrich_chunks.py)export LLAMA_CLOUD_API_KEY="your-llamaparse-key" # PDF extraction (process_books.py)
# Optional: HuggingFace model cache (Legal-BERTimbau is ~500MB)# export HF_HOME="/path/to/hf-cache"
# Optional: GPU control# export CUDA_VISIBLE_DEVICES="0" # use only GPU 0