Development Setup
Development Setup
Section titled “Development Setup”Everything you need to start developing on Douto — from cloning to running your first pipeline pass.
Prerequisites
Section titled “Prerequisites”| Requirement | Version | Notes |
|---|---|---|
| Python | 3.10+ | Required for modern type hints (tuple[dict, str]) |
| pip | Latest | Package installer |
| Git | Any recent | Version control |
| RAM | 8 GB+ | PyTorch + Legal-BERTimbau model loading |
| Disk | ~2 GB free | Model cache (~500 MB) + embeddings data |
Optional:
| Tool | Purpose |
|---|---|
| CUDA-compatible GPU | Faster embedding generation (CPU works, just slower) |
| Obsidian | Navigate the knowledge base visually with wikilinks |
| API keys | See below — only needed for specific pipeline stages |
Clone and Install
Section titled “Clone and Install”# Clone the repositorygit clone https://github.com/sensdiego/douto.gitcd douto
# Create and activate a virtual environmentpython3 -m venv .venvsource .venv/bin/activate # Linux/macOS# .venv\Scripts\activate # Windows
# Install dependenciespip install -r pipeline/requirements.txtConfigure Environment
Section titled “Configure Environment”Set environment variables based on what you plan to do. See the Environment Variables reference for full details.
Minimal (search only)
Section titled “Minimal (search only)”export DATA_PATH="/path/to/juca/data"Full pipeline
Section titled “Full pipeline”export VAULT_PATH="/path/to/your/vault"export OUTPUT_PATH="/path/to/juca/data"export DATA_PATH="/path/to/juca/data"export MINIMAX_API_KEY="your-minimax-key"export LLAMA_CLOUD_API_KEY="your-llamaparse-key"Verify Setup
Section titled “Verify Setup”Run these commands to confirm each component is working:
# Check Python version (need 3.10+)python3 --version
# Check sentence-transformerspython3 -c "import sentence_transformers; print('sentence-transformers OK')"
# Check PyTorch and CUDA availabilitypython3 -c "import torch; print(f'torch OK, CUDA: {torch.cuda.is_available()}')"
# Check anthropic SDKpython3 -c "import anthropic; print('anthropic SDK OK')"
# Check pipeline scripts are accessiblepython3 pipeline/search_doutrina_v2.py --helppython3 pipeline/rechunk_v3.py --helppython3 pipeline/enrich_chunks.py --helpIf all checks pass, your environment is ready.
Project Structure
Section titled “Project Structure”douto/├── AGENTS.md # Agent identity, responsibilities, boundaries├── CLAUDE.md # Coding guidelines for AI agents├── ROADMAP.md # Product roadmap with milestones├── PROJECT_MAP.md # Full project diagnostic├── PREMORTEM.md # Risk analysis and failure scenarios├── knowledge/│ ├── INDEX_DOUTO.md # Skill graph index — entry point│ ├── mocs/│ │ ├── MOC_CIVIL.md # 35 books, ~9,365 chunks│ │ ├── MOC_PROCESSUAL.md # 8 books, ~22,182 chunks│ │ ├── MOC_EMPRESARIAL.md# 7 books│ │ └── MOC_CONSUMIDOR.md # Placeholder (not yet populated)│ └── nodes/│ └── .gitkeep # Atomic notes (future — F36)├── pipeline/│ ├── process_books.py # Step 1: PDF → markdown (LlamaParse)│ ├── rechunk_v3.py # Step 2: markdown → intelligent chunks│ ├── enrich_chunks.py # Step 3: chunks → classified (MiniMax M2.5)│ ├── embed_doutrina.py # Step 4: chunks → 768-dim embeddings│ ├── search_doutrina_v2.py # Step 5: hybrid semantic + BM25 search│ └── requirements.txt # Python dependencies (unpinned)├── tools/│ └── .gitkeep # Auxiliary tools (future)└── docs/ # Starlight documentation siteKey Directories
Section titled “Key Directories”| Directory | Purpose |
|---|---|
pipeline/ | ETL scripts that process legal textbooks into searchable data |
knowledge/ | Obsidian-compatible markdown skill graph (MOCs, index, future atomic notes) |
tools/ | Reserved for auxiliary scripts (empty) |
docs/ | Project documentation (Starlight/Astro) |
Pipeline Data Flow
Section titled “Pipeline Data Flow”process_books.py → rechunk_v3.py → enrich_chunks.py → embed_doutrina.py → search_doutrina_v2.py PDF→MD MD→chunks chunks→metadata chunks→vectors vectors→resultsEach script depends on the output of the previous one. They must be run sequentially.
Development Workflow
Section titled “Development Workflow”1. Create a Branch
Section titled “1. Create a Branch”git fetch origingit checkout -b feat/SEN-XXX-description # featuregit checkout -b fix/SEN-XXX-description # bug fixgit checkout -b docs/description # documentationBranch naming follows the pattern {type}/SEN-{ticket}-{description} when linked to a Linear issue, or {type}/{description} otherwise.
2. Make Changes
Section titled “2. Make Changes”Follow the Coding Conventions. Key rules:
- Type hints on all public functions
--dry-runsupport for scripts that modify dataos.environ.get()for all paths (no hardcoded absolutes)- Idempotent operations (safe to re-run)
3. Test Your Changes
Section titled “3. Test Your Changes”For pipeline changes, test with a small subset:
python3 pipeline/rechunk_v3.py --dry-run --limit 54. Commit
Section titled “4. Commit”Use Conventional Commits:
git add pipeline/rechunk_v3.pygit commit -m "feat: add bibliography detection to rechunker"Commit types: feat:, fix:, docs:, refactor:, test:, chore:
5. Push and Open a PR
Section titled “5. Push and Open a PR”git push -u origin feat/SEN-XXX-descriptionCreate a pull request targeting main. Include a description of what changed and why.
Multi-Agent Coordination
Section titled “Multi-Agent Coordination”This project is developed by two AI code agents:
- Claude Code — local execution
- Codex (OpenAI) — cloud execution
Rule: Never work on the same branch that Codex is using.
git branch -a | grep codex # check for active Codex branchesIDE Recommendations
Section titled “IDE Recommendations”| Tool | Use |
|---|---|
| VS Code | Primary editor, with Python extension for type checking and linting |
| Obsidian | Navigate knowledge/ visually — wikilinks, graph view, frontmatter |
| ruff extension | Linting (when configured — F32) |
Open the knowledge/ directory as an Obsidian vault to see the skill graph in graph view and follow wikilinks between MOCs and future atomic notes.