Coding Conventions
Coding Conventions
Section titled “Coding Conventions”Standards and patterns for all code and knowledge base contributions to Douto. These are extracted from CLAUDE.md and enforced during code review.
Priority Order
Section titled “Priority Order”When principles conflict, prioritize in this order:
| Priority | Principle | In Practice |
|---|---|---|
| 1 | Correctness | Legal data, citations, and metadata must be accurate. A wrong instituto classification is worse than a slow query. |
| 2 | Simplicity & clarity | Code that another agent (or a human six months from now) understands without context. No clever tricks. |
| 3 | Maintainability | Easy to change without breaking. Small functions, clear interfaces, minimal coupling between scripts. |
| 4 | Reversibility | Prefer decisions that can be undone. Use --dry-run, keep original data, avoid destructive operations. |
| 5 | Performance | Only optimize when there is evidence of a problem. Never sacrifice correctness or clarity for speed. |
Python Conventions
Section titled “Python Conventions”Language & Stack
Section titled “Language & Stack”- Python 3.10+ — required for modern type hint syntax
- Type hints are mandatory on all public functions
async/awaitonly where necessary (LlamaParse). The pipeline is mostly synchronous.- Testes:
pytest(when implemented) - Linting:
ruff(when configured)
| Convention | Rule | Example |
|---|---|---|
| Functions/variables | snake_case | parse_frontmatter(), chunk_text |
| Constants | UPPER_SNAKE_CASE | MAX_CHUNK_CHARS, MODEL_NAME |
| Type hints | Modern syntax | tuple[dict, str], not Tuple[Dict, str] |
| String formatting | f-strings preferred | f"Processed {count} chunks" |
| Imports | stdlib, then third-party, then local, separated by blank lines | See below |
| Line length | Follow ruff defaults (when configured) | ~88 characters |
# Import order exampleimport osimport jsonimport refrom pathlib import Path
import numpy as npfrom sentence_transformers import SentenceTransformer
from pipeline.utils import parse_frontmatter, slugifyError Handling
Section titled “Error Handling”- Specific exceptions — never use broad
except Exception as e. Catch the specific error type. - Log with context — include
doc_id,chunk_id, and traceback when logging errors. - Fail loudly — if a chunk is corrupted, log and skip it rather than silently producing garbage.
Docstrings
Section titled “Docstrings”Required for all public functions. Use Google-style format:
def parse_frontmatter(content: str) -> tuple[dict, str]: """Parse YAML frontmatter and body from a markdown string.
Args: content: Full markdown content with optional frontmatter.
Returns: Tuple of (metadata dict, body text without frontmatter). If no frontmatter found, returns ({}, original content). """Pipeline Script Conventions
Section titled “Pipeline Script Conventions”Every pipeline script must follow these patterns:
Required Features
Section titled “Required Features”| Feature | Implementation | Purpose |
|---|---|---|
--help | argparse | Document CLI usage |
--dry-run | Check before write | Show what would happen without modifying data |
--force | Skip idempotency checks | Reprocess already-completed items |
| Idempotent | Processing markers, skip logic | Safe to re-run without side effects |
| Structured logging | Append to processing_log.jsonl | Track successes, errors, skips |
Path Handling
Section titled “Path Handling”# CORRECT: use os.environ.get() with documented fallbackVAULT_PATH = Path(os.environ.get("VAULT_PATH", "/default/fallback/path"))
# INCORRECT: hardcoded absolute pathVAULT_PATH = Path("/home/sensd/.openclaw/workspace/vault")
# CORRECT: relative to script locationPROMPT_PATH = Path(__file__).parent / "enrich_prompt.md"Output Conventions
Section titled “Output Conventions”- JSON for data output (embeddings, corpus, BM25 index)
- YAML frontmatter for metadata in markdown chunks
- JSONL for structured logs (append-only)
- Progress output goes to
stderr; results go tostdout
Resource Limits
Section titled “Resource Limits”Do not run processes that consume more than 50% CPU locally. For validation:
python3 pipeline/rechunk_v3.py --limit 5 --dry-run # test with small subsetKnowledge Base Conventions
Section titled “Knowledge Base Conventions”INDEX_DOUTO.md
Section titled “INDEX_DOUTO.md”- Source of truth for domains and navigation
- Lists all 8 legal domains with wikilinks to their MOCs
- Must be updated when adding new domains
MOC Files
Section titled “MOC Files”Required frontmatter fields:
---type: mocdomain: civildescription: "Obrigacoes, contratos, responsabilidade civil, propriedade"---Chunk Files (Enriched)
Section titled “Chunk Files (Enriched)”Required frontmatter fields:
---knowledge_id: "contratos-orlando-gomes-cap05-001"tipo: chunktitulo: "Titulo do chunk"livro_titulo: "Contratos"autor: "Orlando Gomes"area_direito: civilstatus_enriquecimento: completo # "completo" | "pendente" | "lixo"---Rules:
- Chunks with
status_enriquecimento: "completo"must haveinstitutoandtipo_conteudofilled - Never overwrite enriched chunks without explicit
--force - Never leave
status_enriquecimento: "pendente"after enrichment runs - Use
"lixo"for noise chunks (prefaces, acknowledgments, catalog cards)
- Always use wikilinks:
[[MOC_CIVIL]] - Never use relative markdown links:
[text](../mocs/MOC_CIVIL.md) - Wikilinks enable Obsidian graph view and backlink tracking
Encoding
Section titled “Encoding”- UTF-8 for all files
- LF line endings (Unix-style)
Embedding Conventions
Section titled “Embedding Conventions”| Rule | Detail |
|---|---|
| Single model | rufimelo/Legal-BERTimbau-sts-base (768-dim) |
| Normalization | Always normalize_embeddings=True for cosine similarity |
| Text composition | Use compose_embedding_text() — never embed raw chunk text |
| Output naming | embeddings_{area}.json, search_corpus_{area}.json, bm25_index_{area}.json |
| Compatibility | Output must be compatible with Valter/Juca infrastructure |
The composed text format:
[categoria | instituto_1, instituto_2 | tipo_conteudo | titulo | corpo]This ensures the embedding captures both the metadata context and the actual content.
Git Conventions
Section titled “Git Conventions”Branch Naming
Section titled “Branch Naming”feat/SEN-XXX-description # new feature linked to Linear issuefix/SEN-XXX-description # bug fix linked to Linear issuedocs/description # documentation changesrefactor/description # code restructuringAppend -claude for branches created by Claude Code, -codex for branches created by Codex.
Commit Messages
Section titled “Commit Messages”Follow Conventional Commits:
feat: add bibliography detection to rechunkerfix: handle colons in frontmatter valuesdocs: update environment variable referencerefactor: extract parse_frontmatter to utils.pytest: add fixtures for smart_split edge caseschore: pin dependency versionsInclude the Linear ticket reference when applicable:
feat: standardize env vars across pipeline -- SEN-358Co-Authorship
Section titled “Co-Authorship”When commits are produced with AI assistance:
Co-Authored-By (execucao): Claude Opus 4.6 <noreply@anthropic.com>The term (execucao) indicates the AI assisted with implementation. All conception, architecture, product decisions, and intellectual property belong to the project owner.
Never Commit
Section titled “Never Commit”.envfiles or API keys- Embedding JSON files (too large, generated artifacts)
__pycache__/directories- Model weights or HuggingFace cache
- Node modules (if any frontend tooling is added)
What NOT To Do (Non-Goals)
Section titled “What NOT To Do (Non-Goals)”Do not do any of the following without explicit authorization:
| Non-goal | Reason |
|---|---|
| Introduce abstractions without clear need | Simplicity over elegance |
| Add dependencies for problems already solved in the codebase | Minimize dependency surface |
| Refactor working code without a specific issue | If it works, leave it |
| Optimize without evidence of performance problems | Premature optimization is the root of all evil |
| Expand scope beyond the issue being worked on | Stay focused |
| Create API/MCP server before the pipeline is stable | Foundation before features |
| Manage cases (Joseph’s job) | Douto’s scope is doctrine only |
| Search case law (Juca/Valter’s job) | Douto’s scope is doctrine only |
| Search legislation (Leci’s job) | Douto’s scope is doctrine only |
| Manage infrastructure (Valter’s job) | Douto is a processing pipeline, not a service |