Skip to content

Coding Conventions

Standards and patterns for all code and knowledge base contributions to Douto. These are extracted from CLAUDE.md and enforced during code review.

When principles conflict, prioritize in this order:

PriorityPrincipleIn Practice
1CorrectnessLegal data, citations, and metadata must be accurate. A wrong instituto classification is worse than a slow query.
2Simplicity & clarityCode that another agent (or a human six months from now) understands without context. No clever tricks.
3MaintainabilityEasy to change without breaking. Small functions, clear interfaces, minimal coupling between scripts.
4ReversibilityPrefer decisions that can be undone. Use --dry-run, keep original data, avoid destructive operations.
5PerformanceOnly optimize when there is evidence of a problem. Never sacrifice correctness or clarity for speed.
  • Python 3.10+ — required for modern type hint syntax
  • Type hints are mandatory on all public functions
  • async/await only where necessary (LlamaParse). The pipeline is mostly synchronous.
  • Testes: pytest (when implemented)
  • Linting: ruff (when configured)
ConventionRuleExample
Functions/variablessnake_caseparse_frontmatter(), chunk_text
ConstantsUPPER_SNAKE_CASEMAX_CHUNK_CHARS, MODEL_NAME
Type hintsModern syntaxtuple[dict, str], not Tuple[Dict, str]
String formattingf-strings preferredf"Processed {count} chunks"
Importsstdlib, then third-party, then local, separated by blank linesSee below
Line lengthFollow ruff defaults (when configured)~88 characters
# Import order example
import os
import json
import re
from pathlib import Path
import numpy as np
from sentence_transformers import SentenceTransformer
from pipeline.utils import parse_frontmatter, slugify
  • Specific exceptions — never use broad except Exception as e. Catch the specific error type.
  • Log with context — include doc_id, chunk_id, and traceback when logging errors.
  • Fail loudly — if a chunk is corrupted, log and skip it rather than silently producing garbage.

Required for all public functions. Use Google-style format:

def parse_frontmatter(content: str) -> tuple[dict, str]:
"""Parse YAML frontmatter and body from a markdown string.
Args:
content: Full markdown content with optional frontmatter.
Returns:
Tuple of (metadata dict, body text without frontmatter).
If no frontmatter found, returns ({}, original content).
"""

Every pipeline script must follow these patterns:

FeatureImplementationPurpose
--helpargparseDocument CLI usage
--dry-runCheck before writeShow what would happen without modifying data
--forceSkip idempotency checksReprocess already-completed items
IdempotentProcessing markers, skip logicSafe to re-run without side effects
Structured loggingAppend to processing_log.jsonlTrack successes, errors, skips
# CORRECT: use os.environ.get() with documented fallback
VAULT_PATH = Path(os.environ.get("VAULT_PATH", "/default/fallback/path"))
# INCORRECT: hardcoded absolute path
VAULT_PATH = Path("/home/sensd/.openclaw/workspace/vault")
# CORRECT: relative to script location
PROMPT_PATH = Path(__file__).parent / "enrich_prompt.md"
  • JSON for data output (embeddings, corpus, BM25 index)
  • YAML frontmatter for metadata in markdown chunks
  • JSONL for structured logs (append-only)
  • Progress output goes to stderr; results go to stdout

Do not run processes that consume more than 50% CPU locally. For validation:

Terminal window
python3 pipeline/rechunk_v3.py --limit 5 --dry-run # test with small subset
  • Source of truth for domains and navigation
  • Lists all 8 legal domains with wikilinks to their MOCs
  • Must be updated when adding new domains

Required frontmatter fields:

---
type: moc
domain: civil
description: "Obrigacoes, contratos, responsabilidade civil, propriedade"
---

Required frontmatter fields:

---
knowledge_id: "contratos-orlando-gomes-cap05-001"
tipo: chunk
titulo: "Titulo do chunk"
livro_titulo: "Contratos"
autor: "Orlando Gomes"
area_direito: civil
status_enriquecimento: completo # "completo" | "pendente" | "lixo"
---

Rules:

  • Chunks with status_enriquecimento: "completo" must have instituto and tipo_conteudo filled
  • Never overwrite enriched chunks without explicit --force
  • Never leave status_enriquecimento: "pendente" after enrichment runs
  • Use "lixo" for noise chunks (prefaces, acknowledgments, catalog cards)
  • Always use wikilinks: [[MOC_CIVIL]]
  • Never use relative markdown links: [text](../mocs/MOC_CIVIL.md)
  • Wikilinks enable Obsidian graph view and backlink tracking
  • UTF-8 for all files
  • LF line endings (Unix-style)
RuleDetail
Single modelrufimelo/Legal-BERTimbau-sts-base (768-dim)
NormalizationAlways normalize_embeddings=True for cosine similarity
Text compositionUse compose_embedding_text() — never embed raw chunk text
Output namingembeddings_{area}.json, search_corpus_{area}.json, bm25_index_{area}.json
CompatibilityOutput must be compatible with Valter/Juca infrastructure

The composed text format:

[categoria | instituto_1, instituto_2 | tipo_conteudo | titulo | corpo]

This ensures the embedding captures both the metadata context and the actual content.

feat/SEN-XXX-description # new feature linked to Linear issue
fix/SEN-XXX-description # bug fix linked to Linear issue
docs/description # documentation changes
refactor/description # code restructuring

Append -claude for branches created by Claude Code, -codex for branches created by Codex.

Follow Conventional Commits:

feat: add bibliography detection to rechunker
fix: handle colons in frontmatter values
docs: update environment variable reference
refactor: extract parse_frontmatter to utils.py
test: add fixtures for smart_split edge cases
chore: pin dependency versions

Include the Linear ticket reference when applicable:

feat: standardize env vars across pipeline -- SEN-358

When commits are produced with AI assistance:

Co-Authored-By (execucao): Claude Opus 4.6 <noreply@anthropic.com>

The term (execucao) indicates the AI assisted with implementation. All conception, architecture, product decisions, and intellectual property belong to the project owner.

  • .env files or API keys
  • Embedding JSON files (too large, generated artifacts)
  • __pycache__/ directories
  • Model weights or HuggingFace cache
  • Node modules (if any frontend tooling is added)

Do not do any of the following without explicit authorization:

Non-goalReason
Introduce abstractions without clear needSimplicity over elegance
Add dependencies for problems already solved in the codebaseMinimize dependency surface
Refactor working code without a specific issueIf it works, leave it
Optimize without evidence of performance problemsPremature optimization is the root of all evil
Expand scope beyond the issue being worked onStay focused
Create API/MCP server before the pipeline is stableFoundation before features
Manage cases (Joseph’s job)Douto’s scope is doctrine only
Search case law (Juca/Valter’s job)Douto’s scope is doctrine only
Search legislation (Leci’s job)Douto’s scope is doctrine only
Manage infrastructure (Valter’s job)Douto is a processing pipeline, not a service