Testing
Testing
Section titled “Testing”Current testing status and the planned testing strategy for Douto.
Current State
Section titled “Current State”Planned Testing Strategy
Section titled “Planned Testing Strategy”Testing will be introduced in three phases across milestones v0.3 and v0.5.
Phase 1: Critical Path Tests (v0.3 — F26)
Section titled “Phase 1: Critical Path Tests (v0.3 — F26)”Framework: pytest
Target: rechunk_v3.py — the most complex and fragile component
Priority functions to test:
| Function | Lines | Why it’s critical |
|---|---|---|
smart_split() | ~200 | The 5-pass chunking algorithm. Core of the entire pipeline. |
detect_section() | ~20 | Uses 16 regex patterns to identify legal document structure. Wrong detection = wrong chunk boundaries. |
classify_title() | ~50 | Distinguishes content from noise, bibliography, summary, and running headers. False positives lose real content. |
aggregate_footnotes() | ~30 | Groups footnotes with parent paragraphs. Broken grouping = orphaned footnotes. |
detect_tail() | ~20 | Identifies trailing content (bibliography, index). Misdetection = bibliography mixed with content. |
classify_block_content() | ~40 | Classifies blocks as example, table, characteristics, law_article, or bibliography. |
Test fixtures: Real markdown samples from legal books (anonymized if needed to avoid IP concerns).
Minimum acceptance: At least one test per function above, covering both the happy path and at least one edge case.
Phase 2: Utility Tests (v0.3 — F27)
Section titled “Phase 2: Utility Tests (v0.3 — F27)”Target: Shared functions (currently duplicated, to be extracted to utils.py in F23)
| Function | Edge cases to cover |
|---|---|
parse_frontmatter() | Values with colons (titulo: "Codigo Civil: Comentado"), hash symbols, multiline values, missing frontmatter, malformed YAML |
slugify() | Unicode characters, accented letters, special characters, collision-prone inputs (e.g., “Contratos - Vol. 1” vs. “Contratos - Vol 1”) |
extract_json() | Valid JSON in markdown code blocks, JSON with extra text around it, malformed JSON, nested objects |
compose_embedding_text() | Missing metadata fields, empty body, very long text (truncation at 512 tokens) |
Phase 3: Search Quality Tests (v0.5 — F40)
Section titled “Phase 3: Search Quality Tests (v0.5 — F40)”Target: End-to-end search quality evaluation
| Component | What to measure |
|---|---|
| Evaluation set | 30+ queries with expected result chunks (manually curated) |
| recall@5 | What fraction of expected results appear in top 5? |
| recall@10 | What fraction appear in top 10? |
| nDCG | Are the most relevant results ranked highest? |
| Regression detection | Before/after comparison when pipeline changes (new model, new prompt, rechunking) |
This phase also includes metadata quality validation (M06):
- Sample 200 random chunks
- Verify
instituto,tipo_conteudo,ramoagainst actual content - Establish accuracy baseline
- Quality gate: if accuracy < 85%, halt and re-enrich before proceeding
How to Run Tests
Section titled “How to Run Tests”Planned Feature — Test infrastructure is on the roadmap for v0.3 but not yet implemented.
# Run all testsmake test# or directly:pytest
# Run only pipeline testspytest tests/pipeline/
# Run a specific testpytest -k "test_smart_split"
# Run with coverage reportpytest --cov=pipeline --cov-report=html
# Run with verbose outputpytest -vHow to Write Tests
Section titled “How to Write Tests”Directory Structure (Planned)
Section titled “Directory Structure (Planned)”tests/├── conftest.py # Shared fixtures├── fixtures/│ ├── markdown/ # Sample legal markdown files│ │ ├── contract_chapter.md│ │ ├── cpc_commentary.md│ │ └── bibliography.md│ ├── frontmatter/ # Frontmatter edge cases│ │ ├── valid.md│ │ ├── colon_in_value.md│ │ └── missing_frontmatter.md│ └── json/ # Sample JSON responses│ ├── enrichment_response.json│ └── malformed_response.json├── pipeline/│ ├── test_rechunk_v3.py│ ├── test_enrich_chunks.py│ ├── test_embed_doutrina.py│ ├── test_search_doutrina.py│ └── test_utils.py└── evaluation/ └── test_search_quality.py # Phase 3 eval setConventions
Section titled “Conventions”| Convention | Rule |
|---|---|
| File naming | test_{module}.py |
| Function naming | test_{function}_{scenario}() |
| Fixtures | Place in tests/fixtures/, use pytest fixture mechanism |
| External APIs | Always mock. Never call MiniMax, LlamaParse, or HuggingFace in tests. |
| Assertions | Test behavior and output, not implementation details |
| Independence | Each test must be independent — no shared mutable state between tests |
Example Test (Preview)
Section titled “Example Test (Preview)”import pytestfrom pipeline.rechunk_v3 import detect_section, classify_title
class TestDetectSection: def test_markdown_header(self): result = detect_section("## Responsabilidade Civil") assert result is not None title, ptype = result assert ptype == "md_header" assert "Responsabilidade Civil" in title
def test_legal_article(self): result = detect_section("**Art. 481.** O vendedor obriga-se...") assert result is not None title, ptype = result assert ptype == "artigo"
def test_plain_text_not_section(self): result = detect_section("Este principio fundamenta a boa-fe objetiva.") assert result is None
class TestClassifyTitle: def test_bibliography_detection(self): assert classify_title("REFERENCIAS BIBLIOGRAFICAS") == "bibliography" assert classify_title("Bibliografia") == "bibliography"
def test_noise_detection(self): assert classify_title("Agradecimentos") == "noise" assert classify_title("PREFACIO") == "noise"Test Data
Section titled “Test Data”Creating Fixtures
Section titled “Creating Fixtures”- Use real legal markdown for chunking tests (from the processed corpus)
- Anonymize if there are IP concerns — replace proper names, paraphrase content while preserving structure
- Use synthetic frontmatter for parser tests (craft specific edge cases)
- Use pre-computed small embeddings for search tests (a few dozen vectors, not the full corpus)
- Never commit full corpus data to test fixtures (too large, potential IP issues)
Evaluation Set Guidelines (Phase 3)
Section titled “Evaluation Set Guidelines (Phase 3)”Each entry in the evaluation set should include:
{ "query": "exceptio non adimpleti contractus requisitos", "expected_chunks": ["contratos-orlando-gomes-cap12-003", "contratos-civil-cap08-001"], "area": "contratos", "notes": "Both chunks define the requirements for the defense"}Target: at least 30 queries covering:
- Different legal domains (civil, processual, empresarial)
- Different query types (conceptual, specific article, comparative)
- Edge cases (cross-domain concepts like “boa-fe”, very specific terms)