Skip to content

Testing

Current testing status and the planned testing strategy for Douto.

Testing will be introduced in three phases across milestones v0.3 and v0.5.

Phase 1: Critical Path Tests (v0.3 — F26)

Section titled “Phase 1: Critical Path Tests (v0.3 — F26)”

Framework: pytest Target: rechunk_v3.py — the most complex and fragile component

Priority functions to test:

FunctionLinesWhy it’s critical
smart_split()~200The 5-pass chunking algorithm. Core of the entire pipeline.
detect_section()~20Uses 16 regex patterns to identify legal document structure. Wrong detection = wrong chunk boundaries.
classify_title()~50Distinguishes content from noise, bibliography, summary, and running headers. False positives lose real content.
aggregate_footnotes()~30Groups footnotes with parent paragraphs. Broken grouping = orphaned footnotes.
detect_tail()~20Identifies trailing content (bibliography, index). Misdetection = bibliography mixed with content.
classify_block_content()~40Classifies blocks as example, table, characteristics, law_article, or bibliography.

Test fixtures: Real markdown samples from legal books (anonymized if needed to avoid IP concerns).

Minimum acceptance: At least one test per function above, covering both the happy path and at least one edge case.

Target: Shared functions (currently duplicated, to be extracted to utils.py in F23)

FunctionEdge cases to cover
parse_frontmatter()Values with colons (titulo: "Codigo Civil: Comentado"), hash symbols, multiline values, missing frontmatter, malformed YAML
slugify()Unicode characters, accented letters, special characters, collision-prone inputs (e.g., “Contratos - Vol. 1” vs. “Contratos - Vol 1”)
extract_json()Valid JSON in markdown code blocks, JSON with extra text around it, malformed JSON, nested objects
compose_embedding_text()Missing metadata fields, empty body, very long text (truncation at 512 tokens)

Phase 3: Search Quality Tests (v0.5 — F40)

Section titled “Phase 3: Search Quality Tests (v0.5 — F40)”

Target: End-to-end search quality evaluation

ComponentWhat to measure
Evaluation set30+ queries with expected result chunks (manually curated)
recall@5What fraction of expected results appear in top 5?
recall@10What fraction appear in top 10?
nDCGAre the most relevant results ranked highest?
Regression detectionBefore/after comparison when pipeline changes (new model, new prompt, rechunking)

This phase also includes metadata quality validation (M06):

  • Sample 200 random chunks
  • Verify instituto, tipo_conteudo, ramo against actual content
  • Establish accuracy baseline
  • Quality gate: if accuracy < 85%, halt and re-enrich before proceeding

Planned Feature — Test infrastructure is on the roadmap for v0.3 but not yet implemented.

Terminal window
# Run all tests
make test
# or directly:
pytest
# Run only pipeline tests
pytest tests/pipeline/
# Run a specific test
pytest -k "test_smart_split"
# Run with coverage report
pytest --cov=pipeline --cov-report=html
# Run with verbose output
pytest -v
tests/
├── conftest.py # Shared fixtures
├── fixtures/
│ ├── markdown/ # Sample legal markdown files
│ │ ├── contract_chapter.md
│ │ ├── cpc_commentary.md
│ │ └── bibliography.md
│ ├── frontmatter/ # Frontmatter edge cases
│ │ ├── valid.md
│ │ ├── colon_in_value.md
│ │ └── missing_frontmatter.md
│ └── json/ # Sample JSON responses
│ ├── enrichment_response.json
│ └── malformed_response.json
├── pipeline/
│ ├── test_rechunk_v3.py
│ ├── test_enrich_chunks.py
│ ├── test_embed_doutrina.py
│ ├── test_search_doutrina.py
│ └── test_utils.py
└── evaluation/
└── test_search_quality.py # Phase 3 eval set
ConventionRule
File namingtest_{module}.py
Function namingtest_{function}_{scenario}()
FixturesPlace in tests/fixtures/, use pytest fixture mechanism
External APIsAlways mock. Never call MiniMax, LlamaParse, or HuggingFace in tests.
AssertionsTest behavior and output, not implementation details
IndependenceEach test must be independent — no shared mutable state between tests
tests/pipeline/test_rechunk_v3.py
import pytest
from pipeline.rechunk_v3 import detect_section, classify_title
class TestDetectSection:
def test_markdown_header(self):
result = detect_section("## Responsabilidade Civil")
assert result is not None
title, ptype = result
assert ptype == "md_header"
assert "Responsabilidade Civil" in title
def test_legal_article(self):
result = detect_section("**Art. 481.** O vendedor obriga-se...")
assert result is not None
title, ptype = result
assert ptype == "artigo"
def test_plain_text_not_section(self):
result = detect_section("Este principio fundamenta a boa-fe objetiva.")
assert result is None
class TestClassifyTitle:
def test_bibliography_detection(self):
assert classify_title("REFERENCIAS BIBLIOGRAFICAS") == "bibliography"
assert classify_title("Bibliografia") == "bibliography"
def test_noise_detection(self):
assert classify_title("Agradecimentos") == "noise"
assert classify_title("PREFACIO") == "noise"
  • Use real legal markdown for chunking tests (from the processed corpus)
  • Anonymize if there are IP concerns — replace proper names, paraphrase content while preserving structure
  • Use synthetic frontmatter for parser tests (craft specific edge cases)
  • Use pre-computed small embeddings for search tests (a few dozen vectors, not the full corpus)
  • Never commit full corpus data to test fixtures (too large, potential IP issues)

Each entry in the evaluation set should include:

{
"query": "exceptio non adimpleti contractus requisitos",
"expected_chunks": ["contratos-orlando-gomes-cap12-003", "contratos-civil-cap08-001"],
"area": "contratos",
"notes": "Both chunks define the requirements for the defense"
}

Target: at least 30 queries covering:

  • Different legal domains (civil, processual, empresarial)
  • Different query types (conceptual, specific article, comparative)
  • Edge cases (cross-domain concepts like “boa-fe”, very specific terms)