Feature: PDF Extraction
PDF Extraction (F01)
Section titled “PDF Extraction (F01)”pipeline/process_books.py — The entry point of the Douto pipeline. Converts legal textbook PDFs into structured markdown files with chapter splitting and YAML frontmatter, preparing them for downstream chunking and enrichment.
Overview
Section titled “Overview”| Property | Value |
|---|---|
| Script | pipeline/process_books.py (414 lines) |
| Input | PDF files in Knowledge/_staging/input/ |
| Output | Markdown files in Knowledge/_staging/processed/{slug}/ |
| Engine | LlamaParse (LlamaIndex) |
| Dependencies | llama_parse, asyncio |
| Idempotent | Yes — skips already-processed PDFs via marker files |
flowchart TD PDF["PDF in _staging/input/"] LP["LlamaParse API"] MD["Full markdown"] SPLIT["Split by H1/H2"] CHUNKS["Chapter .md files"] INDEX["_INDEX.md"] RAW["_RAW_FULL.md (backup)"]
PDF -->|"Send to API (async)"| LP LP -->|"Receive markdown"| MD MD --> SPLIT SPLIT --> CHUNKS SPLIT --> INDEX MD --> RAWHow It Works
Section titled “How It Works”1. Scan for unprocessed PDFs
Section titled “1. Scan for unprocessed PDFs”The script scans _staging/input/ for .pdf files that do not have a .processed_{filename} marker file. This marker is created after successful processing, making the operation idempotent.
2. Send to LlamaParse
Section titled “2. Send to LlamaParse”Each PDF is sent to the LlamaParse cloud API with configurable extraction tier:
| Tier | Use Case | Cost |
|---|---|---|
fast | Quick extraction without formatting, cheapest | Low |
cost_effective | Best cost-benefit for clean digital text (default) | Medium |
agentic | Scanned PDFs or complex layouts | High |
The parser is configured for Portuguese (language="pt") with 2 workers:
parser = LlamaParse( result_type="markdown", num_workers=2, verbose=True, language="pt",)documents = await parser.aload_data(str(pdf_path))3. Split into chapters
Section titled “3. Split into chapters”The split_into_chapters() function divides the markdown output by H1 (#) and H2 (##) headers. Each chapter becomes an independent chunk:
def split_into_chapters(markdown_text: str, filename: str) -> list[dict]: lines = markdown_text.split('\n') chunks = [] current_chunk = { "title": "Introdução / Pré-textual", "level": 0, "content": [], "page_hint": "" } parent_title = filename
for line in lines: h1_match = re.match(r'^# (.+)$', line) h2_match = re.match(r'^## (.+)$', line)
if h1_match or h2_match: # Save previous chunk if it has substantial content if current_chunk["content"]: content_text = '\n'.join(current_chunk["content"]).strip() if len(content_text) > 100: # ignore very small chunks chunks.append({ "title": current_chunk["title"], "parent": parent_title, "content": content_text }) # Start new chunk...Key design decisions:
- H1 headers set the parent title context. Subsequent H2 chunks inherit it as
parent_title > section_title. - Minimum content threshold: chunks with fewer than 100 characters are discarded.
- Fallback: if no headers are detected, the entire document becomes a single chunk.
4. Generate slug
Section titled “4. Generate slug”File and directory names use a Portuguese-aware slugification function:
def slugify(text: str) -> str: text = text.lower().strip() text = re.sub(r'[àáâãäå]', 'a', text) text = re.sub(r'[èéêë]', 'e', text) text = re.sub(r'[ìíîï]', 'i', text) text = re.sub(r'[òóôõö]', 'o', text) text = re.sub(r'[ùúûü]', 'u', text) text = re.sub(r'[ç]', 'c', text) text = re.sub(r'[^a-z0-9\s-]', '', text) text = re.sub(r'[\s_]+', '-', text) text = re.sub(r'-+', '-', text) return text.strip('-')[:80]5. Write output files
Section titled “5. Write output files”For each book, the script writes:
_RAW_FULL.md— Complete LlamaParse output as backup_INDEX.md— Book index note with wikilinks to all chapters001-{slug}.md…NNN-{slug}.md— Individual chapter files with YAML frontmatter
Output Format
Section titled “Output Format”Each chapter file includes YAML frontmatter designed for downstream enrichment:
---knowledge_id: ""tipo: "livro_chunk"titulo: "Contratos Bilaterais e Unilaterais"livro_titulo: "Contratos"livro_arquivo_original: "contratos-orlando-gomes.pdf"chunk_numero: 5chunk_total: 42fonte_primaria: ""autor: ""editora: ""edicao: ""ano: ""paginas: ""confianca: "UNVERIFIED"confidencialidade: "Publico"area_direito: []teses_extraidas: []casos_vinculados: []tags: []data_criacao: "2026-02-28T14:30:00"data_ultima_modificacao: ""status_enriquecimento: "pendente"---Configuration
Section titled “Configuration”Environment Variables
Section titled “Environment Variables”| Variable | Required | Description |
|---|---|---|
LLAMA_CLOUD_API_KEY | Yes | LlamaParse API authentication (via env or .env) |
CLI Arguments
Section titled “CLI Arguments”# Process all unprocessed PDFs in stagingpython3 pipeline/process_books.py
# Process a specific PDF filepython3 pipeline/process_books.py contratos-orlando-gomes.pdf
# Use a different LlamaParse tierpython3 pipeline/process_books.py --tier fastpython3 pipeline/process_books.py --tier agentic
# Default tier is cost_effectivepython3 pipeline/process_books.py --tier cost_effectiveDirectory Structure
Section titled “Directory Structure”vault/Knowledge/_staging/ input/ # Place PDFs here livro.pdf .processed_livro.pdf # Marker (auto-created) processed/ contratos-orlando-gomes/ _RAW_FULL.md _INDEX.md 001-introducao.md 002-contratos-bilaterais.md ... failed/ # PDFs that failed processingLogging
Section titled “Logging”All events are appended to _staging/processing_log.jsonl:
{"op": "process_success", "file": "contratos.pdf", "book_title": "Contratos", "chunks": 42, "chars": 185000, "tier": "cost_effective", "output_dir": "/path/to/output", "ts": "2026-02-28T14:30:00"}{"op": "skip", "file": "contratos.pdf", "reason": "already_processed", "ts": "2026-02-28T14:35:00"}{"op": "process_failed", "file": "scanned-book.pdf", "error": "LlamaParse retornou vazio", "tier": "fast", "ts": "2026-02-28T14:40:00"}Known Limitations
Section titled “Known Limitations”- No
--dry-runmode — unlikerechunk_v3.pyandenrich_chunks.py, this script cannot preview operations without making API calls. - LlamaParse quality varies — scanned PDFs or those with complex layouts (tables, multi-column) may produce poor markdown. Use
--tier agenticfor difficult PDFs. - Chapter splitting assumes H1/H2 structure — books without markdown headers (e.g., flat extracted text) become a single chunk. Non-hierarchical legal texts (dictionaries, compilations) produce poor results.
- No OCR quality validation — there is no post-extraction check for garbled text or extraction artifacts.
- Title detection is heuristic — the book title is derived from the filename (
pdf_path.stem.replace('-', ' ').title()), which may not match the actual title. slugify()is duplicated — the same function exists inrechunk_v3.py. Tracked as F23 for extraction intopipeline/utils.py.