Feature: Embedding Generation
Embedding Generation (F04)
Section titled “Embedding Generation (F04)”pipeline/embed_doutrina.py — Generates 768-dimensional semantic embeddings using Legal-BERTimbau, with a metadata-enriched text composition strategy that embeds not just the chunk body but also its legal classification. This is what enables semantic search over the doctrinal corpus.
Overview
Section titled “Overview”| Property | Value |
|---|---|
| Script | pipeline/embed_doutrina.py (345 lines) |
| Input | Enriched chunks with status_enriquecimento: "completo" |
| Output | Three JSON files: embeddings, search corpus, BM25 index |
| Model | rufimelo/Legal-BERTimbau-sts-base (768-dim) |
| Max tokens | 512 (MAX_TOKENS) |
| Batch size | 32 (BATCH_SIZE) |
| Device | CUDA if available, otherwise CPU |
Text Composition Strategy
Section titled “Text Composition Strategy”This is the most important design decision in the embedding stage. Instead of embedding raw chunk text, Douto composes a structured prefix that captures the legal classification within the BERT token limit.
Format
Section titled “Format”[categoria] | [institutos] | [tipo_conteudo] | [titulo] | [corpo_truncado]Why This Matters
Section titled “Why This Matters”Legal-BERTimbau has a hard 512-token limit (~2,000 characters in Portuguese). A raw chunk body would fill the entire token budget with text, losing the semantic signal from its classification. By prefixing with structured metadata, the embedding captures both what the text is about (instituto, categoria) and what kind of content it is (definicao vs. jurisprudencia vs. exemplo) in the vector space.
This means a search for “requisitos da exceptio non adimpleti contractus” will match chunks classified as instituto: ["exceptio_non_adimpleti_contractus"] + tipo_conteudo: ["requisitos"] even if the body text uses different phrasing.
The Actual Code
Section titled “The Actual Code”def compose_embedding_text(fm: dict, body: str) -> str: parts = []
# 1. Categoria geral cat = fm.get("categoria", "") if cat: parts.append(cat)
# 2. Institutos juridicos (most important for search) institutos = fm.get("instituto", []) if isinstance(institutos, list) and institutos: parts.append(", ".join(i.replace("_", " ") for i in institutos[:5]))
# 3. Tipo de conteudo (definicao, jurisprudencia, etc.) tipos = fm.get("tipo_conteudo", []) if isinstance(tipos, list) and tipos: parts.append(", ".join(t.replace("_", " ") for t in tipos[:3]))
# 4. Titulo da secao titulo = fm.get("titulo", "") if titulo: titulo_clean = re.sub(r'\(cont\.\s*\d+\)', '', titulo).strip() titulo_clean = re.sub(r'^\d+\.\s*', '', titulo_clean).strip() if titulo_clean: parts.append(titulo_clean[:150])
# 5. Corpo — truncate to fit model (512 tokens ~ 2000 chars PT) body_clean = body.strip() body_clean = re.sub(r'^>.*\n', '', body_clean, flags=re.MULTILINE).strip() body_clean = re.sub(r'^#+\s+.*\n', '', body_clean, count=1).strip()
prefix = " | ".join(parts)
# BERTimbau: ~512 tokens, ~4 chars/token PT -> ~2000 chars max_body = 1800 - len(prefix) if max_body < 200: max_body = 200
text = f"{prefix} | {body_clean[:max_body]}" return textExample Output
Section titled “Example Output”contratos | contrato bilateral, exceptio non adimpleti contractus |definicao, requisitos | Contratos bilaterais e unilaterais |A exceptio non adimpleti contractus e a defesa que pode ser opostapelo contratante demandado quando o outro nao cumpriu...Corpus Entry Structure
Section titled “Corpus Entry Structure”For each chunk, build_corpus_entry() creates a metadata document used for search result display and metadata filtering:
def build_corpus_entry(fm: dict, body: str, filepath: Path) -> dict: return { "id": filepath.stem, # e.g., "026-contratos-bilaterais" "livro": fm.get("livro_titulo", ""), "autor": fm.get("autor", ""), "titulo": fm.get("titulo", ""), "chunk_numero": fm.get("chunk_numero", 0), "chunk_total": fm.get("chunk_total", 0), "categoria": fm.get("categoria", ""), "instituto": fm.get("instituto", []), "sub_instituto": fm.get("sub_instituto", []), "tipo_conteudo": fm.get("tipo_conteudo", []), "fase": fm.get("fase", []), "ramo": fm.get("ramo", ""), "fontes_normativas": fm.get("fontes_normativas", []), "confiabilidade": fm.get("confiabilidade", ""), "livro_dir": filepath.parent.name, "texto": body[:3000], # Preview for display }Output Files
Section titled “Output Files”The script produces three files, all JSON, designed for compatibility with the existing Juca/Valter infrastructure.
| File | Contents | Used By |
|---|---|---|
embeddings_doutrina.json | doc_ids[] + embeddings[][] (768-dim float32, normalized) + model metadata | Semantic search (search_doutrina_v2.py) |
search_corpus_doutrina.json | Full metadata per chunk (15 fields from build_corpus_entry) | Result display and metadata filtering |
bm25_index_doutrina.json | doc_ids[] + documents[] (composed text, same as embedding input) | BM25 keyword search |
Embeddings JSON structure
Section titled “Embeddings JSON structure”{ "model": "rufimelo/Legal-BERTimbau-sts-base", "dimension": 768, "num_docs": 9365, "min_val": -0.123, "max_val": 0.456, "created_at": "2026-02-28T14:30:00", "doc_ids": ["book-dir/001-chapter-slug", "..."], "embeddings": [[0.012, -0.034, ...], ...]}Chunk Collection and Filtering
Section titled “Chunk Collection and Filtering”Before embedding, collect_chunks() filters the corpus:
- Skip
_prefixed files (index files like_INDEX.md,_RAW_FULL.md) - Skip
status_enriquecimento: "lixo"(noise chunks) - Skip chunks with body < 200 characters
- Skip unenriched chunks (no
institutoand notipo_conteudo— not yet classified)
This ensures only substantive, classified chunks enter the embedding index.
Configuration
Section titled “Configuration”Environment Variables
Section titled “Environment Variables”| Variable | Required | Default | Description |
|---|---|---|---|
VAULT_PATH | No | /mnt/c/Users/sensd/vault | Base directory with enriched chunks |
OUTPUT_PATH | No | /home/sensd/.openclaw/workspace/juca/data | Where to write output JSON files |
CLI Arguments
Section titled “CLI Arguments”# Generate embeddings with default settingspython3 pipeline/embed_doutrina.py
# Specify output directorypython3 pipeline/embed_doutrina.py --output /path/to/output
# Preview without generating embeddingspython3 pipeline/embed_doutrina.py --dry-run
# Limit chunks for testingpython3 pipeline/embed_doutrina.py --limit 100
# Adjust batch size (default: 32)python3 pipeline/embed_doutrina.py --batch-size 16The --dry-run mode is especially useful: it shows chunk counts per book and sample composed texts without loading the BERT model or generating embeddings.
Performance
Section titled “Performance”The embedding process has two phases:
- Model loading — 5-15 seconds depending on hardware and caching
- Encoding — speed depends on device:
| Device | Approximate Speed |
|---|---|
| CUDA (GPU) | ~500-1000 chunks/s |
| CPU | ~30-50 chunks/s |
For the full corpus (~9,000+ chunks per area), CPU encoding takes approximately 3-5 minutes.
Embeddings are normalized (normalize_embeddings=True) so that cosine similarity can be computed as a simple dot product, avoiding the need for normalization at search time.
Known Limitations
Section titled “Known Limitations”- Flat JSON storage — no HNSW/FAISS/Qdrant indexing. Search is brute-force dot product over the entire matrix. Works for the current scale but will not scale past ~100 books. Tracked as mitigation M12.
- 512-token truncation — chunks longer than ~2,000 characters (after prefix) lose tail content. The prefix takes 100-400 characters, leaving 1,400-1,700 characters for the body. Long legal analyses may have important content truncated.
- Metadata pollution — if enrichment metadata is wrong (e.g.,
institutomisclassified), the embedding captures incorrect semantics. Since the prefix is weighted first in the token sequence, bad metadata actively harms search quality. This is why the metadata quality gate (M06) is critical. - Legal-BERTimbau trained on PT-PT — the model was trained on Portuguese from Portugal, not Brazilian Portuguese. While the languages are mutually intelligible, there may be subtle vocabulary mismatches for Brazilian legal terminology.
- No incremental update — adding a new book requires regenerating the entire embedding file. There is no append-only or incremental index update mechanism.
- BM25 index reuses composed text — the BM25 index uses the same metadata-prefixed text as embeddings. This means BM25 keyword search matches against metadata terms (instituto names, categories) in addition to body text, which may inflate relevance of metadata-matching but content-irrelevant results.