Late chunking for long‑context rag: precision without the price tag

Cut costs, keep precision in long‑context RAG

Late chunking flips the standard retrieval workflow by embedding the full document first, then chunking the token embeddings—preserving cross‑chunk context without ColBERT‑level costs. It offers the storage efficiency of naive chunking while retaining much of the precision of late interaction approaches. For teams scaling long‑context RAG, it’s a pragmatic path to better relevance, lower latency, and fewer LLM tokens.

Points clés

The piece was published on September 5, 2024 and authored by Charles Pierse, Connor Shorten, and Akanksha Sharma at Weaviate.
JinaAI introduced “late chunking,” which embeds the entire document first, then splits and pools token‑level embeddings into chunk vectors.
Naive chunking embeds chunks independently, losing cross‑chunk context; ColBERT’s late interaction retains token context but is storage‑intensive.
Storage comparison for a long‑context model with 8,000 token embeddings and 100,000 documents (float32, dim=768): late interaction stores ~800 million vectors (~2.46 TB).
With 512‑token chunking, naive chunking needs ~1.6 million vectors (~4.9 GB)—about 1/500th the storage—and late chunking matches this footprint.
Late chunking requires only a modified pooling step (reported as under 30 lines of code) and no changes to the retrieval pipeline.
Long‑context models are required; JinaAI’s jina‑embeddings‑v2‑small‑en supports up to 8,192 tokens and ranks well on MTEB’s long embed retrieval benchmark.
In a 128‑token split example, the query “what do customers need to prioritise?” was correctly answered by late chunking returning the two neighboring paragraphs; naive chunking surfaced off‑topic text.
Reported benefits include fewer documents required at query time and more efficient, focused LLM calls with reduced latency and hallucination risk.
ColBERT can be extended to long context via sliding windows, but this heuristic may still lose global context compared with document‑wide embedding before pooling.

À retenir

If your RAG stack swings between “cheap but clueless” and “precise but pricier than your cloud bill,” late chunking is your Goldilocks option. Start by using a long‑context embedding model, swap in chunk‑level pooling after full‑document inference, and keep your retrieval pipeline as‑is—yes, really. Then monitor storage, top‑k hits, and LLM token usage; if your results improve and your bill doesn’t explode, pat yourself on the back and pretend it was always your plan.

Sources