Building AI agents is 5% AI and 100% software engineering: how to ship reliable doc-to-chat systems at scale

Data plumbing beats model choice

Production AI agents rise or fall on data engineering, governance, and observability—not model swapping. This doc-to-chat blueprint hardens RAG with ACID storage, vector+SQL retrieval, guardrails, human-in-the-loop gates, and OpenTelemetry traces to meet enterprise reliability and compliance. Invest in controls first; models can be swapped later.

Points clés

The core thesis: building AI agents is “5% AI and 100% software engineering,” as outages mostly stem from data quality, permissions, retrieval decay, and missing telemetry—not model regressions
A doc-to-chat pipeline standardizes enterprise documents, enforces governance, indexes embeddings with relational features, and serves retrieval+generation behind authenticated APIs with human‑in‑the‑loop checkpoints
Integration guidance: use standard service boundaries (REST/JSON, gRPC) and trusted storage; Apache Iceberg provides ACID, schema/partition evolution, and snapshots for reproducible retrieval and backfills
Vector strategy: pgvector in PostgreSQL collocates embeddings with business keys and ACL tags for policy‑aware joins; Milvus offers horizontally scalable ANN for high‑QPS similarity search, and many teams run both
Human coordination: AWS A2I manages HITL approval loops, while LangGraph models approvals as first‑class DAG steps to gate actions like publishing summaries or committing code, with all artifacts persisted for auditability
Reliability controls before the model: apply guardrails (AWS Bedrock Guardrails, NVIDIA NeMo Guardrails, Guardrails AI, Llama Guard), run PII detection/redaction with Microsoft Presidio, and enforce access control/lineage via Databricks Unity Catalog
Retrieval quality gates and hybrid search: evaluate RAG with Ragas (faithfulness, precision/recall), tune chunk sizes and overlaps, and use hybrid retrieval (BM25 + ANN + rerankers such as Cohere Rerank) for structured+unstructured fusion
Scaling and serving: ingest to Iceberg with versioned snapshots, embed asynchronously, use Milvus with disaggregated storage/compute (HNSW/IVF/Flat hybrids), and push SQL+vector policy filters server‑side to avoid N+1 trips
Observability and evaluations: emit OpenTelemetry traces end‑to‑end and use platforms like LangSmith, Arize Phoenix, LangFuse, or Datadog for tracing, cost, and evals; schedule continuous RAG evaluations (Ragas/DeepEval/MLflow) and track grounding drift
Reference flow: connectors → text extraction → normalization → Iceberg (ACID) → PII scan/redact → catalog with ACLs → embedding → pgvector+Milvus → hybrid retrieval → guardrails → LLM+tools → HITL gates → OpenTelemetry traces and scheduled evaluations

À retenir

Start with the unglamorous bits: put your data in ACID tables, wire up guardrails, add pgvector/Milvus where they shine, and trace everything with OTEL. Slip in HITL approvals for low‑confidence calls, then continuously eval your RAG so “grounded” means something measurable. And sure, you can swap to the shiny new model next week—after you can actually see what your current one is doing.

Sources