Building AI agents is 5% AI and 100% software engineering

 

Production-grade agents live or die on data plumbing, controls, and observability—not on model choice. The doc-to-chat pipeline below maps the concrete layers and why they matter.

What is a “doc-to-chat” pipeline?

A doc-to-chat pipeline ingests enterprise documents, standardizes them, enforces governance, indexes embeddings alongside relational features, and serves retrieval + generation behind authenticated APIs with human-in-the-loop (HITL) checkpoints. It’s the reference architecture for agentic Q&A, copilots, and workflow automation where answers must respect permissions and be audit-ready. Production implementations are variations of RAG (retrieval-augmented generation) hardened with LLM guardrails, governance, and OpenTelemetry-backed tracing.

How do you integrate cleanly with the existing stack?

Use standard service boundaries (REST/JSON, gRPC) over a storage layer your org already trusts. For tables, Iceberg gives ACID, schema evolution, partition evolution, and snapshots—critical for reproducible retrieval and backfills. For vectors, use a system that coexists with SQL filters: pgvector collocates embeddings with business keys and ACL tags in PostgreSQL; dedicated engines like Milvus handle high-QPS ANN with disaggregated storage/compute. In practice, many teams run both: SQL+pgvector for transactional joins and Milvus for heavy retrieval.

Key properties

  • Iceberg tables: ACID, hidden partitioning, snapshot isolation; vendor support across warehouses.
  • pgvector: SQL + vector similarity in one query plan for precise joins and policy enforcement.
  • Milvus: layered, horizontally scalable architecture for large-scale similarity search.

How do agents, humans, and workflows coordinate on one “knowledge fabric”?

Production agents require explicit coordination points where humans approve, correct, or escalate. AWS A2I provides managed HITL loops (private workforces, flow definitions) and is a concrete blueprint for gating low-confidence outputs. Frameworks like LangGraph model these human checkpoints inside agent graphs so approvals are first-class steps in the DAG, not ad hoc callbacks. Use them to gate actions like publishing summaries, filing tickets, or committing code.

Pattern: LLM → confidence/guardrail checks → HITL gate → side-effects. Persist every artifact (prompt, retrieval set, decision) for auditability and future re-runs.

How is reliability enforced before anything reaches the model?

Treat reliability as layered defenses:

  1. Language + content guardrails: Pre-validate inputs/outputs for safety and policy. Options span managed (Bedrock Guardrails) and OSS (NeMo Guardrails, Guardrails AI; Llama Guard). Independent comparisons and a position paper catalog the trade-offs.
  2. PII detection/redaction: Run analyzers on both source docs and model I/O. Microsoft Presidio offers recognizers and masking, with explicit caveats to combine with additional controls.
  3. Access control and lineage: Enforce row-/column-level ACLs and audit across catalogs (Unity Catalog) so retrieval respects permissions; unify lineage and access policies across workspaces.
  4. Retrieval quality gates: Evaluate RAG with reference-free metrics (faithfulness, context precision/recall) using Ragas/related tooling; block or down-rank poor contexts.

How do you scale indexing and retrieval under real traffic?

Two axes matter: ingest throughput and query concurrency.

  • Ingest: Normalize at the lakehouse edge; write to Iceberg for versioned snapshots, then embed asynchronously. This enables deterministic rebuilds and point-in-time re-indexing.
  • Vector serving: Milvus’s shared-storage, disaggregated compute architecture supports horizontal scaling with independent failure domains; use HNSW/IVF/Flat hybrids and replica sets to balance recall/latency.
  • SQL + vector: Keep business joins server-side (pgvector), e.g., WHERE tenant_id = ? AND acl_tag @> ... ORDER BY embedding <-> :q LIMIT k. This avoids N+1 trips and respects policies.
  • Chunking/embedding strategy: Tune chunk size/overlap and semantic boundaries; bad chunking is the silent killer of recall.

For structured+unstructured fusion, prefer hybrid retrieval (BM25 + ANN + reranker) and store structured features next to vectors to support filters and re-ranking features at query time.

How do you monitor beyond logs?

You need traces, metrics, and evaluations stitched together:

  • Distributed tracing: Emit OpenTelemetry spans across ingestion, retrieval, model calls, and tools; LangSmith natively ingests OTEL traces and interoperates with external APMs (Jaeger, Datadog, Elastic). This gives end-to-end timing, prompts, contexts, and costs per request.
  • LLM observability platforms: Compare options (LangSmith, Arize Phoenix, LangFuse, Datadog) by tracing, evals, cost tracking, and enterprise readiness. Independent roundups and matrixes are available.
  • Continuous evaluation: Schedule RAG evals (Ragas/DeepEval/MLflow) on canary sets and live traffic replays; track faithfulness and grounding drift over time.

Add schema profiling/mapping on ingestion to keep observability attached to data shape changes (e.g., new templates, table evolution) and to explain retrieval regressions when upstream sources shift.

Example: doc-to-chat reference flow (signals and gates)

  1. Ingest: connectors → text extraction → normalization → Iceberg write (ACID, snapshots).
  2. Govern: PII scan (Presidio) → redact/mask → catalog registration with ACL policies.
  3. Index: embedding jobs → pgvector (policy-aware joins) and Milvus (high-QPS ANN).
  4. Serve: REST/gRPC → hybrid retrieval → guardrails → LLM → tool use.
  5. HITL: low-confidence paths route to A2I/LangGraph approval steps.
  6. Observe: OTEL traces to LangSmith/APM + scheduled RAG evaluations.

Why “5% AI, 100% software engineering” is accurate in practice?

Most outages and trust failures in agent systems are not model regressions; they’re data quality, permissioning, retrieval decay, or missing telemetry. The controls above—ACID tables, ACL catalogs, PII guardrails, hybrid retrieval, OTEL traces, and human gates—determine whether the same base model is safe, fast, and credibly correct for your users. Invest in these first; swap models later if needed.


References:

The post appeared first on .

Read More
Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry

Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry

 

A team of researchers from Meta Reality Labs and Carnegie Mellon University has introduced MapAnything, an end-to-end transformer architecture that directly regresses factored metric 3D scene geometry from images and optional sensor inputs. Released under Apache 2.0 with full training and benchmarking code, MapAnything advances beyond specialist pipelines by supporting over 12 distinct 3D vision tasks in a single feed-forward pass.

https://map-anything.github.io/assets/MapAnything.pdf

Why a Universal Model for 3D Reconstruction?

Image-based 3D reconstruction has historically relied on fragmented pipelines: feature detection, two-view pose estimation, bundle adjustment, multi-view stereo, or monocular depth inference. While effective, these modular solutions require task-specific tuning, optimization, and heavy post-processing.

Recent transformer-based feed-forward models such as DUSt3R, MASt3R, and VGGT simplified parts of this pipeline but remained limited: fixed numbers of views, rigid camera assumptions, or reliance on coupled representations that needed expensive optimization.

MapAnything overcomes these constraints by:

  • Accepting up to 2,000 input images in a single inference run.
  • Flexibly using auxiliary data such as camera intrinsics, poses, and depth maps.
  • Producing direct metric 3D reconstructions without bundle adjustment.

The model’s factored scene representation—composed of ray maps, depth, poses, and a global scale factor—provides modularity and generality unmatched by prior approaches.

Architecture and Representation

At its core, MapAnything employs a multi-view alternating-attention transformer. Each input image is encoded with DINOv2 ViT-L features, while optional inputs (rays, depth, poses) are encoded into the same latent space via shallow CNNs or MLPs. A learnable scale token enables metric normalization across views.

The network outputs a factored representation:

  • Per-view ray directions (camera calibration).
  • Depth along rays, predicted up-to-scale.
  • Camera poses relative to a reference view.
  • A single metric scale factor converting local reconstructions into a globally consistent frame.

This explicit factorization avoids redundancy, allowing the same model to handle monocular depth estimation, multi-view stereo, structure-from-motion (SfM), or depth completion without specialized heads.

https://map-anything.github.io/assets/MapAnything.pdf

Training Strategy

MapAnything was trained across 13 diverse datasets spanning indoor, outdoor, and synthetic domains, including BlendedMVS, Mapillary Planet-Scale Depth, ScanNet++, and TartanAirV2. Two variants are released:

  • Apache 2.0 licensed model trained on six datasets.
  • CC BY-NC model trained on all thirteen datasets for stronger performance.

Key training strategies include:

  • Probabilistic input dropout: During training, geometric inputs (rays, depth, pose) are provided with varying probabilities, enabling robustness across heterogeneous configurations.
  • Covisibility-based sampling: Ensures input views have meaningful overlap, supporting reconstruction up to 100+ views.
  • Factored losses in log-space: Depth, scale, and pose are optimized using scale-invariant and robust regression losses to improve stability.

Training was performed on 64 H200 GPUs with mixed precision, gradient checkpointing, and curriculum scheduling, scaling from 4 to 24 input views.

Benchmarking Results

Multi-View Dense Reconstruction

On ETH3D, ScanNet++ v2, and TartanAirV2-WB, MapAnything achieves state-of-the-art (SoTA) performance across pointmaps, depth, pose, and ray estimation. It surpasses baselines like VGGT and Pow3R even when limited to images only, and improves further with calibration or pose priors.

For example:

  • Pointmap relative error (rel) improves to 0.16 with only images, compared to 0.20 for VGGT.
  • With images + intrinsics + poses + depth, the error drops to 0.01, while achieving >90% inlier ratios.

Two-View Reconstruction

Against DUSt3R, MASt3R, and Pow3R, MapAnything consistently outperforms across scale, depth, and pose accuracy. Notably, with additional priors, it achieves >92% inlier ratios on two-view tasks, significantly beyond prior feed-forward models.

Single-View Calibration

Despite not being trained specifically for single-image calibration, MapAnything achieves an average angular error of 1.18°, outperforming AnyCalib (2.01°) and MoGe-2 (1.95°).

Depth Estimation

On the Robust-MVD benchmark:

  • MapAnything sets new SoTA for multi-view metric depth estimation.
  • With auxiliary inputs, its error rates rival or surpass specialized depth models such as MVSA and Metric3D v2.

Overall, benchmarks confirm 2× improvement over prior SoTA methods in many tasks, validating the benefits of unified training.

Key Contributions

The research team highlight four major contributions:

  1. Unified Feed-Forward Model capable of handling more than 12 problem settings, from monocular depth to SfM and stereo.
  2. Factored Scene Representation enabling explicit separation of rays, depth, pose, and metric scale.
  3. State-of-the-Art Performance across diverse benchmarks with fewer redundancies and higher scalability.
  4. Open-Source Release including data processing, training scripts, benchmarks, and pretrained weights under Apache 2.0.

Conclusion

MapAnything establishes a new benchmark in 3D vision by unifying multiple reconstruction tasks—SfM, stereo, depth estimation, and calibration—under a single transformer model with a factored scene representation. It not only outperforms specialist methods across benchmarks but also adapts seamlessly to heterogeneous inputs, including intrinsics, poses, and depth. With open-source code, pretrained models, and support for over 12 tasks, MapAnything lays the groundwork for a truly general-purpose 3D reconstruction backbone.


Check out the , and. Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Google AI Introduces Agent Payments Protocol (AP2): An Open Protocol for Interoperable AI Agent Checkout Across Merchants and Wallets

Google AI Introduces Agent Payments Protocol (AP2): An Open Protocol for Interoperable AI Agent Checkout Across Merchants and Wallets

 

Your shopping agent auto-purchases a $499 Pro plan instead of the $49 Basic tier—who’s on the hook: the user, the agent’s developer, or the merchant? This trust gap is a primary blocker for agent-led checkout on today’s payment rails. addresses it with an open, interoperable specification for agent-initiated payments, defining a cryptographically verifiable common language so any compliant agent can transact with any compliant merchant globally.

Google’s Agent Payments Protocol (AP2) is an open, vendor-neutral specification for executing payments initiated by AI agents with cryptographic, auditable proof of user intent. AP2 extends existing open protocols—Agent2Agent (A2A) and Model Context Protocol (MCP)—to define how agents, merchants, and payment processors exchange verifiable evidence across the “intent → cart → payment” pipeline. The goal is to close the trust gap in agent-led commerce without fragmenting the payments ecosystem.

https://github.com/google-agentic-commerce/AP2

Why do agents need a payments protocol?

Today’s rails assume a human is the one clicking “buy” on a trusted surface. When an autonomous or semi-autonomous agent initiates checkout, merchants and issuers face three unresolved questions: (1) was the user’s authority truly delegated (authorization), (2) does the request reflect what the user meant and approved (authenticity), and (3) who is responsible if something goes wrong (accountability). AP2 formalizes the data, cryptography, and messaging to answer those questions consistently across providers and payment types.

How does AP2 establish trust?

AP2 uses Verifiable Credentials (VCs)—tamper-evident, cryptographically signed digital objects—to carry evidence through a transaction. The protocol standardizes three mandate types:

  • Intent Mandate (human-not-present): captures the constraints under which an agent may transact (e.g., brand/category, price caps, timing windows), signed by the user.
  • Cart Mandate (human-present): binds the user’s explicit approval to a merchant-signed cart (items, amounts, currency), producing non-repudiable proof of “what you saw is what you paid.”
  • Payment Mandate: conveys to networks/issuers that an AI agent was involved, including modality (human-present vs not present) and risk-relevant context.

These VCs form an audit trail that unambiguously links user authorization to the final charge request.

What are the core roles and trust boundaries?

AP2 defines a role-based architecture to separate concerns and minimize data exposure:

  • User delegates a task to an agent.
  • User/Shopping Agent (the interface the user interacts with) interprets the task, negotiates carts, and collects approvals.
  • Credentials Provider (e.g., wallet) holds payment methods and issues method-specific artifacts.
  • Merchant Endpoint exposes catalog/quoting and signs carts.
  • Merchant Payment Processor constructs the network authorization object.
  • Network & Issuer evaluate and authorize the payment.

Human-present vs human-not-present: what changes on the wire?

AP2 defines clear, testable flows:

  • Human-present: the merchant signs a final cart; the user approves it in a trusted UI, generating a signed Cart Mandate. The processor submits the network authorization alongside the Payment Mandate. If needed, step-up (e.g., 3DS) occurs on a trusted surface.
  • Human-not-present: the user pre-authorizes an Intent Mandate (e.g., “buy when price < $100”); the agent later converts it to a Cart Mandate when conditions are satisfied, or the merchant can force re-confirmation.

How does AP2 compose with A2A and MCP?

AP2 is specified as an extension to A2A (for inter-agent messaging) and interoperates with MCP (for tool access) so developers can reuse established capabilities for discovery, negotiation, and execution. AP2 specializes the payments layer—standardizing mandate objects, signatures, and accountability signals—while leaving collaboration and tool invocation to A2A/MCP.

Which payment methods are in scope?

The protocol is payment-method agnostic. The initial focus covers common pull-based instruments (credit/debit cards), with roadmap support for real-time push transfers (e.g., UPI, PIX) and digital assets. For the web3 path, Google and partners have released an A2A x402 extension to operationalize agent-initiated crypto payments, aligning x402 with AP2’s mandate constructs.

What does this look like for developers?

Google has published a public repository (Apache-2.0) with reference documentation, Python types, and runnable samples:

  • Samples demonstrate human-present card flows, an x402 variant, and Android digital payment credentials, showing how to issue/verify mandates and move from agent negotiation to network authorization.
  • Types package: core protocol objects are available under src/ap2/types for integration.
  • Framework choice: while samples use Google’s ADK and Gemini 2.5 Flash, AP2 is framework-agnostic; any agent stack can generate/verify mandates and speak the protocol.

How does AP2 address privacy and security?

AP2’s role separation ensures sensitive data (e.g., PANs, tokens) remains with the Credentials Provider and never needs to flow through general-purpose agent surfaces. Mandates are signed with verifiable identities and can embed risk signals without exposing full credentials to counterparties. This aligns with existing controls (e.g., step-up authentication) and provides networks with explicit markers of agent involvement to support risk and dispute logic.

What about ecosystem readiness?

Google cites collaboration with 60+ organizations, spanning networks, issuers, gateways, and technology vendors (e.g., American Express, Mastercard, PayPal, Coinbase, Intuit, ServiceNow, UnionPay International, Worldpay, Adyen). The objective is to avoid one-off integrations by aligning on common mandate semantics and accountability signals across platforms.

Implementation notes and edge cases

  • Determinism over inference: merchants receive cryptographic evidence of what the user approved (cart) or pre-authorized (intent), rather than model-generated summaries.
  • Disputes: the credential chain functions as evidentiary material for networks/issuers; accountability can be assigned based on which mandate was signed and by whom.
  • Challenges: the issuer or merchant can trigger step-up; AP2 requires challenges to be completed on trusted surfaces and linked to the mandate trail.
  • Multiple agents: when more than one agent participates (e.g., travel metasearch + airline + hotel), A2A coordinates tasks; AP2 ensures each cart is merchant-signed and user-authorized before payment submission.

What comes next?

The AP2 team plans to evolve the spec in the open and continue adding reference implementations, including deeper integrations across networks and web3, and alignment with standards bodies for VC formats and identity primitives. Developers can start today by running the sample scenarios, integrating mandate types, and validating flows against their agent/merchant stacks.

Summary

AP2 gives the agent ecosystem a concrete, cryptographically grounded way to prove user authorization, bind it to merchant-signed carts, and present issuers with an auditable record—without locking developers into a single stack or payment method. If agents are going to buy things on our behalf, this is the kind of evidence trail the payments system needs.


Check out the , and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More