IBM AI Releases Granite-Docling-258M: An Open-Source, Enterprise-Ready Document AI Model

 

IBM has released Granite-Docling-258M, an open-source (Apache-2.0) vision-language model designed specifically for end-to-end document conversion. The model targets layout-faithful extraction—tables, code, equations, lists, captions, and reading order—emitting a structured, machine-readable representation rather than lossy Markdown. It is available on Hugging Face with a live demo and MLX build for Apple Silicon.

What’s new compared to SmolDocling?

Granite-Docling is the product-ready successor to SmolDocling-256M. IBM replaced the earlier backbone with a Granite 165M language model and upgraded the vision encoder to SigLIP2 (base, patch16-512) while retaining the Idefics3-style connector (pixel-shuffle projector). The resulting model has 258M parameters and shows consistent accuracy gains across layout analysis, full-page OCR, code, equations, and tables (see metrics below). IBM also addressed instability failure modes observed in the preview model (e.g., repetitive token loops).

Architecture and training pipeline

  • Backbone: Idefics3-derived stack with SigLIP2 vision encoder → pixel-shuffle connector → Granite 165M LLM.
  • Training framework: nanoVLM (lightweight, pure-PyTorch VLM training toolkit).
  • Representation: Outputs DocTags, an IBM-authored markup designed for unambiguous document structure (elements + coordinates + relationships), which downstream tools convert to Markdown/HTML/JSON.
  • Compute: Trained on IBM’s Blue Vela H100 cluster.

Quantified improvements (Granite-Docling-258M vs. SmolDocling-256M preview)

Evaluated with docling-eval, LMMS-Eval, and task-specific datasets:

  • Layout: MAP 0.27 vs. 0.23; F1 0.86 vs. 0.85.
  • Full-page OCR: F1 0.84 vs. 0.80; lower edit distance.
  • Code recognition: F1 0.988 vs. 0.915; edit distance 0.013 vs. 0.114.
  • Equation recognition: F1 0.968 vs. 0.947.
  • Table recognition (FinTabNet @150dpi): TEDS-structure 0.97 vs. 0.82; TEDS with content 0.96 vs. 0.76.
  • Other benchmarks: MMStar 0.30 vs. 0.17; OCRBench 500 vs. 338.
  • Stability: “Avoids infinite loops more effectively” (production-oriented fix).

Multilingual support

Granite-Docling adds experimental support for Japanese, Arabic, and Chinese. IBM marks this as early-stage; English remains the primary target.

How the DocTags pathway changes Document AI

Conventional OCR-to-Markdown pipelines lose structural information and complicate downstream retrieval-augmented generation (RAG). Granite-Docling emits DocTags—a compact, LLM-friendly structural grammar—which Docling converts into Markdown/HTML/JSON. This preserves table topology, inline/floating math, code blocks, captions, and reading order with explicit coordinates, improving index quality and grounding for RAG and analytics.

Inference and integration

  • Docling Integration (recommended): The docling CLI/SDK automatically pulls Granite-Docling and converts PDFs/office docs/images to multiple formats. IBM positions the model as a component inside Docling pipelines rather than a general VLM.
  • Runtimes: Works with Transformers, vLLM, ONNX, and MLX; a dedicated MLX build is optimized for Apple Silicon. A Hugging Face Space provides an interactive demo (ZeroGPU).
  • License: Apache-2.0.

Why Granite-Docling?

For enterprise document AI, small VLMs that preserve structure reduce inference cost and pipeline complexity. Granite-Docling replaces multiple single-purpose models (layout, OCR, table, code, equations) with a single component that emits a richer intermediate representation, improving downstream retrieval and conversion fidelity. The measured gains—in TEDS for tables, F1 for code/equations, and reduced instability—make it a practical upgrade from SmolDocling for production workflows.

Demo

Summary

Granite-Docling-258M marks a significant advancement in compact, structure-preserving document AI. By combining IBM’s Granite backbone, SigLIP2 vision encoder, and the nanoVLM training framework, it delivers enterprise-ready performance across tables, equations, code, and multilingual text—all while remaining lightweight and open-source under Apache 2.0. With measurable gains over its SmolDocling predecessor and seamless integration into Docling pipelines, Granite-Docling provides a practical foundation for document conversion and RAG workflows where precision and reliability are critical.


Check out the  and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry

Meta AI Researchers Release MapAnything: An End-to-End Transformer Architecture that Directly Regresses Factored, Metric 3D Scene Geometry

 

A team of researchers from Meta Reality Labs and Carnegie Mellon University has introduced MapAnything, an end-to-end transformer architecture that directly regresses factored metric 3D scene geometry from images and optional sensor inputs. Released under Apache 2.0 with full training and benchmarking code, MapAnything advances beyond specialist pipelines by supporting over 12 distinct 3D vision tasks in a single feed-forward pass.

https://map-anything.github.io/assets/MapAnything.pdf

Why a Universal Model for 3D Reconstruction?

Image-based 3D reconstruction has historically relied on fragmented pipelines: feature detection, two-view pose estimation, bundle adjustment, multi-view stereo, or monocular depth inference. While effective, these modular solutions require task-specific tuning, optimization, and heavy post-processing.

Recent transformer-based feed-forward models such as DUSt3R, MASt3R, and VGGT simplified parts of this pipeline but remained limited: fixed numbers of views, rigid camera assumptions, or reliance on coupled representations that needed expensive optimization.

MapAnything overcomes these constraints by:

  • Accepting up to 2,000 input images in a single inference run.
  • Flexibly using auxiliary data such as camera intrinsics, poses, and depth maps.
  • Producing direct metric 3D reconstructions without bundle adjustment.

The model’s factored scene representation—composed of ray maps, depth, poses, and a global scale factor—provides modularity and generality unmatched by prior approaches.

Architecture and Representation

At its core, MapAnything employs a multi-view alternating-attention transformer. Each input image is encoded with DINOv2 ViT-L features, while optional inputs (rays, depth, poses) are encoded into the same latent space via shallow CNNs or MLPs. A learnable scale token enables metric normalization across views.

The network outputs a factored representation:

  • Per-view ray directions (camera calibration).
  • Depth along rays, predicted up-to-scale.
  • Camera poses relative to a reference view.
  • A single metric scale factor converting local reconstructions into a globally consistent frame.

This explicit factorization avoids redundancy, allowing the same model to handle monocular depth estimation, multi-view stereo, structure-from-motion (SfM), or depth completion without specialized heads.

https://map-anything.github.io/assets/MapAnything.pdf

Training Strategy

MapAnything was trained across 13 diverse datasets spanning indoor, outdoor, and synthetic domains, including BlendedMVS, Mapillary Planet-Scale Depth, ScanNet++, and TartanAirV2. Two variants are released:

  • Apache 2.0 licensed model trained on six datasets.
  • CC BY-NC model trained on all thirteen datasets for stronger performance.

Key training strategies include:

  • Probabilistic input dropout: During training, geometric inputs (rays, depth, pose) are provided with varying probabilities, enabling robustness across heterogeneous configurations.
  • Covisibility-based sampling: Ensures input views have meaningful overlap, supporting reconstruction up to 100+ views.
  • Factored losses in log-space: Depth, scale, and pose are optimized using scale-invariant and robust regression losses to improve stability.

Training was performed on 64 H200 GPUs with mixed precision, gradient checkpointing, and curriculum scheduling, scaling from 4 to 24 input views.

Benchmarking Results

Multi-View Dense Reconstruction

On ETH3D, ScanNet++ v2, and TartanAirV2-WB, MapAnything achieves state-of-the-art (SoTA) performance across pointmaps, depth, pose, and ray estimation. It surpasses baselines like VGGT and Pow3R even when limited to images only, and improves further with calibration or pose priors.

For example:

  • Pointmap relative error (rel) improves to 0.16 with only images, compared to 0.20 for VGGT.
  • With images + intrinsics + poses + depth, the error drops to 0.01, while achieving >90% inlier ratios.

Two-View Reconstruction

Against DUSt3R, MASt3R, and Pow3R, MapAnything consistently outperforms across scale, depth, and pose accuracy. Notably, with additional priors, it achieves >92% inlier ratios on two-view tasks, significantly beyond prior feed-forward models.

Single-View Calibration

Despite not being trained specifically for single-image calibration, MapAnything achieves an average angular error of 1.18°, outperforming AnyCalib (2.01°) and MoGe-2 (1.95°).

Depth Estimation

On the Robust-MVD benchmark:

  • MapAnything sets new SoTA for multi-view metric depth estimation.
  • With auxiliary inputs, its error rates rival or surpass specialized depth models such as MVSA and Metric3D v2.

Overall, benchmarks confirm 2× improvement over prior SoTA methods in many tasks, validating the benefits of unified training.

Key Contributions

The research team highlight four major contributions:

  1. Unified Feed-Forward Model capable of handling more than 12 problem settings, from monocular depth to SfM and stereo.
  2. Factored Scene Representation enabling explicit separation of rays, depth, pose, and metric scale.
  3. State-of-the-Art Performance across diverse benchmarks with fewer redundancies and higher scalability.
  4. Open-Source Release including data processing, training scripts, benchmarks, and pretrained weights under Apache 2.0.

Conclusion

MapAnything establishes a new benchmark in 3D vision by unifying multiple reconstruction tasks—SfM, stereo, depth estimation, and calibration—under a single transformer model with a factored scene representation. It not only outperforms specialist methods across benchmarks but also adapts seamlessly to heterogeneous inputs, including intrinsics, poses, and depth. With open-source code, pretrained models, and support for over 12 tasks, MapAnything lays the groundwork for a truly general-purpose 3D reconstruction backbone.


Check out the , and. Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines?

 

In this tutorial, we build an advanced voice AI agent using Hugging Face’s freely available models, and we keep the entire pipeline simple enough to run smoothly on Google Colab. We combine Whisper for speech recognition, FLAN-T5 for natural language reasoning, and Bark for speech synthesis, all connected through transformers pipelines. By doing this, we avoid heavy dependencies, API keys, or complicated setups, and we focus on showing how we can turn voice input into meaningful conversation and get back natural-sounding voice responses in real time. Check out the .

!pip -q install "transformers>=4.42.0" accelerate torchaudio sentencepiece gradio soundfile


import os, torch, tempfile, numpy as np
import gradio as gr
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM


DEVICE = 0 if torch.cuda.is_available() else -1


asr = pipeline(
   "automatic-speech-recognition",
   model="openai/whisper-small.en",
   device=DEVICE,
   chunk_length_s=30,
   return_timestamps=False
)


LLM_MODEL = "google/flan-t5-base"
tok = AutoTokenizer.from_pretrained(LLM_MODEL)
llm = AutoModelForSeq2SeqLM.from_pretrained(LLM_MODEL, device_map="auto")


tts = pipeline("text-to-speech", model="suno/bark-small")

We install the necessary libraries and load three Hugging Face pipelines: Whisper for speech-to-text, FLAN-T5 for generating responses, and Bark for text-to-speech. We set the device automatically so that we can use GPU if available. Check out the .

SYSTEM_PROMPT = (
   "You are a helpful, concise voice assistant. "
   "Prefer direct, structured answers. "
   "If the user asks for steps or code, use short bullet points."
)


def format_dialog(history, user_text):
   turns = []
   for u, a in history:
       if u: turns.append(f"User: {u}")
       if a: turns.append(f"Assistant: {a}")
   turns.append(f"User: {user_text}")
   prompt = (
       "Instruction:n"
       f"{SYSTEM_PROMPT}nn"
       "Dialog so far:n" + "n".join(turns) + "nn"
       "Assistant:"
   )
   return prompt

We define a system prompt that guides our agent to stay concise and structured, and we implement a format_dialog function that takes past conversation history along with the user input and builds a prompt string for the model to generate the assistant’s reply. Check out the .

def transcribe(filepath):
   out = asr(filepath)
   text = out["text"].strip()
   return text


def generate_reply(history, user_text, max_new_tokens=256):
   prompt = format_dialog(history, user_text)
   inputs = tok(prompt, return_tensors="pt", truncation=True).to(llm.device)
   with torch.no_grad():
       ids = llm.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           temperature=0.7,
           do_sample=True,
           top_p=0.9,
           repetition_penalty=1.05,
       )
   reply = tok.decode(ids[0], skip_special_tokens=True).strip()
   return reply


def synthesize_speech(text):
   out = tts(text)
   audio = out["audio"]
   sr = out["sampling_rate"]
   audio = np.asarray(audio, dtype=np.float32)
   return (sr, audio)

We create three core functions for our voice agent: transcribe converts recorded audio into text using Whisper, generate_reply builds a context-aware response from FLAN-T5, and synthesize_speech turns that response back into spoken audio with Bark. Check out the .

def clear_history():
   return [], []


def voice_to_voice(mic_file, history):
   history = history or []
   if not mic_file:
       return history, None, "Please record something!"
   try:
       user_text = transcribe(mic_file)
   except Exception as e:
       return history, None, f"ASR error: {e}"


   if not user_text:
       return history, None, "Didn't catch that. Try again?"


   try:
       reply = generate_reply(history, user_text)
   except Exception as e:
       return history, None, f"LLM error: {e}"


   try:
       sr, wav = synthesize_speech(reply)
   except Exception as e:
       return history + [(user_text, reply)], None, f"TTS error: {e}"


   return history + [(user_text, reply)], (sr, wav), f"User: {user_text}nAssistant: {reply}"


def text_to_voice(user_text, history):
   history = history or []
   user_text = (user_text or "").strip()
   if not user_text:
       return history, None, "Type a message first."
   try:
       reply = generate_reply(history, user_text)
       sr, wav = synthesize_speech(reply)
   except Exception as e:
       return history, None, f"Error: {e}"
   return history + [(user_text, reply)], (sr, wav), f"User: {user_text}nAssistant: {reply}"


def export_chat(history):
   lines = []
   for u, a in history or []:
       lines += [f"User: {u}", f"Assistant: {a}", ""]
   text = "n".join(lines).strip() or "No conversation yet."
   with tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w") as f:
       f.write(text)
       path = f.name
   return path

We add interactive functions for our agent: clear_history resets the conversation, voice_to_voice handles speech input and returns a spoken reply, text_to_voice processes typed input and speaks back, and export_chat saves the entire dialog into a downloadable text file. Check out the .

with gr.Blocks(title="Advanced Voice AI Agent (HF Pipelines)") as demo:
   gr.Markdown(
       "## 🎙 Advanced Voice AI Agent (Hugging Face Pipelines Only)n"
       "- **ASR**: openai/whisper-small.enn"
       "- **LLM**: google/flan-t5-basen"
       "- **TTS**: suno/bark-smalln"
       "Speak or type; the agent replies with voice + text."
   )


   with gr.Row():
       with gr.Column(scale=1):
           mic = gr.Audio(sources=["microphone"], type="filepath", label="Record")
           say_btn = gr.Button("🎤 Speak")
           text_in = gr.Textbox(label="Or type instead", placeholder="Ask me anything…")
           text_btn = gr.Button("💬 Send")
           export_btn = gr.Button("⬇ Export Chat (.txt)")
           reset_btn = gr.Button("♻ Reset")
       with gr.Column(scale=1):
           audio_out = gr.Audio(label="Assistant Voice", autoplay=True)
           transcript = gr.Textbox(label="Transcript", lines=6)
           chat = gr.Chatbot(height=360)
   state = gr.State([])


   def update_chat(history):
       return [(u, a) for u, a in (history or [])]


   say_btn.click(voice_to_voice, [mic, state], [state, audio_out, transcript]).then(
       update_chat, inputs=state, outputs=chat
   )
   text_btn.click(text_to_voice, [text_in, state], [state, audio_out, transcript]).then(
       update_chat, inputs=state, outputs=chat
   )
   reset_btn.click(clear_history, None, [chat, state])
   export_btn.click(export_chat, state, gr.File(label="Download chat.txt"))


demo.launch(debug=False)

We build a clean Gradio UI that lets us speak or type and then hear the agent’s response. We wire buttons to our callbacks, maintain chat state, and stream results into a chatbot, transcript, and audio player, all launched in one Colab app.

In conclusion, we see how seamlessly Hugging Face pipelines enable us to create a voice-driven conversational agent that listens, thinks, and responds. We now have a working demo that captures audio, transcribes it, generates intelligent responses, and returns speech output, all inside Colab. With this foundation, we can experiment with larger models, add multilingual support, or even extend the system with custom logic. Still, the core idea remains the same: we can bring together ASR, LLM, and TTS into one smooth workflow for an interactive voice AI experience.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

Ai2 Researchers are Changing the Benchmarking Game by Introducing Fluid Benchmarking that Enhances Evaluation along Several Dimensions

 

A team of researchers from Allen Institute for Artificial Intelligence (Ai2), University of Washington and CMU introduce Fluid Benchmarking, an adaptive LLM evaluation method that replaces static accuracy with 2-parameter IRT ability estimation and Fisher-information–driven item selection. By asking only the most informative questions for a model’s current ability, it yields smoother training curves, delays benchmark saturation, improves external validity at small budgets, and filters mislabeled items.

Fluid Benchmarking replaces static accuracy with an adaptive, psychometrics-grounded procedure. A two-parameter logistic IRT model maps responses to a latent ability score and selects each next item by maximizing Fisher information at the model’s current ability estimate. Across six popular benchmarks and multiple model checkpoints, it improves validity (smaller rank distance), reduces variance (lower normalized total variation), delays saturation (more monotonic training curves), and avoids mislabeled items by ~100× compared to random sampling at equal budget.

What problem does Fluid Benchmarking solve?

Static subsets and plain accuracy conflate item quality and item difficulty, inflate step-to-step variance, and hit benchmark saturation early (training curves flatten while the model still improves). Fluid Benchmarking reframes both aggregation and selection: score in a latent ability space and adapt the item subset to the current ability, rather than treating all items equally or fixing them a priori.

How does it work?

1) Ability, not accuracy

Fit a 2-parameter logistic (2PL) IRT model on historical LM responses: for item j with discrimination aj​ and difficulty bj​, the probability a model with ability θi​ answers correctly is

p(uij​=1)=logistic(aj​(θi​−bj​))

At evaluation, estimate the MAP ability θ^i​ for the candidate LM by maximizing the 2PL likelihood over its observed right/wrong responses on the administered items. Items are weighted by their discrimination and difficulty, unlike accuracy which weights all equally

2) Dynamic item selection via Fisher information

At each step t, select the next item qj​ that maximizes Fisher information at the current ability estimate θ^(t):

I(θi​,aj​,bj​)=aj2​logistic(aj​(θi​−bj​))(1−logistic(aj​(θi​−bj​)))

High-information items minimize the variance of the ability estimate. As training progresses, the most informative items shift from easy to hard, so the administered subset evolves with model capability.

What does “better evaluation” mean here?

Fluid evaluates four dimensions with concrete metrics:

  • Validity: external agreement with “true” model ranking; measured by mean rank distance (lower is better).
  • Variance: normalized total variation of the training curve across checkpoints (lower is better).
  • Saturation: monotonicity (Spearman rank correlation between checkpoint index and predicted performance; higher is better).
  • Efficiency: quality at small item budgets.

How strong are the results?

Across six benchmarks (e.g., ARC-C, GSM8K, HellaSwag, MMLU, TruthfulQA, WinoGrande) and six LMs with 61–94 checkpoints each:

  • Validity: On the smallest subset (AP-10), mean rank distance drops from 20.0 → 10.1; on AP-50, 15.2 → 8.8.
  • Variance: Total variation shrinks markedly; e.g., 28.3 → 10.7 (AP-10) and 19.1 → 6.5 (AP-50).
  • Saturation: Monotonicity improves from 0.48 → 0.76 (AP-10) and 0.62 → 0.86 (AP-50).
  • Small-budget efficiency: With 10 items, Fluid improves mean rank distance by 9.9 vs. random; at 500 items, the improvement is 0.8—consistent with diminishing returns as budget grows.

In pretraining runs, accuracy space often looks flat late in training, but ability space continues to rise, delaying apparent saturation (e.g., HellaSwag monotonicity 0.91 → 0.99 for random vs. Fluid).

Fluid also avoids mislabeled items: on MMLU-Redux with 100-item budgets, mislabeled items per session drop from 0.75 (random) to 0.01 (Fluid)—about two orders of magnitude fewer.

Ablations isolate where the gains come from: IRT aggregation raises validity, but only dynamic selection lowers variance; “RANDOM-IRT” can even exceed random’s variance at large budgets, underscoring selection as the key lever.

Does it stop early when confident?

Yes. Fluid supports dynamic stopping using the standard error of the ability estimate; terminate when SE falls below the average ability gap between rank-adjacent LMs on the Open LLM Leaderboard. In practice, required items vary widely over training (≈20 early, >80 mid-run), showing why fixed budgets are suboptimal.

Where does it fit in the evaluation stack?

Fluid is benchmark-refinement: it does not invent new tasks; it re-weights and re-orders existing items to maximize information against a latent ability metric. It generalizes beyond pretraining to post-training and to other modalities, assuming enough responses to fit/update an IRT model. As models improve, IRT parameters must be refreshed to resolve difficulty among items that were previously “too hard,” otherwise the top of the scale compresses.

Summary

Fluid Benchmarking makes LLM evaluation budget-efficient and stable by scoring models in ability space and selecting items by Fisher information, yielding lower variance, better rank validity, and delayed saturation with far fewer questions. The trade-offs are operational: maintain fresh response matrices, periodically refit IRT parameters, and ensure reliable right/wrong binarization for open-ended tasks. As these practices standardize, Fluid becomes a practical default for in-loop pretraining and post-training evals across evolving benchmarks.


Check out the , and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Google AI Introduces Agent Payments Protocol (AP2): An Open Protocol for Interoperable AI Agent Checkout Across Merchants and Wallets

Google AI Introduces Agent Payments Protocol (AP2): An Open Protocol for Interoperable AI Agent Checkout Across Merchants and Wallets

 

Your shopping agent auto-purchases a $499 Pro plan instead of the $49 Basic tier—who’s on the hook: the user, the agent’s developer, or the merchant? This trust gap is a primary blocker for agent-led checkout on today’s payment rails. addresses it with an open, interoperable specification for agent-initiated payments, defining a cryptographically verifiable common language so any compliant agent can transact with any compliant merchant globally.

Google’s Agent Payments Protocol (AP2) is an open, vendor-neutral specification for executing payments initiated by AI agents with cryptographic, auditable proof of user intent. AP2 extends existing open protocols—Agent2Agent (A2A) and Model Context Protocol (MCP)—to define how agents, merchants, and payment processors exchange verifiable evidence across the “intent → cart → payment” pipeline. The goal is to close the trust gap in agent-led commerce without fragmenting the payments ecosystem.

https://github.com/google-agentic-commerce/AP2

Why do agents need a payments protocol?

Today’s rails assume a human is the one clicking “buy” on a trusted surface. When an autonomous or semi-autonomous agent initiates checkout, merchants and issuers face three unresolved questions: (1) was the user’s authority truly delegated (authorization), (2) does the request reflect what the user meant and approved (authenticity), and (3) who is responsible if something goes wrong (accountability). AP2 formalizes the data, cryptography, and messaging to answer those questions consistently across providers and payment types.

How does AP2 establish trust?

AP2 uses Verifiable Credentials (VCs)—tamper-evident, cryptographically signed digital objects—to carry evidence through a transaction. The protocol standardizes three mandate types:

  • Intent Mandate (human-not-present): captures the constraints under which an agent may transact (e.g., brand/category, price caps, timing windows), signed by the user.
  • Cart Mandate (human-present): binds the user’s explicit approval to a merchant-signed cart (items, amounts, currency), producing non-repudiable proof of “what you saw is what you paid.”
  • Payment Mandate: conveys to networks/issuers that an AI agent was involved, including modality (human-present vs not present) and risk-relevant context.

These VCs form an audit trail that unambiguously links user authorization to the final charge request.

What are the core roles and trust boundaries?

AP2 defines a role-based architecture to separate concerns and minimize data exposure:

  • User delegates a task to an agent.
  • User/Shopping Agent (the interface the user interacts with) interprets the task, negotiates carts, and collects approvals.
  • Credentials Provider (e.g., wallet) holds payment methods and issues method-specific artifacts.
  • Merchant Endpoint exposes catalog/quoting and signs carts.
  • Merchant Payment Processor constructs the network authorization object.
  • Network & Issuer evaluate and authorize the payment.

Human-present vs human-not-present: what changes on the wire?

AP2 defines clear, testable flows:

  • Human-present: the merchant signs a final cart; the user approves it in a trusted UI, generating a signed Cart Mandate. The processor submits the network authorization alongside the Payment Mandate. If needed, step-up (e.g., 3DS) occurs on a trusted surface.
  • Human-not-present: the user pre-authorizes an Intent Mandate (e.g., “buy when price < $100”); the agent later converts it to a Cart Mandate when conditions are satisfied, or the merchant can force re-confirmation.

How does AP2 compose with A2A and MCP?

AP2 is specified as an extension to A2A (for inter-agent messaging) and interoperates with MCP (for tool access) so developers can reuse established capabilities for discovery, negotiation, and execution. AP2 specializes the payments layer—standardizing mandate objects, signatures, and accountability signals—while leaving collaboration and tool invocation to A2A/MCP.

Which payment methods are in scope?

The protocol is payment-method agnostic. The initial focus covers common pull-based instruments (credit/debit cards), with roadmap support for real-time push transfers (e.g., UPI, PIX) and digital assets. For the web3 path, Google and partners have released an A2A x402 extension to operationalize agent-initiated crypto payments, aligning x402 with AP2’s mandate constructs.

What does this look like for developers?

Google has published a public repository (Apache-2.0) with reference documentation, Python types, and runnable samples:

  • Samples demonstrate human-present card flows, an x402 variant, and Android digital payment credentials, showing how to issue/verify mandates and move from agent negotiation to network authorization.
  • Types package: core protocol objects are available under src/ap2/types for integration.
  • Framework choice: while samples use Google’s ADK and Gemini 2.5 Flash, AP2 is framework-agnostic; any agent stack can generate/verify mandates and speak the protocol.

How does AP2 address privacy and security?

AP2’s role separation ensures sensitive data (e.g., PANs, tokens) remains with the Credentials Provider and never needs to flow through general-purpose agent surfaces. Mandates are signed with verifiable identities and can embed risk signals without exposing full credentials to counterparties. This aligns with existing controls (e.g., step-up authentication) and provides networks with explicit markers of agent involvement to support risk and dispute logic.

What about ecosystem readiness?

Google cites collaboration with 60+ organizations, spanning networks, issuers, gateways, and technology vendors (e.g., American Express, Mastercard, PayPal, Coinbase, Intuit, ServiceNow, UnionPay International, Worldpay, Adyen). The objective is to avoid one-off integrations by aligning on common mandate semantics and accountability signals across platforms.

Implementation notes and edge cases

  • Determinism over inference: merchants receive cryptographic evidence of what the user approved (cart) or pre-authorized (intent), rather than model-generated summaries.
  • Disputes: the credential chain functions as evidentiary material for networks/issuers; accountability can be assigned based on which mandate was signed and by whom.
  • Challenges: the issuer or merchant can trigger step-up; AP2 requires challenges to be completed on trusted surfaces and linked to the mandate trail.
  • Multiple agents: when more than one agent participates (e.g., travel metasearch + airline + hotel), A2A coordinates tasks; AP2 ensures each cart is merchant-signed and user-authorized before payment submission.

What comes next?

The AP2 team plans to evolve the spec in the open and continue adding reference implementations, including deeper integrations across networks and web3, and alignment with standards bodies for VC formats and identity primitives. Developers can start today by running the sample scenarios, integrating mandate types, and validating flows against their agent/merchant stacks.

Summary

AP2 gives the agent ecosystem a concrete, cryptographically grounded way to prove user authorization, bind it to merchant-signed carts, and present issuers with an auditable record—without locking developers into a single stack or payment method. If agents are going to buy things on our behalf, this is the kind of evidence trail the payments system needs.


Check out the , and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

A Coding Guide to Implement Zarr for Large-Scale Data: Chunking, Compression, Indexing, and Visualization Techniques

 

In this tutorial, we take a deep dive into the capabilities of , a library designed for efficient storage & manipulation of large, multidimensional arrays. We begin by exploring the basics, creating arrays, setting chunking strategies, and modifying values directly on disk. From there, we expand into more advanced operations such as experimenting with chunk sizes for different access patterns, applying multiple compression codecs to optimize both speed and storage efficiency, and comparing their performance on synthetic datasets. We also build hierarchical structures enriched with metadata, simulate realistic workflows with time-series and volumetric data, and demonstrate advanced indexing to extract meaningful subsets. Check out the .

!pip install zarr numcodecs -q
import zarr
import numpy as np
import matplotlib.pyplot as plt
from numcodecs import Blosc, Delta, FixedScaleOffset
import tempfile
import shutil
import os
from pathlib import Path


print(f"Zarr version: {zarr.__version__}")
print(f"NumPy version: {np.__version__}")


print("=== BASIC ZARR OPERATIONS ===")

We begin our tutorial by installing Zarr and Numcodecs, along with essential libraries like NumPy and Matplotlib. We then set up the environment and verify the versions, preparing ourselves to dive into basic Zarr operations. Check out the .

tutorial_dir = Path(tempfile.mkdtemp(prefix="zarr_tutorial_"))
print(f"Working directory: {tutorial_dir}")


z1 = zarr.zeros((1000, 1000), chunks=(100, 100), dtype='f4',
               store=str(tutorial_dir / 'basic_array.zarr'), zarr_format=2)
z2 = zarr.ones((500, 500, 10), chunks=(100, 100, 5), dtype='i4',
              store=str(tutorial_dir / 'multi_dim.zarr'), zarr_format=2)


print(f"2D Array shape: {z1.shape}, chunks: {z1.chunks}, dtype: {z1.dtype}")
print(f"3D Array shape: {z2.shape}, chunks: {z2.chunks}, dtype: {z2.dtype}")


z1[100:200, 100:200] = np.random.random((100, 100)).astype('f4')
z2[:, :, 0] = np.arange(500*500).reshape(500, 500)


print(f"Memory usage estimate: {z1.nbytes_stored() / 1024**2:.2f} MB")

We create our working directory and initialize Zarr arrays: a 2D array of zeros and a 3D array of ones. We then fill them with random and sequential values, while also checking their shapes, chunk sizes, and memory usage in real time. Check out the .

print("n=== ADVANCED CHUNKING ===")


time_steps, height, width = 365, 1000, 2000
time_series = zarr.zeros(
   (time_steps, height, width),
   chunks=(30, 250, 500),
   dtype='f4',
   store=str(tutorial_dir / 'time_series.zarr'),
   zarr_format=2
)


for t in range(0, time_steps, 30):
   end_t = min(t + 30, time_steps)
   seasonal = np.sin(2 * np.pi * np.arange(t, end_t) / 365)[:, None, None]
   spatial = np.random.normal(20, 5, (end_t - t, height, width))
   time_series[t:end_t] = (spatial + 10 * seasonal).astype('f4')


print(f"Time series created: {time_series.shape}")
print(f"Approximate chunks created")


import time
start = time.time()
temporal_slice = time_series[:, 500, 1000]
temporal_time = time.time() - start


start = time.time()
spatial_slice = time_series[100, :200, :200]
spatial_time = time.time() - start


print(f"Temporal access time: {temporal_time:.4f}s")
print(f"Spatial access time: {spatial_time:.4f}s")

In this step, we simulate a year-long time-series dataset with optimized chunking for both temporal and spatial access. We add seasonal patterns and spatial noise, then measure access speeds, allowing us to see firsthand how chunking impacts performance in real-world data exploration. Check out the .

print("n=== COMPRESSION AND CODECS ===")


data = np.random.randint(0, 1000, (1000, 1000), dtype='i4')


from zarr.codecs import BloscCodec, BytesCodec


z_none = zarr.array(data, chunks=(100, 100),
                  codecs=[BytesCodec()],
                  store=str(tutorial_dir / 'no_compress.zarr'))


z_lz4 = zarr.array(data, chunks=(100, 100),
                  codecs=[BytesCodec(), BloscCodec(cname='lz4', clevel=5)],
                  store=str(tutorial_dir / 'lz4_compress.zarr'))


z_zstd = zarr.array(data, chunks=(100, 100),
                   codecs=[BytesCodec(), BloscCodec(cname='zstd', clevel=9)],
                   store=str(tutorial_dir / 'zstd_compress.zarr'))


sequential_data = np.cumsum(np.random.randint(-5, 6, (1000, 1000)), axis=1)
z_delta = zarr.array(sequential_data, chunks=(100, 100),
                    codecs=[BytesCodec(), BloscCodec(cname='zstd', clevel=5)],
                    store=str(tutorial_dir / 'sequential_compress.zarr'))


sizes = {
   'No compression': z_none.nbytes_stored(),
   'LZ4': z_lz4.nbytes_stored(),
   'ZSTD': z_zstd.nbytes_stored(),
   'Sequential+ZSTD': z_delta.nbytes_stored()
}


print("Compression comparison:")
original_size = data.nbytes
for name, size in sizes.items():
   ratio = size / original_size
   print(f"{name}: {size/1024**2:.2f} MB (ratio: {ratio:.3f})")


print("n=== HIERARCHICAL DATA ORGANIZATION ===")


root = zarr.open_group(str(tutorial_dir / 'experiment.zarr'), mode='w')


raw_data = root.create_group('raw_data')
processed = root.create_group('processed')
metadata = root.create_group('metadata')


raw_data.create_dataset('images', shape=(100, 512, 512), chunks=(10, 128, 128), dtype='u2')
raw_data.create_dataset('timestamps', shape=(100,), dtype='datetime64[ns]')


processed.create_dataset('normalized', shape=(100, 512, 512), chunks=(10, 128, 128), dtype='f4')
processed.create_dataset('features', shape=(100, 50), chunks=(20, 50), dtype='f4')


root.attrs['experiment_id'] = 'EXP_2024_001'
root.attrs['description'] = 'Advanced Zarr tutorial demonstration'
root.attrs['created'] = str(np.datetime64('2024-01-01'))


raw_data.attrs['instrument'] = 'Synthetic Camera'
raw_data.attrs['resolution'] = [512, 512]
processed.attrs['normalization'] = 'z-score'


timestamps = np.datetime64('2024-01-01') + np.arange(100) * np.timedelta64(1, 'h')
raw_data['timestamps'][:] = timestamps


for i in range(100):
   frame = np.random.poisson(100 + 50 * np.sin(2 * np.pi * i / 100), (512, 512)).astype('u2')
   raw_data['images'][i] = frame


print(f"Created hierarchical structure with {len(list(root.group_keys()))} groups")
print(f"Data arrays and groups created successfully")


print("n=== ADVANCED INDEXING ===")


volume_data = zarr.zeros((50, 20, 256, 256), chunks=(5, 5, 64, 64), dtype='f4',
                       store=str(tutorial_dir / 'volume.zarr'), zarr_format=2)


for t in range(50):
   for z in range(20):
       y, x = np.ogrid[:256, :256]
       center_y, center_x = 128 + 20*np.sin(t*0.1), 128 + 20*np.cos(t*0.1)
       focus_quality = 1 - abs(z - 10) / 10
      
       signal = focus_quality * np.exp(-((y-center_y)**2 + (x-center_x)**2) / (50**2))
       noise = 0.1 * np.random.random((256, 256))
       volume_data[t, z] = (signal + noise).astype('f4')


print("Various slicing operations:")


max_projection = np.max(volume_data[:, 10], axis=0)
print(f"Max projection shape: {max_projection.shape}")


z_stack = volume_data[25, :, 100:156, 100:156]
print(f"Z-stack subset: {z_stack.shape}")


bright_pixels = volume_data[volume_data > 0.5]
print(f"Pixels above threshold: {len(bright_pixels)}")

We benchmark compression by writing the same data with no compression, LZ4, and ZSTD, then compare on-disk sizes to see practical savings. Next, we organize an experiment as a Zarr group hierarchy with rich attributes, images, and timestamps. Finally, we generate a synthetic 4D volume and perform advanced indexing, max projections, sub-stacks, and thresholding, to validate fast, slice-wise access. Check out the .

print("n=== PERFORMANCE OPTIMIZATION ===")


def process_chunk_serial(data, func):
   results = []
   for i in range(0, len(dt), 100):
       chunk = data[i:i+100]
       results.append(func(chunk))
   return np.concatenate(results)


def gaussian_filter_1d(x, sigma=1.0):
   kernel_size = int(4 * sigma)
   if kernel_size % 2 == 0:
       kernel_size += 1
   kernel = np.exp(-0.5 * ((np.arange(kernel_size) - kernel_size//2) / sigma)**2)
   kernel = kernel / kernel.sum()
   return np.convolve(x.astype(float), kernel, mode='same')


large_array = zarr.random.random((10000,), chunks=(1000,),
                              store=str(tutorial_dir / 'large.zarr'), zarr_format=2)


start_time = time.time()
chunk_size = 1000
filtered_data = []
for i in range(0, len(large_array), chunk_size):
   end_idx = min(i + chunk_size, len(large_array))
   chunk_data = large_array[i:end_idx]
   smoothed = np.convolve(chunk_data, np.ones(5)/5, mode='same')
   filtered_data.append(smoothed)


result = np.concatenate(filtered_data)
processing_time = time.time() - start_time


print(f"Chunk-aware processing time: {processing_time:.4f}s")
print(f"Processed {len(large_array):,} elements")


print("n=== VISUALIZATION ===")


fig, axes = plt.subplots(2, 3, figsize=(15, 10))
fig.suptitle('Advanced Zarr Tutorial - Data Visualization', fontsize=16)


axes[0,0].plot(temporal_slice)
axes[0,0].set_title('Temporal Evolution (Single Pixel)')
axes[0,0].set_xlabel('Day of Year')
axes[0,0].set_ylabel('Temperature')


im1 = axes[0,1].imshow(spatial_slice, cmap='viridis')
axes[0,1].set_title('Spatial Pattern (Day 100)')
plt.colorbar(im1, ax=axes[0,1])


methods = list(sizes.keys())
ratios = [sizes[m]/original_size for m in methods]
axes[0,2].bar(range(len(methods)), ratios)
axes[0,2].set_xticks(range(len(methods)))
axes[0,2].set_xticklabels(methods, rotation=45)
axes[0,2].set_title('Compression Ratios')
axes[0,2].set_ylabel('Size Ratio')


axes[1,0].imshow(max_projection, cmap='hot')
axes[1,0].set_title('Max Intensity Projection')


z_profile = np.mean(volume_data[25, :, 120:136, 120:136], axis=(1,2))
axes[1,1].plot(z_profile, 'o-')
axes[1,1].set_title('Z-Profile (Center Region)')
axes[1,1].set_xlabel('Z-slice')
axes[1,1].set_ylabel('Mean Intensity')


axes[1,2].plot(result[:1000])
axes[1,2].set_title('Processed Signal (First 1000 points)')
axes[1,2].set_xlabel('Sample')
axes[1,2].set_ylabel('Amplitude')


plt.tight_layout()
plt.show()

We optimize performance by processing data in chunk-sized batches, applying simple smoothing filters without loading everything into memory. We then visualize temporal trends, spatial patterns, compression effects, and volume profiles, allowing us to see at a glance how our choices in chunking and compression shape the results. Check out the .

print("n=== TUTORIAL SUMMARY ===")
print("Zarr features demonstrated:")
print("✓ Multi-dimensional array creation and manipulation")
print("✓ Optimal chunking strategies for different access patterns")
print("✓ Advanced compression with multiple codecs")
print("✓ Hierarchical data organization with metadata")
print("✓ Advanced indexing and data views")
print("✓ Performance optimization techniques")
print("✓ Integration with visualization tools")


def show_tree(path, prefix="", max_depth=3, current_depth=0):
   if current_depth > max_depth:
       return
   items = sorted(path.iterdir())
   for i, item in enumerate(items):
       is_last = i == len(items) - 1
       current_prefix = "└── " if is_last else "├── "
       print(f"{prefix}{current_prefix}{item.name}")
       if item.is_dir() and current_depth < max_depth:
           next_prefix = prefix + ("    " if is_last else "│   ")
           show_tree(item, next_prefix, max_depth, current_depth + 1)


print(f"nFiles created in {tutorial_dir}:")
show_tree(tutorial_dir)


print(f"nTotal disk usage: {sum(f.stat().st_size for f in tutorial_dir.rglob('*') if f.is_file()) / 1024**2:.2f} MB")


print("n🎉 Advanced Zarr tutorial completed successfully!")

We wrap up the tutorial by highlighting everything we explored: array creation, chunking, compression, hierarchical organization, indexing, performance tuning, and visualization. We also review the files generated during the session and confirm total disk usage, giving us a complete picture of how Zarr handles large-scale data efficiently from start to finish.

In conclusion, we move beyond the fundamentals and gain a comprehensive view of how Zarr fits into modern data workflows. We see how it handles storage optimization through compression, organizes complex experiments through hierarchical groups, and enables smooth access to slices of large datasets with minimal overhead. Performance enhancements, such as chunk-aware processing and integration with visualization tools, bring additional depth, demonstrating how theory is directly translated into practice.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More