Case Studies: Real-World Applications of Context Engineering

Case Studies: Real-World Applications of Context Engineering

 

Context engineering has become a transformative force in moving from experimental AI demos to robust, production-grade systems across various industries. Below are distilled examples and evidence of real-world impact:

1. Insurance: Five Sigma & Agentic Underwriting

  • Five Sigma Insurance achieved an 80% reduction in claim processing errors and a 25% increase in adjustor productivity by architecting AI systems that ingest policy data, claims history, and regulations simultaneously. The system leveraged advanced retrieval-augmented generation (RAG) and dynamic context assembly, enabling automation that previously wasn’t possible.
  • In insurance underwriting, tailored schema creation and SME-guided context templates ensured that agents handled diverse formats and business rules, reaching over 95% accuracy after deployment feedback cycles.

2. Financial Services: Block (Square) & Major Banks

  • Block (formerly Square) implemented Anthropic’s Model Context Protocol (MCP) to tie LLMs to live payment and merchant data, moving from static prompts to a dynamic, information-rich environment that improved operational automation and bespoke problem-solving. MCP has since been recognized by OpenAI and Microsoft as a backbone for connecting AIs to real-world workflows.
  • Financial service bots increasingly combine user financial history, market data, and regulatory knowledge in real-time, delivering personalized investment advice and reducing user frustration by 40% compared to earlier generations.

3. Healthcare & Customer Support

  • Healthcare virtual assistants with context engineering now consider patients’ health records, medication schedules, and live appointment tracking—delivering accurate, safe advice and dramatically reducing administrative overhead.
  • Customer service bots with dynamic context integration seamlessly pull up prior tickets, account state, and product info, enabling agents and AI to resolve issues without repetitive questioning. This reduces average handle times and improves satisfaction scores.

4. Software Engineering & Coding Assistants

  • At Microsoft, deploying AI code helpers with architectural and organizational context delivered a 26% increase in completed software tasks and a measurable jump in code quality. Teams with well-engineered context windows experienced 65% fewer errors and significantly reduced hallucinations in code generation.
  • Enterprise developer platforms that incorporated user project history, coding standards, and documentation context saw up to 55% faster onboarding for new engineers and 70% better output quality.

5. Ecommerce & Recommendation Systems

  • Ecommerce AI leveraging browsing history, inventory status, and seasonality data provides users with highly relevant recommendations, leading to a measurable increase in conversions over generic prompt-based systems.
  • Retailers report 10x improvements in personalized offer success rates and reductions in abandoned carts after deploying context-engineered agents.

6. Enterprise Knowledge & Legal AI

  • Legal teams using context-aware AI tools to draft contracts and identify risk factors saw work acceleration and fewer missed compliance risks, since systems could dynamically fetch relevant precedent and legal frameworks.
  • Internal enterprise knowledge search, enhanced with multi-source context blocks (policies, client data, service histories), resulted in faster issue resolution and more consistent, high-quality responses for both employees and customers.

Quantifiable Outcomes Across Industries

  • Task success rates improved up to 10x in some applications.
  • Cost reductions of 40% and time savings of 75%-99% reported when context engineering is applied at scale.
  • User satisfaction and engagement metrics rise substantially when systems move beyond isolated prompts to contextual, adaptive information flows.

Context engineering is now central to enterprise AI, enabling reliable automation, rapid scaling, and next-level personalization that isolated prompt engineering cannot match. These case studies showcase how systematically designing and managing context turns large language models and agents from “clever toy” to “business-critical infrastructure”.


Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Case Studies: Real-World Applications of Context Engineering appeared first on MarkTechPost.

MarkTechPost

Read More
Zhipu AI Releases GLM-4.5V: Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Zhipu AI Releases GLM-4.5V: Versatile Multimodal Reasoning with Scalable Reinforcement Learning

 

Zhipu AI has officially released and open-sourced GLM-4.5V, a next-generation vision-language model (VLM) that significantly advances the state of open multimodal AI. Based on Zhipu’s 106-billion parameter GLM-4.5-Air architecture—with 12 billion active parameters via a Mixture-of-Experts (MoE) design—GLM-4.5V delivers strong real-world performance and unmatched versatility across visual and textual content.

Key Features and Design Innovations

1. Comprehensive Visual Reasoning

  • Image Reasoning: GLM-4.5V achieves advanced scene understanding, multi-image analysis, and spatial recognition. It can interpret detailed relationships in complex scenes (such as distinguishing product defects, analyzing geographical clues, or inferring context from multiple images simultaneously).
  • Video Understanding: It processes long videos, performing automatic segmentation and recognizing nuanced events thanks to a 3D convolutional vision encoder. This enables applications like storyboarding, sports analytics, surveillance review, and lecture summarization.
  • Spatial Reasoning: Integrated 3D Rotational Positional Encoding (3D-RoPE) gives the model a robust perception of three-dimensional spatial relationships, crucial for interpreting visual scenes and grounding visual elements.

2. Advanced GUI and Agent Tasks

  • Screen Reading & Icon Recognition: The model excels at reading desktop/app interfaces, localizing buttons and icons, and assisting with automation—essential for RPA (robotic process automation) and accessibility tools.
  • Desktop Operation Assistance: Through detailed visual understanding, GLM-4.5V can plan and describe GUI operations, assisting users in navigating software or performing complex workflows.

3. Complex Chart and Document Parsing

  • Chart Understanding: GLM-4.5V can analyze charts, infographics, and scientific diagrams within PDFs or PowerPoint files, extracting summarized conclusions and structured data even from dense, long documents.
  • Long Document Interpretation: With support for up to 64,000 tokens of multimodal context, it can parse and summarize extended, image-rich documents (such as research papers, contracts, or compliance reports), making it ideal for business intelligence and knowledge extraction.

4. Grounding and Visual Localization

  • Precise Grounding: The model can accurately localize and describe visual elements—such as objects, bounding boxes, or specific UI elements—using world knowledge and semantic context, not just pixel-level cues. This enables detailed analysis for quality control, AR applications, and image annotation workflows.

Architectural Highlights

  • Hybrid Vision-Language Pipeline: The system integrates a powerful visual encoder, MLP adapter, and a language decoder, allowing seamless fusion of visual and textual information. Static images, videos, GUIs, charts, and documents are all treated as first-class inputs.
  • Mixture-of-Experts (MoE) Efficiency: While housing 106B total parameters, the MoE design activates only 12B per inference, ensuring high throughput and affordable deployment without sacrificing accuracy.
  • 3D Convolution for Video & Images: Video inputs are processed using temporal downsampling and 3D convolution, enabling the analysis of high-resolution videos and native aspect ratios, while maintaining efficiency.
  • Adaptive Context Length: Supports up to 64K tokens, allowing robust handling of multi-image prompts, concatenated documents, and lengthy dialogues in one pass.
  • Innovative Pretraining and RL: The training regime combines massive multimodal pretraining, supervised fine-tuning, and Reinforcement Learning with Curriculum Sampling (RLCS) for long-chain reasoning mastery and real-world task robustness.

“Thinking Mode” for Tunable Reasoning Depth

A prominent feature is the “Thinking Mode” toggle:

  • Thinking Mode ON: Prioritizes deep, step-by-step reasoning, suitable for complex tasks (e.g., logical deduction, multi-step chart or document analysis).
  • Thinking Mode OFF: Delivers faster, direct answers for routine lookups or simple Q&A. The user can control the model’s reasoning depth at inference, balancing speed against interpretability and rigor.

Benchmark Performance and Real-World Impact

  • State-of-the-Art Results: GLM-4.5V achieves SOTA across 41–42 public multimodal benchmarks, including MMBench, AI2D, MMStar, MathVista, and more, outperforming both open and some premium proprietary models in categories like STEM QA, chart understanding, GUI operation, and video comprehension.
  • Practical Deployments: Businesses and researchers report transformative results in defect detection, automated report analysis, digital assistant creation, and accessibility technology with GLM-4.5V.
  • Democratizing Multimodal AI: Open-sourced under the MIT license, the model equalizes access to cutting-edge multimodal reasoning that was previously gated by exclusive proprietary APIs.

Example Use Cases

Feature Example Use Description
Image Reasoning Defect detection, content moderation Scene understanding, multiple-image summarization
Video Analysis Surveillance, content creation Long video segmentation, event recognition
GUI Tasks Accessibility, automation, QA Screen/UI reading, icon location, operation suggestion
Chart Parsing Finance, research reports Visual analytics, data extraction from complex charts
Document Parsing Law, insurance, science Analyze & summarize long illustrated documents
Grounding AR, retail, robotics Target object localization, spatial referencing

Summary

GLM-4.5V by Zhipu AI is a flagship open-source vision-language model setting new performance and usability standards for multimodal reasoning. With its powerful architecture, context length, real-time “thinking mode”, and broad capability spectrum, GLM-4.5V is redefining what’s possible for enterprises, researchers, and developers working at the intersection of vision and language.


Check out the Paper, Model on Hugging Face and GitHub Page here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Zhipu AI Releases GLM-4.5V: Versatile Multimodal Reasoning with Scalable Reinforcement Learning appeared first on MarkTechPost.

MarkTechPost

Read More
Meet LEANN: The Tiniest Vector Database that Democratizes Personal AI with Storage-Efficient Approximate Nearest Neighbor (ANN) Search Index

Meet LEANN: The Tiniest Vector Database that Democratizes Personal AI with Storage-Efficient Approximate Nearest Neighbor (ANN) Search Index

 

Embedding-based search outperforms traditional keyword-based methods across various domains by capturing semantic similarity using dense vector representations and approximate nearest neighbor (ANN) search. However, the ANN data structure brings excessive storage overhead, often 1.5 to 7 times the size of the original raw data. This overhead is manageable in large-scale web applications but becomes impractical for personal devices or large datasets. Reducing storage to under 5% of the original data size is critical for edge deployment, but existing solutions fall short. Techniques like product quantization (PQ) can reduce storage, but either lead to a decrease in accuracy or need increased search latency.

Recommended Article: NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video

Vector search methods depend on IVF and proximity graphs. Graph-based approaches like HNSW, NSG, and Vamana are considered state-of-the-art due to their balance of accuracy and efficiency. Efforts to reduce graph size, such as learned neighbor selection, face limitations due to high training costs and dependency on labeled data. For resource-constrained environments, DiskANN and Starling store data on disk, while FusionANNS optimizes hardware usage. Methods like AiSAQ and EdgeRAG attempt to minimize memory usage but still suffer from high storage overhead or performance degradation at scale. Embedding compression techniques like PQ and RabitQ provides quantization with theoretical error bounds, but struggles to maintain accuracy under tight budgets.

Researchers from UC Berkeley, CUHK, Amazon Web Services, and UC Davis have developed LEANN, a storage-efficient ANN search index optimized for resource-limited personal devices. It integrates a compact graph-based structure with an on-the-fly recomputation strategy, enabling fast and accurate retrieval while minimizing storage overhead. LEANN achieves up to 50 times smaller storage than standard indexes by reducing the index size to under 5% of the original raw data. It maintains 90% top-3 recall in under 2 seconds on real-world question-answering benchmarks. To reduce latency, LEANN utilizes a two-level traversal algorithm and dynamic batching that combines embedding computations across search hops, enhancing GPU utilization.

LEANN’s architecture combines core methods such as graph-based recomputation, main techniques, and system workflow. Built on the HNSW framework, it observes that each query needs embeddings for only a limited subset of nodes, prompting on-demand computation instead of pre-storing all embeddings. To address earlier challenges, LEANN introduces two techniques: (a) a two-level graph traversal with dynamic batching to lower recomputation latency, and (b) a high degree of preserving graph pruning method to reduce metadata storage. In the system workflow, LEANN begins by computing embeddings for all dataset items and then constructs a vector index using an off-the-shelf graph-based indexing approach.

In terms of storage and latency, LEANN outperforms EdgeRAG, an IVF-based recomputation method, achieving latency reductions ranging from 21.17 to 200.60 times across various datasets and hardware platforms. This advantage is from LEANN’s polylogarithmic recomputation complexity, which scales more efficiently than EdgeRAG’s √𝑁 growth. In terms of accuracy for downstream RAG tasks, LEANN achieves higher performance across most datasets, except GPQA, where a distributional mismatch limits its effectiveness. Similarly, on HotpotQA, the single-hop retrieval setup limits accuracy gains, as the dataset demands multi-hop reasoning. Despite these limitations, LEANN shows strong performance across diverse benchmarks.

In this paper, researchers introduced LEANN, a storage-efficient neural retrieval system that combines graph-based recomputation with innovative optimizations. By integrating a two-level search algorithm and dynamic batching, it eliminates the need to store full embeddings, achieving significant reductions in storage overhead while maintaining high accuracy. Despite its strengths, LEANN faces limitations, such as high peak storage usage during index construction, which could be addressed through pre-clustering or other techniques. Future work may focus on reducing latency and enhancing responsiveness, opening the path for broader adoption in resource-constrained environments.


Check out the Paper and GitHub Page here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meet LEANN: The Tiniest Vector Database that Democratizes Personal AI with Storage-Efficient Approximate Nearest Neighbor (ANN) Search Index appeared first on MarkTechPost.

MarkTechPost

Read More
A Practical Prompt Engineering Guide for GPT-5 with Examples

A Practical Prompt Engineering Guide for GPT-5 with Examples

 A Practical Prompt Engineering Guide for GPT-5 with Examples

GPT-5 is OpenAI’s latest and most powerful AI large language model (LLM) to date. The company calls it the smartest, fastest, most useful model yet, with built-in thinking and PhD‑level intelligence. When an AI model is truly capable, its effectiveness depends on how users use it in their everyday tasks and how well they take advantage of its intelligence. One of the best ways to use an artificial intelligence (AI) model efficiently is by writing effective prompts that AI can understand rather than get hallucinated by.

Thankfully, to help you get the best out of the latest GPT-5 model, OpenAI has released a GPT-5 Prompting Guide. The cookbook is a practical prompt engineering guide by OpenAI for the latest GPT-5 model to help you get the best results.

This guide will help you get a more predictable, more useful AI that can plan work, follow instructions precisely, and ship clean code without drama. The model is more proactive than older systems, so the guide leans into controls that help you decide when GPT-5 should go on its own and when it should slow down and ask.

A Smarter, More Capable AI

So, what makes GPT-5 such a game-changer? It’s not just about being able to answer more questions or write longer blocks of text. The improvements are far more profound. One of the most significant advancements is the massive reduction in “hallucinations,” the term for when an AI generates false or misleading information. This means we can trust the answers we get from GPT-5 to a much greater degree, making it a more reliable tool for everything from research to creative writing.

What’s actually new—and why it matters

We always hear companies make claims, so what’s actually new and why does that matter?

Agentic control without chaos

GPT-5 can operate anywhere on the spectrum from tightly scripted helper to independent problem-solver. The guide shows how to dial “agentic eagerness” up or down.

For example: Reduce exploration with a lower reasoning_effort, or push persistence with prompts that tell the model to keep going until the task is truly done. That makes long, tool-heavy workflows feel less random and more repeatable.

Clearly defining criteria in your prompt for how you want the AI model to explore the problem space can reduce the model’s need to explore and reason about too many ideas.

Example prompt:

You are a research assistant helping me create a 2-page market trends summary.
- Use reasoning_effort: low
- Do not explore topics outside the 2023–2025 data.
- Stop if you cannot find at least 3 credible sources.
- Keep going until you've fully summarized the 3 trends and written the final draft.

Progress you can follow (tool preambles)

Long tasks build trust when the model explains what it’s doing. The guide encourages short “preambles” before and during tool calls: restate the goal, outline the plan, narrate key steps, then summarize what changed. This doesn’t add fluff; it helps humans review and step in without derailing momentum.

Example prompt:

Your job is to clean and analyze a CSV file.
Before you start, restate the task in one sentence and outline the steps you will take.
When using a tool, explain what you're doing in 1–2 sentences before calling it.
After each step, summarize the result in under 50 words.

Right-sized thinking (reasoning effort)

The reasoning_effort parameter is your depth dial: keep it moderate for routine tasks, raise it for multi-step or tricky problems, and crucially break big tasks into distinct turns so the model plans and checks work in stages. That structure improves both quality and speed.

Example prompt:

You are solving a logic puzzle.
- Use reasoning_effort: high
- Explain your reasoning before giving the final answer.
- If the puzzle takes more than 5 steps to solve, break it into stages and confirm with me after each stage.

Better multi-step flows with the Responses API

If you’re building tools or agents around GPT-5, the Responses API lets the model reuse its prior reasoning instead of re-planning from scratch. OpenAI reports measurable gains just by making that switch (they cite a Tau-Bench Retail bump from ~73.9% to ~78.2% when passing prior reasoning), which means in practice to lower cost, faster latency, and more stable behavior.

Example prompt:

{
  "model": "gpt-5",
  "previous_response_id": "resp_12345",
  "input": "Now summarize the insights from the analysis above in bullet points."
}

Coding with GPT-5

GPT-5 can build new apps or make large, multi-file edits, but the standout tip is to codify taste and standards:

  • Frameworks: Next.js (TypeScript), React, HTML
  • Styling / UI: Tailwind CSS, shadcn/ui, Radix Themes
  • Icons: Material Symbols, Heroicons, Lucide
  • Animation: Motion
  • Fonts: San Serif, Inter, Geist, Mona Sans, IBM Plex Sans, Manrope

The GPT-5 model can hunt for context (like installed packages) without needing special prompts, and a short “code editing rules” block encourages it to follow your house style.

Example prompt:

You are editing an existing React app.
- Follow BEM CSS naming.
- Use Tailwind for styling.
- Place all new components in /src/components/ui.
- Maintain camelCase for function names.
- Use only hooks for state management.
Now, add a responsive navbar with a dropdown menu that matches the existing color palette.

A practical pattern from Cursor

Cursor, an AI code editor that tested GPT-5 early, found a useful balance:

  • Set verbosity low for regular text so the assistant stays concise, but ask for high verbosity inside code tools so diffs are readable with clear variable names.

They also learned to soften old “analyze everything” prompts, which made earlier models thorough but nudged GPT-5 to overuse tools. The fix—structured sections and clearer scope—reduced unnecessary calls and kept autonomy high.

Example prompt:

Global setting: verbosity=low
When providing code edits:
- Switch verbosity to high
- Include full diffs with comments explaining each change
- Use descriptive variable names

Two separate dials: thinking vs. talking (verbosity)

GPT-5 adds a verbosity parameter that controls the length of the final answer (separate from how hard it thinks). Keep a concise global default, then override locally where detail matters. For example, tasks such as code explanations or audit trails. You can also steer verbosity with plain language inside the prompt.

Example prompt:

Verbosity: low
Task: Summarize this 20-page report in 200 words.
If I type "explain more", increase verbosity, and give a detailed breakdown.

Instruction precision matters

GPT-5 follows directions with “surgical” accuracy, which is powerful and at the same time unforgiving. If your prompt includes conflicts (“never schedule without consent” next to “auto-assign before contacting”), the model wastes tokens reconciling rules. Clean hierarchies and explicit exceptions fix that. The guide even shows a healthcare example and how rewriting it makes reasoning faster and clearer. Use OpenAI’s prompt optimizer to spot these issues.

Example prompt (❌ bad):

Never schedule meetings without consent.
Always schedule the earliest available time.

Fixed prompt (✅ good):

Only schedule meetings with explicit consent.
If consent is given, choose the earliest available time.

Minimal reasoning for speed

There’s also a minimal-effort mode: the fastest option that still benefits from the reasoning model pattern. It shines on latency-sensitive tasks when paired with a short “why” summary, clear tool preambles, and an explicit plan.

The model spends fewer reasoning tokens on figuring out how to solve the problem before writing the answer, and won’t do as much step-by-step planning inside its hidden reasoning process. If you still want it to complete a multi-step task or follow a structured approach, you have to build that structure into your prompt. For example:

  • Explicitly list the steps you want it to take.
  • Provide templates, headings, or formats that it should fill in.
  • State exactly what information goes where and in what order.

This “scaffolding” works like the outline of a building: it gives the model a clear frame to follow when it’s not spending much time figuring things out for itself.

If you don’t provide that, minimal reasoning mode might give you a shallow, incomplete, or disorganized answer—because it’s skipping the deeper planning phase.

Example prompt:

Minimal reasoning
Extract all email addresses from the following text and output as a comma-separated list.
Do not include any other text.

Formatting defaults and overrides

By default, API responses aren’t in Markdown. If your app expects headings, lists, and code fences, say so—briefly and precisely—and refresh that instruction every few turns in long chats to keep the formatting consistent.

Example prompt:

Output your response in Markdown format.
- Use H2 headings
- Bullet points for lists
- Code blocks with language tags for code snippets

Metaprompting with GPT-5

When a prompt underperforms, ask GPT-5 to critique it: what to add, delete, or clarify to elicit the behavior you want, without removing everything that already works. It’s a low-effort way to improve the prompts you rely on daily.

Example prompt:

Here is my current prompt: "Write a friendly product description for our coffee shop."
Critique this prompt and suggest 3 specific changes that would make the output warmer, more detailed, and in a consistent brand voice.

Why this matters

The main idea of the guide is to maintain control without micromanaging. You set a small number of dials—how much to think, how much to say, how persistent to be—and describe your environment with just enough detail for the model to blend in. In return, you get better hand-offs, fewer stalls, and code or documents that feel like they were made by your team. Most importantly, the fixes are the kind you can try this afternoon: trim a conflicting line, add a persistence clause, split a big job into steps, or move to the Responses API if you’re doing tool-based work.

In Conclusion:

Good prompts aren’t fancy—they’re specific. GPT-5 rewards that specificity: clear stop conditions, right-sized reasoning, a steady narration of what it’s doing, and a tidy set of rules for how work should look when it’s done. Treat the guide as a checklist, not a lecture. Start small, measure, and keep your prompts short, scoped, and consistent. The payoff is an assistant that feels less like a chat window and more like a reliable teammate, one you can trust to plan, execute, and finish the job.


🤝
For Partnership/Promotion on AI Tools Club, please check out our partnership page.

↗️ GPT-5 Prompting Guide

AI Tools Club

Read More
NVIDIA AI Releases ProRLv2: Advancing Reasoning in Language Models with Extended Reinforcement Learning RL

NVIDIA AI Releases ProRLv2: Advancing Reasoning in Language Models with Extended Reinforcement Learning RL

 

What Is ProRLv2?

ProRLv2 is the latest version of NVIDIA’s Prolonged Reinforcement Learning (ProRL), designed specifically to push the boundaries of reasoning in large language models (LLMs). By scaling reinforcement learning (RL) steps from 2,000 up to 3,000, ProRLv2 systematically tests how extended RL can unlock new solution spaces, creativity, and high-level reasoning that were previously inaccessible—even with smaller models like the 1.5B-parameter Nemotron-Research-Reasoning-Qwen-1.5B-v2.

Key Innovations in ProRLv2

ProRLv2 incorporates several innovations to overcome common RL limitations in LLM training:

  • REINFORCE++- Baseline: A robust RL algorithm that enables long-horizon optimization over thousands of steps, handling the instability typical in RL for LLMs.
  • KL Divergence Regularization & Reference Policy Reset: Periodically refreshes the reference model with the current best checkpoint, allowing stable progress and continued exploration by preventing the RL objective from dominating too early.
  • Decoupled Clipping & Dynamic Sampling (DAPO): Encourages diverse solution discovery by boosting unlikely tokens and focusing learning signals on prompts of intermediate difficulty.
  • Scheduled Length Penalty: Cyclically applied, helping maintain diversity and prevent entropy collapse as training lengthens.
  • Scaling Training Steps: ProRLv2 moves the RL training horizon from 2,000 to 3,000 steps, directly testing how much longer RL can expand reasoning abilities.
Recommended Article: NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video

How ProRLv2 Expands LLM Reasoning

Nemotron-Research-Reasoning-Qwen-1.5B-v2, trained with ProRLv2 for 3,000 RL steps, sets a new standard for open-weight 1.5B models on reasoning tasks, including math, code, science, and logic puzzles:

  • Performance surpasses previous versions and competitors like DeepSeek-R1-1.5B.
  • Sustained gains with more RL steps: Longer training leads to continual improvements, especially on tasks where base models perform poorly, demonstrating genuine expansion in reasoning boundaries.
  • Generalization: Not only does ProRLv2 boost pass@1 accuracy, but it also enables novel reasoning and solution strategies on tasks not seen during training.
  • Benchmarks: Gains include average pass@1 improvements of 14.7% in math, 13.9% in coding, 54.8% in logic puzzles, 25.1% in STEM reasoning, and 18.1% in instruction-following tasks, with further improvements in v2 on unseen and harder benchmarks.

Why It Matters

The major finding of ProRLv2 is that continued RL training, with careful exploration and regularization, reliably expands what LLMs can learn and generalize. Rather than hitting an early plateau or overfitting, prolonged RL allows smaller models to rival much larger ones in reasoning—demonstrating that scaling RL itself is as important as model or dataset size.

Using Nemotron-Research-Reasoning-Qwen-1.5B-v2

The latest checkpoint is available for testing on Hugging Face. Loading the model:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")

Conclusion

ProRLv2 redefines the limits of reasoning in language models by showing that RL scaling laws matter as much as size or data. Through advanced regularization and smart training schedules, it enables deep, creative, and generalizable reasoning even in compact architectures. The future lies in how far RL can push—not just how big models can get.


Check out the Unofficial Blog and Model on Hugging Face here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post NVIDIA AI Releases ProRLv2: Advancing Reasoning in Language Models with Extended Reinforcement Learning RL appeared first on MarkTechPost.

MarkTechPost

Read More
AI Agent Trends of 2025: A Transformative Landscape

AI Agent Trends of 2025: A Transformative Landscape

 

The year 2025 marks a defining moment in the evolution of artificial intelligence, ushering in an era where agentic systems—autonomous AI agents capable of complex reasoning and coordinated action—are transforming enterprise workflows, research, software development, and day-to-day user experiences. This articles focuses on five core AI agent trends for 2025: Agentic RAG, Voice Agents, AI Agent Protocols, DeepResearch Agents, Coding Agents, and Computer Using Agents (CUA).

1. Agentic RAG: Reasoning-Driven AI Workflows

Agentic Retrieval-Augmented Generation (RAG) stands as the cornerstone use case in 2025 for real-world AI agents. Building on the standard RAG architecture, Agentic RAG introduces goal-driven autonomy, memory, and planning. Here’s how the agentic approach refines classical RAG:

  • Memory & Context Retention: Agents track user queries across sessions, building short-term and long-term memory for seamless context management.
  • Planning & Tool Use: Agents dynamically select retrieval strategies (vector DBs, APIs) and coordinate the right tool for the task.
  • Multi-Step Reasoning: They orchestrate complex workflows—involving dynamic data fetching, prompt optimization, and leveraging diverse sources—before generating responses via LLMs.
  • Accuracy and Adaptability: Enhanced post-generation verification and learning loop improve output quality and domain adaptability, creating systems that can synthesize and reason over vast data sets, not just retrieve answers.

Enterprise adoption of Agentic RAG is sweeping across sectors, powering smart assistants, search engines, and collaborative platforms that rely on multi-source data retrieval and reasoning.

2. Voice Agents: Natural Language Interfaces

Voice-controlled agents are reaching new heights, seamlessly blending speech-to-text (STT) and text-to-speech (TTS) technologies with agentic reasoning pipelines. These agents interact conversationally with users, retrieve data from diverse sources, and even execute tasks such as placing calls or managing calendars—all through spoken language.

  • Intelligent Telephony: Agents can participate in live phone conversations, interpret natural queries, and deliver informed responses based on enterprise databases.
  • Context-Aware Interaction: Deep integration with agentic workflows ensures voice agents adapt to context, understand intent, and use planning to fulfill spoken tasks beyond simple command-and-response.

3. AI Agent Protocols: Coordination at Scale

With the proliferation of multi-agent systems, open communication protocols are vital. The most prominent ones include:

  • MCP (Model Context Protocol): Shares workflow states, tools, and memory across agents.
  • ACP (Agent Communication Protocol): Enables reliable message exchange, workflow orchestration, context management, and observability.
  • A2A (Agent-to-Agent Protocol): Facilitates seamless, decentralized collaboration and task delegation among agents—even across platform or vendor boundaries.

These protocols are rapidly adopted to enable scalable, interoperable, and secure agentic ecosystems in the enterprise—supporting everything from customer support to supply chain automation.

4. DeepResearch Agents: Advanced Collaborative Analysis

A new category of agents, DeepResearch Agents, is architected for tackling multi-step research problems. These AI systems aggregate and analyze vast swathes of structured and unstructured information from the web and databases, synthesizing analytical reports and actionable insights.

  • Long-Horizon Planning: Capable of breaking down research tasks into sub-queries, aggregating results, and iteratively refining outputs with reasoned analysis.
  • Multi-Agent Collaboration: Specialized agents—for citation, aggregation, verification—work together to generate thoroughly researched deliverables.
  • Tool Integration: DeepResearch agents leverage APIs, browsers, code execution tools, and context protocols to drive high-depth reports at a speed impossible for human researchers.

Business, science, and finance sectors are rapidly integrating DeepResearch architecture, reshaping how teams approach knowledge-intensive work.

5. Coding Agents & CUA: Autonomous Software Engineering

Coding Agents are revolutionizing application development, debugging, and testing:

  • Code Generation: Agents propose solutions, architect systems, and write code based on abstract queries or specifications.
  • Autonomous Debugging: They diagnose issues, apply fixes, and even run test suites iteratively.
  • Testing & Continuous Integration: Agents manage testing environments, execute test runners, and ensure code quality at scale.

CUA (Computer Using Agents) bridge the gap between human-computer interaction and autonomous interfaces. These agents operate desktop sandboxes, manipulate files and data, and use third-party tools—fully automating tasks as a human would.

The Bigger Picture: Autonomous, Collaborative, and Context-Aware AI

The AI agent revolution of 2025 is defined by several key themes:

  • Autonomy: Agents plan and execute complex tasks with minimal human intervention.
  • Collaboration: Robust protocols unlock federated, large-scale coordination between agents and platforms.
  • Memory & Reasoning: Enhanced long-term memory and advanced reasoning deliver higher-quality, more relevant results.
  • Accessibility: Low-code and no-code tools are democratizing agent development, enabling non-technical users to harness agentic AI.

With ongoing innovations, human oversight remains critical. As agents become more capable, establishing boundaries around agent autonomy—and ensuring transparency and safety—are vital for responsible adoption.

In Summary

2025’s agentic AI trends is not about single-purpose bots, but sophisticated, task-oriented systems capable of holistic reasoning, collaboration, and learning. These advances are redefining how we work, research, build, and interact with technology—fulfilling the vision set forth in the AI Agent Trends of 2025


Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post AI Agent Trends of 2025: A Transformative Landscape appeared first on MarkTechPost.

MarkTechPost

Read More
From 100,000 to Under 500 Labels: How Google AI Cuts LLM Training Data by Orders of Magnitude

From 100,000 to Under 500 Labels: How Google AI Cuts LLM Training Data by Orders of Magnitude

 

Google Research has unveiled a groundbreaking method for fine-tuning large language models (LLMs) that slashes the amount of required training data by up to 10,000x, while maintaining or even improving model quality. This approach centers on active learning and focusing expert labeling efforts on the most informative examples—the “boundary cases” where model uncertainty peaks.

The Traditional Bottleneck

Fine-tuning LLMs for tasks demanding deep contextual and cultural understanding—like ad content safety or moderation—has typically required massive, high-quality labeled datasets. Most data is benign, meaning that for policy violation detection, only a small fraction of examples matter, driving up the cost and complexity of data curation. Standard methods also struggle to keep up when policies or problematic patterns shift, necessitating expensive retraining.

Google’s Active Learning Breakthrough

How It Works:

  • LLM-as-Scout: The LLM is used to scan a vast corpus (hundreds of billions of examples) and identify cases it’s least certain about.
  • Targeted Expert Labeling: Instead of labeling thousands of random examples, human experts only annotate those borderline, confusing items.
  • Iterative Curation: This process repeats, with each batch of new “problematic” examples informed by the latest model’s confusion points.
  • Rapid Convergence: Models are fine-tuned in multiple rounds, and the iteration continues until the model’s output aligns closely with expert judgment—measured by Cohen’s Kappa, which compares agreement between annotators beyond chance.
Image source: https://research.google/blog/achieving-10000x-training-data-reduction-with-high-fidelity-labels/

Impact:

  • Data Needs Plummet: In experiments with Gemini Nano-1 and Nano-2 models, alignment with human experts reached parity or better using 250–450 well-chosen examples rather than ~100,000 random crowdsourced labels—a reduction of three to four orders of magnitude.
  • Model Quality Rises: For more complex tasks and larger models, performance improvements reached 55–65% over baseline, demonstrating more reliable alignment with policy experts.
  • Label Efficiency: For reliable gains using tiny datasets, high label quality was consistently necessary (Cohen’s Kappa > 0.8).

Why It Matters

This approach flips the traditional paradigm. Rather than drowning models in vast pools of noisy, redundant data, it leverages both LLMs’ ability to identify ambiguous cases and the domain expertise of human annotators where their input is most valuable. The benefits are profound:

  • Cost Reduction: Vastly fewer examples to label, dramatically lowering labor and capital expenditure.
  • Faster Updates: The ability to retrain models on a handful of examples makes adaptation to new abuse patterns, policy changes, or domain shifts rapid and feasible.
  • Societal Impact: Enhanced capacity for contextual and cultural understanding increases the safety and reliability of automated systems handling sensitive content.

In Summary

Google’s new methodology enables LLM fine-tuning on complex, evolving tasks with just hundreds (not hundreds of thousands) of targeted, high-fidelity labels—ushering in far leaner, more agile, and cost-effective model development.


Check out the technical article from Google blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post From 100,000 to Under 500 Labels: How Google AI Cuts LLM Training Data by Orders of Magnitude appeared first on MarkTechPost.

MarkTechPost

Read More
Graph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Reasoning with Reinforcement Learning

Graph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Reasoning with Reinforcement Learning

 

Introduction

Large Language Models (LLMs) have set new benchmarks in natural language processing, but their tendency for hallucination—generating inaccurate outputs—remains a critical issue for knowledge-intensive applications. Retrieval-Augmented Generation (RAG) frameworks attempt to solve this by incorporating external knowledge into language generation. However, traditional RAG approaches rely on chunk-based retrieval, which limits their ability to represent complex semantic relationships. Entity-relation graph-based RAG methods (GraphRAG) address some structural limitations, but still face high construction cost, one-shot retrieval inflexibility, and dependence on long-context reasoning and carefully crafted prompts.

Researchers from Nanyang Technological University, National University of Singapore, Beijing Institute of Computer Technology and Application, and Beijing Anzhen Hospital have introduced Graph-R1, an agentic GraphRAG framework powered by end-to-end reinforcement learning.

Image source: https://arxiv.org/pdf/2507.21892v1

Core Innovations of Graph-R1

1. Lightweight Knowledge Hypergraph Construction

Graph-R1 constructs knowledge as a hypergraph, where each knowledge segment is extracted using LLM-driven n-ary relation extraction. This approach encodes richer and more semantically grounded relationships, boosting agentic reasoning capabilities while maintaining manageable cost and computational requirements.

  • Efficiency: Only 5.69s and $2.81 per 1,000 tokens for construction (vs. $3.35 for GraphRAG and $4.14 for HyperGraphRAG), while generating semantically rich graphs with 120,499 nodes and 98,073 edges.

2. Multi-Turn Agentic Retrieval Process

Graph-R1 models retrieval as a multi-turn interaction loop (“think-retrieve-rethink-generate”), allowing the agent to adaptively query and refine its knowledge path, unlike previous methods that use one-shot retrieval.

  • Dynamic Reasoning: The agent decides at each step whether to continue exploring or terminate with an answer. Entity-based and direct hyperedge retrieval are fused through reciprocal rank aggregation, improving the chances of retrieving the most relevant knowledge.

3. End-to-End Reinforcement Learning Optimization

Graph-R1 uses Group Relative Policy Optimization (GRPO) for end-to-end RL, integrating rewards for format adherence, relevance, and answer correctness. This unified reward guides agents to develop generalizable reasoning strategies tightly aligned with both the knowledge structure and output quality.

  • Outcome-directed reward mechanism: Combines format rewards (structural coherence) and answer rewards (semantic accuracy) for effective optimization, only rewarding answers embedded in structurally valid reasoning trajectories.

Key Findings

Benchmarking on RAG QA Tasks

Graph-R1 was evaluated across six standard QA datasets (2WikiMultiHopQA, HotpotQA, Musique, Natural Questions, PopQA, TriviaQA).

Method Avg. F1 (Qwen2.5-7B)
NaiveGeneration 13.87
StandardRAG 15.89
GraphRAG 24.87
HyperGraphRAG 29.40
Search-R1 46.19
R1-Searcher 42.29
Graph-R1 57.82
  • Graph-R1 achieves up to 57.82 average F1 with Qwen2.5-7B, surpassing all previous baselines by a wide margin. Larger base models amplify its performance gains.

Ablation Analysis

Component ablation demonstrates that removing hypergraph construction, multi-turn reasoning, or RL optimization dramatically reduces performance, validating the necessity of each module within Graph-R1.

Retrieval and Efficiency

  • Graph-R1 retrieval is more concise and effective. It achieves high F1 scores with moderate average content lengths (~1200-1500 tokens per exchange), and supports more interaction turns (average 2.3-2.5), facilitating stable and accurate knowledge extraction.2507.21892v1.pdf
  • Generation cost is minimal: Despite richer representation, Graph-R1’s response time per query (7.0s) and per-query cost ($0) outperforms graph-based competitors like HyperGraphRAG (9.6s, $8.76).2507.21892v1.pdf

Generation Quality

Graph-R1’s generation quality is evaluated across seven dimensions—comprehensiveness, knowledgeability, correctness, relevance, diversity, logical coherence, factuality—and consistently outperforms all RL-based and graph-based baselines, achieving top scores in correctness (86.9), relevance (95.2), and coherence (88.5).

Generalizability

Cross-validation on out-of-distribution (O.O.D.) settings reveals that Graph-R1 maintains robust performance across datasets, with O.O.D./I.I.D. ratios often above 85%, demonstrating strong domain generalization properties.

Theoretical Guarantees

Graph-R1 is supported by information-theoretic analyses:

  • Graph-structured knowledge provides higher information density per retrieval and faster convergence to correct answers compared to chunk-based retrieval.
  • Multi-turn interaction enables the agent to achieve higher retrieval efficiency by dynamically focusing on high-impact graph regions.
  • End-to-end RL optimization bridges graph-structured evidence and language generation, reducing output entropy and error rates.

Algorithmic Workflow (High-Level)

  1. Knowledge Hypergraph Extraction: LLM extracts n-ary relations to build entity and hyperedge sets.
  2. Multi-turn Agentic Reasoning: The agent alternates between reflective thinking, querying, hypergraph retrieval (entity and hyperedge dual paths), and synthesis.
  3. GRPO Optimization: RL policy is updated using sampled trajectories and reward normalization, enforcing structure and answer correctness.

Conclusion

Graph-R1 demonstrates that integrating hypergraph-based knowledge representation, agentic multi-turn reasoning, and end-to-end RL delivers unprecedented gains in factual QA performance, retrieval efficiency, and generation quality, charting the path for next-generation agentic and knowledge-driven LLM systems.


FAQ 1: What is the key innovation of Graph-R1 compared to earlier GraphRAG and RAG systems?

Graph-R1 introduces an agentic framework where retrieval is modeled as a multi-turn interaction rather than a single one-shot process. Its main innovations are:

  • Hypergraph Knowledge Representation: Instead of simple entity-relation graphs or text chunks, Graph-R1 constructs a semantic hypergraph that enables more expressive, n-ary relationships between entities.
  • Multi-Turn Reasoning Loop: The agent operates in repeated cycles of “think–retrieve–rethink–generate” over the hypergraph, dynamically focusing queries rather than retrieving everything at once.
  • End-to-End Reinforcement Learning (RL): The agent is trained with a reward function that simultaneously optimizes for step-wise logical reasoning and final answer correctness, enabling tighter alignment between structured knowledge and natural language answers.

FAQ 2: How does Graph-R1’s retrieval and generation efficiency compare to previous methods?

Graph-R1 is significantly more efficient and effective in both retrieval and answer generation:

  • Lower Construction & Retrieval Cost: For building the knowledge hypergraph, Graph-R1 takes only 5.69 seconds and costs $2.81 per 1,000 tokens (on the 2Wiki dataset), outperforming similar graph-based methods.
  • Faster and Cheaper Generation: Query response times (average 7 seconds per query) and generation costs ($0 per query) are better than prior graph-RAG systems, such as HyperGraphRAG.
  • Conciseness & Robustness: Graph-R1 answers are both more concise (usually 1,200–1,500 tokens) and more accurate due to the multi-turn interaction, with state-of-the-art F1 scores across six QA datasets.

FAQ 3: In which scenarios or domains is the Graph-R1 framework most applicable?

Graph-R1 is ideal for complex knowledge-intensive applications demanding both factual accuracy and reasoning transparency, such as:

  • Healthcare and Medical AI: Where multi-hop reasoning, traceability, and reliability are essential.
  • Legal and Regulatory Domains: That require precise grounded answers and interpretable multi-step reasoning.
  • Enterprise Knowledge Automation: For tasks needing scalable, dynamic querying and retrieval across large document or data corpora.
    The model’s architecture also allows for easy adaptation to other fields that benefit from agentic, multi-turn knowledge search anchored in structured representations.

Check out the Paper here and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks

The post Graph-R1: An Agentic GraphRAG Framework for Structured, Multi-Turn Reasoning with Reinforcement Learning appeared first on MarkTechPost.

MarkTechPost

Read More