How to Build a Conversational Research AI Agent with LangGraph: Step Replay and Time-Travel Checkpoints

 

In this tutorial, we aim to understand how LangGraph enables us to manage conversation flows in a structured manner, while also providing the power to “time travel” through checkpoints. By building a chatbot that integrates a free Gemini model and a Wikipedia tool, we can add multiple steps to a dialogue, record each checkpoint, replay the full state history, and even resume from a past state. This hands-on approach enables us to see, in real-time, how LangGraph’s design facilitates the tracking and manipulation of conversation progression with clarity and control. Check out the .

!pip -q install -U langgraph langchain langchain-google-genai google-generativeai typing_extensions
!pip -q install "requests==2.32.4"


import os
import json
import textwrap
import getpass
import time
from typing import Annotated, List, Dict, Any, Optional


from typing_extensions import TypedDict


from langchain.chat_models import init_chat_model
from langchain_core.messages import BaseMessage
from langchain_core.tools import tool


from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.prebuilt import ToolNode, tools_condition


import requests
from requests.adapters import HTTPAdapter, Retry


if not os.environ.get("GOOGLE_API_KEY"):
   os.environ["GOOGLE_API_KEY"] = getpass.getpass("🔑 Enter your Google API Key (Gemini): ")


llm = init_chat_model("google_genai:gemini-2.0-flash")

We start by installing the required libraries, setting up our Gemini API key, and importing all the necessary modules. We then initialize the Gemini model using LangChain so that we can use it as the core LLM in our LangGraph workflow. Check out the .

WIKI_SEARCH_URL = "https://en.wikipedia.org/w/api.php"


_session = requests.Session()
_session.headers.update({
   "User-Agent": "LangGraph-Colab-Demo/1.0 (contact: example@example.com)",
   "Accept": "application/json",
})
retry = Retry(
   total=5, connect=5, read=5, backoff_factor=0.5,
   status_forcelist=(429, 500, 502, 503, 504),
   allowed_methods=("GET", "POST")
)
_session.mount("https://", HTTPAdapter(max_retries=retry))
_session.mount("http://", HTTPAdapter(max_retries=retry))


def _wiki_search_raw(query: str, limit: int = 3) -> List[Dict[str, str]]:
   """
   Use MediaWiki search API with:
     - origin='*' (good practice for CORS)
     - Polite UA + retries
   Returns compact list of {title, snippet_html, url}.
   """
   params = {
       "action": "query",
       "list": "search",
       "format": "json",
       "srsearch": query,
       "srlimit": limit,
       "srprop": "snippet",
       "utf8": 1,
       "origin": "*",
   }
   r = _session.get(WIKI_SEARCH_URL, params=params, timeout=15)
   r.raise_for_status()
   data = r.json()
   out = []
   for item in data.get("query", {}).get("search", []):
       title = item.get("title", "")
       page_url = f"https://en.wikipedia.org/wiki/{title.replace(' ', '_')}"
       snippet = item.get("snippet", "")
       out.append({"title": title, "snippet_html": snippet, "url": page_url})
   return out


@tool
def wiki_search(query: str) -> List[Dict[str, str]]:
   """Search Wikipedia and return up to 3 results with title, snippet_html, and url."""
   try:
       results = _wiki_search_raw(query, limit=3)
       return results if results else [{"title": "No results", "snippet_html": "", "url": ""}]
   except Exception as e:
       return [{"title": "Error", "snippet_html": str(e), "url": ""}]


TOOLS = [wiki_search]

We set up a Wikipedia search tool with a custom session, retries, and a polite user-agent. We define _wiki_search_raw to query the MediaWiki API and then wrap it as a LangChain tool, allowing us to seamlessly call it within our LangGraph workflow. Check out the .

class State(TypedDict):
   messages: Annotated[list, add_messages]


graph_builder = StateGraph(State)


llm_with_tools = llm.bind_tools(TOOLS)


SYSTEM_INSTRUCTIONS = textwrap.dedent("""
You are ResearchBuddy, a careful research assistant.
- If the user asks you to "research", "find info", "latest", "web", or references a library/framework/product,
 you SHOULD call the `wiki_search` tool at least once before finalizing your answer.
- When you call tools, be concise in the text you produce around the call.
- After receiving tool results, cite at least the page titles you used in your summary.
""").strip()


def chatbot(state: State) -> Dict[str, Any]:
   """Single step: call the LLM (with tools bound) on the current messages."""
   return {"messages": [llm_with_tools.invoke(state["msgs"])]}


graph_builder.add_node("chatbot", chatbot)


memory = InMemorySaver()
graph = graph_builder.compile(checkpointer=memory)

We define our graph state to store the running message thread and bind our Gemini model to the wiki_search tool, allowing it to call it when needed. We add a chatbot node and a tools node, wire them with conditional edges, and enable checkpointing with an in-memory saver. We now compile the graph so we can add steps, replay history, and resume from any checkpoint. Check out the .

def print_last_message(event: Dict[str, Any]):
   """Pretty-print the last message in an event if available."""
   if "messages" in event and event["messages"]:
       msg = event["messages"][-1]
       try:
           if isinstance(msg, BaseMessage):
               msg.pretty_print()
           else:
               role = msg.get("role", "unknown")
               content = msg.get("content", "")
               print(f"n[{role.upper()}]n{content}n")
       except Exception:
           print(str(msg))


def show_state_history(cfg: Dict[str, Any]) -> List[Any]:
   """Print a concise view of checkpoints; return the list as well."""
   history = list(graph.get_state_history(cfg))
   print("n=== 📜 State history (most recent first) ===")
   for i, st in enumerate(history):
       n = st.next
       n_txt = f"{n}" if n else "()"
       print(f"{i:02d}) NumMessages={len(st.values.get('messages', []))}  Next={n_txt}")
   print("=== End history ===n")
   return history


def pick_checkpoint_by_next(history: List[Any], node_name: str = "tools") -> Optional[Any]:
   """Pick the first checkpoint whose `next` includes a given node (e.g., 'tools')."""
   for st in history:
       nxt = tuple(st.next) if st.next else tuple()
       if node_name in nxt:
           return st
   return None

We add utility functions to make our LangGraph workflow easier to inspect and control. We use print_last_message to neatly display the most recent response, show_state_history to list all saved checkpoints, and pick_checkpoint_by_next to locate a checkpoint where the graph is about to run a specific node, such as the tools step. Check out the .

config = {"configurable": {"thread_id": "demo-thread-1"}}


first_turn = {
   "messages": [
       {"role": "system", "content": SYSTEM_INSTRUCTIONS},
       {"role": "user", "content": "I'm learning LangGraph. Could you do some research on it for me?"},
   ]
}


print("n==================== 🟢 STEP 1: First user turn ====================")
events = graph.stream(first_turn, config, stream_mode="values")
for ev in events:
   print_last_message(ev)


second_turn = {
   "messages": [
       {"role": "user", "content": "Ya. Maybe I'll build an agent with it!"}
   ]
}


print("n==================== 🟢 STEP 2: Second user turn ====================")
events = graph.stream(second_turn, config, stream_mode="values")
for ev in events:
   print_last_message(ev)

We simulate two user interactions in the same thread by streaming events through the graph. We first provide system instructions and ask the assistant to research LangGraph, then follow up with a second user message about building an autonomous agent. Each step is checkpointed, allowing us to replay or resume from these states later. Check out the .

print("n==================== 🔁 REPLAY: Full state history ====================")
history = show_state_history(config)


to_replay = pick_checkpoint_by_next(history, node_name="tools")
if to_replay is None:
   to_replay = history[min(2, len(history) - 1)]


print("Chosen checkpoint to resume from:")
print("  Next:", to_replay.next)
print("  Config:", to_replay.config)


print("n==================== ⏪ RESUME from chosen checkpoint ====================")
for ev in graph.stream(None, to_replay.config, stream_mode="vals"):
   print_last_message(ev)


MANUAL_INDEX = None 
if MANUAL_INDEX is not None and 0 <= MANUAL_INDEX < len(history):
   chosen = history[MANUAL_INDEX]
   print(f"n==================== 🧭 MANUAL RESUME @ index {MANUAL_INDEX} ====================")
   print("Next:", chosen.next)
   print("Config:", chosen.config)
   for ev in graph.stream(None, chosen.config, stream_mode="values"):
       print_last_message(ev)


print("n✅ Done. You added steps, replayed history, and resumed from a prior checkpoint.")

We replay the full checkpoint history to see how our conversation evolves across steps and identify a useful point to resume. We then “time travel” by restarting from a selected checkpoint, and optionally from any manual index, so we continue the dialogue exactly from that saved state.

In conclusion, we have gained a clearer picture of how LangGraph’s checkpointing and time-travel capabilities bring flexibility and transparency to conversation management. By stepping through multiple user turns, replaying state history, and resuming from earlier points, we can experience firsthand the power of this framework in building reliable research agents or autonomous assistants. We recognize that this workflow is not just a demo, but a foundation that we can extend into more complex applications, where reproducibility and traceability are as important as the answers themselves.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Chunking vs. Tokenization: Key Differences in AI Text Processing

Chunking vs. Tokenization: Key Differences in AI Text Processing

 

Table of contents

Introduction

When you’re working with AI and natural language processing, you’ll quickly encounter two fundamental concepts that often get confused: tokenization and chunking. While both involve breaking down text into smaller pieces, they serve completely different purposes and work at different scales. If you’re building AI applications, understanding these differences isn’t just academic—it’s crucial for creating systems that actually work well.

Think of it this way: if you’re making a sandwich, tokenization is like cutting your ingredients into bite-sized pieces, while chunking is like organizing those pieces into logical groups that make sense to eat together. Both are necessary, but they solve different problems.

Source: marktechpost.com

What is Tokenization?

Tokenization is the process of breaking text into the smallest meaningful units that AI models can understand. These units, called tokens, are the basic building blocks that language models work with. You can think of tokens as the “words” in an AI’s vocabulary, though they’re often smaller than actual words.

There are several ways to create tokens:

Word-level tokenization splits text at spaces and punctuation. It’s straightforward but creates problems with rare words that the model has never seen before.

Subword tokenization is more sophisticated and widely used today. Methods like Byte Pair Encoding (BPE), WordPiece, and SentencePiece break words into smaller chunks based on how frequently character combinations appear in training data. This approach handles new or rare words much better.

Character-level tokenization treats each letter as a token. It’s simple but creates very long sequences that are harder for models to process efficiently.

Here’s a practical example:

  • Original text: “AI models process text efficiently.”
  • Word tokens: [“AI”, “models”, “process”, “text”, “efficiently”]
  • Subword tokens: [“AI”, “model”, “s”, “process”, “text”, “efficient”, “ly”]

Notice how subword tokenization splits “models” into “model” and “s” because this pattern appears frequently in training data. This helps the model understand related words like “modeling” or “modeled” even if it hasn’t seen them before.

What is Chunking?

Chunking takes a completely different approach. Instead of breaking text into tiny pieces, it groups text into larger, coherent segments that preserve meaning and context. When you’re building applications like chatbots or search systems, you need these larger chunks to maintain the flow of ideas.

Think about reading a research paper. You wouldn’t want each sentence scattered randomly—you’d want related sentences grouped together so the ideas make sense. That’s exactly what chunking does for AI systems.

Here’s how it works in practice:

  • Original text: “AI models process text efficiently. They rely on tokens to capture meaning and context. Chunking allows better retrieval.”
  • Chunk 1: “AI models process text efficiently.”
  • Chunk 2: “They rely on tokens to capture meaning and context.”
  • Chunk 3: “Chunking allows better retrieval.”

Modern chunking strategies have become quite sophisticated:

Fixed-length chunking creates chunks of a specific size (like 500 words or 1000 characters). It’s predictable but sometimes breaks up related ideas awkwardly.

Semantic chunking is smarter—it looks for natural breakpoints where topics change, using AI to understand when ideas shift from one concept to another.

Recursive chunking works hierarchically, first trying to split at paragraph breaks, then sentences, then smaller units if needed.

Sliding window chunking creates overlapping chunks to ensure important context isn’t lost at boundaries.

The Key Differences That Matter

Understanding when to use each approach makes all the difference in your AI applications:

What You’re Doing Tokenization Chunking
Size Tiny pieces (words, parts of words) Bigger pieces (sentences, paragraphs)
Goal Make text digestible for AI models Keep meaning intact for humans and AI
When You Use It Training models, processing input Search systems, question answering
What You Optimize For Processing speed, vocabulary size Context preservation, retrieval accuracy

Why This Matters for Real Applications

For AI Model Performance

When you’re working with language models, tokenization directly affects how much you pay and how fast your system runs. Models like GPT-4 charge by the token, so efficient tokenization saves money. Current models have different limits:

  • GPT-4: Around 128,000 tokens
  • Claude 3.5: Up to 200,000 tokens
  • Gemini 2.0 Pro: Up to 2 million tokens

Recent research shows that larger models actually work better with bigger vocabularies. For example, while LLaMA-2 70B uses about 32,000 different tokens, it would probably perform better with around 216,000. This matters because the right vocabulary size affects both performance and efficiency.

For Search and Question-Answering Systems

Chunking strategy can make or break your RAG (Retrieval-Augmented Generation) system. If your chunks are too small, you lose context. Too big, and you overwhelm the model with irrelevant information. Get it right, and your system provides accurate, helpful answers. Get it wrong, and you get hallucinations and poor results.

Companies building enterprise AI systems have found that smart chunking strategies significantly reduce those frustrating cases where AI makes up facts or gives nonsensical answers.

Where You’ll Use Each Approach

Tokenization is Essential For:

Training new models – You can’t train a language model without first tokenizing your training data. The tokenization strategy affects everything about how well the model learns.

Fine-tuning existing models – When you adapt a pre-trained model for your specific domain (like medical or legal text), you need to carefully consider whether the existing tokenization works for your specialized vocabulary.

Cross-language applications – Subword tokenization is particularly helpful when working with languages that have complex word structures or when building multilingual systems.

Chunking is Critical For:

Building company knowledge bases – When you want employees to ask questions and get accurate answers from your internal documents, proper chunking ensures the AI retrieves relevant, complete information.

Document analysis at scale – Whether you’re processing legal contracts, research papers, or customer feedback, chunking helps maintain document structure and meaning.

Search systems – Modern search goes beyond keyword matching. Semantic chunking helps systems understand what users really want and retrieve the most relevant information.

Current Best Practices (What Actually Works)

After watching many real-world implementations, here’s what tends to work:

For Chunking:

  • Start with 512-1024 token chunks for most applications
  • Add 10-20% overlap between chunks to preserve context
  • Use semantic boundaries when possible (end of sentences, paragraphs)
  • Test with your actual use cases and adjust based on results
  • Monitor for hallucinations and tweak your approach accordingly

For Tokenization:

  • Use established methods (BPE, WordPiece, SentencePiece) rather than building your own
  • Consider your domain—medical or legal text might need specialized approaches
  • Monitor out-of-vocabulary rates in production
  • Balance between compression (fewer tokens) and meaning preservation

Summary

Tokenization and chunking aren’t competing techniques—they’re complementary tools that solve different problems. Tokenization makes text digestible for AI models, while chunking preserves meaning for practical applications.

As AI systems become more sophisticated, both techniques continue evolving. Context windows are getting larger, vocabularies are becoming more efficient, and chunking strategies are getting smarter about preserving semantic meaning.

The key is understanding what you’re trying to accomplish. Building a chatbot? Focus on chunking strategies that preserve conversational context. Training a model? Optimize your tokenization for efficiency and coverage. Building an enterprise search system? You’ll need both—smart tokenization for efficiency and intelligent chunking for accuracy.

The post appeared first on .

Read More

A Coding Guide to Building a Brain-Inspired Hierarchical Reasoning AI Agent with Hugging Face Models

 

In this tutorial, we set out to recreate the spirit of the Hierarchical Reasoning Model (HRM) using a free Hugging Face model that runs locally. We walk through the design of a lightweight yet structured reasoning agent, where we act as both architects and experimenters. By breaking problems into subgoals, solving them with Python, critiquing the outcomes, and synthesizing a final answer, we can experience how hierarchical planning and execution can enhance reasoning performance. This process enables us to see, in real-time, how a brain-inspired workflow can be implemented without requiring massive model sizes or expensive APIs. Check out the  and .

!pip -q install -U transformers accelerate bitsandbytes rich


import os, re, json, textwrap, traceback
from typing import Dict, Any, List
from rich import print as rprint
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline


MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
DTYPE = torch.bfloat16 if torch.cuda.is_available() else torch.float32

We begin by installing the required libraries and loading the Qwen2.5-1.5B-Instruct model from Hugging Face. We set the data type based on GPU availability to ensure efficient model execution in Colab.

tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   device_map="auto",
   torch_dtype=DTYPE,
   load_in_4bit=True
)
gen = pipeline(
   "text-generation",
   model=model,
   tokenizer=tok,
   return_full_text=False
)

We load the tokenizer and model, configure it to run in 4-bit for efficiency, and wrap everything in a text-generation pipeline so we can interact with the model easily in Colab. Check out the  and .

def chat(prompt: str, system: str = "", max_new_tokens: int = 512, temperature: float = 0.3) -> str:
   msgs = []
   if system:
       msgs.append({"role":"system","content":system})
   msgs.append({"role":"user","content":prompt})
   inputs = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
   out = gen(inputs, max_new_tokens=max_new_tokens, do_sample=(temperature>0), temperature=temperature, top_p=0.9)
   return out[0]["generated_text"].strip()


def extract_json(txt: str) -> Dict[str, Any]:
   m = re.search(r"{[sS]*}$", txt.strip())
   if not m:
       m = re.search(r"{[sS]*?}", txt)
   try:
       return json.loads(m.group(0)) if m else {}
   except Exception:
       # fallback: strip code fences
       s = re.sub(r"^```.*?n|n```$", "", txt, flags=re.S)
       try:
           return json.loads(s)
       except Exception:
           return {}

We define helper functions: the chat function allows us to send prompts to the model with optional system instructions and sampling controls, while extract_json helps us parse structured JSON outputs from the model reliably, even if the response includes code fences or additional text. Check out the  and .

def extract_code(txt: str) -> str:
   m = re.search(r"```(?:python)?s*([sS]*?)```", txt, flags=re.I)
   return (m.group(1) if m else txt).strip()


def run_python(code: str, env: Dict[str, Any] | None = None) -> Dict[str, Any]:
   import io, contextlib
   g = {"__name__": "__main__"}; l = {}
   if env: g.update(env)
   buf = io.StringIO()
   try:
       with contextlib.redirect_stdout(buf):
           exec(code, g, l)
       out = l.get("RESULT", g.get("RESULT"))
       return {"ok": True, "result": out, "stdout": buf.getvalue()}
   except Exception as e:
       return {"ok": False, "error": str(e), "trace": traceback.format_exc(), "stdout": buf.getvalue()}


PLANNER_SYS = """You are the HRM Planner.
Decompose the TASK into 2–4 atomic, code-solvable subgoals.
Return compact JSON only: {"subgoals":[...], "final_format":"<one-line answer format>"}."""


SOLVER_SYS = """You are the HRM Solver.
Given SUBGOAL and CONTEXT vars, output a single Python snippet.
Rules:
- Compute deterministically.
- Set a variable RESULT to the answer.
- Keep code short; stdlib only.
Return only a Python code block."""


CRITIC_SYS = """You are the HRM Critic.
Given TASK and LOGS (subgoal results), decide if final answer is ready.
Return JSON only: {"action":"submit"|"revise","critique":"...", "fix_hint":"<if revise>"}."""


SYNTH_SYS = """You are the HRM Synthesizer.
Given TASK, LOGS, and final_format, output only the final answer (no steps).
Follow final_format exactly."""

We add two important pieces: utility functions and system prompts. The extract_code function pulls Python snippets from the model’s output, while run_python safely executes those snippets and captures their results. Alongside, we define four role prompts, Planner, Solver, Critic, and Synthesizer, which guide the model to break tasks into subgoals, solve them with code, verify correctness, and finally produce a clean answer. Check out the  and .

def plan(task: str) -> Dict[str, Any]:
   p = f"TASK:n{task}nReturn JSON only."
   return extract_json(chat(p, PLANNER_SYS, temperature=0.2, max_new_tokens=300))


def solve_subgoal(subgoal: str, context: Dict[str, Any]) -> Dict[str, Any]:
   prompt = f"SUBGOAL:n{subgoal}nCONTEXT vars: {list(context.keys())}nReturn Python code only."
   code = extract_code(chat(prompt, SOLVER_SYS, temperature=0.2, max_new_tokens=400))
   res = run_python(code, env=context)
   return {"subgoal": subgoal, "code": code, "run": res}


def critic(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
   pl = [{"subgoal": L["subgoal"], "result": L["run"].get("result"), "ok": L["run"]["ok"]} for L in logs]
   out = chat("TASK:n"+task+"nLOGS:n"+json.dumps(pl, ensure_ascii=False, indent=2)+"nReturn JSON only.",
              CRITIC_SYS, temperature=0.1, max_new_tokens=250)
   return extract_json(out)


def refine(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
   sys = "Refine subgoals minimally to fix issues. Return same JSON schema as planner."
   out = chat("TASK:n"+task+"nLOGS:n"+json.dumps(logs, ensure_ascii=False)+"nReturn JSON only.",
              sys, temperature=0.2, max_new_tokens=250)
   j = extract_json(out)
   return j if j.get("subgoals") else {}


def synthesize(task: str, logs: List[Dict[str, Any]], final_format: str) -> str:
   packed = [{"subgoal": L["subgoal"], "result": L["run"].get("result")} for L in logs]
   return chat("TASK:n"+task+"nLOGS:n"+json.dumps(packed, ensure_ascii=False)+
               f"nfinal_format: {final_format}nOnly the final answer.",
               SYNTH_SYS, temperature=0.0, max_new_tokens=120).strip()


def hrm_agent(task: str, context: Dict[str, Any] | None = None, budget: int = 2) -> Dict[str, Any]:
   ctx = dict(context or {})
   trace, plan_json = [], plan(task)
   for round_id in range(1, budget+1):
       logs = [solve_subgoal(sg, ctx) for sg in plan_json.get("subgoals", [])]
       for L in logs:
           ctx_key = f"g{len(trace)}_{abs(hash(L['subgoal']))%9999}"
           ctx[ctx_key] = L["run"].get("result")
       verdict = critic(task, logs)
       trace.append({"round": round_id, "plan": plan_json, "logs": logs, "verdict": verdict})
       if verdict.get("action") == "submit": break
       plan_json = refine(task, logs) or plan_json
   final = synthesize(task, trace[-1]["logs"], plan_json.get("final_format", "Answer: <value>"))
   return {"final": final, "trace": trace}

We implement the full HRM loop: we plan subgoals, solve each by generating and running Python (capturing RESULT), then we critique, optionally refine the plan, and synthesize a clean final answer. We orchestrate these rounds in hrm_agent, carrying forward intermediate results as context so we iteratively improve and stop once the critic says “submit.” Check out the  and .

ARC_TASK = textwrap.dedent("""
Infer the transformation rule from train examples and apply to test.
Return exactly: "Answer: <grid>", where <grid> is a Python list of lists of ints.
""").strip()
ARC_DATA = {
   "train": [
       {"inp": [[0,0],[1,0]], "out": [[1,1],[0,1]]},
       {"inp": [[0,1],[0,0]], "out": [[1,0],[1,1]]}
   ],
   "test": [[0,0],[0,1]]
}
res1 = hrm_agent(ARC_TASK, context={"TRAIN": ARC_DATA["train"], "TEST": ARC_DATA["test"]}, budget=2)
rprint("n[bold]Demo 1 — ARC-like Toy[/bold]")
rprint(res1["final"])


WM_TASK = "A tank holds 1200 L. It leaks 2% per hour for 3 hours, then is refilled by 150 L. Return exactly: 'Answer: <liters>'."
res2 = hrm_agent(WM_TASK, context={}, budget=2)
rprint("n[bold]Demo 2 — Word Math[/bold]")
rprint(res2["final"])


rprint("n[dim]Rounds executed (Demo 1):[/dim]", len(res1["trace"]))

We run two demos to validate the agent: an ARC-style task where we infer a transformation from train pairs and apply it to a test grid, and a word-math problem that checks numeric reasoning. We call hrm_agent with each task, print the final answers, and also display the number of reasoning rounds the ARC run takes.

In conclusion, we recognize that what we have built is more than a simple demonstration; it is a window into how hierarchical reasoning can make smaller models punch above their weight. By layering planning, solving, and critiquing, we empower a free Hugging Face model to perform tasks with surprising robustness. We leave with a deeper appreciation of how brain-inspired structures, when paired with practical, open-source tools, enable us to explore reasoning benchmarks and experiment creatively without incurring high costs. This hands-on journey shows us that advanced cognitive-like workflows are accessible to anyone willing to tinker, iterate, and learn.


Check out the  and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Microsoft AI Introduces rStar2-Agent: A 14B Math Reasoning Model Trained with Agentic Reinforcement Learning to Achieve Frontier-Level Performance

Microsoft AI Introduces rStar2-Agent: A 14B Math Reasoning Model Trained with Agentic Reinforcement Learning to Achieve Frontier-Level Performance

 

Table of contents

The Problem with “Thinking Longer”

Large language models have made impressive strides in mathematical reasoning by extending their Chain-of-Thought (CoT) processes—essentially “thinking longer” through more detailed reasoning steps. However, this approach has fundamental limitations. When models encounter subtle errors in their reasoning chains, they often compound these mistakes rather than detecting and correcting them. Internal self-reflection frequently fails, especially when the initial reasoning approach is fundamentally flawed.

Microsoft’s new research report introduces rStar2-Agent, that takes a different approach: instead of just thinking longer, it teaches models to think smarter by actively using coding tools to verify, explore, and refine their reasoning process.

https://arxiv.org/abs/2508.20722

The Agentic Approach

rStar2-Agent represents a shift toward agentic reinforcement learning, where a 14B parameter model interacts with a Python execution environment throughout its reasoning process. Rather than relying solely on internal reflection, the model can write code, execute it, analyze the results, and adjust its approach based on concrete feedback.

This creates a dynamic problem-solving process. When the model encounters a complex mathematical problem, it might generate initial reasoning, write Python code to test hypotheses, analyze execution results, and iterate toward a solution. The approach mirrors how human mathematicians often work—using computational tools to verify intuitions and explore different solution paths.

Infrastructure Challenges and Solutions

Scaling agentic RL presents significant technical hurdles. During training, a single batch can generate tens of thousands of concurrent code execution requests, creating bottlenecks that can stall GPU utilization. The researchers addressed this with two key infrastructure innovations.

First, they built a distributed code execution service capable of handling 45,000 concurrent tool calls with sub-second latency. The system isolates code execution from the main training process while maintaining high throughput through careful load balancing across CPU workers.

Second, they developed a dynamic rollout scheduler that allocates computational work based on real-time GPU cache availability rather than static assignment. This prevents GPU idle time caused by uneven workload distribution—a common problem when some reasoning traces require significantly more computation than others.

These infrastructure improvements enabled the entire training process to complete in just one week using 64 AMD MI300X GPUs, demonstrating that frontier-level reasoning capabilities don’t require massive computational resources when efficiently orchestrated.

GRPO-RoC: Learning from High-Quality Examples

The core algorithmic innovation is Group Relative Policy Optimization with Resampling on Correct (GRPO-RoC). Traditional reinforcement learning in this context faces a quality problem: models receive positive rewards for correct final answers even when their reasoning process includes multiple code errors or inefficient tool usage.

GRPO-RoC addresses this by implementing an asymmetric sampling strategy. During training, the algorithm:

  • Oversamples initial rollouts to create a larger pool of reasoning traces
  • Preserves diversity in failed attempts to maintain learning from various error modes
  • Filters positive examples to emphasize traces with minimal tool errors and cleaner formatting

This approach ensures the model learns from high-quality successful reasoning while still exposure to diverse failure patterns. The result is more efficient tool usage and shorter, more focused reasoning traces.

https://arxiv.org/abs/2508.20722

Training Strategy: From Simple to Complex

The training process unfolds in three carefully designed stages, starting with non-reasoning supervised fine-tuning that focuses purely on instruction following and tool formatting—deliberately avoiding complex reasoning examples that might create early biases.

Stage 1 constrains responses to 8,000 tokens, forcing the model to develop concise reasoning strategies. Despite this limitation, performance jumps dramatically—from near-zero to over 70% on challenging benchmarks.

Stage 2 extends the token limit to 12,000, allowing for more complex reasoning while maintaining the efficiency gains from the first stage.

Stage 3 shifts focus to the most difficult problems by filtering out those the model has already mastered, ensuring continued learning from challenging cases.

This progression from concise to extended reasoning, combined with increasing problem difficulty, maximizes learning efficiency while minimizing computational overhead.

Breakthrough Results

The results are striking. rStar2-Agent-14B achieves 80.6% accuracy on AIME24 and 69.8% on AIME25, surpassing much larger models including the 671B parameter DeepSeek-R1. Perhaps more importantly, it accomplishes this with significantly shorter reasoning traces—averaging around 10,000 tokens compared to over 17,000 for comparable models.

The efficiency gains extend beyond mathematics. Despite training exclusively on math problems, the model demonstrates strong transfer learning, outperforming specialized models on scientific reasoning benchmarks and maintaining competitive performance on general alignment tasks.

https://arxiv.org/abs/2508.20722

Understanding the Mechanisms

Analysis of the trained model reveals fascinating behavioral patterns. High-entropy tokens in reasoning traces fall into two categories: traditional “forking tokens” that trigger self-reflection and exploration, and a new category of “reflection tokens” that emerge specifically in response to tool feedback.

These reflection tokens represent a form of environment-driven reasoning where the model carefully analyzes code execution results, diagnoses errors, and adjusts its approach accordingly. This creates more sophisticated problem-solving behavior than pure CoT reasoning can achieve.

Summary

rStar2-Agent demonstrates that moderate-sized models can achieve frontier-level reasoning through sophisticated training rather than brute-force scaling. The approach suggests a more sustainable path toward advanced AI capabilities—one that emphasizes efficiency, tool integration, and smart training strategies over raw computational power.

The success of this agentic approach also points toward future AI systems that can seamlessly integrate multiple tools and environments, moving beyond static text generation toward dynamic, interactive problem-solving capabilities.


Check out the  and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More