Genie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation

Genie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation

 

Embodied AI agents that can perceive, think, and act in the real world mark a key step toward the future of robotics. A central challenge is building scalable, reliable robotic manipulation, the skill of deliberately interacting with and controlling objects through selective contact. While progress spans analytic methods, model-based approaches, and large-scale data-driven learning, most systems still operate in disjoint stages of data collection, training, and evaluation. These stages often require custom setups, manual curation, and task-specific tweaks, creating friction that slows progress, hides failure patterns, and hampers reproducibility. This highlights the need for a unified framework to streamline learning and assessment. 

Robotic manipulation research has progressed from analytical models to neural world models that learn dynamics directly from sensory inputs, using both pixel and latent spaces. Large-scale video generation models can produce realistic visuals but often lack action conditioning, long-term temporal consistency, and multi-view reasoning needed for control. Vision-language-action models follow instructions but are limited by imitation-based learning, preventing error recovery and planning. Policy evaluation remains challenging, as physics simulators require heavy tuning, and real-world testing is resource-intensive. Existing evaluation metrics often emphasize visual quality over task success, highlighting the need for benchmarks that better capture real-world manipulation performance. 

The Genie Envisioner (GE), developed by researchers from AgiBot Genie Team, NUS LV-Lab, and BUAA, is a unified platform for robotic manipulation that combines policy learning, simulation, and evaluation in a video-generative framework. Its core, GE-Base, is a large-scale, instruction-driven video diffusion model capturing spatial, temporal, and semantic dynamics of real-world tasks. GE-Act maps these representations to precise action trajectories, while GE-Sim offers fast, action-conditioned video-based simulation. The EWMBench benchmark evaluates visual realism, physical accuracy, and instruction-action alignment. Trained on over a million episodes, GE generalizes across robots and tasks, enabling scalable, memory-aware, and physically grounded embodied intelligence research. 

GE’s design unfolds in three key parts. GE-Base is a multi-view, instruction-conditioned video diffusion model trained on over 1 million robotic manipulation episodes. It learns latent trajectories that capture how scenes evolve under given commands. Building on that, GE-Act translates these latent video representations into real action signals via a lightweight, flow-matching decoder, offering quick, precise motor control even on robots not in the training data. GE-Sim repurposes GE-Base’s generative power into an action-conditioned neural simulator, enabling closed-loop, video-based rollout at speeds far beyond real hardware. The EWMBench suite then evaluates the system holistically across video realism, physical consistency, and alignment between instructions and resulting actions.

In evaluations, Genie Envisioner showed strong real-world and simulated performance across varied robotic manipulation tasks. GE-Act achieved rapid control generation (54-step trajectories in 200 ms) and consistently outperformed leading vision-language-action baselines in both step-wise and end-to-end success rates. It adapted to new robot types, like Agilex Cobot Magic and Dual Franka, with only an hour of task-specific data, excelling in complex deformable object tasks. GE-Sim delivered high-fidelity, action-conditioned video simulations for scalable, closed-loop policy testing. The EWMBench benchmark confirmed GE-Base’s superior temporal alignment, motion consistency, and scene stability over state-of-the-art video models, aligning closely with human quality judgments. 

In conclusion, Genie Envisioner is a unified, scalable platform for dual-arm robotic manipulation that merges policy learning, simulation, and evaluation into one video-generative framework. Its core, GE-Base, is an instruction-guided video diffusion model capturing the spatial, temporal, and semantic patterns of real-world robot interactions. GE-Act builds on this by converting these representations into precise, adaptable action plans, even on new robot types with minimal retraining. GE-Sim offers high-fidelity, action-conditioned simulation for closed-loop policy refinement, while EWMBench provides rigorous evaluation of realism, alignment, and consistency. Extensive real-world tests highlight the system’s superior performance, making it a strong foundation for general-purpose, instruction-driven embodied intelligence. 


Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Genie Envisioner: A Unified Video-Generative Platform for Scalable, Instruction-Driven Robotic Manipulation appeared first on MarkTechPost.

MarkTechPost

Read More
NuMind AI Releases NuMarkdown-8B-Thinking: A Reasoning Breakthrough in OCR and Document-to-Markdown Conversion

NuMind AI Releases NuMarkdown-8B-Thinking: A Reasoning Breakthrough in OCR and Document-to-Markdown Conversion

 

NuMind AI has officially released NuMarkdown-8B-Thinking, an open-source (MIT License) reasoning OCR Vision-Language Model (VLM) that redefines how complex documents are digitized and structured. Unlike traditional OCR systems, NuMarkdown-8B-Thinking doesn’t just extract text—it thinks about a document’s layout, structure, and formatting before generating a precise, ready-to-use Markdown file.

This makes it the first reasoning VLM purpose-built for converting PDFs, scanned documents, and spreadsheets into clean, structured Markdown—ideal for Retrieval-Augmented Generation (RAG) workflows, AI-powered knowledge bases, and large-scale document archiving.

How NuMarkdown-8B-Thinking Is Different?

The model introduces a reasoning-first approach to OCR. Instead of directly rendering extracted text, NuMarkdown-8B-Thinking generates “thinking tokens” — internal reasoning steps that help it understand document layouts before producing the final output.

This capability allows it to handle formats and structures that stump most conventional and even AI-powered OCR systems, including:

  • Multi-column layouts with complex reading orders
  • Tables with merged, nested, or irregular cells
  • Mixed visual elements (images, decorative headers, watermarks)
  • Historical or degraded scans where layout inference is crucial

The number of reasoning tokens varies with complexity—anywhere from 20% to 500% of the final Markdown length—showing how much the model “thinks” before it “writes.”

Training and Architecture

NuMarkdown-8B-Thinking is a fine-tuned version of Qwen 2.5-VL-7B from Alibaba—one of the strongest open-source multi-modal models available.

Its training pipeline involved two key phases:

  1. Supervised Fine-Tuning (SFT) on synthetic document samples where each example included:
    • Raw document input
    • Intermediate reasoning steps (layout parsing, structure inference)
    • Final Markdown representation
  2. Reinforcement Learning with GRPO, using a layout-centric reward that encouraged accurate reconstruction of document formatting and spatial relationships.

This two-stage process gave NuMarkdown-8B-Thinking the ability to maintain high accuracy even on challenging layouts that typically require human-level judgment.

Benchmark Results: Outperforming OCR Heavyweights

In independent evaluations and user testing, NuMarkdown-8B-Thinking demonstrates state-of-the-art reasoning for OCR-to-Markdown tasks:

  • Beats:
    • Generalist models like GPT-4o
    • Specialized OCR-focused models like OCRFlux
  • Competitive with:
    • Large closed-source reasoning models like Gemini 2.5
    • Just behind elite models like Gemini Flash Reasoning in blind, multi-model user rankings

Users particularly highlight its ability to:

  • Correctly infer reading order in non-linear layouts
  • Preserve intricate table formatting
  • Output clean, parsing-friendly Markdown for RAG ingestion without further post-processing

Example in Action

Imagine a scanned annual report page with:

  • Multi-level headings
  • Sidebars and multiple columns
  • A financial table with merged cells and uneven row spacing
  • A footer with legal disclaimers

NuMarkdown-8B-Thinking first produces reasoning tokens outlining the structure (“Column 1: Intro paragraph… Column 2: Continue paragraph… Footer text at bottom… Table spans two columns…”), then outputs Markdown that accurately reflects both content and layout.

This transparent reasoning layer makes the model’s decisions auditable—a major plus in enterprise, legal, and archival contexts.

Deployment Options

Whether you’re a researcher, developer, or enterprise AI engineer, NuMarkdown-8B-Thinking is ready to slot into your workflow:

  • Hugging Face: Available for direct testing and integration.
  • Local Execution: Model weights and quantized GGUF versions are published for CPU/GPU-friendly deployment.
  • API-friendly: Compatible with OpenAI-style APIs and Hugging Face Transformers for rapid integration into pipelines.

Its MIT License ensures full freedom for commercial, academic, or personal projects—no vendor lock-in or costly API gates.

Why This Matters

For industries that rely on accurate document digitization—finance, legal, healthcare, government archives—layout fidelity is as important as textual accuracy. Most OCR systems treat layout as an afterthought; NuMarkdown-8B-Thinking treats it as a reasoning problem.

By combining open-sourcing, layout reasoning, and RAG-optimized Markdown output, NuMarkdown-8B-Thinking offers a transparent, verifiable, and high-performance alternative to proprietary document AI solutions.


Check out the Model on Hugging Face and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post NuMind AI Releases NuMarkdown-8B-Thinking: A Reasoning Breakthrough in OCR and Document-to-Markdown Conversion appeared first on MarkTechPost.

MarkTechPost

Read More
How to Make Perfect Songs for Any Moment Using This New AI Music Generator

How to Make Perfect Songs for Any Moment Using This New AI Music Generator

 How to Make Perfect Songs for Any Moment Using This New AI Music Generator

Music is often called the universal language of mankind. You can either agree or disagree, but one thing is for sure: music has the ability to change our mood. You could be feeling sad, but then you hear your favourite song, and suddenly you feel better. Music does make a difference, and that is why it is everywhere, in ads, films, YouTube intros, and meditation apps. ElevenLabs*, well-known as an AI voice generator for its realistic voice creation, has introduced Eleven Music.

What is Eleven Music all about?

Eleven Music is an AI music generator that allows anyone to make studio-quality songs for any moment, of any genre, style, vocals, or instrumental in minutes using simple text prompts. It even lets you tweak parts of the song, like editing a chorus or verse using prompts. It even supports multiple languages like English, Spanish, German, and Japanese.

Crucially, Eleven Music comes with commercial clearance. That’s thanks to licensing deals with Merlin Network and Kobalt Music Group, so creators, from freelancers to indie filmmakers, can use the tracks in films, ads, games, social videos, podcasts, and more without legal worry. Plus, built‑in safeguards prevent the AI from mimicking known artists, using copyrighted lyrics, or generating hateful or violent content.

According to company docs, music is generated in MP3 format at studio quality (44.1 kHz), tracks range from 10 seconds to 5 minutes, and both playlists and APIs are rolling out soon.

Top 3 features of Eleven Music that make it a game-changer:

  • Text-to-Music: Being a prompt-based tool, you can generate a complete musical piece simply by describing it and letting the AI compose a track based on your input.
  • Vocal and Instrumental Tracks: This AI music generator by ElevenLabs can generate both purely instrumental music and tracks with AI-generated vocals in different languages like English, Spanish, and Japanese.
  • Fine Control: You aren’t just stuck with the first thing the AI creates. The platform allows for section-by-section editing. You can generate an intro, then a verse, and fine-tune each part to build a complete song with seamless transitions.

We often see creators facing licensing and copyright issues. In such cases, if an AI tool can create the perfect track for your work, why not give it a try?

How to make perfect songs using Eleven Music:

I eagerly wanted to test this AI music generator, being an ElevenLabs user, and show how you can create your first song using it.

Step 1: Visit ElevenLabs’ website*. Click on the platforms option at the top and then select the Music option.

How to Make Perfect Songs for Any Moment Using This New AI Music Generator
  • Scroll down a bit and click on Get Started. You’ll need to sign up for ElevenLabs if you haven’t already.
How to Make Perfect Songs for Any Moment Using This New AI Music Generator

Step 2: If you are new to ElevenLabs, you will be asked a few quick questions to optimize your experience.

Step 3: Once you are in, you can get started immediately with your 10,000 free credits. 

How to Make Perfect Songs for Any Moment Using This New AI Music Generator

For my song, I went for the following setting:

  • 2 variants
  • 30 seconds
  • Prompt: A fun UK drill rap song to help increase work productivity
How to Make Perfect Songs for Any Moment Using This New AI Music Generator

Step 4: Eleven Music will generate songs for you fast. As I generated two variants, I could choose from two. I personally like the chorus on the second one.



0:00

/1:22


Step 5: You can edit the song using a text prompt. As I loved the chorus on the second variant, I wanted the verse to be similar to that to match the vibe.

  • I am happy I decided to edit the verse, and this time I liked the first verse of the first variant of the edited version.



0:00

/1:22


If you are happy, just like me. You can either download or share the Eleven Music-generated song.

In Conclusion:

ElevenLabs* was already a capable AI voice generator, but by adding this new AI music generation ability, it has only gotten better. I wouldn’t say it’s perfect, and it has to compete against Suno AI and Udio, which have been in the game for quite some time now and are very capable. I would say it is a solid AI music generator that makes studio-quality music production accessible for everyone. Eleven Music allows everyone to make perfect songs for any moment.


🤝
For Partnership/Promotion on AI Tools Club, please check out our partnership page.

*Affiliate: We do make a small profit from the sales of this AI product through affiliate marketing. This is not an official list; we have tried to mention as many tools as possible.

AI Tools Club

Read More
Building a Secure and Memory-Enabled Cipher Workflow for AI Agents with Dynamic LLM Selection and API Integration

Building a Secure and Memory-Enabled Cipher Workflow for AI Agents with Dynamic LLM Selection and API Integration

 

In this tutorial, we walk through building a compact but fully functional Cipher-based workflow. We start by securely capturing our Gemini API key in the Colab UI without exposing it in code. We then implement a dynamic LLM selection function that can automatically switch between OpenAI, Gemini, or Anthropic based on which API key is available. The setup phase ensures Node.js and the Cipher CLI are installed, after which we programmatically generate a cipher.yml configuration to enable a memory agent with long-term recall. We create helper functions to run Cipher commands directly from Python, store key project decisions as persistent memories, retrieve them on demand, and finally spin up Cipher in API mode for external integration. Check out the FULL CODES here.

import os, getpass
os.environ["GEMINI_API_KEY"] = getpass.getpass("Enter your Gemini API key: ").strip()


import subprocess, tempfile, pathlib, textwrap, time, requests, shlex


def choose_llm():
   if os.getenv("OPENAI_API_KEY"):
       return "openai", "gpt-4o-mini", "OPENAI_API_KEY"
   if os.getenv("GEMINI_API_KEY"):
       return "gemini", "gemini-2.5-flash", "GEMINI_API_KEY"
   if os.getenv("ANTHROPIC_API_KEY"):
       return "anthropic", "claude-3-5-haiku-20241022", "ANTHROPIC_API_KEY"
   raise RuntimeError("Set one API key before running.")

We start by securely entering our Gemini API key using getpass so it stays hidden in the Colab UI. We then define a choose_llm() function that checks our environment variables and automatically selects the appropriate LLM provider, model, and key based on what is available. Check out the FULL CODES here.

def run(cmd, check=True, env=None):
   print("▸", cmd)
   p = subprocess.run(cmd, shell=True, text=True, capture_output=True, env=env)
   if p.stdout: print(p.stdout)
   if p.stderr: print(p.stderr)
   if check and p.returncode != 0:
       raise RuntimeError(f"Command failed: {cmd}")
   return p

We create a run() helper function that executes shell commands, prints both stdout and stderr for visibility, and raises an error if the command fails when check is enabled, making our workflow execution more transparent and reliable. Check out the FULL CODES here.

def ensure_node_and_cipher():
   run("sudo apt-get update -y && sudo apt-get install -y nodejs npm", check=False)
   run("npm install -g @byterover/cipher")

We define ensure_node_and_cipher() to install Node.js, npm, and the Cipher CLI globally, ensuring our environment has all the necessary dependencies before running any Cipher-related commands. Check out the FULL CODES here.

def write_cipher_yml(workdir, provider, model, key_env):
   cfg = """
llm:
 provider: {provider}
 model: {model}
 apiKey: ${key_env}
systemPrompt:
 enabled: true
 content: |
   You are an AI programming assistant with long-term memory of prior decisions.
embedding:
 disabled: true
mcpServers:
 filesystem:
   type: stdio
   command: npx
   args: ['-y','@modelcontextprotocol/server-filesystem','.']
""".format(provider=provider, model=model, key_env=key_env)


   (workdir / "memAgent").mkdir(parents=True, exist_ok=True)
   (workdir / "memAgent" / "cipher.yml").write_text(cfg.strip() + "n")

We implement write_cipher_yml() to generate a cipher.yml configuration file inside a memAgent folder, setting the chosen LLM provider, model, and API key, enabling a system prompt with long-term memory, and registering a filesystem MCP server for file operations. Check out the FULL CODES here.

def cipher_once(text, env=None, cwd=None):
   cmd = f'cipher {shlex.quote(text)}'
   p = subprocess.run(cmd, shell=True, text=True, capture_output=True, env=env, cwd=cwd)
   print("Cipher says:n", p.stdout or p.stderr)
   return p.stdout.strip() or p.stderr.strip()

We define cipher_once() to run a single Cipher CLI command with the provided text, capture and display its output, and return the response, allowing us to interact with Cipher programmatically from Python. Check out the FULL CODES here.

def start_api(env, cwd):
   proc = subprocess.Popen("cipher --mode api", shell=True, env=env, cwd=cwd,
                           stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
   for _ in range(30):
       try:
           r = requests.get("http://127.0.0.1:3000/health", timeout=2)
           if r.ok:
               print("API /health:", r.text)
               break
       except: pass
       time.sleep(1)
   return proc

We create start_api() to launch Cipher in API mode as a subprocess, then repeatedly poll its /health endpoint until it responds, ensuring the API server is ready before proceeding. Check out the FULL CODES here.

def main():
   provider, model, key_env = choose_llm()
   ensure_node_and_cipher()
   workdir = pathlib.Path(tempfile.mkdtemp(prefix="cipher_demo_"))
   write_cipher_yml(workdir, provider, model, key_env)
   env = os.environ.copy()


   cipher_once("Store decision: use pydantic for config validation; pytest fixtures for testing.", env, str(workdir))
   cipher_once("Remember: follow conventional commits; enforce black + isort in CI.", env, str(workdir))


   cipher_once("What did we standardize for config validation and Python formatting?", env, str(workdir))


   api_proc = start_api(env, str(workdir))
   time.sleep(3)
   api_proc.terminate()


if __name__ == "__main__":
   main()

In main(), we select the LLM provider, install dependencies, and create a temporary working directory with a cipher.yml configuration. We then store key project decisions in Cipher’s memory, query them back, and finally start the Cipher API server briefly before shutting it down, demonstrating both CLI and API-based interactions.

In conclusion, we have a working Cipher environment that securely manages API keys, selects the right LLM provider automatically, and configures a memory-enabled agent entirely through Python automation. Our implementation includes decision logging, memory retrieval, and a live API endpoint, all orchestrated in a Notebook/Colab-friendly workflow. This makes the setup reusable for other AI-assisted development pipelines, allowing us to store and query project knowledge programmatically while keeping the environment lightweight and easy to redeploy.


Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Building a Secure and Memory-Enabled Cipher Workflow for AI Agents with Dynamic LLM Selection and API Integration appeared first on MarkTechPost.

MarkTechPost

Read More
NVIDIA AI Introduces End-to-End AI Stack, Cosmos Physical AI Models and New Omniverse Libraries for Advanced Robotics

NVIDIA AI Introduces End-to-End AI Stack, Cosmos Physical AI Models and New Omniverse Libraries for Advanced Robotics

 

Nvidia made major waves at SIGGRAPH 2025 by unveiling a suite of new Cosmos world models, robust simulation libraries, and cutting-edge infrastructure—all designed to accelerate the next era of physical AI for robotics, autonomous vehicles, and industrial applications. Let’s break down the technological details, what this means for developers, and why it matters to the future of embodied intelligence and simulation.

Cosmos World Foundation Models: Reasoning for Robots

Cosmos Reason: Vision-Language Model for Physical AI

At the heart of the announcement is Cosmos Reason, a 7-billion-parameter reasoning vision-language model. This AI is engineered for robots and embodied agents tackling real-world tasks:

  • Memory and Physics Awareness: Cosmos Reason incorporates advanced memory for spatial and temporal reasoning, plus an understanding of physical laws. This lets robots and AI agents actually “plan” step-by-step actions in complex environments—making it ideal for data curation, robot planning, and video analytics.
  • Planning Capability: The model feeds structured video and sensor data (like segmentation maps and LIDAR) into a reasoning engine that decides what moves an agent should take next. It supports both high-level instruction parsing and low-level action generation, mimicking human-like logic for navigation and manipulation.

Cosmos Transfer Models: Turbocharging Synthetic Data Generation

  • Cosmos Transfer-2: Accelerates generation of synthetic datasets from 3D simulation scenes or spatial control inputs, vastly reducing the time and cost to produce realistic robot training data. This is especially helpful for reinforcement learning and policy model validation—where edge cases, diverse lighting, and weather scenarios must be modeled at scale.
  • Distilled Transfer Variant: Optimized for speed, letting developers iterate fast on dataset creation.

Practical Impact

The Cosmos WFM family spans three categories (Nano, Super, Ultra), ranging from 4 billion to 14 billion parameters, and can be fine-tuned for varied latency, fidelity, and use cases from real-time streaming to photorealistic rendering.

Simulation and Rendering Libraries: Creating Virtual Worlds for Training

Nvidia’s Omniverse platform gets a major update, adding:

  • Neural Reconstruction Libraries: These tools allow developers to import sensor data and simulate the physical world in 3D with lifelike photorealism, powered by neural rendering techniques.
  • Integration with OpenUSD and CARLA Simulator: The addition of new conversion tools and rendering capabilities helps standardize complex simulation workflows, making it easier to interoperate between robotics frameworks (like Mujoco) and Nvidia’s USD-based pipeline.
  • SimReady Materials Library: Offers thousands of substrate materials for creating highly realistic virtual environments, boosting the fidelity of robotics training and simulation.

Isaac Sim 5.0.0: Nvidia’s simulation engine now includes enhanced actuator models, broader Python and ROS support, and new neural rendering for better synthetic data.

Infrastructure for Robotics Workflows

  • RTX Pro Blackwell Servers: Purpose-built for robotic development workloads, providing unified architecture for simulation, training, and inference tasks.
  • DGX Cloud: Enables cloud-based management and scaling of physical AI workflows, so teams can develop, train, and deploy AI agents remotely.

Industry Adoption and Open Innovation

Industry leaders—including Amazon Devices, Agility Robotics, Figure AI, Uber, Boston Dynamics, and more—are already piloting Cosmos models and Omniverse tools to generate training data, build digital twins, and accelerate the deployment of robotics in manufacturing, transportation, and logistics.

Cosmos models are broadly available through Nvidia’s API and developer catalogs, with a permissive license supporting both research and commercial usage.

A New Era for Physical AI

Nvidia’s vision is clear: physical AI is a full-stack challenge, demanding smarter models, richer simulation, and scalable infrastructure. With the Cosmos model suite, Omniverse libraries, and Blackwell-powered servers, Nvidia is closing the gap between virtual training and real-world deployment—reducing costly trial-and-error and unlocking new levels of autonomy for robots and intelligent agents.


Check out the technical article from NVIDIA blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post NVIDIA AI Introduces End-to-End AI Stack, Cosmos Physical AI Models and New Omniverse Libraries for Advanced Robotics appeared first on MarkTechPost.

MarkTechPost

Read More
Case Studies: Real-World Applications of Context Engineering

Case Studies: Real-World Applications of Context Engineering

 

Context engineering has become a transformative force in moving from experimental AI demos to robust, production-grade systems across various industries. Below are distilled examples and evidence of real-world impact:

1. Insurance: Five Sigma & Agentic Underwriting

  • Five Sigma Insurance achieved an 80% reduction in claim processing errors and a 25% increase in adjustor productivity by architecting AI systems that ingest policy data, claims history, and regulations simultaneously. The system leveraged advanced retrieval-augmented generation (RAG) and dynamic context assembly, enabling automation that previously wasn’t possible.
  • In insurance underwriting, tailored schema creation and SME-guided context templates ensured that agents handled diverse formats and business rules, reaching over 95% accuracy after deployment feedback cycles.

2. Financial Services: Block (Square) & Major Banks

  • Block (formerly Square) implemented Anthropic’s Model Context Protocol (MCP) to tie LLMs to live payment and merchant data, moving from static prompts to a dynamic, information-rich environment that improved operational automation and bespoke problem-solving. MCP has since been recognized by OpenAI and Microsoft as a backbone for connecting AIs to real-world workflows.
  • Financial service bots increasingly combine user financial history, market data, and regulatory knowledge in real-time, delivering personalized investment advice and reducing user frustration by 40% compared to earlier generations.

3. Healthcare & Customer Support

  • Healthcare virtual assistants with context engineering now consider patients’ health records, medication schedules, and live appointment tracking—delivering accurate, safe advice and dramatically reducing administrative overhead.
  • Customer service bots with dynamic context integration seamlessly pull up prior tickets, account state, and product info, enabling agents and AI to resolve issues without repetitive questioning. This reduces average handle times and improves satisfaction scores.

4. Software Engineering & Coding Assistants

  • At Microsoft, deploying AI code helpers with architectural and organizational context delivered a 26% increase in completed software tasks and a measurable jump in code quality. Teams with well-engineered context windows experienced 65% fewer errors and significantly reduced hallucinations in code generation.
  • Enterprise developer platforms that incorporated user project history, coding standards, and documentation context saw up to 55% faster onboarding for new engineers and 70% better output quality.

5. Ecommerce & Recommendation Systems

  • Ecommerce AI leveraging browsing history, inventory status, and seasonality data provides users with highly relevant recommendations, leading to a measurable increase in conversions over generic prompt-based systems.
  • Retailers report 10x improvements in personalized offer success rates and reductions in abandoned carts after deploying context-engineered agents.

6. Enterprise Knowledge & Legal AI

  • Legal teams using context-aware AI tools to draft contracts and identify risk factors saw work acceleration and fewer missed compliance risks, since systems could dynamically fetch relevant precedent and legal frameworks.
  • Internal enterprise knowledge search, enhanced with multi-source context blocks (policies, client data, service histories), resulted in faster issue resolution and more consistent, high-quality responses for both employees and customers.

Quantifiable Outcomes Across Industries

  • Task success rates improved up to 10x in some applications.
  • Cost reductions of 40% and time savings of 75%-99% reported when context engineering is applied at scale.
  • User satisfaction and engagement metrics rise substantially when systems move beyond isolated prompts to contextual, adaptive information flows.

Context engineering is now central to enterprise AI, enabling reliable automation, rapid scaling, and next-level personalization that isolated prompt engineering cannot match. These case studies showcase how systematically designing and managing context turns large language models and agents from “clever toy” to “business-critical infrastructure”.


Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Case Studies: Real-World Applications of Context Engineering appeared first on MarkTechPost.

MarkTechPost

Read More
Zhipu AI Releases GLM-4.5V: Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Zhipu AI Releases GLM-4.5V: Versatile Multimodal Reasoning with Scalable Reinforcement Learning

 

Zhipu AI has officially released and open-sourced GLM-4.5V, a next-generation vision-language model (VLM) that significantly advances the state of open multimodal AI. Based on Zhipu’s 106-billion parameter GLM-4.5-Air architecture—with 12 billion active parameters via a Mixture-of-Experts (MoE) design—GLM-4.5V delivers strong real-world performance and unmatched versatility across visual and textual content.

Key Features and Design Innovations

1. Comprehensive Visual Reasoning

  • Image Reasoning: GLM-4.5V achieves advanced scene understanding, multi-image analysis, and spatial recognition. It can interpret detailed relationships in complex scenes (such as distinguishing product defects, analyzing geographical clues, or inferring context from multiple images simultaneously).
  • Video Understanding: It processes long videos, performing automatic segmentation and recognizing nuanced events thanks to a 3D convolutional vision encoder. This enables applications like storyboarding, sports analytics, surveillance review, and lecture summarization.
  • Spatial Reasoning: Integrated 3D Rotational Positional Encoding (3D-RoPE) gives the model a robust perception of three-dimensional spatial relationships, crucial for interpreting visual scenes and grounding visual elements.

2. Advanced GUI and Agent Tasks

  • Screen Reading & Icon Recognition: The model excels at reading desktop/app interfaces, localizing buttons and icons, and assisting with automation—essential for RPA (robotic process automation) and accessibility tools.
  • Desktop Operation Assistance: Through detailed visual understanding, GLM-4.5V can plan and describe GUI operations, assisting users in navigating software or performing complex workflows.

3. Complex Chart and Document Parsing

  • Chart Understanding: GLM-4.5V can analyze charts, infographics, and scientific diagrams within PDFs or PowerPoint files, extracting summarized conclusions and structured data even from dense, long documents.
  • Long Document Interpretation: With support for up to 64,000 tokens of multimodal context, it can parse and summarize extended, image-rich documents (such as research papers, contracts, or compliance reports), making it ideal for business intelligence and knowledge extraction.

4. Grounding and Visual Localization

  • Precise Grounding: The model can accurately localize and describe visual elements—such as objects, bounding boxes, or specific UI elements—using world knowledge and semantic context, not just pixel-level cues. This enables detailed analysis for quality control, AR applications, and image annotation workflows.

Architectural Highlights

  • Hybrid Vision-Language Pipeline: The system integrates a powerful visual encoder, MLP adapter, and a language decoder, allowing seamless fusion of visual and textual information. Static images, videos, GUIs, charts, and documents are all treated as first-class inputs.
  • Mixture-of-Experts (MoE) Efficiency: While housing 106B total parameters, the MoE design activates only 12B per inference, ensuring high throughput and affordable deployment without sacrificing accuracy.
  • 3D Convolution for Video & Images: Video inputs are processed using temporal downsampling and 3D convolution, enabling the analysis of high-resolution videos and native aspect ratios, while maintaining efficiency.
  • Adaptive Context Length: Supports up to 64K tokens, allowing robust handling of multi-image prompts, concatenated documents, and lengthy dialogues in one pass.
  • Innovative Pretraining and RL: The training regime combines massive multimodal pretraining, supervised fine-tuning, and Reinforcement Learning with Curriculum Sampling (RLCS) for long-chain reasoning mastery and real-world task robustness.

“Thinking Mode” for Tunable Reasoning Depth

A prominent feature is the “Thinking Mode” toggle:

  • Thinking Mode ON: Prioritizes deep, step-by-step reasoning, suitable for complex tasks (e.g., logical deduction, multi-step chart or document analysis).
  • Thinking Mode OFF: Delivers faster, direct answers for routine lookups or simple Q&A. The user can control the model’s reasoning depth at inference, balancing speed against interpretability and rigor.

Benchmark Performance and Real-World Impact

  • State-of-the-Art Results: GLM-4.5V achieves SOTA across 41–42 public multimodal benchmarks, including MMBench, AI2D, MMStar, MathVista, and more, outperforming both open and some premium proprietary models in categories like STEM QA, chart understanding, GUI operation, and video comprehension.
  • Practical Deployments: Businesses and researchers report transformative results in defect detection, automated report analysis, digital assistant creation, and accessibility technology with GLM-4.5V.
  • Democratizing Multimodal AI: Open-sourced under the MIT license, the model equalizes access to cutting-edge multimodal reasoning that was previously gated by exclusive proprietary APIs.

Example Use Cases

Feature Example Use Description
Image Reasoning Defect detection, content moderation Scene understanding, multiple-image summarization
Video Analysis Surveillance, content creation Long video segmentation, event recognition
GUI Tasks Accessibility, automation, QA Screen/UI reading, icon location, operation suggestion
Chart Parsing Finance, research reports Visual analytics, data extraction from complex charts
Document Parsing Law, insurance, science Analyze & summarize long illustrated documents
Grounding AR, retail, robotics Target object localization, spatial referencing

Summary

GLM-4.5V by Zhipu AI is a flagship open-source vision-language model setting new performance and usability standards for multimodal reasoning. With its powerful architecture, context length, real-time “thinking mode”, and broad capability spectrum, GLM-4.5V is redefining what’s possible for enterprises, researchers, and developers working at the intersection of vision and language.


Check out the Paper, Model on Hugging Face and GitHub Page here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Zhipu AI Releases GLM-4.5V: Versatile Multimodal Reasoning with Scalable Reinforcement Learning appeared first on MarkTechPost.

MarkTechPost

Read More
Meet LEANN: The Tiniest Vector Database that Democratizes Personal AI with Storage-Efficient Approximate Nearest Neighbor (ANN) Search Index

Meet LEANN: The Tiniest Vector Database that Democratizes Personal AI with Storage-Efficient Approximate Nearest Neighbor (ANN) Search Index

 

Embedding-based search outperforms traditional keyword-based methods across various domains by capturing semantic similarity using dense vector representations and approximate nearest neighbor (ANN) search. However, the ANN data structure brings excessive storage overhead, often 1.5 to 7 times the size of the original raw data. This overhead is manageable in large-scale web applications but becomes impractical for personal devices or large datasets. Reducing storage to under 5% of the original data size is critical for edge deployment, but existing solutions fall short. Techniques like product quantization (PQ) can reduce storage, but either lead to a decrease in accuracy or need increased search latency.

Recommended Article: NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video

Vector search methods depend on IVF and proximity graphs. Graph-based approaches like HNSW, NSG, and Vamana are considered state-of-the-art due to their balance of accuracy and efficiency. Efforts to reduce graph size, such as learned neighbor selection, face limitations due to high training costs and dependency on labeled data. For resource-constrained environments, DiskANN and Starling store data on disk, while FusionANNS optimizes hardware usage. Methods like AiSAQ and EdgeRAG attempt to minimize memory usage but still suffer from high storage overhead or performance degradation at scale. Embedding compression techniques like PQ and RabitQ provides quantization with theoretical error bounds, but struggles to maintain accuracy under tight budgets.

Researchers from UC Berkeley, CUHK, Amazon Web Services, and UC Davis have developed LEANN, a storage-efficient ANN search index optimized for resource-limited personal devices. It integrates a compact graph-based structure with an on-the-fly recomputation strategy, enabling fast and accurate retrieval while minimizing storage overhead. LEANN achieves up to 50 times smaller storage than standard indexes by reducing the index size to under 5% of the original raw data. It maintains 90% top-3 recall in under 2 seconds on real-world question-answering benchmarks. To reduce latency, LEANN utilizes a two-level traversal algorithm and dynamic batching that combines embedding computations across search hops, enhancing GPU utilization.

LEANN’s architecture combines core methods such as graph-based recomputation, main techniques, and system workflow. Built on the HNSW framework, it observes that each query needs embeddings for only a limited subset of nodes, prompting on-demand computation instead of pre-storing all embeddings. To address earlier challenges, LEANN introduces two techniques: (a) a two-level graph traversal with dynamic batching to lower recomputation latency, and (b) a high degree of preserving graph pruning method to reduce metadata storage. In the system workflow, LEANN begins by computing embeddings for all dataset items and then constructs a vector index using an off-the-shelf graph-based indexing approach.

In terms of storage and latency, LEANN outperforms EdgeRAG, an IVF-based recomputation method, achieving latency reductions ranging from 21.17 to 200.60 times across various datasets and hardware platforms. This advantage is from LEANN’s polylogarithmic recomputation complexity, which scales more efficiently than EdgeRAG’s √𝑁 growth. In terms of accuracy for downstream RAG tasks, LEANN achieves higher performance across most datasets, except GPQA, where a distributional mismatch limits its effectiveness. Similarly, on HotpotQA, the single-hop retrieval setup limits accuracy gains, as the dataset demands multi-hop reasoning. Despite these limitations, LEANN shows strong performance across diverse benchmarks.

In this paper, researchers introduced LEANN, a storage-efficient neural retrieval system that combines graph-based recomputation with innovative optimizations. By integrating a two-level search algorithm and dynamic batching, it eliminates the need to store full embeddings, achieving significant reductions in storage overhead while maintaining high accuracy. Despite its strengths, LEANN faces limitations, such as high peak storage usage during index construction, which could be addressed through pre-clustering or other techniques. Future work may focus on reducing latency and enhancing responsiveness, opening the path for broader adoption in resource-constrained environments.


Check out the Paper and GitHub Page here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Meet LEANN: The Tiniest Vector Database that Democratizes Personal AI with Storage-Efficient Approximate Nearest Neighbor (ANN) Search Index appeared first on MarkTechPost.

MarkTechPost

Read More
A Practical Prompt Engineering Guide for GPT-5 with Examples

A Practical Prompt Engineering Guide for GPT-5 with Examples

 A Practical Prompt Engineering Guide for GPT-5 with Examples

GPT-5 is OpenAI’s latest and most powerful AI large language model (LLM) to date. The company calls it the smartest, fastest, most useful model yet, with built-in thinking and PhD‑level intelligence. When an AI model is truly capable, its effectiveness depends on how users use it in their everyday tasks and how well they take advantage of its intelligence. One of the best ways to use an artificial intelligence (AI) model efficiently is by writing effective prompts that AI can understand rather than get hallucinated by.

Thankfully, to help you get the best out of the latest GPT-5 model, OpenAI has released a GPT-5 Prompting Guide. The cookbook is a practical prompt engineering guide by OpenAI for the latest GPT-5 model to help you get the best results.

This guide will help you get a more predictable, more useful AI that can plan work, follow instructions precisely, and ship clean code without drama. The model is more proactive than older systems, so the guide leans into controls that help you decide when GPT-5 should go on its own and when it should slow down and ask.

A Smarter, More Capable AI

So, what makes GPT-5 such a game-changer? It’s not just about being able to answer more questions or write longer blocks of text. The improvements are far more profound. One of the most significant advancements is the massive reduction in “hallucinations,” the term for when an AI generates false or misleading information. This means we can trust the answers we get from GPT-5 to a much greater degree, making it a more reliable tool for everything from research to creative writing.

What’s actually new—and why it matters

We always hear companies make claims, so what’s actually new and why does that matter?

Agentic control without chaos

GPT-5 can operate anywhere on the spectrum from tightly scripted helper to independent problem-solver. The guide shows how to dial “agentic eagerness” up or down.

For example: Reduce exploration with a lower reasoning_effort, or push persistence with prompts that tell the model to keep going until the task is truly done. That makes long, tool-heavy workflows feel less random and more repeatable.

Clearly defining criteria in your prompt for how you want the AI model to explore the problem space can reduce the model’s need to explore and reason about too many ideas.

Example prompt:

You are a research assistant helping me create a 2-page market trends summary.
- Use reasoning_effort: low
- Do not explore topics outside the 2023–2025 data.
- Stop if you cannot find at least 3 credible sources.
- Keep going until you've fully summarized the 3 trends and written the final draft.

Progress you can follow (tool preambles)

Long tasks build trust when the model explains what it’s doing. The guide encourages short “preambles” before and during tool calls: restate the goal, outline the plan, narrate key steps, then summarize what changed. This doesn’t add fluff; it helps humans review and step in without derailing momentum.

Example prompt:

Your job is to clean and analyze a CSV file.
Before you start, restate the task in one sentence and outline the steps you will take.
When using a tool, explain what you're doing in 1–2 sentences before calling it.
After each step, summarize the result in under 50 words.

Right-sized thinking (reasoning effort)

The reasoning_effort parameter is your depth dial: keep it moderate for routine tasks, raise it for multi-step or tricky problems, and crucially break big tasks into distinct turns so the model plans and checks work in stages. That structure improves both quality and speed.

Example prompt:

You are solving a logic puzzle.
- Use reasoning_effort: high
- Explain your reasoning before giving the final answer.
- If the puzzle takes more than 5 steps to solve, break it into stages and confirm with me after each stage.

Better multi-step flows with the Responses API

If you’re building tools or agents around GPT-5, the Responses API lets the model reuse its prior reasoning instead of re-planning from scratch. OpenAI reports measurable gains just by making that switch (they cite a Tau-Bench Retail bump from ~73.9% to ~78.2% when passing prior reasoning), which means in practice to lower cost, faster latency, and more stable behavior.

Example prompt:

{
  "model": "gpt-5",
  "previous_response_id": "resp_12345",
  "input": "Now summarize the insights from the analysis above in bullet points."
}

Coding with GPT-5

GPT-5 can build new apps or make large, multi-file edits, but the standout tip is to codify taste and standards:

  • Frameworks: Next.js (TypeScript), React, HTML
  • Styling / UI: Tailwind CSS, shadcn/ui, Radix Themes
  • Icons: Material Symbols, Heroicons, Lucide
  • Animation: Motion
  • Fonts: San Serif, Inter, Geist, Mona Sans, IBM Plex Sans, Manrope

The GPT-5 model can hunt for context (like installed packages) without needing special prompts, and a short “code editing rules” block encourages it to follow your house style.

Example prompt:

You are editing an existing React app.
- Follow BEM CSS naming.
- Use Tailwind for styling.
- Place all new components in /src/components/ui.
- Maintain camelCase for function names.
- Use only hooks for state management.
Now, add a responsive navbar with a dropdown menu that matches the existing color palette.

A practical pattern from Cursor

Cursor, an AI code editor that tested GPT-5 early, found a useful balance:

  • Set verbosity low for regular text so the assistant stays concise, but ask for high verbosity inside code tools so diffs are readable with clear variable names.

They also learned to soften old “analyze everything” prompts, which made earlier models thorough but nudged GPT-5 to overuse tools. The fix—structured sections and clearer scope—reduced unnecessary calls and kept autonomy high.

Example prompt:

Global setting: verbosity=low
When providing code edits:
- Switch verbosity to high
- Include full diffs with comments explaining each change
- Use descriptive variable names

Two separate dials: thinking vs. talking (verbosity)

GPT-5 adds a verbosity parameter that controls the length of the final answer (separate from how hard it thinks). Keep a concise global default, then override locally where detail matters. For example, tasks such as code explanations or audit trails. You can also steer verbosity with plain language inside the prompt.

Example prompt:

Verbosity: low
Task: Summarize this 20-page report in 200 words.
If I type "explain more", increase verbosity, and give a detailed breakdown.

Instruction precision matters

GPT-5 follows directions with “surgical” accuracy, which is powerful and at the same time unforgiving. If your prompt includes conflicts (“never schedule without consent” next to “auto-assign before contacting”), the model wastes tokens reconciling rules. Clean hierarchies and explicit exceptions fix that. The guide even shows a healthcare example and how rewriting it makes reasoning faster and clearer. Use OpenAI’s prompt optimizer to spot these issues.

Example prompt (❌ bad):

Never schedule meetings without consent.
Always schedule the earliest available time.

Fixed prompt (✅ good):

Only schedule meetings with explicit consent.
If consent is given, choose the earliest available time.

Minimal reasoning for speed

There’s also a minimal-effort mode: the fastest option that still benefits from the reasoning model pattern. It shines on latency-sensitive tasks when paired with a short “why” summary, clear tool preambles, and an explicit plan.

The model spends fewer reasoning tokens on figuring out how to solve the problem before writing the answer, and won’t do as much step-by-step planning inside its hidden reasoning process. If you still want it to complete a multi-step task or follow a structured approach, you have to build that structure into your prompt. For example:

  • Explicitly list the steps you want it to take.
  • Provide templates, headings, or formats that it should fill in.
  • State exactly what information goes where and in what order.

This “scaffolding” works like the outline of a building: it gives the model a clear frame to follow when it’s not spending much time figuring things out for itself.

If you don’t provide that, minimal reasoning mode might give you a shallow, incomplete, or disorganized answer—because it’s skipping the deeper planning phase.

Example prompt:

Minimal reasoning
Extract all email addresses from the following text and output as a comma-separated list.
Do not include any other text.

Formatting defaults and overrides

By default, API responses aren’t in Markdown. If your app expects headings, lists, and code fences, say so—briefly and precisely—and refresh that instruction every few turns in long chats to keep the formatting consistent.

Example prompt:

Output your response in Markdown format.
- Use H2 headings
- Bullet points for lists
- Code blocks with language tags for code snippets

Metaprompting with GPT-5

When a prompt underperforms, ask GPT-5 to critique it: what to add, delete, or clarify to elicit the behavior you want, without removing everything that already works. It’s a low-effort way to improve the prompts you rely on daily.

Example prompt:

Here is my current prompt: "Write a friendly product description for our coffee shop."
Critique this prompt and suggest 3 specific changes that would make the output warmer, more detailed, and in a consistent brand voice.

Why this matters

The main idea of the guide is to maintain control without micromanaging. You set a small number of dials—how much to think, how much to say, how persistent to be—and describe your environment with just enough detail for the model to blend in. In return, you get better hand-offs, fewer stalls, and code or documents that feel like they were made by your team. Most importantly, the fixes are the kind you can try this afternoon: trim a conflicting line, add a persistence clause, split a big job into steps, or move to the Responses API if you’re doing tool-based work.

In Conclusion:

Good prompts aren’t fancy—they’re specific. GPT-5 rewards that specificity: clear stop conditions, right-sized reasoning, a steady narration of what it’s doing, and a tidy set of rules for how work should look when it’s done. Treat the guide as a checklist, not a lecture. Start small, measure, and keep your prompts short, scoped, and consistent. The payoff is an assistant that feels less like a chat window and more like a reliable teammate, one you can trust to plan, execute, and finish the job.


🤝
For Partnership/Promotion on AI Tools Club, please check out our partnership page.

↗️ GPT-5 Prompting Guide

AI Tools Club

Read More
NVIDIA AI Releases ProRLv2: Advancing Reasoning in Language Models with Extended Reinforcement Learning RL

NVIDIA AI Releases ProRLv2: Advancing Reasoning in Language Models with Extended Reinforcement Learning RL

 

What Is ProRLv2?

ProRLv2 is the latest version of NVIDIA’s Prolonged Reinforcement Learning (ProRL), designed specifically to push the boundaries of reasoning in large language models (LLMs). By scaling reinforcement learning (RL) steps from 2,000 up to 3,000, ProRLv2 systematically tests how extended RL can unlock new solution spaces, creativity, and high-level reasoning that were previously inaccessible—even with smaller models like the 1.5B-parameter Nemotron-Research-Reasoning-Qwen-1.5B-v2.

Key Innovations in ProRLv2

ProRLv2 incorporates several innovations to overcome common RL limitations in LLM training:

  • REINFORCE++- Baseline: A robust RL algorithm that enables long-horizon optimization over thousands of steps, handling the instability typical in RL for LLMs.
  • KL Divergence Regularization & Reference Policy Reset: Periodically refreshes the reference model with the current best checkpoint, allowing stable progress and continued exploration by preventing the RL objective from dominating too early.
  • Decoupled Clipping & Dynamic Sampling (DAPO): Encourages diverse solution discovery by boosting unlikely tokens and focusing learning signals on prompts of intermediate difficulty.
  • Scheduled Length Penalty: Cyclically applied, helping maintain diversity and prevent entropy collapse as training lengthens.
  • Scaling Training Steps: ProRLv2 moves the RL training horizon from 2,000 to 3,000 steps, directly testing how much longer RL can expand reasoning abilities.
Recommended Article: NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video

How ProRLv2 Expands LLM Reasoning

Nemotron-Research-Reasoning-Qwen-1.5B-v2, trained with ProRLv2 for 3,000 RL steps, sets a new standard for open-weight 1.5B models on reasoning tasks, including math, code, science, and logic puzzles:

  • Performance surpasses previous versions and competitors like DeepSeek-R1-1.5B.
  • Sustained gains with more RL steps: Longer training leads to continual improvements, especially on tasks where base models perform poorly, demonstrating genuine expansion in reasoning boundaries.
  • Generalization: Not only does ProRLv2 boost pass@1 accuracy, but it also enables novel reasoning and solution strategies on tasks not seen during training.
  • Benchmarks: Gains include average pass@1 improvements of 14.7% in math, 13.9% in coding, 54.8% in logic puzzles, 25.1% in STEM reasoning, and 18.1% in instruction-following tasks, with further improvements in v2 on unseen and harder benchmarks.

Why It Matters

The major finding of ProRLv2 is that continued RL training, with careful exploration and regularization, reliably expands what LLMs can learn and generalize. Rather than hitting an early plateau or overfitting, prolonged RL allows smaller models to rival much larger ones in reasoning—demonstrating that scaling RL itself is as important as model or dataset size.

Using Nemotron-Research-Reasoning-Qwen-1.5B-v2

The latest checkpoint is available for testing on Hugging Face. Loading the model:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Research-Reasoning-Qwen-1.5B")

Conclusion

ProRLv2 redefines the limits of reasoning in language models by showing that RL scaling laws matter as much as size or data. Through advanced regularization and smart training schedules, it enables deep, creative, and generalizable reasoning even in compact architectures. The future lies in how far RL can push—not just how big models can get.


Check out the Unofficial Blog and Model on Hugging Face here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post NVIDIA AI Releases ProRLv2: Advancing Reasoning in Language Models with Extended Reinforcement Learning RL appeared first on MarkTechPost.

MarkTechPost

Read More