Biomni-R0: New Agentic LLMs Trained End-to-End with Multi-Turn Reinforcement Learning for Expert-Level Intelligence in Biomedical Research

Biomni-R0: New Agentic LLMs Trained End-to-End with Multi-Turn Reinforcement Learning for Expert-Level Intelligence in Biomedical Research

 

Table of contents

The Growing Role of AI in Biomedical Research

The field of biomedical artificial intelligence is evolving rapidly, with increasing demand for agents capable of performing tasks that span genomics, clinical diagnostics, and molecular biology. These agents aren’t merely designed to retrieve facts; they are expected to reason through complex biological problems, interpret patient data, and extract meaningful insights from vast biomedical databases. Unlike general-purpose AI models, biomedical agents must interface with domain-specific tools, comprehend biological hierarchies, and simulate workflows similar to those of researchers to effectively support modern biomedical research.

The Core Challenge: Matching Expert-Level Reasoning

However, achieving expert-level performance in these tasks is far from trivial. Most large language models fall short when dealing with the nuance and depth of biomedical reasoning. They may succeed on surface-level retrieval or pattern recognition tasks, but often fail when challenged with multi-step reasoning, rare disease diagnosis, or gene prioritization, areas that require not just data access, but contextual understanding and domain-specific judgment. This limitation has created a clear gap: how to train biomedical AI agents that can think and act like domain experts.

Why Traditional Approaches Fall Short

While some solutions leverage supervised learning on curated biomedical datasets or retrieval-augmented generation to ground responses in literature or databases, these approaches have drawbacks. They often rely on static prompts and pre-defined behaviors that lack adaptability. Furthermore, many of these agents struggle to effectively execute external tools, and their reasoning chains collapse when faced with unfamiliar biomedical structures. This fragility makes them ill-suited for dynamic or high-stakes environments, where interpretability and accuracy are non-negotiable.

Biomni-R0: A New Paradigm Using Reinforcement Learning

Researchers from Stanford University and UC Berkeley introduced a new family of models called , built by applying reinforcement learning (RL) to a biomedical agent foundation. These models, Biomni-R0-8B and Biomni-R0-32B, were trained in an RL environment specifically tailored for biomedical reasoning, using both expert-annotated tasks and a novel reward structure. The collaboration combines Stanford’s Biomni agent and environment platform with UC Berkeley’s SkyRL reinforcement learning infrastructure, aiming to push biomedical agents past human-level capabilities.

Training Strategy and System Design

The research introduced a two-phase training process. First, they used supervised fine-tuning (SFT) on high-quality trajectories sampled from Claude-4 Sonnet using rejection sampling, effectively bootstrapping the agent’s ability to follow structured reasoning formats. Next, they fine-tuned the models using reinforcement learning, optimizing for two kinds of rewards: one for correctness (e.g., selecting the right gene or diagnosis), and another for response formatting (e.g., using structured <think> and <answer> tags correctly).

To ensure computational efficiency, the team developed asynchronous rollout scheduling that minimized bottlenecks caused by external tool delays. They also expanded the context length to 64k tokens, allowing the agent to manage long multi-step reasoning conversations effectively.

Results That Outperform Frontier Models

The performance gains were significant. Biomni-R0-32B achieved a score of 0.669, a jump from the base model’s 0.346. Even Biomni-R0-8B, the smaller version, scored 0.588, outperforming general-purpose models like Claude 4 Sonnet and GPT-5, which are both much larger. On a task-by-task basis, Biomni-R0-32B scored highest on 7 out of 10 tasks, while GPT-5 led in 2, and Claude 4 in just 1. One of the most striking results was in rare disease diagnosis, where Biomni-R0-32B reached 0.67, compared to Qwen-32B’s 0.03, a more than 20× improvement. Similarly, in GWAS variant prioritization, the model’s score increased from 0.16 to 0.74, demonstrating the value of domain-specific reasoning.

Designing for Scalability and Precision

Training large biomedical agents requires dealing with resource-heavy rollouts involving external tool execution, database queries, and code evaluation. To manage this, the system decoupled environment execution from model inference, allowing more flexible scaling and reducing idle GPU time. This innovation ensured efficient use of resources, even with tools that had varying execution latencies. Longer reasoning sequences also proved beneficial. The RL-trained models consistently produced lengthier, structured responses, which strongly correlated with better performance, highlighting that depth and structure in reasoning are key indicators of expert-level understanding in biomedicine.

Key Takeaways from the research include:

  • Biomedical agents must perform deep reasoning, not just retrieval, across genomics, diagnostics, and molecular biology.
  • The central problem is achieving expert-level task performance, mainly in complex areas such as rare diseases and gene prioritization.
  • Traditional methods, including supervised fine-tuning and retrieval-based models, often fall short in terms of robustness and adaptability.
  • Biomni-R0, developed by Stanford and UC Berkeley, uses reinforcement learning with expert-based rewards and structured output formatting.
  • The two-phase training pipeline, SFT followed by RL, proved highly effective in optimizing performance and reasoning quality.
  • Biomni-R0-8B delivers strong results with a smaller architecture, while Biomni-R0-32B sets new benchmarks, outperforming Claude 4 and GPT-5 on 7 of 10 tasks.
  • Reinforcement learning enabled the agent to generate longer, more coherent reasoning traces, a key trait of expert behavior.
  • This work lays the foundation for super-expert biomedical agents, capable of automating complex research workflows with precision.

Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Google AI Releases EmbeddingGemma: A 308M Parameter On-Device Embedding Model with State-of-the-Art MTEB Results

Google AI Releases EmbeddingGemma: A 308M Parameter On-Device Embedding Model with State-of-the-Art MTEB Results

 

EmbeddingGemma is Google’s new open text embedding model optimized for on-device AI, designed to balance efficiency with state-of-the-art retrieval performance.

How compact is EmbeddingGemma compared to other models?

At just 308 million parameters, EmbeddingGemma is lightweight enough to run on mobile devices and offline environments. Despite its size, it performs competitively with much larger embedding models. Inference latency is low (sub-15 ms for 256 tokens on EdgeTPU), making it suitable for real-time applications.

How well does it perform on multilingual benchmarks?

EmbeddingGemma was trained across 100+ languages and achieved the highest ranking on the Massive Text Embedding Benchmark (MTEB) among models under 500M parameters. Its performance rivals or exceeds embedding models nearly twice its size, particularly in cross-lingual retrieval and semantic search.

https://developers.googleblog.com/en/introducing-embeddinggemma/
https://developers.googleblog.com/en/introducing-embeddinggemma/

What is the underlying architecture?

EmbeddingGemma is built on a Gemma 3–based encoder backbone with mean pooling. Importantly, the architecture does not use the multimodal-specific bidirectional attention layers that Gemma 3 applies for image inputs. Instead, EmbeddingGemma employs a standard transformer encoder stack with full-sequence self-attention, which is typical for text embedding models.

This encoder produces 768-dimensional embeddings and supports sequences up to 2,048 tokens, making it well-suited for retrieval-augmented generation (RAG) and long-document search. The mean pooling step ensures fixed-length vector representations regardless of input size.

https://developers.googleblog.com/en/introducing-embeddinggemma/

What makes its embeddings flexible?

EmbeddingGemma employs Matryoshka Representation Learning (MRL). This allows embeddings to be truncated from 768 dimensions down to 512, 256, or even 128 dimensions with minimal loss of quality. Developers can tune the trade-off between storage efficiency and retrieval precision without retraining.

Can it run entirely offline?

Yes. EmbeddingGemma was specifically designed for on-device, offline-first use cases. Since it shares a tokenizer with Gemma 3n, the same embeddings can directly power compact retrieval pipelines for local RAG systems, with privacy benefits from avoiding cloud inference.

What tools and frameworks support EmbeddingGemma?

It integrates seamlessly with:

  • Hugging Face (transformers, Sentence-Transformers, transformers.js)
  • LangChain and LlamaIndex for RAG pipelines
  • Weaviate and other vector databases
  • ONNX Runtime for optimized deployment across platforms
    This ecosystem ensures developers can slot it directly into existing workflows.

How can it be implemented in practice?

(1) Load and Embed

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("google/embeddinggemma-300m")
emb = model.encode(["example text to embed"])

(2) Adjust Embedding Size
Use full 768 dims for maximum accuracy or truncate to 512/256/128 dims for lower memory or faster retrieval.

(3) Integrate into RAG
Run similarity search locally (cosine similarity) and feed top results into Gemma 3n for generation. This enables a fully offline RAG pipeline.

Why EmbeddingGemma?

  1. Efficiency at scale – High multilingual retrieval accuracy in a compact footprint.
  2. Flexibility – Adjustable embedding dimensions via MRL.
  3. Privacy – End-to-end offline pipelines without external dependencies.
  4. Accessibility – Open weights, permissive licensing, and strong ecosystem support.

EmbeddingGemma proves that smaller embedding models can achieve best-in-class retrieval performance while being light enough for offline deployment. It marks an important step toward efficient, privacy-conscious, and scalable on-device AI.


Check out the and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Google DeepMind Finds a Fundamental Bug in RAG: Embedding Limits Break Retrieval at Scale

Google DeepMind Finds a Fundamental Bug in RAG: Embedding Limits Break Retrieval at Scale

 

Retrieval-Augmented Generation (RAG) systems generally rely on dense embedding models that map queries and documents into fixed-dimensional vector spaces. While this approach has become the default for many AI applications, a recent research from Google DeepMind team explains a fundamental architectural limitation that cannot be solved by larger models or better training alone.

What Is the Theoretical Limit of Embedding Dimensions?

At the core of the issue is the representational capacity of fixed-size embeddings. An embedding of dimension d cannot represent all possible combinations of relevant documents once the database grows beyond a critical size. This follows from results in communication complexity and sign-rank theory.

  • For embeddings of size 512, retrieval breaks down around 500K documents.
  • For 1024 dimensions, the limit extends to about 4 million documents.
  • For 4096 dimensions, the theoretical ceiling is 250 million documents.

These values are best-case estimates derived under free embedding optimization, where vectors are directly optimized against test labels. Real-world language-constrained embeddings fail even earlier.

https://arxiv.org/pdf/2508.21038

How Does the LIMIT Benchmark Expose This Problem?

To test this limitation empirically, Google DeepMind Team introduced LIMIT (Limitations of Embeddings in Information Retrieval), a benchmark dataset specifically designed to stress-test embedders. LIMIT has two configurations:

  • LIMIT full (50K documents): In this large-scale setup, even strong embedders collapse, with recall@100 often falling below 20%.
  • LIMIT small (46 documents): Despite the simplicity of this toy-sized setup, models still fail to solve the task. Performance varies widely but remains far from reliable:
    • Promptriever Llama3 8B: 54.3% recall@2 (4096d)
    • GritLM 7B: 38.4% recall@2 (4096d)
    • E5-Mistral 7B: 29.5% recall@2 (4096d)
    • Gemini Embed: 33.7% recall@2 (3072d)

Even with just 46 documents, no embedder reaches full recall, highlighting that the limitation is not dataset size alone but the single-vector embedding architecture itself.

In contrast, BM25, a classical sparse lexical model, does not suffer from this ceiling. Sparse models operate in effectively unbounded dimensional spaces, allowing them to capture combinations that dense embeddings cannot.

https://arxiv.org/pdf/2508.21038

Why Does This Matter for RAG?

CCurrent RAG implementations typically assume that embeddings can scale indefinitely with more data. The Google DeepMind research team explains how this assumption is incorrect: embedding size inherently constrains retrieval capacity. This affects:

  • Enterprise search engines handling millions of documents.
  • Agentic systems that rely on complex logical queries.
  • Instruction-following retrieval tasks, where queries define relevance dynamically.

Even advanced benchmarks like MTEB fail to capture these limitations because they test only a narrow part/section of query-document combinations.

What Are the Alternatives to Single-Vector Embeddings?

The research team suggested that scalable retrieval will require moving beyond single-vector embeddings:

  • Cross-Encoders: Achieve perfect recall on LIMIT by directly scoring query-document pairs, but at the cost of high inference latency.
  • Multi-Vector Models (e.g., ColBERT): Offer more expressive retrieval by assigning multiple vectors per sequence, improving performance on LIMIT tasks.
  • Sparse Models (BM25, TF-IDF, neural sparse retrievers): Scale better in high-dimensional search but lack semantic generalization.

The key insight is that architectural innovation is required, not simply larger embedders.

What is the Key Takeaway?

The research team’s analysis shows that dense embeddings, despite their success, are bound by a mathematical limit: they cannot capture all possible relevance combinations once corpus sizes exceed limits tied to embedding dimensionality. The LIMIT benchmark demonstrates this failure concretely:

  • On LIMIT full (50K docs): recall@100 drops below 20%.
  • On LIMIT small (46 docs): even the best models max out at ~54% recall@2.

Classical techniques like BM25, or newer architectures such as multi-vector retrievers and cross-encoders, remain essential for building reliable retrieval engines at scale.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
What is OLMoASR and How Does It Compare to OpenAI’s Whisper in Speech Recognition?

What is OLMoASR and How Does It Compare to OpenAI’s Whisper in Speech Recognition?

 

The Allen Institute for AI (AI2) has released OLMoASR, a suite of open automatic speech recognition (ASR) models that rival closed-source systems such as OpenAI’s Whisper. Beyond just releasing model weights, AI2 has published training data identifiers, filtering steps, training recipes, and benchmark scripts—an unusually transparent move in the ASR space. This makes OLMoASR one of the most trending and extensible platforms for speech recognition research.

Why Open Automatic Speech Recognition ASR?

Most speech recognition models available today—whether from OpenAI, Google, or Microsoft—are only accessible via APIs. While these services provide high performance, they operate as black boxes: the training datasets are opaque, the filtering methods are undocumented, and the evaluation protocols are not always aligned with research standards.

This lack of transparency poses challenges for reproducibility and scientific progress. Researchers cannot verify claims, test variations, or adapt models to new domains without re-building large datasets themselves. OLMoASR addresses this problem by opening the entire pipeline. The release is not just about enabling practical transcription—it’s about pushing ASR toward a more open, scientific foundation.

Model Architecture and Scaling

OLMoASR uses a transformer encoder–decoder architecture, the dominant paradigm in modern ASR.

  • The encoder ingests audio waveforms and produces hidden representations.
  • The decoder generates text tokens conditioned on the encoder’s outputs.

This design is similar to Whisper, but OLMoASR makes the implementation fully open.

The family of models covers six sizes, all trained on English:

  • tiny.en – 39M parameters, designed for lightweight inference
  • base.en – 74M parameters
  • small.en – 244M parameters
  • medium.en – 769M parameters
  • large.en-v1 – 1.5B parameters, trained on 440K hours
  • large.en-v2 – 1.5B parameters, trained on 680K hours

This range allows developers to trade off between inference cost and accuracy. Smaller models are suited for embedded devices or real-time transcription, while the larger models maximize accuracy for research or batch workloads.

Data: From Web Scraping to Curated Mixes

One of the core contributions of OLMoASR is the open release of training datasets, not just the models.

OLMoASR-Pool (~3M hours)

This massive collection contains weakly supervised speech paired with transcripts scraped from the web. It includes around 3 million hours of audio and 17 million text transcripts. Like Whisper’s original dataset, it is noisy, containing misaligned captions, duplicates, and transcription errors.

OLMoASR-Mix (~1M hours)

To address quality issues, AI2 applied rigorous filtering:

  • Alignment heuristics to ensure audio and transcripts match
  • Fuzzy deduplication to remove repeated or low-diversity examples
  • Cleaning rules to eliminate duplicate lines and mismatched text

The result is a high-quality, 1M-hour dataset that boosts zero-shot generalization—critical for real-world tasks where data may differ from training distributions.

This two-tiered data strategy mirrors practices in large-scale language model pretraining: use vast noisy corpora for scale, then refine with filtered subsets to improve quality.

Performance Benchmarks

AI2 benchmarked OLMoASR against Whisper across both short-form and long-form speech tasks, using datasets like LibriSpeech, TED-LIUM3, Switchboard, AMI, and VoxPopuli.

Medium Model (769M)

  • 12.8% WER (word error rate) on short-form speech
  • 11.0% WER on long-form speech

This nearly matches Whisper-medium.en, which achieves 12.4% and 10.5% respectively.

Large Models (1.5B)

  • large.en-v1 (440K hours): 13.0% WER short-form vs Whisper large-v1 at 12.2%
  • large.en-v2 (680K hours): 12.6% WER, closing the gap to less than 0.5%

Smaller Models

Even the tiny and base versions perform competitively:

  • tiny.en: ~20.5% WER short-form, ~15.6% WER long-form
  • base.en: ~16.6% WER short-form, ~12.9% WER long-form

This gives developers flexibility to choose models based on compute and latency requirements.

How to use?

Transcribing audio takes just a few lines of code:

import olmoasr

model = olmoasr.load_model("medium", inference=True)
result = model.transcribe("audio.mp3")
print(result)

The output includes both the transcription and time-aligned segments, making it useful for captioning, meeting transcription, or downstream NLP pipelines.

Fine-Tuning and Domain Adaptation

Since AI2 provides full training code and recipes, OLMoASR can be fine-tuned for specialized domains:

  • Medical speech recognition – adapting models on datasets like MIMIC-III or proprietary hospital recordings
  • Legal transcription – training on courtroom audio or legal proceedings
  • Low-resource accents – fine-tuning on dialects not well covered in OLMoASR-Mix

This adaptability is critical: ASR performance often drops when models are used in specialized domains with domain-specific jargon. Open pipelines make domain adaptation straightforward.

Applications

OLMoASR opens up exciting opportunities across academic research and real-world AI development:

  • Educational Research: Researchers can explore the intricate relationships between model architecture, dataset quality, and filtering techniques to understand their effects on speech recognition performance.
  • Human-Computer Interaction: Developers gain the freedom to embed speech recognition capabilities directly into conversational AI systems, real-time meeting transcription platforms, and accessibility applications—all without dependency on proprietary APIs or external services.
  • Multimodal AI Development: When combined with large language models, OLMoASR enables the creation of advanced multimodal assistants that can seamlessly process spoken input and generate intelligent, contextually-aware responses.
  • Research Benchmarking: The open availability of both training data and evaluation metrics positions OLMoASR as a standardized reference point, allowing researchers to compare new approaches against a consistent, reproducible baseline in future ASR studies.

Conclusion

The release of OLMoASR brings high-quality speech recognition can be developed and released in a way that prioritizes transparency and reproducibility. While the models are currently limited to English and still demand significant compute for training, they provide a solid foundation for adaptation and extension. This release sets a clear reference point for future work in open ASR and makes it easier for researchers and developers to study, benchmark, and apply speech recognition models in different domains.


Check out the , and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

Google Brings Gemini CLI to GitHub Actions: Secure, Free, and Enterprise-Ready AI Integration

 

How do devs integrate coding capabilities directly into their GitHub repositories? Google has recently introduced Gemini CLI GitHub Actions, a new way for developers to integrate Gemini’s AI coding capabilities directly into their GitHub repositories. Built on top of GitHub’s workflow automation framework, this Google’s new release turns Gemini from a terminal-only coding assistant into a collaborative teammate that participates in issue triage, pull request reviews, and repository maintenance.

But how is it different from Microsoft’s GitHub Copilot? Unlike Microsoft’s GitHub Copilot features, which require paid subscriptions for advanced functionality, Google’s integration is available at no cost. This really helps open-source devs, small teams, and enterprises that want to embed AI into their workflows without additional licensing overhead.

From Terminal to Repository Integration

Google first released Gemini CLI earlier this year as a command-line interface that connected developers directly to the Gemini 2.5 Pro model. With a one-million-token context window, built-in tools, and open-source licensing, Gemini CLI was designed for local, developer-focused workflows.

The new GitHub Actions integration extends those capabilities to collaborative environments. Instead of operating only on a developer’s machine, Gemini can now participate in repository-level automation action, where it assists teams during code reviews, issue management, and continuous integration processes, saving hours of time for dev and helps in faster code deployment.

Core Capabilities

Gemini CLI GitHub Actions comes with three key use cases:

  1. Automated Issue Triage
    New issues are automatically labeled, categorized, and prioritized. This reduces the time dev maintainers spend manually managing backlogs and helps teams focus on critical bugs or features.
  2. AI-Powered Pull Request Reviews
    Every new pull request can be reviewed by Gemini before real human dev reviewers. The system checks code for style adherence, potential bugs, and correctness. This allows human dev maintainers to focus on design-level concerns rather than surface-level errors. Saving a lot of time and effort!
  3. On-Demand Collaboration via Commands
    Developers can interact with Gemini directly in GitHub comments. By mentioning @gemini-cli and issuing commands such as /review, /triage, or /write-tests, they can trigger specific actions. This makes Gemini act like a conversational collaborator inside the repository just like how devs interact with each other inside Slack or JIRA.

Setup and Configuration

Integrating Gemini CLI GitHub Actions is very straightforward. Developers need Gemini CLI version 0.1.18 or higher. Running the command /setup-github inside the CLI scaffolds the necessary workflow files under .github/workflows and ensures configuration settings are properly managed.

For authentication, Google provides two methods:

  • API Key Authentication: Developers can store a GEMINI_API_KEY in GitHub Secrets. This method is simple and sufficient for most individual and team projects.
  • Workload Identity Federation (WIF): For enterprise users, WIF provides a more secure option by replacing long-lived credentials with short-lived, federated tokens. This approach aligns with modern security best practices for CI/CD pipelines.

Gemini’s behavior can be further customized using a GEMINI.md file placed in the repository. This file can contain coding guidelines, documentation links, or project-specific rules. The AI model then uses this context to tailor its reviews and responses.

Security Model

But apart from all these cool benefits of Gemini CLI GitHub Actions, the question is how secure it is? The commands executed by the model are run in isolated environments since the system supports multiple sandboxing technologies—Docker, Podman, and macOS Seatbelt.

Additionally, since version 0.1.14 of Gemini CLI, all executions are logged for auditability. Any commands flagged as unusual or potentially unsafe require explicit developer confirmation before execution. For production environments, Google strongly recommends using WIF authentication to avoid risks associated with static API keys.

Example Workflow

The following minimal YAML configuration enables Gemini to automatically review pull requests. This workflow ensures that every new or updated pull request is analyzed by Gemini before merging, providing consistent automated review across the repository.

name: Gemini Pull Request Review
on:
  pull_request:
    types: [opened, synchronize]
jobs:
  gemini-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: google-github-actions/run-gemini-cli@v0.1
        with:
          args: review --files .
        env:
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}

Summary

Gemini CLI GitHub Actions represents a significant step in Google’s effort to embed AI into collaborative software development. By combining free access, flexible configuration, and strong security practices, the release lowers the barrier for teams to experiment with AI-driven automation inside their repositories.

The post appeared first on .

Read More
AI and the Brain: How DINOv3 Models Reveal Insights into Human Visual Processing

AI and the Brain: How DINOv3 Models Reveal Insights into Human Visual Processing

 

Introduction

Understanding how the brain builds internal representations of the visual world is one of the most fascinating challenges in neuroscience. Over the past decade, deep learning has reshaped computer vision, producing neural networks that not only perform at human-level accuracy on recognition tasks but also seem to process information in ways that resemble our brains. This unexpected overlap raises an intriguing question: can studying AI models help us better understand how the brain itself learns to see?

Researchers at Meta AI and École Normale Supérieure set out to explore this question by focusing on DINOv3, a self-supervised vision transformer trained on billions of natural images. They compared DINOv3’s internal activations with human brain responses to the same images, using two complementary neuroimaging techniques. fMRI provided high-resolution spatial maps of cortical activity, while MEG captured the precise timing of brain responses. Together, these datasets offered a rich view of how the brain processes visual information.

https://arxiv.org/pdf/2508.18226

Technical Details

The research team explores three factors that might drive brain-model similarity: model size, the amount of training data, and the type of images used for training. To do this, the team trained multiple versions of DINOv3, varying these factors independently.

Brain-Model Similarity

The research team found strong evidence of convergence while looking at how well DINOv3 matched brain responses. The model’s activations predicted fMRI signals in both early visual regions and higher-order cortical areas. Peak voxel correlations reached R = 0.45, and MEG results showed that alignment started as early as 70 milliseconds after image onset and lasted up to three seconds. Importantly, early DINOv3 layers aligned with regions like V1 and V2, while deeper layers matched activity in higher-order regions, including parts of the prefrontal cortex.

Training Trajectories

Tracking these similarities over the course of training revealed a developmental trajectory. Low-level visual alignments emerged very early, after only a small fraction of training, while higher-level alignments required billions of images. This mirrors the way the human brain develops, with sensory areas maturing earlier than associative cortices. The study showed that temporal alignment emerged fastest, spatial alignment more slowly, and encoding similarity in between, highlighting the layered nature of representational development.

Role of Model Factors

The role of model factors was equally telling. Larger models consistently achieved higher similarity scores, especially in higher-order cortical regions. Longer training improved alignment across the board, with high-level representations benefiting most from extended exposure. The type of images mattered as well: models trained on human-centric images produced the strongest alignment. Those trained on satellite or cellular images showed partial convergence in early visual regions but much weaker similarity in higher-level brain areas. This suggests that ecologically relevant data are crucial for capturing the full range of human-like representations.

Links to Cortical Properties

Interestingly, the timing of when DINOv3’s representations emerged also lined up with structural and functional properties of the cortex. Regions with greater developmental expansion, thicker cortex, or slower intrinsic timescales aligned later in training. Conversely, highly myelinated regions aligned earlier, reflecting their role in fast information processing. These correlations suggest that AI models can offer clues about the biological principles underlying cortical organization.

Nativism vs. Empiricism

The study highlights a balance between innate structure and learning. DINOv3’s architecture gives it a hierarchical processing pipeline, but full brain-like similarity only emerged with prolonged training on ecologically valid data. This interplay between architectural priors and experience echoes debates in cognitive science about nativism and empiricism.

Developmental Parallels

The parallels to human development are striking. Just as sensory cortices in the brain mature quickly and associative areas develop more slowly, DINOv3 aligned with sensory regions early in training and with prefrontal areas much later. This suggests that training trajectories in large-scale AI models may serve as computational analogues for the staged maturation of human brain functions.

Beyond the Visual Pathway

The results also extended beyond traditional visual pathways. DINOv3 showed alignment in prefrontal and multimodal regions, raising questions about whether such models capture higher-order features relevant for reasoning and decision-making. While this study focused only on DINOv3, it points toward exciting possibilities for using AI as a tool to test hypotheses about brain organization and development.

https://arxiv.org/pdf/2508.18226

Conclusion

In conclusion, this research shows that self-supervised vision models like DINOv3 are more than just powerful computer vision systems. They also approximate aspects of human visual processing, revealing how size, training, and data shape convergence between brains and machines. By studying how models learn to “see,” we gain valuable insights into how the human brain itself develops the ability to perceive and interpret the world.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Tencent Hunyuan Open-Sources Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B: A State-of-the-Art Multilingual Translation Models

Tencent Hunyuan Open-Sources Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B: A State-of-the-Art Multilingual Translation Models

 

Introduction

Tencent’s Hunyuan team has released Hunyuan-MT-7B (a translation model) and Hunyuan-MT-Chimera-7B (an ensemble model). Both models are designed specifically for multilingual machine translation and were introduced in conjunction with Tencent’s participation in the WMT2025 General Machine Translation shared task, where Hunyuan-MT-7B ranked first in 30 out of 31 language pairs.

https://github.com/Tencent-Hunyuan/Hunyuan-MT/blob/main/Hunyuan_MT_Technical_Report.pdf

Model Overview

Hunyuan-MT-7B

  • A 7B parameter translation model.
  • Supports mutual translation across 33 languages, including Chinese ethnic minority languages such as Tibetan, Mongolian, Uyghur, and Kazakh.
  • Optimized for both high-resource and low-resource translation tasks, achieving state-of-the-art results among models of comparable size.

Hunyuan-MT-Chimera-7B

  • An integrated weak-to-strong fusion model.
  • Combines multiple translation outputs at inference time and produces a refined translation using reinforcement learning and aggregation techniques.
  • Represents the first open-source translation model of this type, improving translation quality beyond single-system outputs.
https://github.com/Tencent-Hunyuan/Hunyuan-MT/blob/main/Hunyuan_MT_Technical_Report.pdf

Training Framework

The models were trained using a five-stage framework designed for translation tasks:

  1. General Pre-training
    • 1.3 trillion tokens covering 112 languages and dialects.
    • Multilingual corpora assessed for knowledge value, authenticity, and writing style.
    • Diversity maintained through disciplinary, industry, and thematic tagging systems.
  2. MT-Oriented Pre-training
    • Monolingual corpora from mC4 and OSCAR, filtered using fastText (language ID), minLSH (deduplication), and KenLM (perplexity filtering).
    • Parallel corpora from OPUS and ParaCrawl, filtered with CometKiwi.
    • Replay of general pre-training data (20%) to avoid catastrophic forgetting.
  3. Supervised Fine-Tuning (SFT)
    • Stage I: ~3M parallel pairs (Flores-200, WMT test sets, curated Mandarin–minority data, synthetic pairs, instruction-tuning data).
    • Stage II: ~268k high-quality pairs selected through automated scoring (CometKiwi, GEMBA) and manual verification.
  4. Reinforcement Learning (RL)
    • Algorithm: GRPO.
    • Reward functions:
      • XCOMET-XXL and DeepSeek-V3-0324 scoring for quality.
      • Terminology-aware rewards (TAT-R1).
      • Repetition penalties to avoid degenerate outputs.
  5. Weak-to-Strong RL
    • Multiple candidate outputs generated and aggregated through reward-based output
    • Applied in Hunyuan-MT-Chimera-7B, improving translation robustness and reducing repetitive errors.

Benchmark Results

Automatic Evaluation

  • WMT24pp (English⇔XX): Hunyuan-MT-7B achieved 0.8585 (XCOMET-XXL), surpassing larger models like Gemini-2.5-Pro (0.8250) and Claude-Sonnet-4 (0.8120).
  • FLORES-200 (33 languages, 1056 pairs): Hunyuan-MT-7B scored 0.8758 (XCOMET-XXL), outperforming open-source baselines including Qwen3-32B (0.7933).
  • Mandarin⇔Minority Languages: Scored 0.6082 (XCOMET-XXL), higher than Gemini-2.5-Pro (0.5811), showing significant improvements in low-resource settings.

Comparative Results

  • Outperforms Google Translator by 15–65% across evaluation categories.
  • Outperforms specialized translation models such as Tower-Plus-9B and Seed-X-PPO-7B despite having fewer parameters.
  • Chimera-7B adds ~2.3% improvement on FLORES-200, particularly in Chinese⇔Other and non-English⇔non-Chinese translations.

Human Evaluation

A custom evaluation set (covering social, medical, legal, and internet domains) compared Hunyuan-MT-7B with state-of-the-art models:

  • Hunyuan-MT-7B: Avg. 3.189
  • Gemini-2.5-Pro: Avg. 3.223
  • DeepSeek-V3: Avg. 3.219
  • Google Translate: Avg. 2.344

This shows that Hunyuan-MT-7B, despite being smaller at 7B parameters, approaches the quality of much larger proprietary models.

Case Studies

The report highlights several real-world cases:

  • Cultural References: Correctly translates “小红薯” as the platform “REDnote,” unlike Google Translate’s “sweet potatoes.”
  • Idioms: Interprets “You are killing me” as “你真要把我笑死了” (expressing amusement), avoiding literal misinterpretation.
  • Medical Terms: Translates “uric acid kidney stones” precisely, while baselines generate malformed outputs.
  • Minority Languages: For Kazakh and Tibetan, Hunyuan-MT-7B produces coherent translations, where baselines fail or output nonsensical text.
  • Chimera Enhancements: Adds improvements in gaming jargon, intensifiers, and sports terminology.

Conclusion

Tencent’s release of Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B establishes a new standard for open-source translation. By combining a carefully designed training framework with specialized focus on low-resource and minority language translation, the models achieve quality on par with or exceeding larger closed-source systems. The launch of these 2 models provides the AI research community with accessible, high-performance tools for multilingual translation research and deployment.


Check out the , , and . All credit for this research goes to the researchers of this project. Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

Google AI Introduces Stax: A Practical AI Tool for Evaluating Large Language Models LLMs

 

Evaluating large language models (LLMs) is not straightforward. Unlike traditional software testing, LLMs are probabilistic systems. This means they can generate different responses to identical prompts, which complicates testing for reproducibility and consistency. To address this challenge, , an experimental developer tool that provides a structured way to assess and compare LLMs with custom and pre-built autoraters.

Stax is built for developers who want to understand how a model or a specific prompt performs for their use cases rather than relying solely on broad benchmarks or leaderboards.

Why Standard Evaluation Approaches Fall Short

Leaderboards and general-purpose benchmarks are useful for tracking model progress at a high level, but they don’t reflect domain-specific requirements. A model that does well on open-domain reasoning tasks may not handle specialized use cases such as compliance-oriented summarization, legal text analysis, or enterprise-specific question answering.

Stax addresses this by letting developers define the evaluation process in terms that matter to them. Instead of abstract global scores, developers can measure quality and reliability against their own criteria.

Key Capabilities of Stax

Quick Compare for Prompt Testing

The Quick Compare feature allows developers to test different prompts across models side by side. This makes it easier to see how variations in prompt design or model choice affect outputs, reducing time spent on trial-and-error.

Projects and Datasets for Larger Evaluations

When testing needs to go beyond individual prompts, Projects & Datasets provide a way to run evaluations at scale. Developers can create structured test sets and apply consistent evaluation criteria across many samples. This approach supports reproducibility and makes it easier to evaluate models under more realistic conditions.

Custom and Pre-Built Evaluators

At the center of Stax is the concept of autoraters. Developers can either build custom evaluators tailored to their use cases or use the pre-built evaluators provided. The built-in options cover common evaluation categories such as:

  • Fluency – grammatical correctness and readability.
  • Groundedness – factual consistency with reference material.
  • Safety – ensuring the output avoids harmful or unwanted content.

This flexibility helps align evaluations with real-world requirements rather than one-size-fits-all metrics.

Analytics for Model Behavior Insights

The Analytics dashboard in Stax makes results easier to interpret. Developers can view performance trends, compare outputs across evaluators, and analyze how different models perform on the same dataset. The focus is on providing structured insights into model behavior rather than single-number scores.

Practical Use Cases

  • Prompt iteration – refining prompts to achieve more consistent results.
  • Model selection – comparing different LLMs before choosing one for production.
  • Domain-specific validation – testing outputs against industry or organizational requirements.
  • Ongoing monitoring – running evaluations as datasets and requirements evolve.

Summary

Stax provides a systematic way to evaluate generative models with criteria that reflect actual use cases. By combining quick comparisons, dataset-level evaluations, customizable evaluators, and clear analytics, it gives developers tools to move from ad-hoc testing toward structured evaluation.

For teams deploying LLMs in production environments, Stax offers a way to better understand how models behave under specific conditions and to track whether outputs meet the standards required for real applications.

The post appeared first on .

Read More
Meet Chatterbox Multilingual: An Open-Source Zero-Shot Text To Speech (TTS) Multilingual Model with Emotion Control and Watermarking

Meet Chatterbox Multilingual: An Open-Source Zero-Shot Text To Speech (TTS) Multilingual Model with Emotion Control and Watermarking

 

Table of contents

Resemble AI has recently released Chatterbox Multilingual, a production grade open-source Text To Speech (TTS) model designed for zero-shot voice cloning in 23 languages. It is distributed under the MIT license, making it freely available for integration and modification. The system builds on the original Chatterbox framework and adds multilingual capability, expressive controls, and built-in watermarking for traceability.

What does Chatterbox Multilingual offer?

Chatterbox Multilingual enables voice cloning without retraining by leveraging zero-shot learning. You can easily generate a synthetic voice using a short audio sample that captures the speaker’s features/characteristics. It supports 23 languages, including Arabic, Hindi, Chinese, Swahili, and other widely spoken languages, giving it coverage across diverse linguistic families.

Apart from basic voice cloning, the model integrates emotion and intensity controls, which allow users to specify not just what is said, but also how it is delivered. The model also includes PerTh watermarking by default to ensures that every output can be authenticated through neural watermark extraction. These features make the model suitable for tasks where both accuracy and security are important.

How does it compare with commercial systems?

Evaluations indicate that Chatterbox Multilingual performs competitively with most commercial TTS models. In , listeners expressed a 63.75% preference for Chatterbox over ElevenLabs. This suggests that in certain conditions, users found Chatterbox outputs closer to natural or accurate speech reproduction.

https://www.resemble.ai/chatterbox/

It is worth noting that while some reported numbers compare performance on specific languages such as German, the only verifiable public metric is the Podonos listener preference result. This makes preference-based benchmarking the most reliable evidence currently available.

How is expressive control implemented?

Chatterbox Multilingual not only reproduce voice identity but also provides tools for controlling delivery style. The model allows adjustment of emotion categories such as happy, sad, or angry, and includes an exaggeration parameter to regulate intensity. This means a cloned voice can be made more enthusiastic, subdued, or dramatic depending on context.

Such flexibility is useful in interactive media, dialog agents, gaming, and assistive technologies, where emotional nuance affects the effectiveness of communication. Rather than producing static or neutral speech, the system can generate output that adapts to context-specific needs.

How does watermarking contribute to responsible AI usage?

Every file generated by Chatterbox Multilingual contains PerTh (Perceptual Threshold) watermarking, a neural technique developed by Resemble AI. The watermark is inaudible to listeners but can be extracted using the provided open-source detector. This enables traceability and verification of generated content, an increasingly important factor as synthetic audio becomes more widespread.

By embedding watermarking at the system level and keeping it always active, Chatterbox helps mitigate risks of misuse without requiring external enforcement mechanisms. This design choice aligns with ongoing discussions about the ethics of generative audio systems.

What deployment options are available?

The open-source release provides a baseline system that can be installed and run by researchers, developers, or hobbyists under the permissive MIT license. For environments where high concurrency, latency targets, or compliance guarantees are necessary, Resemble AI offers a managed variant called Chatterbox Multilingual Pro.

This hosted version supports sub-200 ms latency, fine-tuned voices, and includes SLAs (service-level agreements) along with compliance features required in enterprise deployments. While the open-source project serves as a general foundation, the Pro service is aimed at production workloads with operational constraints.

What is the significance of Chatterbox Multilingual open release?

Chatterbox Multilingual contributes a multilingual, open, and controllable voice cloning system to the speech synthesis community. It integrates zero-shot cloning, expressivity controls, and watermarking in a framework that is both technically advanced and freely available.

Performance studies suggest it is competitive with leading proprietary solutions, offering a practical platform for further research and application development. Its open-source license makes it accessible to a broad range of users, from academic researchers to independent developers, strengthening the ecosystem of multilingual speech synthesis tools.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More