Inside the Man vs. Machine Hackathon
At a weekend hackathon in San Francisco, more than 100 coders gathered to test whether they could beat AI—and win a $12,500 cash prize.
At a weekend hackathon in San Francisco, more than 100 coders gathered to test whether they could beat AI—and win a $12,500 cash prize.
In her new book, feminist author Laura Bates explores how sexbots, AI assistants, and deepfakes are reinventing misogyny and harming women.
K2 Think compares well with reasoning models from OpenAI and DeepSeek but is smaller and more efficient, say researchers based in Abu Dhabi.
Anthropic agreed to a $1.5 billion settlement for authors whose books were used to train its AI model. As an author who fits that description, I’ve come around to the idea.
Mustafa Suleyman says that designing AI systems to exceed human intelligence—and to mimic behavior that suggests consciousness—would be “dangerous and misguided.”
The company behind ChatGPT is putting together a team capable of developing algorithms to control robots and appears to be hiring roboticists who work specifically on humanoids.
The field of biomedical artificial intelligence is evolving rapidly, with increasing demand for agents capable of performing tasks that span genomics, clinical diagnostics, and molecular biology. These agents aren’t merely designed to retrieve facts; they are expected to reason through complex biological problems, interpret patient data, and extract meaningful insights from vast biomedical databases. Unlike general-purpose AI models, biomedical agents must interface with domain-specific tools, comprehend biological hierarchies, and simulate workflows similar to those of researchers to effectively support modern biomedical research.
However, achieving expert-level performance in these tasks is far from trivial. Most large language models fall short when dealing with the nuance and depth of biomedical reasoning. They may succeed on surface-level retrieval or pattern recognition tasks, but often fail when challenged with multi-step reasoning, rare disease diagnosis, or gene prioritization, areas that require not just data access, but contextual understanding and domain-specific judgment. This limitation has created a clear gap: how to train biomedical AI agents that can think and act like domain experts.
While some solutions leverage supervised learning on curated biomedical datasets or retrieval-augmented generation to ground responses in literature or databases, these approaches have drawbacks. They often rely on static prompts and pre-defined behaviors that lack adaptability. Furthermore, many of these agents struggle to effectively execute external tools, and their reasoning chains collapse when faced with unfamiliar biomedical structures. This fragility makes them ill-suited for dynamic or high-stakes environments, where interpretability and accuracy are non-negotiable.
Researchers from Stanford University and UC Berkeley introduced a new family of models called , built by applying reinforcement learning (RL) to a biomedical agent foundation. These models, Biomni-R0-8B and Biomni-R0-32B, were trained in an RL environment specifically tailored for biomedical reasoning, using both expert-annotated tasks and a novel reward structure. The collaboration combines Stanford’s Biomni agent and environment platform with UC Berkeley’s SkyRL reinforcement learning infrastructure, aiming to push biomedical agents past human-level capabilities.
The research introduced a two-phase training process. First, they used supervised fine-tuning (SFT) on high-quality trajectories sampled from Claude-4 Sonnet using rejection sampling, effectively bootstrapping the agent’s ability to follow structured reasoning formats. Next, they fine-tuned the models using reinforcement learning, optimizing for two kinds of rewards: one for correctness (e.g., selecting the right gene or diagnosis), and another for response formatting (e.g., using structured <think> and <answer> tags correctly).
To ensure computational efficiency, the team developed asynchronous rollout scheduling that minimized bottlenecks caused by external tool delays. They also expanded the context length to 64k tokens, allowing the agent to manage long multi-step reasoning conversations effectively.
The performance gains were significant. Biomni-R0-32B achieved a score of 0.669, a jump from the base model’s 0.346. Even Biomni-R0-8B, the smaller version, scored 0.588, outperforming general-purpose models like Claude 4 Sonnet and GPT-5, which are both much larger. On a task-by-task basis, Biomni-R0-32B scored highest on 7 out of 10 tasks, while GPT-5 led in 2, and Claude 4 in just 1. One of the most striking results was in rare disease diagnosis, where Biomni-R0-32B reached 0.67, compared to Qwen-32B’s 0.03, a more than 20× improvement. Similarly, in GWAS variant prioritization, the model’s score increased from 0.16 to 0.74, demonstrating the value of domain-specific reasoning.
Training large biomedical agents requires dealing with resource-heavy rollouts involving external tool execution, database queries, and code evaluation. To manage this, the system decoupled environment execution from model inference, allowing more flexible scaling and reducing idle GPU time. This innovation ensured efficient use of resources, even with tools that had varying execution latencies. Longer reasoning sequences also proved beneficial. The RL-trained models consistently produced lengthier, structured responses, which strongly correlated with better performance, highlighting that depth and structure in reasoning are key indicators of expert-level understanding in biomedicine.
Check out the . Feel free to check out our . Also, feel free to follow us on and don’t forget to join our and Subscribe to .
The post appeared first on .
EmbeddingGemma is Google’s new open text embedding model optimized for on-device AI, designed to balance efficiency with state-of-the-art retrieval performance.
At just 308 million parameters, EmbeddingGemma is lightweight enough to run on mobile devices and offline environments. Despite its size, it performs competitively with much larger embedding models. Inference latency is low (sub-15 ms for 256 tokens on EdgeTPU), making it suitable for real-time applications.
EmbeddingGemma was trained across 100+ languages and achieved the highest ranking on the Massive Text Embedding Benchmark (MTEB) among models under 500M parameters. Its performance rivals or exceeds embedding models nearly twice its size, particularly in cross-lingual retrieval and semantic search.
EmbeddingGemma is built on a Gemma 3–based encoder backbone with mean pooling. Importantly, the architecture does not use the multimodal-specific bidirectional attention layers that Gemma 3 applies for image inputs. Instead, EmbeddingGemma employs a standard transformer encoder stack with full-sequence self-attention, which is typical for text embedding models.
This encoder produces 768-dimensional embeddings and supports sequences up to 2,048 tokens, making it well-suited for retrieval-augmented generation (RAG) and long-document search. The mean pooling step ensures fixed-length vector representations regardless of input size.
EmbeddingGemma employs Matryoshka Representation Learning (MRL). This allows embeddings to be truncated from 768 dimensions down to 512, 256, or even 128 dimensions with minimal loss of quality. Developers can tune the trade-off between storage efficiency and retrieval precision without retraining.
Yes. EmbeddingGemma was specifically designed for on-device, offline-first use cases. Since it shares a tokenizer with Gemma 3n, the same embeddings can directly power compact retrieval pipelines for local RAG systems, with privacy benefits from avoiding cloud inference.
It integrates seamlessly with:
(1) Load and Embed
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("google/embeddinggemma-300m")
emb = model.encode(["example text to embed"])
(2) Adjust Embedding Size
Use full 768 dims for maximum accuracy or truncate to 512/256/128 dims for lower memory or faster retrieval.
(3) Integrate into RAG
Run similarity search locally (cosine similarity) and feed top results into Gemma 3n for generation. This enables a fully offline RAG pipeline.
EmbeddingGemma proves that smaller embedding models can achieve best-in-class retrieval performance while being light enough for offline deployment. It marks an important step toward efficient, privacy-conscious, and scalable on-device AI.
Check out the and . Feel free to check out our . Also, feel free to follow us on and don’t forget to join our and Subscribe to .
The post appeared first on .
Retrieval-Augmented Generation (RAG) systems generally rely on dense embedding models that map queries and documents into fixed-dimensional vector spaces. While this approach has become the default for many AI applications, a recent research from Google DeepMind team explains a fundamental architectural limitation that cannot be solved by larger models or better training alone.
At the core of the issue is the representational capacity of fixed-size embeddings. An embedding of dimension d cannot represent all possible combinations of relevant documents once the database grows beyond a critical size. This follows from results in communication complexity and sign-rank theory.
These values are best-case estimates derived under free embedding optimization, where vectors are directly optimized against test labels. Real-world language-constrained embeddings fail even earlier.
To test this limitation empirically, Google DeepMind Team introduced LIMIT (Limitations of Embeddings in Information Retrieval), a benchmark dataset specifically designed to stress-test embedders. LIMIT has two configurations:
Even with just 46 documents, no embedder reaches full recall, highlighting that the limitation is not dataset size alone but the single-vector embedding architecture itself.
In contrast, BM25, a classical sparse lexical model, does not suffer from this ceiling. Sparse models operate in effectively unbounded dimensional spaces, allowing them to capture combinations that dense embeddings cannot.
CCurrent RAG implementations typically assume that embeddings can scale indefinitely with more data. The Google DeepMind research team explains how this assumption is incorrect: embedding size inherently constrains retrieval capacity. This affects:
Even advanced benchmarks like MTEB fail to capture these limitations because they test only a narrow part/section of query-document combinations.
The research team suggested that scalable retrieval will require moving beyond single-vector embeddings:
The key insight is that architectural innovation is required, not simply larger embedders.
The research team’s analysis shows that dense embeddings, despite their success, are bound by a mathematical limit: they cannot capture all possible relevance combinations once corpus sizes exceed limits tied to embedding dimensionality. The LIMIT benchmark demonstrates this failure concretely:
Classical techniques like BM25, or newer architectures such as multi-vector retrievers and cross-encoders, remain essential for building reliable retrieval engines at scale.
Check out the . Feel free to check out our . Also, feel free to follow us on and don’t forget to join our and Subscribe to .
The post appeared first on .
The Allen Institute for AI (AI2) has released OLMoASR, a suite of open automatic speech recognition (ASR) models that rival closed-source systems such as OpenAI’s Whisper. Beyond just releasing model weights, AI2 has published training data identifiers, filtering steps, training recipes, and benchmark scripts—an unusually transparent move in the ASR space. This makes OLMoASR one of the most trending and extensible platforms for speech recognition research.
Most speech recognition models available today—whether from OpenAI, Google, or Microsoft—are only accessible via APIs. While these services provide high performance, they operate as black boxes: the training datasets are opaque, the filtering methods are undocumented, and the evaluation protocols are not always aligned with research standards.
This lack of transparency poses challenges for reproducibility and scientific progress. Researchers cannot verify claims, test variations, or adapt models to new domains without re-building large datasets themselves. OLMoASR addresses this problem by opening the entire pipeline. The release is not just about enabling practical transcription—it’s about pushing ASR toward a more open, scientific foundation.
OLMoASR uses a transformer encoder–decoder architecture, the dominant paradigm in modern ASR.
This design is similar to Whisper, but OLMoASR makes the implementation fully open.
The family of models covers six sizes, all trained on English:
This range allows developers to trade off between inference cost and accuracy. Smaller models are suited for embedded devices or real-time transcription, while the larger models maximize accuracy for research or batch workloads.
One of the core contributions of OLMoASR is the open release of training datasets, not just the models.
This massive collection contains weakly supervised speech paired with transcripts scraped from the web. It includes around 3 million hours of audio and 17 million text transcripts. Like Whisper’s original dataset, it is noisy, containing misaligned captions, duplicates, and transcription errors.
To address quality issues, AI2 applied rigorous filtering:
The result is a high-quality, 1M-hour dataset that boosts zero-shot generalization—critical for real-world tasks where data may differ from training distributions.
This two-tiered data strategy mirrors practices in large-scale language model pretraining: use vast noisy corpora for scale, then refine with filtered subsets to improve quality.
AI2 benchmarked OLMoASR against Whisper across both short-form and long-form speech tasks, using datasets like LibriSpeech, TED-LIUM3, Switchboard, AMI, and VoxPopuli.
This nearly matches Whisper-medium.en, which achieves 12.4% and 10.5% respectively.
Even the tiny and base versions perform competitively:
This gives developers flexibility to choose models based on compute and latency requirements.
Transcribing audio takes just a few lines of code:
import olmoasr
model = olmoasr.load_model("medium", inference=True)
result = model.transcribe("audio.mp3")
print(result)
The output includes both the transcription and time-aligned segments, making it useful for captioning, meeting transcription, or downstream NLP pipelines.
Since AI2 provides full training code and recipes, OLMoASR can be fine-tuned for specialized domains:
This adaptability is critical: ASR performance often drops when models are used in specialized domains with domain-specific jargon. Open pipelines make domain adaptation straightforward.
OLMoASR opens up exciting opportunities across academic research and real-world AI development:
The release of OLMoASR brings high-quality speech recognition can be developed and released in a way that prioritizes transparency and reproducibility. While the models are currently limited to English and still demand significant compute for training, they provide a solid foundation for adaptation and extension. This release sets a clear reference point for future work in open ASR and makes it easier for researchers and developers to study, benchmark, and apply speech recognition models in different domains.
Check out the , and . Feel free to check out our . Also, feel free to follow us on and don’t forget to join our and Subscribe to .
The post appeared first on .