InstaDeep Introduces Nucleotide Transformer v3 (NTv3): A New Multi-Species Genomics Foundation Model, Designed for 1 Mb Context Lengths at Single-Nucleotide Resolution

InstaDeep Introduces Nucleotide Transformer v3 (NTv3): A New Multi-Species Genomics Foundation Model, Designed for 1 Mb Context Lengths at Single-Nucleotide Resolution

 

Genomic prediction and design now require models that connect local motifs with megabase scale regulatory context and that operate across many organisms. Nucleotide Transformer v3, or NTv3, is InstaDeep’s new multi species genomics foundation model for this setting. It unifies representation learning, functional track and genome annotation prediction, and controllable sequence generation in a single backbone that runs on 1 Mb contexts at single nucleotide resolution.

Earlier Nucleotide Transformer models already showed that self supervised pretraining on thousands of genomes yields strong features for molecular phenotype prediction. The original series included models from 50M to 2.5B parameters trained on 3,200 human genomes and 850 additional genomes from diverse species. NTv3 keeps this sequence only pretraining idea but extends it to longer contexts and adds explicit functional supervision and a generative mode.

https://huggingface.co/spaces/InstaDeepAI/ntv3

Architecture for 1 Mb genomic windows

NTv3 uses a U-Net style architecture that targets very long genomic windows. A convolutional downsampling tower compresses the input sequence, a transformer stack models long range dependencies in that compressed space, and a deconvolution tower restores base level resolution for prediction and generation. Inputs are tokenized at the character level over A, T, C, G, N with special tokens such as <unk>, <pad>, <mask>, <cls>, <eos>, and <bos>. Sequence length must be a multiple of 128 tokens, and the reference implementation uses padding to enforce this constraint. All public checkpoints use single base tokenization with a vocabulary size of 11 tokens.

The smallest public model, NTv3 8M pre, has about 7.69M parameters with hidden dimension 256, FFN dimension 1,024, 2 transformer layers, 8 attention heads, and 7 downsample stages. At the high end, NTv3 650M uses hidden dimension 1,536, FFN dimension 6,144, 12 transformer layers, 24 attention heads, and 7 downsample stages, and adds conditioning layers for species specific prediction heads.

Training data

The NTv3 model is pretrained on 9 trillion base pairs from the OpenGenome2 resource using base resolution masked language modeling. After this stage, the model is post trained with a joint objective that integrates continued self supervision with supervised learning on approximately 16,000 functional tracks and annotation labels from 24 animal and plant species.

Performance and Ntv3 Benchmark

After post training NTv3 achieves state of the art accuracy for functional track prediction and genome annotation across species. It outperforms strong sequence to function models and previous genomic foundation models on existing public benchmarks and on the new Ntv3 Benchmark, which is defined as a controlled downstream fine tuning suite with standardized 32 kb input windows and base resolution outputs.

The Ntv3 Benchmark currently consists of 106 long range, single nucleotide, cross assay, cross species tasks. Because NTv3 sees thousands of tracks across 24 species during post training, the model learns a shared regulatory grammar that transfers between organisms and assays and supports coherent long range genome to function inference.

From prediction to controllable sequence generation

Beyond prediction, NTv3 can be fine tuned into a controllable generative model via masked diffusion language modeling. In this mode the model receives conditioning signals that encode desired enhancer activity levels and promoter selectivity, and it fills masked spans in the DNA sequence in a way that is consistent with those conditions.

In experiments described in the launch materials, the team designs 1,000 enhancer sequences with specified activity and promoter specificity and validates them in vitro using STARR seq assays in collaboration with the Stark Lab. The results show that these generated enhancers recover the intended ordering of activity levels and reach more than 2 times improved promoter specificity compared with baselines.

Key Takeaways

  1. NTv3 is a long range, multi species genomics foundation model: It unifies representation learning, functional track prediction, genome annotation, and controllable sequence generation in a single U Net style architecture that supports 1 Mb nucleotide resolution context across 24 animal and plant species.
  2. The model is trained on 9 trillion base pairs with joint self supervised and supervised objectives: NTv3 is pretrained on 9 trillion base pairs from OpenGenome2 with base resolution masked language modeling, then post trained on more than 16,000 functional tracks and annotation labels from 24 species using a joint objective that mixes continued self supervision with supervised learning.
  3. NTv3 achieves state of the art performance on the Ntv3 Benchmark: After post training, NTv3 reaches state of the art accuracy for functional track prediction and genome annotation across species and outperforms previous sequence to function models and genomics foundation models on public benchmarks and on the Ntv3 Benchmark, which contains 106 standardized long range downstream tasks with 32 kb input and base resolution outputs.
  4. The same backbone supports controllable enhancer design validated with STARR seq: NTv3 can be fine tuned as a controllable generative model using masked diffusion language modeling to design enhancer sequences with specified activity levels and promoter selectivity, and these designs are validated experimentally with STARR seq assays that confirm the intended activity ordering and improved promoter specificity.

Check out the ,  and . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to . Wait! are you on telegram? 

The post appeared first on .

Read More

Google Health AI Releases MedASR: a Conformer Based Medical Speech to Text Model for Clinical Dictation

 

Google Health AI team has released MedASR, an open weights medical speech to text model that targets clinical dictation and physician patient conversations and is designed to plug directly into modern AI workflows.

What MedASR is and where it fits?

MedASR is a speech to text model based on the Conformer architecture and is pre trained for medical dictation and transcription. It is positioned as a starting point for developers who want to build healthcare based voice applications such as radiology dictation tools or visit note capture systems.

The model has 105 million parameters and accepts mono channel audio at 16000 hertz with 16 bit integer waveforms. It produces text only output, so it drops directly into downstream natural language processing or generative models such as MedGemma.

MedASR sits inside the Health AI Developer Foundations portfolio, alongside MedGemma, MedSigLIP and other domain specific medical models that share common terms of use and a consistent governance story.

Training data and domain specialization

MedASR is trained on a diverse corpus of de identified medical speech. The dataset includes about 5000 hours of physician dictations and clinical conversations across radiology, internal medicine and family medicine.

The training pairs audio segments with transcripts and metadata. Subsets of the conversational data are annotated with medical named entities including symptoms, medications and conditions. This gives the model strong coverage of clinical vocabulary and phrasing patterns that appear in routine documentation.

The model is English only, and most training audio comes from speakers for whom English is a first language and who were raised in the United States. The documentation notes that performance may be lower for other speaker profiles or noisy microphones and recommends fine tuning for such settings.

Architecture and decoding

MedASR follows the Conformer encoder design. Conformer combines convolution blocks with self attention layers so it can capture local acoustic patterns and longer range temporal dependencies in the same stack.

The model is exposed as an automated speech detector with a CTC style interface. In the reference implementation, developers use AutoProcessor to create input features from waveform audio and AutoModelForCTC to produce token sequences. Decoding uses greedy decoding by default. The model can also be paired with an external six gram language model with beam search of size 8 to improve word error rate.

MedASR training uses JAX and ML Pathways on TPUv4p, TPUv5p and TPUv5e hardware. These systems provide the scale needed for large speech models and align with Google’s broader foundation model training stack.

Performance on medical speech tasks

Key results, with greedy decoding and with a six gram language model, are:

  • RAD DICT, radiologist dictation: MedASR greedy 6.6 percent, MedASR plus language model 4.6 percent, Gemini 2.5 Pro 10.0 percent, Gemini 2.5 Flash 24.4 percent, Whisper v3 Large 25.3 percent.
  • GENERAL DICT, general and internal medicine: MedASR greedy 9.3 percent, MedASR plus language model 6.9 percent, Gemini 2.5 Pro 16.4 percent, Gemini 2.5 Flash 27.1 percent, Whisper v3 Large 33.1 percent.
  • FM DICT, family medicine: MedASR greedy 8.1 percent, MedASR plus language model 5.8 percent, Gemini 2.5 Pro 14.6 percent, Gemini 2.5 Flash 19.9 percent, Whisper v3 Large 32.5 percent.
  • Eye Gaze, dictation on 998 MIMIC chest X ray cases: MedASR greedy 6.6 percent, MedASR plus language model 5.2 percent, Gemini 2.5 Pro 5.9 percent, Gemini 2.5 Flash 9.3 percent, Whisper v3 Large 12.5 percent.

Developer workflow and deployment options

A minimal pipeline example is:

from transformers import pipeline
import huggingface_hub

audio = huggingface_hub.hf_hub_download("google/medasr", "test_audio.wav")
pipe = pipeline("automatic-speech-recognition", model="google/medasr")
result = pipe(audio, chunk_length_s=20, stride_length_s=2)
print(result)

For more control, developers load AutoProcessor and AutoModelForCTC, resample audio to 16000 hertz with librosa, move tensors to CUDA if available and call model.generate followed by processor.batch_decode.

Key Takeaways

  1. MedASR is a lightweight, open weights Conformer based medical ASR model: It has 105M parameters, is trained specifically for medical dictation and transcription, and is released under the Health AI Developer Foundations program as an English only model for healthcare developers.
  2. Domain specific training on about 5000 hours of de identified medical audio: MedASR is pre trained on physician dictations and clinical conversations across specialties like radiology, internal medicine and family medicine, which gives it strong coverage of clinical terminology compared to general purpose ASR systems.
  3. Competitive or better word error rates on medical dictation benchmarks: On internal radiology, general medicine, family medicine and Eye Gaze datasets, MedASR with greedy or language model decoding matches or outperforms large general models such as Gemini 2.5 Pro, Gemini 2.5 Flash and Whisper v3 Large on word error rate for English medical speech.

Check out the , and . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to . Wait! are you on telegram? 

The post appeared first on .

Read More

How to Build a Proactive Pre-Emptive Churn Prevention Agent with Intelligent Observation and Strategy Formation

 

In this tutorial, we build a fully functional Pre-Emptive Churn Agent that proactively identifies at-risk users and drafts personalized re-engagement emails before they cancel. Rather than waiting for churn to occur, we design an agentic loop in which we observe user inactivity, analyze behavioral patterns, strategize incentives, and generate human-ready email drafts using Gemini. We orchestrate the entire process step by step, ensuring each component, from data simulation to manager approval, works seamlessly together. Check out the .

import os
import time
import json
import random
from datetime import datetime, timedelta
from typing import List, Dict, Any
import textwrap


try:
   import google.generativeai as genai
except ImportError:
   !pip install -q -U google-generativeai
   import google.generativeai as genai


from google.colab import userdata
import getpass

We set up our environment, import all required libraries, and ensure Gemini is available for use. We keep the initialization minimal so the rest of the system loads cleanly. As we run it, we prepare the foundation for the agent-driven workflow that follows. Check out the .

def setup_gemini():
   print("--- 🔐 Security Check ---")
   try:
       api_key = userdata.get('GEMINI_API_KEY')
   except:
       print("Please enter your Google Gemini API Key:")
       api_key = getpass.getpass("API Key: ")
   if not api_key:
       raise ValueError("API Key is required to run the agent.")
   genai.configure(api_key=api_key)
   return genai.GenerativeModel('gemini-2.5-flash')


class MockCustomerDB:
   def __init__(self):
       self.today = datetime.now()
       self.users = self._generate_mock_users()


   def _generate_mock_users(self) -> List[Dict]:
       profiles = [
           {"id": "U001", "name": "Sarah Connor", "plan": "Enterprise",
            "last_login_days_ago": 2, "top_features": ["Reports", "Admin Panel"], "total_spend": 5000},
           {"id": "U002", "name": "John Smith", "plan": "Basic",
            "last_login_days_ago": 25, "top_features": ["Image Editor"], "total_spend": 50},
           {"id": "U003", "name": "Emily Chen", "plan": "Pro",
            "last_login_days_ago": 16, "top_features": ["API Access", "Data Export"], "total_spend": 1200},
           {"id": "U004", "name": "Marcus Aurelius", "plan": "Enterprise",
            "last_login_days_ago": 45, "top_features": ["Team Management"], "total_spend": 8000}
       ]
       return profiles


   def fetch_at_risk_users(self, threshold_days=14) -> List[Dict]:
       return [u for u in self.users if u['last_login_days_ago'] >= threshold_days]

We configure authentication for Gemini and construct a mock customer database that behaves like a real system. We simulate users with varying levels of inactivity to generate realistic churn scenarios. Check out the .

class ChurnPreventionAgent:
   def __init__(self, model):
       self.model = model


   def analyze_and_strategize(self, user: Dict) -> Dict:
       print(f"   ... 🧠 Analyzing strategy for {user['name']}...")
       prompt = f"""
       You are a Customer Success AI Specialist.
       Analyze this user profile and determine the best 'Win-Back Strategy'.
       USER PROFILE:
       - Name: {user['name']}
       - Plan: {user['plan']}
       - Days Inactive: {user['last_login_days_ago']}
       - Favorite Features: {', '.join(user['top_features'])}
       - Total Spend: ${user['total_spend']}
       TASK:
       1. Determine the 'Churn Probability' (Medium/High/Critical).
       2. Select a specific INCENTIVE.
       3. Explain your reasoning briefly.
       OUTPUT FORMAT:
       {{
           "risk_level": "High",
           "incentive_type": "Specific Incentive",
           "reasoning": "One sentence explanation."
       }}
       """
       try:
           response = self.model.generate_content(prompt)
           clean_json = response.text.replace("```json", "").replace("```", "").strip()
           return json.loads(clean_json)
       except Exception as e:
           return {
               "risk_level": "Unknown",
               "incentive_type": "General Check-in",
               "reasoning": f"Analysis failed: {str(e)}"
           }

We build the analytical core of our churn agent to evaluate user behavior and select win-back strategies. We let Gemini interpret signals, such as inactivity and usage patterns, to determine risk and incentives. Check out the .

def draft_engagement_email(self, user: Dict, strategy: Dict) -> str:
       print(f"   ... ✍  Drafting email for {user['name']} using '{strategy['incentive_type']}'...")
       prompt = f"""
       Write a short, empathetic, professional re-engagement email.
       TO: {user['name']}
       CONTEXT: They haven't logged in for {user['last_login_days_ago']} days.
       STRATEGY: {strategy['incentive_type']}
       REASONING: {strategy['reasoning']}
       USER HISTORY: They love {', '.join(user['top_features'])}.
       TONE: Helpful and concise.
       """
       response = self.model.generate_content(prompt)
       return response.text

We generate personalized re-engagement emails based on the strategy output from the previous step. We use Gemini to craft concise, empathetic messaging that aligns with each user’s history. Check out the .

class ManagerDashboard:
   def review_draft(self, user_name, strategy, draft_text):
       print("n" + "="*60)
       print(f"🚨 REVIEW REQUIRED: Re-engagement for {user_name}")
       print(f"🎯 Strategy: {strategy['incentive_type']}")
       print(f"📝 Risk Level: {strategy['risk_level']}")
       print("-" * 60)
       print("📨 DRAFT EMAIL:n")
       print(textwrap.indent(draft_text, '    '))
       print("-" * 60)
       print("n[Auto-Simulation] Manager reviewing...")
       time.sleep(1.5)
       if strategy['risk_level'] == "Critical":
           print("✅ MANAGER DECISION: Approved (Priority Send)")
           return True
       else:
           print("✅ MANAGER DECISION: Approved")
           return True

We simulate a manager dashboard where human oversight approves or rejects the drafted email. We keep the flow simple but realistic, ensuring the agent’s actions remain aligned with human judgment. Check out the .

def main():
   print("Initializing Agentic System...")
   try:
       model = setup_gemini()
       db = MockCustomerDB()
       agent = ChurnPreventionAgent(model)
       manager = ManagerDashboard()
   except Exception as e:
       print(f"Setup failed: {e}")
       return


   print("n🔍 AGENT STATUS: Scanning Database for inactive users (>14 days)...")
   at_risk_users = db.fetch_at_risk_users(threshold_days=14)
   print(f"Found {len(at_risk_users)} at-risk users.n")


   for user in at_risk_users:
       print(f"--- Processing Case: {user['id']} ({user['name']}) ---")
       strategy = agent.analyze_and_strategize(user)
       email_draft = agent.draft_engagement_email(user, strategy)
       approved = manager.review_draft(user['name'], strategy, email_draft)
       if approved:
           print(f"🚀 ACTION: Email queued for sending to {user['name']}.")
       else:
           print(f"🛑 ACTION: Email rejected.")
       print("n")
       time.sleep(1)


if __name__ == "__main__":
   main()

We orchestrate the full system: scanning for at-risk users, analyzing them, drafting messages, and routing everything for approval. We bring all components together into one continuous loop. 

In conclusion, we have completed a churn-prevention pipeline that observes, reasons, drafts, and involves a human reviewer before action. We watch the agent detect risk patterns, craft tailored strategies, and generate professional emails, all while maintaining human oversight for final decisions. This implementation demonstrates how agentic workflows can transform customer success operations by enabling timely, personalized, and scalable interventions. We now have a modular foundation we can expand further, connecting it to real databases, CRMs, web dashboards, or automation systems, to build a truly production-ready churn prevention engine.


Check out the . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to . Wait! are you on telegram? 

The post appeared first on .

Read More

Google DeepMind Researchers Release Gemma Scope 2 as a Full Stack Interpretability Suite for Gemma 3 Models

 

Google DeepMind Researchers introduce Gemma Scope 2, an open suite of interpretability tools that exposes how Gemma 3 language models process and represent information across all layers, from 270M to 27B parameters.

Its core goal is simple, give AI safety and alignment teams a practical way to trace model behavior back to internal features instead of relying only on input output analysis. When a Gemma 3 model jailbreaks, hallucinates or shows sycophantic behavior, Gemma Scope 2 lets researchers inspect which internal features fired and how those activations flowed through the network.

What is Gemma Scope 2?

Gemma Scope 2 is a comprehensive, open suite of sparse autoencoders and related tools trained on internal activations of the Gemma 3 model family. Sparse autoencoders, SAEs, act as a microscope on the model. They decompose high dimensional activations into a sparse set of human inspectable features that correspond to concepts or behaviors.

Training Gemma Scope 2 required storing around 110 Petabytes of activation data and fitting over 1 trillion total parameters across all interpretability models.

The suite targets every Gemma 3 variant, including 270M, 1B, 4B, 12B and 27B parameter models, and covers the full depth of the network. This is important because many safety relevant behaviors only appear at larger scales.

What is new compared to the original Gemma Scope?

The first Gemma Scope release focused on Gemma 2 and already enabled research on model hallucination, identifying secrets known by a model and training safer models.

Gemma Scope 2 extends that work in four main ways:

  1. The tools now span the entire Gemma 3 family up to 27B parameters, which is needed to study emergent behaviors observed only in larger models, such as the behavior previously analyzed in the 27B size C2S Scale model for scientific discovery tasks.
  2. Gemma Scope 2 includes SAEs and transcoders trained on every layer of Gemma 3. Skip transcoders and cross layer transcoders help trace multi step computations that are distributed across layers.
  3. The suite applies the Matryoshka training technique so that SAEs learn more useful and stable features and mitigate some flaws identified in the earlier Gemma Scope release.
  4. There are dedicated interpretability tools for Gemma 3 models tuned for chat, which make it possible to analyze multi step behaviors such as jailbreaks, refusal mechanisms and chain of thought faithfulness.

Key Takeaways

  1. Gemma Scope 2 is an open interpretability suite for all Gemma 3 models, from 270M to 27B parameters, with SAEs and transcoders on every layer of both pretrained and instruction tuned variants.
  2. The suite uses sparse autoencoders as a microscope that decomposes internal activations into sparse, concept like features, plus transcoders that track how these features propagate across layers.
  3. Gemma Scope 2 is explicitly positioned for AI safety work to study jailbreaks, hallucinations, sycophancy, refusal mechanisms and discrepancies between internal state and communicated reasoning in Gemma 3.

Check out the ,  and . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to . Wait! are you on telegram? 

The post appeared first on .

Read More
Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval

Meta AI Open-Sourced Perception Encoder Audiovisual (PE-AV): The Audiovisual Encoder Powering SAM Audio And Large Scale Multimodal Retrieval

 

Meta researchers have introduced Perception Encoder Audiovisual, PEAV, as a new family of encoders for joint audio and video understanding. The model learns aligned audio, video, and text representations in a single embedding space using large scale contrastive training on about 100M audio video pairs with text captions.

From Perception Encoder to PEAV

Perception Encoder, PE, is the core vision stack in Meta’s Perception Models project. It is a family of encoders for images, video, and audio that reaches state of the art on many vision and audio benchmarks using a unified contrastive pretraining recipe. PE core surpasses SigLIP2 on image tasks and InternVideo2 on video tasks. PE lang powers Perception Language Model for multimodal reasoning. PE spatial is tuned for dense prediction tasks such as detection and depth estimation.

PEAV builds on this backbone and extends it to full audio video text alignment. In the Perception Models repository, PE audio visual is listed as the branch that embeds audio, video, audio video, and text into a single joint embedding space for cross modal understanding.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Architecture, Separate Towers and Fusion

The PEAV architecture is composed of a frame encoder, a video encoder, an audio encoder, an audio video fusion encoder, and a text encoder.

  • The video path uses the existing PE frame encoder on RGB frames, then applies a temporal video encoder on top of frame level features.
  • The audio path uses DAC VAE as a codec to convert raw waveforms into discrete audio tokens at fixed frame rate, about one embedding every 40 milliseconds.

These towers feed an audio video fusion encoder that learns a shared representation for both streams. The text encoder projects text queries into several specialized spaces. In practice this gives you a single backbone that can be queried in many ways. You can retrieve video from text, audio from text, audio from video, or retrieve text descriptions conditioned on any combination of modalities without retraining task specific heads.

https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

Data Engine, Synthetic Audiovisual Captions At Scale

The research team proposed a two stage audiovisual data engine that generates high quality synthetic captions for unlabeled clips. The team describes a pipeline that first uses several weak audio caption models, their confidence scores, and separate video captioners as input to a large language model. This LLM produces three caption types per clip, one for audio content, one for visual content, and one for joint audio visual content. An initial PE AV model is trained on this synthetic supervision.

In the second stage, this initial PEAV is paired with a Perception Language Model decoder. Together they refine the captions to better exploit audiovisual correspondences. The two stage engine yields reliable captions for about 100M audio video pairs and uses about 92M unique clips for stage 1 pretraining and 32M additional unique clips for stage 2 fine tuning.

Compared to prior work that often focuses on speech or narrow sound domains, this corpus is designed to be balanced across speech, general sounds, music, and diverse video domains, which is important for general audio visual retrieval and understanding.

Contrastive Objective Across Ten Modality Pairs

PEAV uses a sigmoid based contrastive loss across audio, video, text, and fused representations. The research team explains that the model uses eight contrastive loss pairs during pretraining. These cover combinations such as audio text, video text, audio video text, and fusion related pairs. During fine tuning, two extra pairs are added, which brings the total to ten loss pairs among the different modality and caption types.

This objective is similar in form to contrastive objectives used in recent vision language encoders but generalized to audio video text tri modal training. By aligning all these views in one space, the same encoder can support classification, retrieval, and correspondence tasks with simple dot product similarities.

Performance Across Audio, Speech, Music And Video

On benchmarks, PEAV targets zero shot retrieval and classification for multiple domains. PE AV achieves state of the art performance on several audio and video benchmarks compared to recent audio text and audio video text models from works such as CLAP, Audio Flamingo, ImageBind, and LanguageBind.

Concrete gains include:

  • On AudioCaps, text to audio retrieval improves from 35.4 R at 1 to 45.8 R at 1.
  • On VGGSound, clip level classification accuracy improves from 36.0 to 47.1.
  • For speech retrieval on VCTK style tasks, PE AV reaches 85.6 accuracy while earlier models are near 0.
  • On ActivityNet, text to video retrieval improves from 60.4 R at 1 to 66.5 R at 1.
  • On Kinetics 400, zero shot video classification improves from 76.9 to 78.9, beating models 2 to 4 times larger.
https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/

PEA-Frame, Frame Level Audio Text Alignment

Alongside PEAV, Meta releases Perception Encoder Audio Frame, PEA-Frame, for sound event localization. PE A Frame is an audio text embedding model that outputs one audio embedding per 40 milliseconds frame and a single text embedding per query. The model can return temporal spans that mark where in the audio each described event occurs.

PEA-Frame uses frame level contrastive learning to align audio frames with text. This enables precise localization of events such as specific speakers, instruments, or transient sounds in long audio sequences.

Role In The Perception Models And SAM Audio Ecosystem

PEAV and PEA-Frame sit inside the broader Perception Models stack, which combines PE encoders with Perception Language Model for multimodal generation and reasoning.

PEAV is also the core perception engine behind Meta’s new SAM Audio model and its Judge evaluator. SAM Audio uses PEAV embeddings to connect visual prompts and text prompts to sound sources in complex mixtures and to score the quality of separated audio tracks.

Key Takeaways

  • PEAV is a unified encoder for audio, video, and text, trained with contrastive learning on over 100M videos, and embeds audio, video, audio video, and text into a single joint space for cross modal retrieval and understanding.
  • The architecture uses separate video and audio towers, with PE based visual encoding and DAC VAE audio tokenization, followed by an audio visual fusion encoder and specialized text heads aligned to different modality pairs.
  • A 2 stage data engine generates synthetic audio, visual, and audio visual captions using weaker captioners plus an LLM in stage 1 and PEAV plus Perception Language Model in stage 2, enabling large scale multimodal supervision without manual labels.
  • PEAV establishes new state of the art on a wide range of audio and video benchmarks through a sigmoid contrastive objective over multiple modality pairs, with six public checkpoints from small 16 frame to large all frame variants, where average retrieval improves from about 45 to 51.6.
  • PEAV, together with the frame level PEA-Frame variant, forms the perception backbone for Meta’s SAM Audio system, providing the embeddings used for prompt based audio separation and fine grained sound event localization across speech, music, and general sounds.

Check out the , and . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to . Wait! are you on telegram? 

The post appeared first on .

Read More