xAI launches Grok-4-Fast: Unified Reasoning and Non-Reasoning Model with 2M-Token Context and Trained End-to-End with Tool-Use Reinforcement Learning (RL)

xAI launches Grok-4-Fast: Unified Reasoning and Non-Reasoning Model with 2M-Token Context and Trained End-to-End with Tool-Use Reinforcement Learning (RL)

 

xAI introduced , a cost-optimized successor to Grok-4 that merges “reasoning” and “non-reasoning” behaviors into a single set of weights controllable via system prompts. The model targets high-throughput search, coding, and Q&A with a 2M-token context window and native tool-use RL that decides when to browse the web, execute code, or call tools.

Architecture note

Previous Grok releases split long-chain “reasoning” and short “non-reasoning” responses across separate models. Grok-4-Fast’s unified weight space reduces end-to-end latency and tokens by steering behavior via system prompts, which is relevant for real-time applications (search, assistive agents, and interactive coding) where switching models penalizes both latency and cost.

Search and agentic use

Grok-4-Fast was trained end-to-end with tool-use reinforcement learning and shows gains on search-centric agent benchmarks: BrowseComp 44.9%, SimpleQA 95.0%, Reka Research 66.0%, plus higher scores on Chinese variants (e.g., BrowseComp-zh 51.2%). xAI also cites private battle-testing on LMArena where grok-4-fast-search (codename “menlo”) ranks #1 in the Search Arena with 1163 Elo, and the text variant (codename “tahoe”) sits at #8 in the Text Arena, roughly on par with grok-4-0709.

Performance and efficiency deltas

On internal and public benchmarks, Grok-4-Fast posts frontier-class scores while cutting token usage. xAI reports pass@1 results of 92.0% (AIME 2025, no tools), 93.3% (HMMT 2025, no tools), 85.7% (GPQA Diamond), and 80.0% (LiveCodeBench Jan–May), approaching or matching Grok-4 but using ~40% fewer “thinking” tokens on average. The company frames this as “intelligence density,” claiming a ~98% reduction in price to reach the same benchmark performance as Grok-4 when the lower token count and new per-token pricing are combined.

Deployment and price

The model is generally available to all users in Grok’s Fast and Auto modes across web and mobile; Auto will select Grok-4-Fast for difficult queries to improve latency without losing quality, and—for the first time—free users access xAI’s latest model tier. For developers, xAI exposes two SKUsgrok-4-fast-reasoning and grok-4-fast-non-reasoning—both with 2M context. Pricing (xAI API) is $0.20 / 1M input tokens (<128k), $0.40 / 1M input tokens (≥128k), $0.50 / 1M output tokens (<128k), $1.00 / 1M output tokens (≥128k), and $0.05 / 1M cached input tokens.

https://x.ai/news/grok-4-fast

5 Technical Takeaways:

  • Unified model + 2M context. Grok-4-Fast uses a single weight space for “reasoning” and “non-reasoning,” prompt-steered, with a 2,000,000-token window across both SKUs.
  • Pricing for scale. API pricing starts at $0.20/M input, $0.50/M output, with cached input at $0.05/M and higher rates only beyond 128K context.
  • Efficiency claims. xAI reports ~40% fewer “thinking” tokens at comparable accuracy vs Grok-4, yielding a ~98% lower price to match Grok-4 performance on frontier benchmarks.
  • Benchmark profile. Reported pass@1: AIME-2025 92.0%, HMMT-2025 93.3%, GPQA-Diamond 85.7%, LiveCodeBench (Jan–May) 80.0%.
  • Agentic/search use. Post-training with tool-use RL; positioned for browsing/search workflows with documented search-agent metrics and live-search billing in docs.

Summary

Grok-4-Fast packages Grok-4-level capability into a single, prompt-steerable model with a 2M-token window, tool-use RL, and pricing tuned for high-throughput search and agent workloads. Early public signals (LMArena #1 in Search, competitive Text placement) align with xAI’s claim of similar accuracy using ~40% fewer “thinking” tokens, translating to lower latency and unit cost in production.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens

Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens

 

Xiaomi’s MiMo team released MiMo-Audio, a 7-billion-parameter audio-language model that runs a single next-token objective over interleaved text and discretized speech, scaling pretraining beyond 100 million hours of audio.

What’s actually new?

Instead of relying on task-specific heads or lossy acoustic tokens, MiMo-Audio uses a bespoke RVQ (residual vector quantization) tokenizer that targets both semantic fidelity and high-quality reconstruction. The tokenizer runs at 25 Hz and outputs 8 RVQ layers (≈200 tokens/s), giving the LM access to “lossless” speech features it can model autoregressively alongside text.

Architecture: patch encoder → 7B LLM → patch decoder

To handle the audio/text rate mismatch, the system packs four timesteps per patch for LM consumption (downsampling 25 Hz → 6.25 Hz), then reconstructs full-rate RVQ streams with a causal patch decoder. A delayed multi-layer RVQ generation scheme staggers predictions per codebook to stabilize synthesis and respect inter-layer dependencies. All three parts—patch encoder, MiMo-7B backbone, and patch decoder—are trained under a single next-token objective.

https://xiaomimimo.github.io/MiMo-Audio-Demo/

Scale is the algorithm

Training proceeds in two big phases: (1) an “understanding” stage that optimizes text-token loss over interleaved speech-text corpora, and (2) a joint “understanding + generation” stage that turns on audio losses for speech continuation, S2T/T2S tasks, and instruction-style data. The report emphasizes a compute/data threshold where few-shot behavior appears to “switch on,” echoing emergence curves seen in large text-only LMs.

Benchmarks: speech intelligence and general audio

MiMo-Audio is evaluated on speech-reasoning suites (e.g., SpeechMMLU) and broad audio understanding benchmarks (e.g., MMAU), reporting strong scores across speech, sound, and music and a reduced “modality gap” between text-only and speech-in/speech-out settings. Xiaomi also releases MiMo-Audio-Eval, a public toolkit to reproduce these results. Listen-and-respond demos (speech continuation, voice/emotion conversion, denoising, and speech translation) are available online.

https://xiaomimimo.github.io/MiMo-Audio-Demo/

Why this is important?

The approach is intentionally simple—no multi-head task tower, no bespoke ASR/TTS objectives at pretraining time—just GPT-style next-token prediction over lossless audio tokens plus text. The key engineering ideas are (i) a tokenizer the LM can actually use without throwing away prosody and speaker identity; (ii) patchification to keep sequence lengths manageable; and (iii) delayed RVQ decoding to preserve quality at generation time. For teams building spoken agents, those design choices translate into few-shot speech-to-speech editing and robust speech continuation with minimal task-specific finetuning.

6 Technical Takeaways:

  1. High-Fidelity Tokenization
    MiMo-Audio uses a custom RVQ tokenizer operating at 25 Hz with 8 active codebooks, ensuring speech tokens preserve prosody, timbre, and speaker identity while keeping them LM-friendly.
  2. Patchified Sequence Modeling
    The model reduces sequence length by grouping 4 timesteps into one patch (25 Hz → 6.25 Hz), letting the 7B LLM handle long speech efficiently without discarding detail.
  3. Unified Next-Token Objective
    Rather than separate heads for ASR, TTS, or dialogue, MiMo-Audio trains under a single next-token prediction loss across interleaved text and audio, simplifying architecture while supporting multi-task generalization.
  4. Emergent Few-Shot Abilities
    Few-shot behaviors such as speech continuation, voice conversion, emotion transfer, and speech translation emerge once training surpasses a large-scale data threshold (~100M hours, trillions of tokens).
  5. Benchmark Leadership
    MiMo-Audio sets state-of-the-art scores on SpeechMMLU (S2S 69.1, T2S 71.5) and MMAU (66.0 overall), while minimizing the text-to-speech modality gap to just 3.4 points.
  6. Open Ecosystem Release
    Xiaomi provides the tokenizer, 7B checkpoints (base and instruct), MiMo-Audio-Eval toolkit, and public demos, enabling researchers and developers to test and extend speech-to-speech intelligence in open-source pipelines.

Summary

MiMo-Audio demonstrates that high-fidelity, RVQ-based “lossless” tokenization combined with patchified next-token pretraining at scale is sufficient to unlock few-shot speech intelligence without task-specific heads. The 7B stack—tokenizer → patch encoder → LLM → patch decoder—bridges the audio/text rate gap (25→6.25 Hz) and preserves prosody and speaker identity via delayed multi-layer RVQ decoding. Empirically, the model narrows the text↔speech modality gap, generalizes across speech/sound/music benchmarks, and supports in-context S2S editing and continuation.


Check out the , and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

Run MATLAB-Style Code Inside Python by Connecting Octave with the oct2py Library

 

In this tutorial, we explore how we can seamlessly run MATLAB-style code inside Python by connecting Octave with the oct2py library. We set up the environment on Google Colab, exchange data between NumPy and Octave, write and call .m files, visualize plots generated in Octave within Python, and even work with toolboxes, structs, and .mat files. By doing this, we gain the flexibility of Python’s ecosystem while continuing to leverage the familiar syntax and numerical power of MATLAB/Octave in a single workflow. Check out the .

!apt-get -qq update
!apt-get -qq install -y octave gnuplot octave-signal octave-control > /dev/null
!python -m pip -q install oct2py scipy matplotlib pillow


from oct2py import Oct2Py, Oct2PyError
import numpy as np, matplotlib.pyplot as plt, textwrap
from scipy.io import savemat, loadmat
from PIL import Image


oc = Oct2Py()
print("Octave version:", oc.eval("version"))


def show_png(path, title=None):
   img = Image.open(path)
   plt.figure(figsize=(5,4)); plt.imshow(img); plt.axis("off")
   if title: plt.title(title)
   plt.show()

We begin by setting up Octave and essential libraries in Google Colab, ensuring that we have Octave-Forge packages and Python dependencies ready. We then initialize an Oct2Py session and define a helper function so we can display Octave-generated plots directly in our Python workflow. Check out the .

print("n--- Basic eval ---")
print(oc.eval("A = magic(4); A"))
print("eig(A) diag:", oc.eval("[V,D]=eig(A); diag(D)'"))
print("sin(pi/4):", oc.eval("sin(pi/4)"))


print("n--- NumPy exchange ---")
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x) + 0.1*np.random.randn(x.size)
y_filt = oc.feval("conv", y, np.ones(5)/5.0, "same") 
print("y_filt shape:", np.asarray(y_filt).shape)


print("n--- Cells & Structs ---")
cells = ["hello", 42, [1,2,3]]
oc.push("C", cells)
oc.eval("s = struct('name','Ada','score',99,'tags',{C});")
s = oc.pull("s")
print("Struct from Octave -> Python:", s)

We test the bridge between Python and Octave by running basic matrix operations, eigenvalue decomposition, and trigonometric evaluations directly in Octave. We then exchange NumPy arrays with Octave to perform a convolution filter and verify its shape. Finally, we demonstrate how to push Python lists into Octave as cell arrays, build a struct, and retrieve it back into Python for seamless data sharing. Check out the .

print("n--- Writing and calling .m files ---")
gd_code = r"""
function [w, hist] = gradient_descent(X, y, alpha, iters)
 % X: (n,m), y: (n,1). Adds bias; returns weights and loss history.
 if size(X,2) == 0, error('X must be 2D'); end
 n = rows(X);
 Xb = [ones(n,1), X];
 m = columns(Xb);
 w = zeros(m,1);
 hist = zeros(iters,1);
 for t=1:iters
   yhat = Xb*w;
   g = (Xb'*(yhat - y))/n;
   w = w - alpha * g;
   hist(t) = (sum((yhat - y).^2)/(2*n));
 endfor
endfunction
"""
with open("gradient_descent.m","w") as f: f.write(textwrap.dedent(gd_code))


np.random.seed(0)
X = np.random.randn(200, 3)
true_w = np.array([2.0, -1.0, 0.5, 3.0])    
y = true_w[0] + X @ true_w[1:] + 0.3*np.random.randn(200)
w_est, hist = oc.gradient_descent(X, y.reshape(-1,1), 0.1, 100, nout=2)
print("Estimated w:", np.ravel(w_est))
print("Final loss:", float(np.ravel(hist)[-1]))


print("n--- Octave plotting -> PNG -> Python display ---")
oc.eval("x = linspace(0,2*pi,400); y = sin(2*x) .* exp(-0.2*x);")
oc.eval("figure('visible','off'); plot(x,y,'linewidth',2); grid on; title('Damped Sine (Octave)');")
plot_path = "/content/oct_plot.png"
oc.eval(f"print('{plot_path}','-dpng'); close all;")
show_png(plot_path, title="Octave-generated Plot")

We write a custom gradient_descent.m in Octave, call it from Python with nout=2, and confirm that we recover reasonable weights and a decreasing loss. We then render a damped sine plot in an off-screen Octave figure and display the saved PNG inline in our Python notebook, keeping the whole workflow seamless. Check out the .

print("n--- Packages (signal/control) ---")
signal_ok = True
try:
   oc.eval("pkg load signal; pkg load control;")
   print("Loaded: signal, control")
except Oct2PyError as e:
   signal_ok = False
   print("Could not load signal/control, skipping package demo.nReason:", str(e).splitlines()[0])


if signal_ok:
   oc.push("t", np.linspace(0,1,800))
   oc.eval("x = sin(2*pi*5*t) + 0.5*sin(2*pi*40*t);")
   oc.eval("[b,a] = butter(4, 10/(800/2)); xf = filtfilt(b,a,x);")
   xf = oc.pull("xf")
   plt.figure(); plt.plot(xf); plt.title("Octave signal package: filtered"); plt.show()


print("n--- Function handles ---")
oc.eval("""
f = @(z) z.^2 + 3*z + 2;
vals = feval(f, [0 1 2 3]);
""")
vals = oc.pull("vals")
print("f([0,1,2,3]) =", np.ravel(vals))


quadfun_code = r"""
function y = quadfun(z)
 y = z.^2 + 3*z + 2;
end
"""
with open("quadfun.m","w") as f: f.write(textwrap.dedent(quadfun_code))
vals2 = oc.quadfun(np.array([0,1,2,3], dtype=float))
print("quadfun([0,1,2,3]) =", np.ravel(vals2))

We load the signal and control packages so we can design a Butterworth filter in Octave and visualize the filtered waveform back in Python. We also work with function handles by evaluating an anonymous quadratic inside Octave and, for robustness, define a named quadfun.m that we call from Python, showing both handle-based and file-based function calls in the same flow. Check out the .

print("n--- .mat I/O ---")
data_py = {"A": np.arange(9).reshape(3,3), "label": "demo"}
savemat("demo.mat", data_py)
oc.eval("load('demo.mat'); A2 = A + 1;")
oc.eval("save('-mat','demo_from_octave.mat','A2','label');")
back = loadmat("demo_from_octave.mat")
print("Keys from Octave-saved mat:", list(back.keys()))


print("n--- Error handling ---")
try:
   oc.eval("no_such_function(1,2,3);")
except Oct2PyError as e:
   print("Caught Octave error as Python exception:n", str(e).splitlines()[0])


print("n--- Simple Octave benchmark ---")
oc.eval("N = 2e6; a = rand(N,1);")


oc.eval("tic; s1 = sum(a); tv = toc;")
t_vec = float(oc.pull("tv"))


oc.eval("tic; s2 = 0; for i=1:length(a), s2 += a(i); end; tl = toc;")
t_loop = float(oc.pull("tl"))


print(f"Vectorized sum: {t_vec:.4f}s | Loop sum: {t_loop:.4f}s")


print("n--- Multi-file pipeline ---")
pipeline_m = r"""
function out = mini_pipeline(x, fs)
 try, pkg load signal; catch, end
 [b,a] = butter(6, 0.2);
 y = filtfilt(b,a,x);
 y_env = abs(hilbert(y));
 out = struct('rms', sqrt(mean(y.^2)), 'peak', max(abs(y)), 'env', y_env(1:10));
end
"""
with open("mini_pipeline.m","w") as f: f.write(textwrap.dedent(pipeline_m))


fs = 200.0
sig = np.sin(2*np.pi*3*np.linspace(0,3,int(3*fs))) + 0.1*np.random.randn(int(3*fs))
out = oc.mini_pipeline(sig, fs, nout=1)
print("mini_pipeline -> keys:", list(out.keys()))
print("RMS ~", float(out["rms"]), "| Peak ~", float(out["peak"]), "| env head:", np.ravel(out["env"])[:5])


print("nAll sections executed. You are now running MATLAB/Octave code from Python!")

We exchange .mat files between Python and Octave, confirming that data flows both ways without issues. We also test error handling by catching an Octave error as a Python exception. Next, we benchmark vectorized versus looped summations in Octave, showing the performance edge of vectorization. Finally, we built a multi-file pipeline that applies filtering and envelope detection, returning key statistics to Python, which demonstrates how we can organize Octave code into reusable components within our Python workflow.

In conclusion, we see how we can integrate Octave’s MATLAB-compatible features directly into Python and Colab. We successfully test data exchange, custom functions, plotting, package usage, and performance benchmarking, demonstrating that we can mix MATLAB/Octave workflows with Python without leaving our notebook. By combining the strengths of both environments, we put ourselves in a position to solve problems more efficiently and with greater flexibility.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Google’s Sensible Agent Reframes Augmented Reality (AR) Assistance as a Coupled “what+how” Decision—So What does that Change?

Google’s Sensible Agent Reframes Augmented Reality (AR) Assistance as a Coupled “what+how” Decision—So What does that Change?

 

Sensible Agent is an AI research framework and prototype from Google that chooses both the action an augmented reality (AR) agent should take and the interaction modality to deliver/confirm it, conditioned on real-time multimodal context (e.g., whether hands are busy, ambient noise, social setting). Rather than treating “what to suggest” and “how to ask” as separate problems, it computes them jointly to minimize friction and social awkwardness in the wild.

https://research.google/pubs/sensible-agent-a-framework-for-unobtrusive-interaction-with-proactive-ar-agent/

What interaction failure modes is it targeting?

Voice-first prompting is brittle: it’s slow under time pressure, unusable with busy hands/eyes, and awkward in public. Sensible Agent’s core bet is that a high-quality suggestion delivered through the wrong channel is effectively noise. The framework explicitly models the joint decision of (a) what the agent proposes (recommend/guide/remind/automate) and (b) how it’s presented and confirmed (visual, audio, or both; inputs via head nod/shake/tilt, gaze dwell, finger poses, short-vocabulary speech, or non-lexical conversational sounds). By binding content selection to modality feasibility and social acceptability, the system aims to lower perceived effort while preserving utility.

How is the system architected at runtime?

A prototype on an Android-class XR headset implements a pipeline with three main stages. First, context parsing fuses egocentric imagery (vision-language inference for scene/activity/familiarity) with an ambient audio classifier (YAMNet) to detect conditions like noise or conversation. Second, a proactive query generator prompts a large multimodal model with few-shot exemplars to select the action, query structure (binary / multi-choice / icon-cue), and presentation modality. Third, the interaction layer enables only those input methods compatible with the sensed I/O availability, e.g., head nod for “yes” when whispering isn’t acceptable, or gaze dwell when hands are occupied.

Where do the few-shot policies come from—designer instinct or data?

The team seeded the policy space with two studies: an expert workshop (n=12) to enumerate when proactive help is useful and which micro-inputs are socially acceptable; and a context mapping study (n=40; 960 entries) across everyday scenarios (e.g., gym, grocery, museum, commuting, cooking) where participants specified desired agent actions and chose a preferred query type and modality given the context. These mappings ground the few-shot exemplars used at runtime, shifting the choice of “what+how” from ad-hoc heuristics to data-derived patterns (e.g., multi-choice in unfamiliar environments, binary under time pressure, icon + visual in socially sensitive settings).

What concrete interaction techniques does the prototype support?

For binary confirmations, the system recognizes head nod/shake; for multi-choice, a head-tilt scheme maps left/right/back to options 1/2/3. Finger-pose gestures support numeric selection and thumbs up/down; gaze dwell triggers visual buttons where raycast pointing would be fussy; short-vocabulary speech (e.g., “yes,” “no,” “one,” “two,” “three”) provides a minimal dictation path; and non-lexical conversational sounds (“mm-hm”) cover noisy or whisper-only contexts. Crucially, the pipeline only offers modalities that are feasible under current constraints (e.g., suppress audio prompts in quiet spaces; avoid gaze dwell if the user isn’t looking at the HUD).

https://research.google/pubs/sensible-agent-a-framework-for-unobtrusive-interaction-with-proactive-ar-agent/

Does the joint decision actually reduce interaction cost?

A preliminary within-subjects user study (n=10) comparing the framework to a voice-prompt baseline across AR and 360° VR reported lower perceived interaction effort and lower intrusiveness while maintaining usability and preference. This is a small sample typical of early HCI validation; it’s directional evidence rather than product-grade proof, but it aligns with the thesis that coupling intent and modality reduces overhead.

How does the audio side work, and why YAMNet?

YAMNet is a lightweight, MobileNet-v1–based audio event classifier trained on Google’s AudioSet, predicting 521 classes. In this context it’s a practical choice to detect rough ambient conditions—speech presence, music, crowd noise—fast enough to gate audio prompts or to bias toward visual/gesture interaction when speech would be awkward or unreliable. The model’s ubiquity in TensorFlow Hub and Edge guides makes it straightforward to deploy on device.

How can you integrate it into an existing AR or mobile assistant stack?

A minimal adoption plan looks like this: (1) instrument a lightweight context parser (VLM on egocentric frames + ambient audio tags) to produce a compact state; (2) build a few-shot table of context→(action, query type, modality) mappings from internal pilots or user studies; (3) prompt an LMM to emit both the “what” and the “how” at once; (4) expose only feasible input methods per state and keep confirmations binary by default; (5) log choices and outcomes for offline policy learning. The Sensible Agent artifacts show this is feasible in WebXR/Chrome on Android-class hardware, so migrating to a native HMD runtime or even a phone-based HUD is mostly an engineering exercise.

Summary

Sensible Agent operationalizes proactive AR as a coupled policy problem—selecting the action and the interaction modality in a single, context-conditioned decision—and validates the approach with a working WebXR prototype and small-N user study showing lower perceived interaction effort relative to a voice baseline. The framework’s contribution is not a product but a reproducible recipe: a dataset of context→(what/how) mappings, few-shot prompts to bind them at runtime, and low-effort input primitives that respect social and I/O constraints.


Check out the  and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

Top Computer Vision CV Blogs & News Websites (2025)

 

Computer vision moved fast in 2025: new multimodal backbones, larger open datasets, and tighter model–systems integration. Practitioners need sources that publish rigorously, link code and benchmarks, and track deployment patterns—not marketing posts. This list prioritizes primary research hubs, lab blogs, and production-oriented engineering outlets with consistent update cadence. Use it to monitor SOTA shifts, grab reproducible code paths, and translate papers into deployable pipelines.

Primary source for advances from Google/DeepMind teams, including vision architectures (e.g., V-MoE) and periodic research year-in-review posts across CV and multimodal. Posts typically include method summaries, figures, and links to papers/code.

Consistent reporting on new computer-vision models, datasets, and benchmarks with links to papers, code, and demos. Dedicated CV category plus frequent deep-dives (e.g., DINOv3 releases and analysis). Useful for staying on top of weekly research drops without wading through raw feeds.

High-signal posts with preprints and open-source drops. Recent examples include DINOv3—scaled self-supervised backbones with SOTA across dense prediction tasks—which provide technical detail and artifacts.

Production-oriented content on VLM-powered analytics, optimized inference, and GPU pipelines. Category feed for Computer Vision includes blueprints, SDK usage, and performance guidance relevant to enterprise deployments.

The canonical preprint feed for CV. Use the recent or new views for daily updates; taxonomy confirms scope (image processing, pattern recognition, scene understanding). Best paired with RSS + custom filters.

Final versions of main-conference papers and workshops, searchable and citable. CVPR 2025 proceedings and workshop menus are already live, making this the authoritative archive post-acceptance.

Occasional but deep posts on frontier topics (e.g., extremely large image modeling, robotics-vision crossovers). Good for conceptual clarity directly from authors.

Technical explainers and lab roundups (e.g., SAIL at CVPR 2025) with links to papers/talks. Useful to scan emerging directions across perception, generative models, and embodied vision.

High-frequency, implementation-focused posts (labeling, training, deployment, apps, and trend reports). Strong for practitioners who need working pipelines and edge deployments.

Hands-on guides (VLMs, FiftyOne integrations) and ecosystem notes across Transformers, Diffusers, and timm; good for rapid prototyping and fine-tuning CV/VLM stacks.

Change logs, APIs, and recipes affecting CV training/inference (Transforms V2, multi-weight support, FX feature extraction). Read when upgrading training stacks.

The post appeared first on .

Read More

Qwen3-ASR-Toolkit: An Advanced Open Source Python Command-Line Toolkit for Using the Qwen-ASR API Beyond the 3 Minutes/10 MB Limit

 

Qwen has released , an MIT-licensed Python CLI that programmatically bypasses the Qwen3-ASR-Flash API’s 3-minute/10 MB per-request limit by performing VAD-aware chunking, parallel API calls, and automatic resampling/format normalization via FFmpeg. The result is stable, hour-scale transcription pipelines with configurable concurrency, context injection, and clean text post-processing. Python ≥3.8 prerequisite, Install with:

pip install qwen3-asr-toolkit

What the toolkit adds on top of the API

  • Long-audio handling. The toolkit slices input using voice activity detection (VAD) at natural pauses, keeping each chunk under the API’s hard duration/size caps, then merges outputs in order.
  • Parallel throughput. A thread pool dispatches multiple chunks concurrently to DashScope endpoints, improving wall-clock latency for hour-long inputs. You control concurrency via -j/--num-threads.
  • Format & rate normalization. Any common audio/video container (MP4/MOV/MKV/MP3/WAV/M4A, etc.) is converted to the API’s required mono 16 kHz before submission. Requires FFmpeg installed on PATH.
  • Text cleanup & context. The tool includes post-processing to reduce repetitions/hallucinations and supports context injection to bias recognition toward domain terms; the underlying API also exposes language detection and inverse text normalization (ITN) toggles.

The official Qwen3-ASR-Flash API is single-turn and enforces ≤3 min duration and ≤10 MB payloads per call. That is reasonable for interactive requests but awkward for long media. The toolkit operationalizes best practices—VAD-aware segmentation + concurrent calls—so teams can batch large archives or live capture dumps without writing orchestration from scratch.

Quick start

  1. Install prerequisites
# System: FFmpeg must be available
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install -y ffmpeg
  1. Install the CLI
pip install qwen3-asr-toolkit
  1. Configure credentials
# International endpoint key
export DASHSCOPE_API_KEY="sk-..."
  1. Run
# Basic: local video, default 4 threads
qwen3-asr -i "/path/to/lecture.mp4"

# Faster: raise parallelism and pass key explicitly (optional if env var set)
qwen3-asr -i "/path/to/podcast.wav" -j 8 -key "sk-..."

# Improve domain accuracy with context
qwen3-asr -i "/path/to/earnings_call.m4a" 
  -c "tickers, CFO name, product names, Q3 revenue guidance"

Arguments you’ll actually use:
-i/--input-file (file path or http/https URL), -j/--num-threads, -c/--context, -key/--dashscope-api-key, -t/--tmp-dir, -s/--silence. Output is printed and saved as <input_basename>.txt.

Minimal pipeline architecture

  1. Load local file or URL → 2) VAD to find silence boundaries → 3) Chunk under API caps → 4) Resample to 16 kHz mono → 5) Parallel submit to DashScope → 6) Aggregate segments in order → 7) Post-process text (dedupe, repetitions) → 8) Emit .txt transcript.

Summary

Qwen3-ASR-Toolkit turns Qwen3-ASR-Flash into a practical long-audio pipeline by combining VAD-based segmentation, FFmpeg normalization (mono/16 kHz), and parallel API dispatch under the 3-minute/10 MB caps. Teams get deterministic chunking, configurable throughput, and optional context/LID/ITN controls without custom orchestration. For production, pin the package version, verify region endpoints/keys, and tune thread count to your network and QPS—then pip install qwen3-asr-toolkit and ship.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

Physical AI: Bridging Robotics, Material Science, and Artificial Intelligence for Next-Gen Embodied Systems

 

Table of contents

What Do We Mean by “Physical AI”?

Artificial intelligence in robotics is not just a matter of clever algorithms. Robots operate in the physical world, and their intelligence emerges from the co-design of body and brain. Physical AI describes this integration, where materials, actuation, sensing, and computation shape how learning policies function. The term was introduced in Nature Machine Intelligence and reinforced by research on “physical intelligence,” emphasizing that a robot’s body is as much a locus of intelligence as its software.

How Do Materials Contribute to Intelligence?

Materials define how a robot moves and interacts with its environment. Dielectric elastomer actuators (DEAs) deliver high strain and power density, with 3D-printable multilayer designs that are scalable to production. Liquid crystal elastomers (LCEs) offer programmable contraction and deformation via fiber alignment, enabling novel morphologies in soft robotics. Engineers are also exploring impulsive actuation, where latching and snap-through mechanics produce explosive movements like jumps or rapid grasping. Beyond actuation, computing metamaterials embed logic and memory into structures themselves, hinting at a future where the body performs part of the computation.

What New Sensing Technologies Are Powering Embodiment?

Perception is central to embodied intelligence. Event cameras update pixels asynchronously with microsecond latency and high dynamic range, ideal for high-speed tasks under changing lighting. Vision-based tactile skins, derived from GelSight, can detect slip and capture high-resolution contact geometry. Meanwhile, flexible e-skins spread tactile sensing across large robot surfaces, enabling whole-body awareness. Together, these sensors give robots the ability to “see” and “feel” the world in real time.

Why Is Neuromorphic Computing Relevant for Physical AI?

Robots cannot rely on energy-hungry datacenter GPUs alone. Neuromorphic hardware, such as Intel’s Loihi 2 chips and the Hala Point system (1.15 billion neurons, 140,544 neuromorphic cores), executes spiking neural networks with extreme energy efficiency. These event-driven architectures align naturally with sensors like event cameras, supporting low-power reflexes and always-on perception. In practice, this frees GPUs and NPUs to run foundation models while neuromorphic substrates handle real-time safety and control.

How Are Foundation Policies Changing Robot Learning?

The old model of programming robots task-by-task is giving way to generalist robot policies. Massive datasets like Open X-Embodiment (OXE)—with over one million robot trajectories across 22 embodiments—provide the training substrate. On top of OXE, policies such as Octo (~800,000 episodes) and OpenVLA 7B (~970,000 episodes) demonstrate transferable skills across robots. Google’s RT-2 further shows how grounding robot policies in web-scale vision-language data enables generalization to novel tasks. This signals a shift toward shared foundation controllers for robots, much like foundation models transformed natural language processing.

How Does Differentiable Physics Enable Co-Design?

Traditionally, robots were built as hardware first and programmed later. With differentiable physics engines like DiffTaichi and Brax, designers can now compute gradients through simulations of deformable bodies and rigid dynamics. This allows morphology, materials, and policies to be optimized jointly, reducing the “sim-to-real” gap that has slowed soft robotics. Differentiable co-design accelerates iteration, aligning physical design with learned behaviors from the start.

How Can We Assure Safety in Physical AI?

Learned policies can behave unpredictably, making safety a core concern. Control Barrier Functions (CBFs) enforce mathematical safety constraints at runtime, ensuring robots remain within safe state spaces. Shielded reinforcement learning adds another layer by filtering unsafe actions before execution. Embedding these safeguards beneath vision-language-action or diffusion policies ensures robots can adapt while staying safe in dynamic, human-centered environments.

What Benchmarks Are Used to Evaluate Physical AI?

Evaluation is shifting toward embodied competence. The BEHAVIOR benchmark tests robots on long-horizon household tasks requiring mobility and manipulation. Ego4D provides ~3,670 hours of egocentric video from hundreds of participants, while Ego-Exo4D adds ~1,286 hours of synchronized egocentric and exocentric recordings with rich 3D annotations. These benchmarks emphasize adaptability, perception, and long-horizon reasoning in real-world contexts, not just short scripted tasks.

Where Is Physical AI Headed Next?

A practical Physical AI stack is beginning to emerge: smart actuators like DEAs and LCEs, tactile and event-based sensors, hybrid compute that combines GPU inference with neuromorphic reflex cores, generalist policies trained on cross-embodiment data, safety enforced through CBFs and shields, and design loops informed by differentiable physics. Each of these components exists today, though many are still in early stages.

The significance is clear: robots are evolving beyond narrow automation. With embodied intelligence distributed across body and brain, Physical AI represents a paradigm shift as profound for robotics as deep learning was for software AI.

Summary

Physical AI distributes intelligence across materials, morphology, sensors, compute, and learning policies. Advances in soft actuators, tactile/event-based sensing, neuromorphic hardware, and generalist robot policies are enabling robots that adapt across tasks and platforms. Safety frameworks like control barrier functions and shielded reinforcement learning ensure these systems can be deployed reliably in real-world environments.

FAQs

1. What is Physical AI?
Physical AI refers to embodied intelligence that emerges from the co-design of materials, actuation, sensing, compute, and learning policies—not just software.

2. How do materials like DEAs and LCEs impact robotics?
Dielectric elastomer actuators (DEAs) and liquid crystal elastomers (LCEs) act as artificial muscles, enabling high strain, programmable motion, and dynamic soft robotics.

3. Why are event cameras important in Physical AI?
Event cameras provide microsecond latency and high dynamic range, supporting low-power, high-speed perception for real-time control in robots.

4. What role does neuromorphic hardware play?
Neuromorphic chips like Intel Loihi 2 enable energy-efficient, event-driven processing, complementing GPUs by handling reflexes and always-on safety perception.

5. How is safety guaranteed in Physical AI systems?
Control Barrier Functions (CBFs) and shielded reinforcement learning filter unsafe actions and enforce state constraints during robot operation.

The post appeared first on .

Read More