Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens

Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens

 

Xiaomi’s MiMo team released MiMo-Audio, a 7-billion-parameter audio-language model that runs a single next-token objective over interleaved text and discretized speech, scaling pretraining beyond 100 million hours of audio.

What’s actually new?

Instead of relying on task-specific heads or lossy acoustic tokens, MiMo-Audio uses a bespoke RVQ (residual vector quantization) tokenizer that targets both semantic fidelity and high-quality reconstruction. The tokenizer runs at 25 Hz and outputs 8 RVQ layers (≈200 tokens/s), giving the LM access to “lossless” speech features it can model autoregressively alongside text.

Architecture: patch encoder → 7B LLM → patch decoder

To handle the audio/text rate mismatch, the system packs four timesteps per patch for LM consumption (downsampling 25 Hz → 6.25 Hz), then reconstructs full-rate RVQ streams with a causal patch decoder. A delayed multi-layer RVQ generation scheme staggers predictions per codebook to stabilize synthesis and respect inter-layer dependencies. All three parts—patch encoder, MiMo-7B backbone, and patch decoder—are trained under a single next-token objective.

https://xiaomimimo.github.io/MiMo-Audio-Demo/

Scale is the algorithm

Training proceeds in two big phases: (1) an “understanding” stage that optimizes text-token loss over interleaved speech-text corpora, and (2) a joint “understanding + generation” stage that turns on audio losses for speech continuation, S2T/T2S tasks, and instruction-style data. The report emphasizes a compute/data threshold where few-shot behavior appears to “switch on,” echoing emergence curves seen in large text-only LMs.

Benchmarks: speech intelligence and general audio

MiMo-Audio is evaluated on speech-reasoning suites (e.g., SpeechMMLU) and broad audio understanding benchmarks (e.g., MMAU), reporting strong scores across speech, sound, and music and a reduced “modality gap” between text-only and speech-in/speech-out settings. Xiaomi also releases MiMo-Audio-Eval, a public toolkit to reproduce these results. Listen-and-respond demos (speech continuation, voice/emotion conversion, denoising, and speech translation) are available online.

https://xiaomimimo.github.io/MiMo-Audio-Demo/

Why this is important?

The approach is intentionally simple—no multi-head task tower, no bespoke ASR/TTS objectives at pretraining time—just GPT-style next-token prediction over lossless audio tokens plus text. The key engineering ideas are (i) a tokenizer the LM can actually use without throwing away prosody and speaker identity; (ii) patchification to keep sequence lengths manageable; and (iii) delayed RVQ decoding to preserve quality at generation time. For teams building spoken agents, those design choices translate into few-shot speech-to-speech editing and robust speech continuation with minimal task-specific finetuning.

6 Technical Takeaways:

  1. High-Fidelity Tokenization
    MiMo-Audio uses a custom RVQ tokenizer operating at 25 Hz with 8 active codebooks, ensuring speech tokens preserve prosody, timbre, and speaker identity while keeping them LM-friendly.
  2. Patchified Sequence Modeling
    The model reduces sequence length by grouping 4 timesteps into one patch (25 Hz → 6.25 Hz), letting the 7B LLM handle long speech efficiently without discarding detail.
  3. Unified Next-Token Objective
    Rather than separate heads for ASR, TTS, or dialogue, MiMo-Audio trains under a single next-token prediction loss across interleaved text and audio, simplifying architecture while supporting multi-task generalization.
  4. Emergent Few-Shot Abilities
    Few-shot behaviors such as speech continuation, voice conversion, emotion transfer, and speech translation emerge once training surpasses a large-scale data threshold (~100M hours, trillions of tokens).
  5. Benchmark Leadership
    MiMo-Audio sets state-of-the-art scores on SpeechMMLU (S2S 69.1, T2S 71.5) and MMAU (66.0 overall), while minimizing the text-to-speech modality gap to just 3.4 points.
  6. Open Ecosystem Release
    Xiaomi provides the tokenizer, 7B checkpoints (base and instruct), MiMo-Audio-Eval toolkit, and public demos, enabling researchers and developers to test and extend speech-to-speech intelligence in open-source pipelines.

Summary

MiMo-Audio demonstrates that high-fidelity, RVQ-based “lossless” tokenization combined with patchified next-token pretraining at scale is sufficient to unlock few-shot speech intelligence without task-specific heads. The 7B stack—tokenizer → patch encoder → LLM → patch decoder—bridges the audio/text rate gap (25→6.25 Hz) and preserves prosody and speaker identity via delayed multi-layer RVQ decoding. Empirically, the model narrows the text↔speech modality gap, generalizes across speech/sound/music benchmarks, and supports in-context S2S editing and continuation.


Check out the , and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

Run MATLAB-Style Code Inside Python by Connecting Octave with the oct2py Library

 

In this tutorial, we explore how we can seamlessly run MATLAB-style code inside Python by connecting Octave with the oct2py library. We set up the environment on Google Colab, exchange data between NumPy and Octave, write and call .m files, visualize plots generated in Octave within Python, and even work with toolboxes, structs, and .mat files. By doing this, we gain the flexibility of Python’s ecosystem while continuing to leverage the familiar syntax and numerical power of MATLAB/Octave in a single workflow. Check out the .

!apt-get -qq update
!apt-get -qq install -y octave gnuplot octave-signal octave-control > /dev/null
!python -m pip -q install oct2py scipy matplotlib pillow


from oct2py import Oct2Py, Oct2PyError
import numpy as np, matplotlib.pyplot as plt, textwrap
from scipy.io import savemat, loadmat
from PIL import Image


oc = Oct2Py()
print("Octave version:", oc.eval("version"))


def show_png(path, title=None):
   img = Image.open(path)
   plt.figure(figsize=(5,4)); plt.imshow(img); plt.axis("off")
   if title: plt.title(title)
   plt.show()

We begin by setting up Octave and essential libraries in Google Colab, ensuring that we have Octave-Forge packages and Python dependencies ready. We then initialize an Oct2Py session and define a helper function so we can display Octave-generated plots directly in our Python workflow. Check out the .

print("n--- Basic eval ---")
print(oc.eval("A = magic(4); A"))
print("eig(A) diag:", oc.eval("[V,D]=eig(A); diag(D)'"))
print("sin(pi/4):", oc.eval("sin(pi/4)"))


print("n--- NumPy exchange ---")
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x) + 0.1*np.random.randn(x.size)
y_filt = oc.feval("conv", y, np.ones(5)/5.0, "same") 
print("y_filt shape:", np.asarray(y_filt).shape)


print("n--- Cells & Structs ---")
cells = ["hello", 42, [1,2,3]]
oc.push("C", cells)
oc.eval("s = struct('name','Ada','score',99,'tags',{C});")
s = oc.pull("s")
print("Struct from Octave -> Python:", s)

We test the bridge between Python and Octave by running basic matrix operations, eigenvalue decomposition, and trigonometric evaluations directly in Octave. We then exchange NumPy arrays with Octave to perform a convolution filter and verify its shape. Finally, we demonstrate how to push Python lists into Octave as cell arrays, build a struct, and retrieve it back into Python for seamless data sharing. Check out the .

print("n--- Writing and calling .m files ---")
gd_code = r"""
function [w, hist] = gradient_descent(X, y, alpha, iters)
 % X: (n,m), y: (n,1). Adds bias; returns weights and loss history.
 if size(X,2) == 0, error('X must be 2D'); end
 n = rows(X);
 Xb = [ones(n,1), X];
 m = columns(Xb);
 w = zeros(m,1);
 hist = zeros(iters,1);
 for t=1:iters
   yhat = Xb*w;
   g = (Xb'*(yhat - y))/n;
   w = w - alpha * g;
   hist(t) = (sum((yhat - y).^2)/(2*n));
 endfor
endfunction
"""
with open("gradient_descent.m","w") as f: f.write(textwrap.dedent(gd_code))


np.random.seed(0)
X = np.random.randn(200, 3)
true_w = np.array([2.0, -1.0, 0.5, 3.0])    
y = true_w[0] + X @ true_w[1:] + 0.3*np.random.randn(200)
w_est, hist = oc.gradient_descent(X, y.reshape(-1,1), 0.1, 100, nout=2)
print("Estimated w:", np.ravel(w_est))
print("Final loss:", float(np.ravel(hist)[-1]))


print("n--- Octave plotting -> PNG -> Python display ---")
oc.eval("x = linspace(0,2*pi,400); y = sin(2*x) .* exp(-0.2*x);")
oc.eval("figure('visible','off'); plot(x,y,'linewidth',2); grid on; title('Damped Sine (Octave)');")
plot_path = "/content/oct_plot.png"
oc.eval(f"print('{plot_path}','-dpng'); close all;")
show_png(plot_path, title="Octave-generated Plot")

We write a custom gradient_descent.m in Octave, call it from Python with nout=2, and confirm that we recover reasonable weights and a decreasing loss. We then render a damped sine plot in an off-screen Octave figure and display the saved PNG inline in our Python notebook, keeping the whole workflow seamless. Check out the .

print("n--- Packages (signal/control) ---")
signal_ok = True
try:
   oc.eval("pkg load signal; pkg load control;")
   print("Loaded: signal, control")
except Oct2PyError as e:
   signal_ok = False
   print("Could not load signal/control, skipping package demo.nReason:", str(e).splitlines()[0])


if signal_ok:
   oc.push("t", np.linspace(0,1,800))
   oc.eval("x = sin(2*pi*5*t) + 0.5*sin(2*pi*40*t);")
   oc.eval("[b,a] = butter(4, 10/(800/2)); xf = filtfilt(b,a,x);")
   xf = oc.pull("xf")
   plt.figure(); plt.plot(xf); plt.title("Octave signal package: filtered"); plt.show()


print("n--- Function handles ---")
oc.eval("""
f = @(z) z.^2 + 3*z + 2;
vals = feval(f, [0 1 2 3]);
""")
vals = oc.pull("vals")
print("f([0,1,2,3]) =", np.ravel(vals))


quadfun_code = r"""
function y = quadfun(z)
 y = z.^2 + 3*z + 2;
end
"""
with open("quadfun.m","w") as f: f.write(textwrap.dedent(quadfun_code))
vals2 = oc.quadfun(np.array([0,1,2,3], dtype=float))
print("quadfun([0,1,2,3]) =", np.ravel(vals2))

We load the signal and control packages so we can design a Butterworth filter in Octave and visualize the filtered waveform back in Python. We also work with function handles by evaluating an anonymous quadratic inside Octave and, for robustness, define a named quadfun.m that we call from Python, showing both handle-based and file-based function calls in the same flow. Check out the .

print("n--- .mat I/O ---")
data_py = {"A": np.arange(9).reshape(3,3), "label": "demo"}
savemat("demo.mat", data_py)
oc.eval("load('demo.mat'); A2 = A + 1;")
oc.eval("save('-mat','demo_from_octave.mat','A2','label');")
back = loadmat("demo_from_octave.mat")
print("Keys from Octave-saved mat:", list(back.keys()))


print("n--- Error handling ---")
try:
   oc.eval("no_such_function(1,2,3);")
except Oct2PyError as e:
   print("Caught Octave error as Python exception:n", str(e).splitlines()[0])


print("n--- Simple Octave benchmark ---")
oc.eval("N = 2e6; a = rand(N,1);")


oc.eval("tic; s1 = sum(a); tv = toc;")
t_vec = float(oc.pull("tv"))


oc.eval("tic; s2 = 0; for i=1:length(a), s2 += a(i); end; tl = toc;")
t_loop = float(oc.pull("tl"))


print(f"Vectorized sum: {t_vec:.4f}s | Loop sum: {t_loop:.4f}s")


print("n--- Multi-file pipeline ---")
pipeline_m = r"""
function out = mini_pipeline(x, fs)
 try, pkg load signal; catch, end
 [b,a] = butter(6, 0.2);
 y = filtfilt(b,a,x);
 y_env = abs(hilbert(y));
 out = struct('rms', sqrt(mean(y.^2)), 'peak', max(abs(y)), 'env', y_env(1:10));
end
"""
with open("mini_pipeline.m","w") as f: f.write(textwrap.dedent(pipeline_m))


fs = 200.0
sig = np.sin(2*np.pi*3*np.linspace(0,3,int(3*fs))) + 0.1*np.random.randn(int(3*fs))
out = oc.mini_pipeline(sig, fs, nout=1)
print("mini_pipeline -> keys:", list(out.keys()))
print("RMS ~", float(out["rms"]), "| Peak ~", float(out["peak"]), "| env head:", np.ravel(out["env"])[:5])


print("nAll sections executed. You are now running MATLAB/Octave code from Python!")

We exchange .mat files between Python and Octave, confirming that data flows both ways without issues. We also test error handling by catching an Octave error as a Python exception. Next, we benchmark vectorized versus looped summations in Octave, showing the performance edge of vectorization. Finally, we built a multi-file pipeline that applies filtering and envelope detection, returning key statistics to Python, which demonstrates how we can organize Octave code into reusable components within our Python workflow.

In conclusion, we see how we can integrate Octave’s MATLAB-compatible features directly into Python and Colab. We successfully test data exchange, custom functions, plotting, package usage, and performance benchmarking, demonstrating that we can mix MATLAB/Octave workflows with Python without leaving our notebook. By combining the strengths of both environments, we put ourselves in a position to solve problems more efficiently and with greater flexibility.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

Building AI agents is 5% AI and 100% software engineering

 

Production-grade agents live or die on data plumbing, controls, and observability—not on model choice. The doc-to-chat pipeline below maps the concrete layers and why they matter.

What is a “doc-to-chat” pipeline?

A doc-to-chat pipeline ingests enterprise documents, standardizes them, enforces governance, indexes embeddings alongside relational features, and serves retrieval + generation behind authenticated APIs with human-in-the-loop (HITL) checkpoints. It’s the reference architecture for agentic Q&A, copilots, and workflow automation where answers must respect permissions and be audit-ready. Production implementations are variations of RAG (retrieval-augmented generation) hardened with LLM guardrails, governance, and OpenTelemetry-backed tracing.

How do you integrate cleanly with the existing stack?

Use standard service boundaries (REST/JSON, gRPC) over a storage layer your org already trusts. For tables, Iceberg gives ACID, schema evolution, partition evolution, and snapshots—critical for reproducible retrieval and backfills. For vectors, use a system that coexists with SQL filters: pgvector collocates embeddings with business keys and ACL tags in PostgreSQL; dedicated engines like Milvus handle high-QPS ANN with disaggregated storage/compute. In practice, many teams run both: SQL+pgvector for transactional joins and Milvus for heavy retrieval.

Key properties

  • Iceberg tables: ACID, hidden partitioning, snapshot isolation; vendor support across warehouses.
  • pgvector: SQL + vector similarity in one query plan for precise joins and policy enforcement.
  • Milvus: layered, horizontally scalable architecture for large-scale similarity search.

How do agents, humans, and workflows coordinate on one “knowledge fabric”?

Production agents require explicit coordination points where humans approve, correct, or escalate. AWS A2I provides managed HITL loops (private workforces, flow definitions) and is a concrete blueprint for gating low-confidence outputs. Frameworks like LangGraph model these human checkpoints inside agent graphs so approvals are first-class steps in the DAG, not ad hoc callbacks. Use them to gate actions like publishing summaries, filing tickets, or committing code.

Pattern: LLM → confidence/guardrail checks → HITL gate → side-effects. Persist every artifact (prompt, retrieval set, decision) for auditability and future re-runs.

How is reliability enforced before anything reaches the model?

Treat reliability as layered defenses:

  1. Language + content guardrails: Pre-validate inputs/outputs for safety and policy. Options span managed (Bedrock Guardrails) and OSS (NeMo Guardrails, Guardrails AI; Llama Guard). Independent comparisons and a position paper catalog the trade-offs.
  2. PII detection/redaction: Run analyzers on both source docs and model I/O. Microsoft Presidio offers recognizers and masking, with explicit caveats to combine with additional controls.
  3. Access control and lineage: Enforce row-/column-level ACLs and audit across catalogs (Unity Catalog) so retrieval respects permissions; unify lineage and access policies across workspaces.
  4. Retrieval quality gates: Evaluate RAG with reference-free metrics (faithfulness, context precision/recall) using Ragas/related tooling; block or down-rank poor contexts.

How do you scale indexing and retrieval under real traffic?

Two axes matter: ingest throughput and query concurrency.

  • Ingest: Normalize at the lakehouse edge; write to Iceberg for versioned snapshots, then embed asynchronously. This enables deterministic rebuilds and point-in-time re-indexing.
  • Vector serving: Milvus’s shared-storage, disaggregated compute architecture supports horizontal scaling with independent failure domains; use HNSW/IVF/Flat hybrids and replica sets to balance recall/latency.
  • SQL + vector: Keep business joins server-side (pgvector), e.g., WHERE tenant_id = ? AND acl_tag @> ... ORDER BY embedding <-> :q LIMIT k. This avoids N+1 trips and respects policies.
  • Chunking/embedding strategy: Tune chunk size/overlap and semantic boundaries; bad chunking is the silent killer of recall.

For structured+unstructured fusion, prefer hybrid retrieval (BM25 + ANN + reranker) and store structured features next to vectors to support filters and re-ranking features at query time.

How do you monitor beyond logs?

You need traces, metrics, and evaluations stitched together:

  • Distributed tracing: Emit OpenTelemetry spans across ingestion, retrieval, model calls, and tools; LangSmith natively ingests OTEL traces and interoperates with external APMs (Jaeger, Datadog, Elastic). This gives end-to-end timing, prompts, contexts, and costs per request.
  • LLM observability platforms: Compare options (LangSmith, Arize Phoenix, LangFuse, Datadog) by tracing, evals, cost tracking, and enterprise readiness. Independent roundups and matrixes are available.
  • Continuous evaluation: Schedule RAG evals (Ragas/DeepEval/MLflow) on canary sets and live traffic replays; track faithfulness and grounding drift over time.

Add schema profiling/mapping on ingestion to keep observability attached to data shape changes (e.g., new templates, table evolution) and to explain retrieval regressions when upstream sources shift.

Example: doc-to-chat reference flow (signals and gates)

  1. Ingest: connectors → text extraction → normalization → Iceberg write (ACID, snapshots).
  2. Govern: PII scan (Presidio) → redact/mask → catalog registration with ACL policies.
  3. Index: embedding jobs → pgvector (policy-aware joins) and Milvus (high-QPS ANN).
  4. Serve: REST/gRPC → hybrid retrieval → guardrails → LLM → tool use.
  5. HITL: low-confidence paths route to A2I/LangGraph approval steps.
  6. Observe: OTEL traces to LangSmith/APM + scheduled RAG evaluations.

Why “5% AI, 100% software engineering” is accurate in practice?

Most outages and trust failures in agent systems are not model regressions; they’re data quality, permissioning, retrieval decay, or missing telemetry. The controls above—ACID tables, ACL catalogs, PII guardrails, hybrid retrieval, OTEL traces, and human gates—determine whether the same base model is safe, fast, and credibly correct for your users. Invest in these first; swap models later if needed.


References:

The post appeared first on .

Read More