CloudFlare AI Team Just Open-Sourced ‘VibeSDK’ that Lets Anyone Build and Deploy a Full AI Vibe Coding Platform with a Single Click

CloudFlare AI Team Just Open-Sourced ‘VibeSDK’ that Lets Anyone Build and Deploy a Full AI Vibe Coding Platform with a Single Click

 

CloudFlare AI team just open-sourced , a full-stack “vibe coding” platform that you can deploy end-to-end with a single click on Cloudflare’s network or GitHub Repo Fork. It packages code generation, safe execution, live preview, and multi-tenant deployment so teams can run their own internal or customer-facing AI app builder without stitching together infrastructure.

What’s actually in the box?

VibeSDK is a production-oriented reference implementation, not a toy UI. The repo (MIT-licensed) ships a React+Vite front end, Workers back end with Durable Objects for agent coordination, D1 (SQLite) via Drizzle, R2 for template storage, KV for sessions, and a “Deploy to Cloudflare” flow. It integrates Cloudflare Sandboxes/Containers for isolated builds and previews, and uses Workers for Platforms to publish each generated app as an isolated Worker with its own URL.

https://blog.cloudflare.com/deploy-your-own-ai-vibe-coding-platform/

How code moves through the system?

  1. A user describes the app; the agent generates files and writes them into a per-user sandbox.
  2. The sandbox installs deps and starts a dev ; the SDK exposes a public preview URL.
  3. Logs/errors stream back to the agent for iterative fixes.
  4. A deployment sandbox runs wrangler deploy to publish the app into a Workers-for-Platforms dispatch namespace, giving each app its own tenant-isolated Worker.

Models and routing

By default, VibeSDK uses Google’s Gemini 2.5 family for planning, codegen, and debugging, but all LLM calls go through Cloudflare AI Gateway. That enables unified routing across providers (OpenAI/Anthropic/Google/etc.), response caching for common requests, per-provider token/latency observability, and cost tracking. Swapping or mixing models is a config choice, not an architectural rewrite.

https://blog.cloudflare.com/deploy-your-own-ai-vibe-coding-platform/

Safety and multitenancy

The design assumes untrusted, AI-generated code: every build runs in an isolated container or sandbox with fast start, controlled egress, and preview URLs; production deployment is multi-tenant by design (per-app Worker isolation, usage limits, and optional outbound firewalling). This model scales to “thousands or millions” of user apps without cross-tenant access.

Is it really one click—and can I take my code to GitHub or my own account?

The Cloudflare provides a and a one-click deploy button. Once running, users can export generated projects to their own Cloudflare account or a GitHub repo for continued development—useful if you want to move work off the hosted instance or bring your own CI.

Why should platform teams care about “vibe coding” now?

“Vibe coding” shifts effort from hand-coding to supervising generative agents. VibeSDK hardens that pattern with a concrete, reproducible architecture: safe code execution, preview feedback loops, and cheap global deployment. For companies exploring AI builders for customers or internal teams, this replaces a weeks-to-months integration project with a baseline platform you can fork and specialize. For context, Cloudflare also documents the approach as a formal reference architecture so you can swap pieces (e.g., containers vs. sandboxes) without losing the system’s guarantees.

https://marktechpost.com/

Summary

Cloudflare’s VibeSDK turns “vibe coding” from demo to deployable substrate: a one-click stack that routes LLM calls through AI Gateway, executes AI-generated code in isolated sandboxes/containers, and publishes tenant-scoped via Workers for Platforms; paired with project export and a formal reference architecture, it gives teams a reproducible path to ship AI app builders without re-inventing the runtime or safety model.


Check out the  and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

For content partnership/promotions on marktechpost.com, please 

The post appeared first on .

Read More
Google AI Research Introduce a Novel Machine Learning Approach that Transforms TimesFM into a Few-Shot Learner

Google AI Research Introduce a Novel Machine Learning Approach that Transforms TimesFM into a Few-Shot Learner

 

Table of contents

Google Research introduces in-context fine-tuning (ICF) for time-series forecasting named as ‘TimesFM-ICF): a continued-pretraining recipe that teaches TimesFM to exploit multiple related series provided directly in the prompt at inference time. The result is a few-shot forecaster that matches supervised fine-tuning while delivering +6.8% accuracy over the base TimesFM across an OOD benchmark—no per-dataset training loop required.

What pain point in forecasting is being eliminated?

Most production workflows still trade off between (a) one model per dataset via supervised fine-tuning (accuracy, but heavy MLOps) and (b) zero-shot foundation models (simple, but not domain-adapted). Google’s new approach keeps a single pre-trained TimesFM checkpoint but lets it adapt on the fly using a handful of in-context examples from related series during inference, avoiding per-tenant training pipelines.

How does in-context fine-tuning work under the hood?

Start with TimesFM—a patched, decoder-only transformer that tokenizes 32-point input patches and de-tokenizes 128-point outputs via a shared MLP—and continue pre-training it on sequences that interleave the target history with multiple “support” series. Now the key change introduced is a learnable common separator token, so cross-example causal attention can mine structure across examples without conflating trends. The training objective remains next-token prediction; what’s new is the context construction that teaches the model to reason across multiple related series at inference time.

https://research.google/blog/time-series-foundation-models-can-be-few-shot-learners/

What exactly is “few-shot” here?

At inference, the user concatenates the target history with kk additional time-series snippets (e.g., similar SKUs, adjacent sensors), each delimited by the separator token. The model’s attention layers are now explicitly trained to leverage those in-context examples, analogous to LLM few-shot prompting—but for numeric sequences rather than text tokens. This shifts adaptation from parameter updates to prompt engineering over structured series.

Does it actually match supervised fine-tuning?

On a 23-dataset out-of-domain suite, TimesFM-ICF equals the performance of per-dataset TimesFM-FT while being 6.8% more accurate than TimesFM-Base (geometric mean of scaled MASE). The blog also shows the expected accuracy–latency trade-off: more in-context examples yield better forecasts at the cost of longer inference. A “just make the context longer” ablation indicates that structured in-context examples beat naive long-context alone.

How is this different from Chronos-style approaches?

Chronos tokenizes values into a discrete vocabulary and demonstrated strong zero-shot accuracy and fast variants (e.g., Chronos-Bolt). Google’s contribution here is not another tokenizer or headroom on zero-shot; it’s making a time-series FM behave like an LLM few-shot learner—learning from cross-series context at inference. That capability closes the gap between “train-time adaptation” and “prompt-time adaptation” for numeric forecasting.

What are the architectural specifics to watch?

The research team highlights: (1) separator tokens to mark boundaries, (2) causal self-attention over mixed histories and examples, (3) persisted patching and shared MLP heads, and (4) continued pre-training to instill cross-example behavior. Collectively, these enable the model to treat support series as informative exemplars rather than background noise.

Summary

Google’s in-context fine-tuning turns TimesFM into a practical few-shot forecaster: a single pretrained checkpoint that adapts at inference via curated support series, delivering fine-tuning-level accuracy without per-dataset training overhead—useful for multi-tenant, latency-bounded deployments where selection of support sets becomes the main control surface


FAQs

1) What is Google’s “in-context fine-tuning” (ICF) for time series?
ICF is continued pre-training that conditions TimesFM to use multiple related series placed in the prompt at inference, enabling few-shot adaptation without per-dataset gradient updates.

2) How does ICF differ from standard fine-tuning and zero-shot use?
Standard fine-tuning updates weights per dataset; zero-shot uses a fixed model with only the target history. ICF keeps weights fixed at deployment but learns during pre-training how to leverage extra in-context examples, matching per-dataset fine-tuning on reported benchmarks.

3) What architectural or training changes were introduced?
TimesFM is continued-pretrained with sequences that interleave target history and support series, separated by special boundary tokens so causal attention can exploit cross-series structure; the rest of the decoder-only TimesFM stack stays intact.

4) What do the results show relative to baselines?
On out-of-domain suites, ICF improves over TimesFM base and reaches parity with supervised fine-tuning; it is evaluated against strong TS baselines (e.g., PatchTST) and prior FMs (e.g., Chronos).

The post appeared first on .

Read More

Coding Implementation to End-to-End Transformer Model Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization

 

In this tutorial, we walk through how we use Hugging Face to optimize Transformer models and make them faster while maintaining accuracy. We begin by setting up DistilBERT on the SST-2 dataset, and then we compare different execution engines, including plain PyTorch and torch.compile, ONNX Runtime, and quantized ONNX. By doing this step by step, we get hands-on experience with model export, optimization, quantization, and benchmarking, all inside a Google Colab environment. Check out the  here.

!pip -q install "transformers>=4.49" "optimum[onnxruntime]>=1.20.0" "datasets>=2.20" "evaluate>=0.4" accelerate


from pathlib import Path
import os, time, numpy as np, torch
from datasets import load_dataset
import evaluate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig


os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")


MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"
ORT_DIR  = Path("onnx-distilbert")
Q_DIR    = Path("onnx-distilbert-quant")
DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"
BATCH    = 16
MAXLEN   = 128
N_WARM   = 3
N_ITERS  = 8


print(f"Device: {DEVICE} | torch={torch.__version__}")

We begin by installing the required libraries and setting up our environment for Hugging Face Optimum with ONNX Runtime. We configure paths, batch size, and iteration settings, and we confirm whether we run on CPU or GPU. Check out the  here.

ds = load_dataset("glue", "sst2", split="validation[:20%]")
texts, labels = ds["sentence"], ds["label"]
metric = evaluate.load("accuracy")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)


def make_batches(texts, max_len=MAXLEN, batch=BATCH):
   for i in range(0, len(texts), batch):
       yield tokenizer(texts[i:i+batch], padding=True, truncation=True,
                       max_length=max_len, return_tensors="pt")


def run_eval(predict_fn, texts, labels):
   preds = []
   for toks in make_batches(texts):
       preds.extend(predict_fn(toks))
   return metric.compute(predictions=preds, references=labels)["accuracy"]


def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):
   for _ in range(n_warm):
       for toks in make_batches(texts[:BATCH*2]):
           predict_fn(toks)
   times = []
   for _ in range(n_iters):
       t0 = time.time()
       for toks in make_batches(texts):
           predict_fn(toks)
       times.append((time.time() - t0) * 1000)
   return float(np.mean(times)), float(np.std(times))

We load an SST-2 validation slice and prepare tokenization, an accuracy metric, and batching. We define run_eval to compute accuracy from any predictor and bench to warm up and time end-to-end inference. With these helpers, we fairly compare different engines using identical data and batching. Check out the  here.

torch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()


@torch.no_grad()
def pt_predict(toks):
   toks = {k: v.to(DEVICE) for k, v in toks.items()}
   logits = torch_model(**toks).logits
   return logits.argmax(-1).detach().cpu().tolist()


pt_ms, pt_sd = bench(pt_predict, texts)
pt_acc = run_eval(pt_predict, texts, labels)
print(f"[PyTorch eager]   {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}")


compiled_model = torch_model
compile_ok = False
try:
   compiled_model = torch.compile(torch_model, mode="reduce-overhead", fullgraph=False)
   compile_ok = True
except Exception as e:
   print("torch.compile unavailable or failed -> skipping:", repr(e))


@torch.no_grad()
def ptc_predict(toks):
   toks = {k: v.to(DEVICE) for k, v in toks.items()}
   logits = compiled_model(**toks).logits
   return logits.argmax(-1).detach().cpu().tolist()


if compile_ok:
   ptc_ms, ptc_sd = bench(ptc_predict, texts)
   ptc_acc = run_eval(ptc_predict, texts, labels)
   print(f"[torch.compile]   {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}")

We load the baseline PyTorch classifier, define a pt_predict helper, and benchmark/score it on SST-2. We then attempt torch.compile for just-in-time graph optimizations and, if successful, run the same benchmarks to compare speed and accuracy under an identical setup. Check out the  here.

provider = "CUDAExecutionProvider" if DEVICE == "cuda" else "CPUExecutionProvider"
ort_model = ORTModelForSequenceClassification.from_pretrained(
   MODEL_ID, export=True, provider=provider, cache_dir=ORT_DIR
)


@torch.no_grad()
def ort_predict(toks):
   logits = ort_model(**{k: v.cpu() for k, v in toks.items()}).logits
   return logits.argmax(-1).cpu().tolist()


ort_ms, ort_sd = bench(ort_predict, texts)
ort_acc = run_eval(ort_predict, texts, labels)
print(f"[ONNX Runtime]    {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}")


Q_DIR.mkdir(parents=True, exist_ok=True)
quantizer = ORTQuantizer.from_pretrained(ORT_DIR)
qconfig = QuantizationConfig(approach="dynamic", per_channel=False, reduce_range=True)
quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)


ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, provider=provider)


@torch.no_grad()
def ortq_predict(toks):
   logits = ort_quant(**{k: v.cpu() for k, v in toks.items()}).logits
   return logits.argmax(-1).cpu().tolist()


oq_ms, oq_sd = bench(ortq_predict, texts)
oq_acc = run_eval(ortq_predict, texts, labels)
print(f"[ORT Quantized]   {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}")

We export the model to ONNX, run it with ONNX Runtime, then apply dynamic quantization with Optimum’s ORTQuantizer and benchmark both to see how latency improves while accuracy stays comparable. Check out the  here.

pt_pipe  = pipeline("sentiment-analysis", model=torch_model, tokenizer=tokenizer,
                   device=0 if DEVICE=="cuda" else -1)
ort_pipe = pipeline("sentiment-analysis", model=ort_model, tokenizer=tokenizer, device=-1)
samples = [
   "What a fantastic movie—performed brilliantly!",
   "This was a complete waste of time.",
   "I’m not sure how I feel about this one."
]
print("nSample predictions (PT | ORT):")
for s in samples:
   a = pt_pipe(s)[0]["label"]
   b = ort_pipe(s)[0]["label"]
   print(f"- {s}n  PT={a} | ORT={b}")


import pandas as pd
rows = [["PyTorch eager", pt_ms, pt_sd, pt_acc],
       ["ONNX Runtime",  ort_ms, ort_sd, ort_acc],
       ["ORT Quantized", oq_ms, oq_sd, oq_acc]]
if compile_ok: rows.insert(1, ["torch.compile", ptc_ms, ptc_sd, ptc_acc])
df = pd.DataFrame(rows, columns=["Engine", "Mean ms (↓)", "Std ms", "Accuracy"])
display(df)


print("""
Notes:
- BetterTransformer is deprecated on transformers>=4.49, hence omitted.
- For larger gains on GPU, also try FlashAttention2 models or FP8 with TensorRT-LLM.
- For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; try NUMA pinning.
- For static (calibrated) quantization, use QuantizationConfig(approach='static') with a calibration set.
""")

We sanity-check predictions with quick sentiment pipelines and print PyTorch vs ONNX labels side by side. We then assemble a summary table to compare latency and accuracy across engines, inserting torch.compile results when available. We conclude with practical notes, allowing us to extend the workflow to other backends and quantization modes.

In conclusion, we can clearly see how Optimum helps us bridge the gap between standard PyTorch models and production-ready, optimized deployments. We achieve speedups with ONNX Runtime and quantization while retaining accuracy, and we also explore how torch.compile provides gains directly within PyTorch. This workflow demonstrates a practical approach to balancing performance and efficiency for Transformer models, providing a foundation that can be further extended with advanced backends, such as OpenVINO or TensorRT.


Check out the  here. Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

For content partnership/promotions on marktechpost.com, please 

The post appeared first on .

Read More
Google AI Introduces the Public Preview of Chrome DevTools MCP: Making Your Coding Agent Control and Inspect a Live Chrome Browser

Google AI Introduces the Public Preview of Chrome DevTools MCP: Making Your Coding Agent Control and Inspect a Live Chrome Browser

 

Google has released a ,” a Model Context Protocol (MCP) server that lets AI coding agents control and inspect a real Chrome instance—recording performance traces, inspecting the DOM and CSS, executing JavaScript, reading console output, and automating user flows. The launch directly targets a well-known limitation in code-generating agents: they usually cannot observe the runtime behavior of the pages they create or modify. By wiring agents into Chrome’s DevTools via MCP, Google is turning static suggestion engines into loop-closed debuggers that run measurements in the browser before proposing fixes.

What exactly is Chrome DevTools MCP?

MCP is an open protocol for connecting LLMs to tools and data. Google’s DevTools MCP acts as a specialized server that exposes Chrome’s debugging surface to MCP-compatible clients. Google’s developer blog positions this as “bringing the power of Chrome DevTools to AI coding assistants,” with concrete workflows like initiating a performance trace (e.g., performance_start_trace) against a target URL, then having the agent analyze the resulting trace to suggest optimizations (for example, diagnosing high Largest Contentful Paint).

Capabilities and tool surface

The official GitHub repository documents a broad tool set. Beyond performance tracing (performance_start_trace, performance_stop_trace, performance_analyze_insight), agents can run navigation primitives (navigate_page, new_page, wait_for), simulate user input (click, fill, drag, hover), and interrogate runtime state (list_console_messages, evaluate_script, list_network_requests, get_network_request). Screenshot and snapshot utilities provide visual and DOM-state capture to support diffs and regressions. The server uses Puppeteer under the hood for reliable automation and waiting semantics, and it speaks to Chrome via the Chrome DevTools Protocol (CDP).

Installation

Setup is intentionally minimal for MCP clients. Google recommends adding a single config stanza that shells out to npx, always tracking the latest server build:

{
  "mcpServers": {
    "chrome-devtools": {
      "command": "npx",
      "args": ["chrome-devtools-mcp@latest"]
    }
  }
}

This server integrates with multiple agent front ends: Gemini CLI, Claude Code, Cursor, and GitHub Copilot’s MCP support. For VS Code/Copilot, the repo documents a code --add-mcp one-liner; for Claude Code, a claude mcp add command mirrors the same npx target. The package targets Node.js ≥22 and current Chrome.

Example agent workflows

Google’s announcement highlights pragmatic prompts that demonstrate end-to-end loops: verify a proposed fix in a live browser; analyze network failures (e.g., CORS or blocked image requests); simulate user behaviors like form submission to reproduce bugs; inspect layout issues by reading DOM/CSS in context; and run automated performance audits to reduce LCP and other Core Web Vitals. These are all operations agents can now validate with actual measurements rather than heuristics.

https://developer.chrome.com/blog/chrome-devtools-mcp?hl=en

Summary

Chrome DevTools MCP’s public preview is a practical inflection point for agentic frontend tooling: it grounds AI assistants in real browser telemetry—performance traces, DOM/CSS state, network and console data—so recommendations are driven by measurements rather than guesswork. The first-party server, shipped by the Chrome DevTools team, is installable via npx and targets MCP-capable clients, with Chrome/CDP under the hood. Expect shorter diagnose-fix loops for regressions and flaky UI flows, plus tighter validation of performance work.


Check out the  and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

For content partnership/promotions on marktechpost.com, please 

The post appeared first on .

Read More
Meet VoXtream: An Open-Sourced Full-Stream Zero-Shot TTS Model for Real-Time Use that Begins Speaking from the First Word

Meet VoXtream: An Open-Sourced Full-Stream Zero-Shot TTS Model for Real-Time Use that Begins Speaking from the First Word

 

Real-time agents, live dubbing, and simultaneous translation die by a thousand milliseconds. Most “streaming” TTS (Text to Speech) stacks still wait for a chunk of text before they emit sound, so the human hears a beat of silence before the voice starts. VoXtream—released by KTH’s Speech, Music and Hearing group—attacks this head-on: it begins speaking after the first word, outputs audio in 80 ms frames, and reports 102 ms first-packet latency (FPL) on a modern GPU (with PyTorch compile).

What exactly is “full-stream” TTS and how is it different from “output streaming”?

Output-streaming systems decode speech in chunks but still require the entire input text upfront; the clock starts late. Full-stream systems consume text as it arrives (word-by-word from an LLM) and emit audio in lockstep. VoXtream implements the latter: it ingests a word stream and generates audio frames continuously, eliminating input-side buffering while maintaining low per-frame compute. The architecture explicitly targets first-word onset rather than only steady-state throughput.

https://arxiv.org/pdf/2509.15969

How does VoXtream start speaking without waiting for future words?

The core trick is a dynamic phoneme look-ahead inside an incremental Phoneme Transformer (PT). PT may peek up to 10 phonemes to stabilize prosody, but it does not wait for that context; generation can start immediately after the first word enters the buffer. This avoids fixed look-ahead windows that add onset delay.

What’s the model stack under the hood?

VoXtream is a single, fully-autoregressive (AR) pipeline with three transformers:

  • Phoneme Transformer (PT): decoder-only, incremental; dynamic look-ahead ≤ 10 phonemes; phonemization via g2pE at the word level.
  • Temporal Transformer (TT): AR predictor over Mimi codec semantic tokens plus a duration token that encodes a monotonic phoneme-to-audio alignment (“stay/go” and {1, 2} phonemes per frame). Mimi runs at 12.5 Hz (→ 80 ms frames).
  • Depth Transformer (DT): AR generator for the remaining Mimi acoustic codebooks, conditioned on TT outputs and a ReDimNet speaker embedding for zero-shot voice prompting. The Mimi decoder reconstructs the waveform frame-by-frame, enabling continuous emission.

Mimi’s streaming codec design and dual-stream tokenization are well documented; VoXtream uses its first codebook as “semantic” context and the rest for high-fidelity reconstruction.

Is it actually fast in practice—or just “fast on paper”?

The repository includes a benchmark script that measures both FPL and real-time factor (RTF). On A100, the research team report 171 ms / 1.00 RTF without compile and 102 ms / 0.17 RTF with compile; on RTX 3090, 205 ms / 1.19 RTF uncompiled and 123 ms / 0.19 RTF compiled.

How does it compare to today’s popular streaming baselines?

The research team evaluates short-form output streaming and full-stream scenarios. On LibriSpeech-long full-stream (where text arrives word-by-word), VoXtream shows lower WER (3.24 %) than CosyVoice2 (6.11 %) and a significant naturalness preference for VoXtream in listener studies (p ≤ 5e-10), while CosyVoice2 scores higher on speaker-similarity—consistent with its flow-matching decoder. In runtime, VoXtream has the lowest FPL among the compared public streaming systems, and with compile it operates >5× faster than real time (RTF ≈ 0.17).

https://arxiv.org/pdf/2509.15969
https://arxiv.org/pdf/2509.15969

Why does this AR design beat diffusion/flow stacks on onset?

Diffusion/flow vocoders typically generate audio in chunks, so even if the text-audio interleaving is clever, the vocoder imposes a floor on first-packet latency. VoXtream keeps every stage AR and frame-synchronous—PT→TT→DT→Mimi decoder—so the first 80 ms packet emerges after one pass through the stack rather than a multi-step sampler. The introduction surveys prior interleaved and chunked approaches and explains how NAR flow-matching decoders used in IST-LM and CosyVoice2 impede low FPL despite strong offline quality.

Did they get here with huge data—or something smaller and cleaner?

VoXtream trains on a ~9k-hour mid-scale corpus: roughly 4.5k h Emilia and 4.5k h HiFiTTS-2 (22 kHz subset). The team diarized to remove multi-speaker clips, filtered transcripts using ASR, and applied NISQA to drop low-quality audio. Everything is resampled to 24 kHz, and the dataset card spells out the preprocessing pipeline and alignment artifacts (Mimi tokens, MFA alignments, duration labels, and speaker templates).

Are the headline quality metrics holding up outside cherry-picked clips?

Table 1 (zero-shot TTS) shows VoXtream is competitive on WER, UTMOS (MOS predictor), and speaker similarity across SEED-TTS test-en and LibriSpeech test-clean; the research team also runs an ablation: adding the CSM Depth Transformer and speaker encoder notably improves similarity without a significant WER penalty relative to a stripped baseline. The subjective study uses a MUSHRA-like protocol and a second-stage preference test tailored to full-stream generation.

source: marktechpost.com

Where does this land in the TTS landscape?

As per the research paper, it positions VoXtream among recent interleaved AR + NAR vocoder approaches and LM-codec stacks. The core contribution isn’t a new codec or a giant model—it’s a latency-focused AR arrangement plus a duration-token alignment that preserves input-side streaming. If you build live agents, the important trade-off is explicit: a small drop in speaker similarity vs. order-of-magnitude lower FPL than chunked NAR vocoders in full-stream conditions.


Check out the , , and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

For content partnership/promotions on marktechpost.com, please 

The post appeared first on .

Read More
How to Create Reliable Conversational AI Agents Using Parlant?

How to Create Reliable Conversational AI Agents Using Parlant?

 

Parlant is a framework designed to help developers build production-ready AI agents that behave consistently and reliably. A common challenge when deploying large language model (LLM) agents is that they often perform well in testing but fail when interacting with real users. They may ignore carefully designed system prompts, generate inaccurate or irrelevant responses at critical moments, struggle with edge cases, or produce inconsistent behavior from one conversation to another. 

Parlant addresses these challenges by shifting the focus from prompt engineering to principle-driven development. Instead of relying on prompts alone, it provides mechanisms to define clear rules and tool integrations, ensuring that an agent can access and process real-world data safely and predictably.

In this tutorial, we will create an insurance agent that can retrieve open claims, file new claims, and provide detailed policy information, demonstrating how to integrate domain-specific tools into a Parlant-powered AI system for consistent and reliable customer support. Check out the

Installing & importing the dependencies

pip install parlant
import asyncio
from datetime import datetime
import parlant.sdk as p

Defining the tools

The following code block introduces three tools that simulate interactions an insurance assistant might need. 

  • The get_open_claims tool represents an asynchronous function that retrieves a list of open insurance claims, allowing the agent to provide users with up-to-date information about pending or approved claims. 
  • The file_claim tool accepts claim details as input and simulates the process of filing a new insurance claim, returning a confirmation message to the user. 

Finally, the get_policy_details tool provides essential policy information, such as the policy number and coverage limits, enabling the agent to respond accurately to questions about insurance coverage. Check out the

@p.tool
async def get_open_claims(context: p.ToolContext) -> p.ToolResult:
    return p.ToolResult(data=["Claim #123 - Pending", "Claim #456 - Approved"])
@p.tool
async def file_claim(context: p.ToolContext, claim_details: str) -> p.ToolResult:
    return p.ToolResult(data=f"New claim filed: {claim_details}")
@p.tool
async def get_policy_details(context: p.ToolContext) -> p.ToolResult:
    return p.ToolResult(data={
        "policy_number": "POL-7788",
        "coverage": "Covers accidental damage and theft up to $50,000"
    })

Defining Glossary & Journeys

In this section, we define the glossary and journeys that shape how the agent handles domain knowledge and conversations. The glossary contains important business terms, such as the customer service number and operating hours, allowing the agent to reference them accurately when needed. 

The journeys describe step-by-step processes for specific tasks. In this example, one journey walks a customer through filing a new insurance claim, while another focuses on retrieving and explaining policy coverage details. Check out the

async def add_domain_glossary(agent: p.Agent):
    await agent.create_term(
        name="Customer Service Number",
        description="You can reach us at +1-555-INSURE",
    )
    await agent.create_term(
        name="Operating Hours",
        description="We are available Mon-Fri, 9AM-6PM",
    )
async def create_claim_journey(agent: p.Agent) -> p.Journey:
    journey = await agent.create_journey(
        title="File an Insurance Claim",
        description="Helps customers report and submit a new claim.",
        conditions=["The customer wants to file a claim"],
    )
    s0 = await journey.initial_state.transition_to(chat_state="Ask for accident details")
    s1 = await s0.target.transition_to(tool_state=file_claim, condition="Customer provides details")
    s2 = await s1.target.transition_to(chat_state="Confirm claim was submitted")
    await s2.target.transition_to(state=p.END_JOURNEY)
    return journey
async def create_policy_journey(agent: p.Agent) -> p.Journey:
    journey = await agent.create_journey(
        title="Explain Policy Coverage",
        description="Retrieves and explains customer's insurance coverage.",
        conditions=["The customer asks about their policy"],
    )
    s0 = await journey.initial_state.transition_to(tool_state=get_policy_details)
    await s0.target.transition_to(
        chat_state="Explain the policy coverage clearly",
        condition="Policy info is available",
    )
    await agent.create_guideline(
        condition="Customer presses for legal interpretation of coverage",
        action="Politely explain that legal advice cannot be provided",
    )
    return journey

Defining the main runner

The main runner ties together all the components defined in earlier cells and launches the agent. It starts a Parlant , creates the insurance support agent, and loads its glossary, journeys, and global guidelines. It also handles edge cases such as ambiguous customer intent by prompting the agent to choose between relevant journeys. Finally, once the script is executed, the server becomes active and prints a confirmation message. You can then open your browser and navigate to http://localhost:8800 to access the Parlant UI and begin interacting with the insurance agent in real time. Check out the

async def main():
    async with p.Server() as server:
        agent = await server.create_agent(
            name="Insurance Support Agent",
            description="Friendly and professional; helps with claims and policy queries.",
        )
        await add_domain_glossary(agent)
        claim_journey = await create_claim_journey(agent)
        policy_journey = await create_policy_journey(agent)
        # Disambiguation: if intent is unclear
        status_obs = await agent.create_observation(
            "Customer mentions an issue but doesn't specify if it's a claim or policy"
        )
        await status_obs.disambiguate([claim_journey, policy_journey])
        # Global guideline
        await agent.create_guideline(
            condition="Customer asks about unrelated topics",
            action="Kindly redirect them to insurance-related support only",
        )
        print("✅ Insurance Agent is ready! Open the Parlant UI to chat.")
if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

For content partnership with marktechpost.com, please 

The post appeared first on .

Read More
Microsoft Brings MCP to Azure Logic Apps (Standard) in Public Preview, Turning Connectors into Agent Tools

Microsoft Brings MCP to Azure Logic Apps (Standard) in Public Preview, Turning Connectors into Agent Tools

 

Microsoft has released a public preview that enables Azure Logic Apps (Standard) to run as Model Context Protocol (MCP) servers, exposing Logic Apps workflows as agent tools discoverable and callable by MCP-capable clients (e.g., VS Code + Copilot).

What’s actually shipping

  • Remote MCP server on Logic Apps (Standard): You configure a Standard logic app to host an MCP endpoint (/api/mcp) and surface HTTP Request/Response workflows as tools. Authentication is front-doored by Easy Auth; MCP endpoints default to OAuth 2.0. VS Code (≥1.102) includes GA MCP client support for testing.
  • API Center registration path (preview): You can also create/register MCP servers in Azure API Center, where selected managed connector actions become tools with cataloging and governance.
https://learn.microsoft.com/en-us/azure/logic-apps/set-up-model-context-protocol-server-standard

Key requirements and transport details

  • Workflow shape: Tools must be implemented as HTTP Request trigger (“When a HTTP request is received”) plus a Response action.
  • Auth & access control: By default, MCP uses OAuth 2.0; Easy Auth enforces client/identity/tenant restrictions. During setup, App Service authentication must allow unauthenticated requests (the MCP flow still performs OAuth).
  • Transports: Streamable HTTP works out of the box. SSE additionally requires VNET integration and host.json setting
    Runtime.Backend.EdgeWorkflowRuntimeTriggerListener.AllowCrossWorkerCommunication=true.
  • Enablement switch: MCP APIs are enabled by adding extensions.workflow.McpServerEndpoints.enable=true in host.json.

API Center path: preview limitations that matter

When creating MCP servers via API Center backed by Logic Apps, the current preview imposes the following limits:

  • Start with an empty Standard logic app resource.
  • One connector per MCP server.
  • Built-in service-provider and custom connectors aren’t supported in this path (managed connectors only).
  • One action per tool.

These constraints materially affect tool granularity and server layout in larger estates.

Why Standard (single-tenant) is the target?

Standard runs on the single-tenant Logic Apps runtime (on Azure Functions), supports multiple workflows per app, and integrates directly with virtual networks and private endpoints—all relevant for exposing private systems safely to agents and for predictable throughput/latency. By contrast, Consumption is multitenant, single-workflow per app, and pay-per-execution.

Tooling semantics and discoverability

Microsoft recommends adding trigger descriptions, parameter schemas/descriptions, and required markers to improve agent tool selection and invocation reliability. These annotations are read by MCP clients and influence calling behavior.

Connectors and enterprise reach

Organizations can front existing workflows and a large catalog of Logic Apps connectors (cloud and on-prem) through MCP, turning them into callable agent tools; Microsoft explicitly cites “.”

Operations, governance, and testing

Run history plus Application Insights/Log Analytics are available for diagnostics and auditability. VS Code provides quick client validation via MCP: Add Server, including OAuth sign-in and tool enumeration. Registering via API Center brings discovery/governance to MCP servers across teams.

Production notes (preview)

  • SSE requires both VNET and the cross-worker setting; without these, use streamable HTTP.
  • Easy Auth must be configured precisely (including the “allow unauthenticated” toggle) or client sign-in flows will fail despite OAuth expectations.
  • Throttling, idempotency, and schema versioning remain your responsibility when wrapping connectors as tools (not new, but now in the agent path). InfoQ highlights similar operational concerns from early adopters.

Summary

The preview cleanly MCP-enables Logic Apps (Standard): you expose HTTP-based workflows as OAuth-protected tools; you can catalog them in API Center; and you can reach private systems through single-tenant networking. For teams already invested in Logic Apps, this is a low-friction, standards-aligned route to operationalize enterprise agent tooling—just mind the API Center limits, SSE prerequisites, and Easy Auth nuances during rollout.


Check out more . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

For content partnership with marktechpost.com, please

The post appeared first on .

Read More