Google AI Ships TimesFM-2.5: Smaller, Longer-Context Foundation Model That Now Leads GIFT-Eval (Zero-Shot Forecasting)

Google AI Ships TimesFM-2.5: Smaller, Longer-Context Foundation Model That Now Leads GIFT-Eval (Zero-Shot Forecasting)

 

Google Research has released , a 200M-parameter, decoder-only time-series foundation model with a 16K context length and native probabilistic forecasting support. The new checkpoint is live on Hugging Face. On GIFT-Eval, TimesFM-2.5 now tops the leaderboard across accuracy metrics (MASE, CRPS) among zero-shot foundation models.

What is Time-Series Forecasting?

Time-series forecasting is the practice of analyzing sequential data points collected over time to identify patterns and predict future values. It underpins critical applications across industries, including forecasting product demand in retail, monitoring weather and precipitation trends, and optimizing large-scale systems such as supply chains and energy grids. By capturing temporal dependencies and seasonal variations, time-series forecasting enables data-driven decision-making in dynamic environments.

What changed in TimesFM-2.5 vs v2.0?

  • Parameters: 200M (down from 500M in 2.0).
  • Max context: 16,384 points (up from 2,048).
  • Quantiles: Optional 30M-param quantile head for continuous quantile forecasts up to 1K horizon.
  • Inputs: No “frequency” indicator required; new inference flags (flip-invariance, positivity inference, quantile-crossing fix).
  • Roadmap: Upcoming Flax implementation for faster inference; covariates support slated to return; docs being expanded.

Why does a longer context matter?

16K historical points allow a single forward pass to capture multi-seasonal structure, regime breaks, and low-frequency components without tiling or hierarchical stitching. In practice, that reduces pre-processing heuristics and improves stability for domains where context >> horizon (e.g., energy load, retail demand). The longer context is a core design change explicitly noted for 2.5.

What’s the research context?

TimesFM’s core thesis—a single, decoder-only foundation model for forecasting—was introduced in the ICML 2024 paper and Google’s research blog. GIFT-Eval (Salesforce) emerged to standardize evaluation across domains, frequencies, horizon lengths, and univariate/multivariate regimes, with a public leaderboard hosted on Hugging Face.

Key Takeaways

  • Smaller, Faster Model: TimesFM-2.5 runs with 200M parameters (half of 2.0’s size) while improving accuracy.
  • Longer Context: Supports 16K input length, enabling forecasts with deeper historical coverage.
  • Benchmark Leader: Now ranks #1 among zero-shot foundation models on GIFT-Eval for both MASE (point accuracy) and CRPS (probabilistic accuracy).
  • Production-Ready: Efficient design and quantile forecasting support make it suitable for real-world deployments across industries.
  • Broad Availability: The model is live on Hugging Face.

Summary

TimesFM-2.5 shows that foundation models for forecasting are moving past proof-of-concept into practical, production-ready tools. By cutting parameters in half while extending context length and leading GIFT-Eval across both point and probabilistic accuracy, it marks a step-change in efficiency and capability. With Hugging Face access already live and BigQuery/Model Garden integration on the way, the model is positioned to accelerate adoption of zero-shot time-series forecasting in real-world pipelines.


Check out the , , and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Stanford Researchers Introduced MedAgentBench: A Real-World Benchmark for Healthcare AI Agents

Stanford Researchers Introduced MedAgentBench: A Real-World Benchmark for Healthcare AI Agents

 

A team of Stanford University researchers have released MedAgentBench, a new benchmark suite designed to evaluate large language model (LLM) agents in healthcare contexts. Unlike prior question-answering datasets, MedAgentBench provides a virtual electronic health record (EHR) environment where AI systems must interact, plan, and execute multi-step clinical tasks. This marks a significant shift from testing static reasoning to assessing agentic capabilities in live, tool-based medical workflows.

https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

Why Do We Need Agentic Benchmarks in Healthcare?

Recent LLMs have moved beyond static chat-based interactions toward agentic behavior—interpreting high-level instructions, calling APIs, integrating patient data, and automating complex processes. In medicine, this evolution could help address staff shortages, documentation burden, and administrative inefficiencies.

While general-purpose agent benchmarks (e.g., AgentBench, AgentBoard, tau-bench) exist, healthcare lacked a standardized benchmark that captures the complexity of medical data, FHIR interoperability, and longitudinal patient records. MedAgentBench fills this gap by offering a reproducible, clinically relevant evaluation framework.

What Does MedAgentBench Contain?

How Are the Tasks Structured?

MedAgentBench consists of 300 tasks across 10 categories, written by licensed physicians. These tasks include patient information retrieval, lab result tracking, documentation, test ordering, referrals, and medication management. Tasks average 2–3 steps and mirror workflows encountered in inpatient and outpatient care.

What Patient Data Supports the Benchmark?

The benchmark leverages 100 realistic patient profiles extracted from Stanford’s STARR data repository, comprising over 700,000 records including labs, vitals, diagnoses, procedures, and medication orders. Data was de-identified and jittered for privacy while preserving clinical validity.

How Is the Environment Built?

The environment is FHIR-compliant, supporting both retrieval (GET) and modification (POST) of EHR data. AI systems can simulate realistic clinical interactions such as documenting vitals or placing medication orders. This design makes the benchmark directly translatable to live EHR systems.

How Are Models Evaluated?

  • Metric: Task success rate (SR), measured with strict pass@1 to reflect real-world safety requirements.
  • Models Tested: 12 leading LLMs including GPT-4o, Claude 3.5 Sonnet, Gemini 2.0, DeepSeek-V3, Qwen2.5, and Llama 3.3.
  • Agent Orchestrator: A baseline orchestration setup with nine FHIR functions, limited to eight interaction rounds per task.

Which Models Performed Best?

  • Claude 3.5 Sonnet v2: Best overall with 69.67% success, especially strong in retrieval tasks (85.33%).
  • GPT-4o: 64.0% success, showing balanced retrieval and action performance.
  • DeepSeek-V3: 62.67% success, leading among open-weight models.
  • Observation: Most models excelled at query tasks but struggled with action-based tasks requiring safe multi-step execution.
https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

What Errors Did Models Make?

Two dominant failure patterns emerged:

  1. Instruction adherence failures — invalid API calls or incorrect JSON formatting.
  2. Output mismatch — providing full sentences when structured numerical values were required.

These errors highlight gaps in precision and reliability, both critical in clinical deployment.

Summary

MedAgentBench establishes the first large-scale benchmark for evaluating LLM agents in realistic EHR settings, pairing 300 clinician-authored tasks with a FHIR-compliant environment and 100 patient profiles. Results show strong potential but limited reliability—Claude 3.5 Sonnet v2 leads at 69.67%—highlighting the gap between query success and safe action execution. While constrained by single-institution data and EHR-focused scope, MedAgentBench provides an open, reproducible framework to drive the next generation of dependable healthcare AI agents


Check out the and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
MoonshotAI Released Checkpoint-Engine: A Simple Middleware to Update Model Weights in LLM Inference Engines, Effective for Reinforcement Learning

MoonshotAI Released Checkpoint-Engine: A Simple Middleware to Update Model Weights in LLM Inference Engines, Effective for Reinforcement Learning

 

MoonshotAI has open-sourced checkpoint-engine, a lightweight middleware aimed at solving one of the key bottlenecks in large language model (LLM) deployment: rapidly updating model weights across thousands of GPUs without disrupting inference.

The library is particularly designed for reinforcement learning (RL) and reinforcement learning with human feedback (RLHF), where models are updated frequently and downtime directly impacts system throughput.

https://github.com/MoonshotAI/checkpoint-engine

How Fast can LLMs be updated?

Checkpoint-engine delivers a significant breakthrough by updating a 1-trillion parameter model across thousands of GPUs in roughly 20 seconds.

Traditional distributed inference pipelines can take several minutes to reload models of this size. By reducing the update time by an order of magnitude, checkpoint-engine directly addresses one of the largest inefficiencies in large-scale serving.

The system achieves this through:

  • Broadcast updates for static clusters.
  • Peer-to-peer (P2P) updates for dynamic clusters.
  • Overlapped communication and memory copy for reduced latency.

What does the Architecture look like?

Checkpoint-engine sits between training engines and inference clusters. Its design includes:

  • A Parameter that coordinates updates.
  • Worker Extensions that integrate with inference frameworks such as vLLM.

The weight update pipeline runs in three stages:

  1. Host-to-Device (H2D): Parameters are copied into GPU memory.
  2. Broadcast: Weights are distributed across workers using CUDA IPC buffers.
  3. Reload: Each inference shard reloads only the subset of weights it needs.

This staged pipeline is optimized for overlap, ensuring GPUs remain active throughout the update process.

How does it perform in practice?

Benchmarking results confirm checkpoint-engine’s scalability:

  • GLM-4.5-Air (BF16, 8×H800): 3.94s (broadcast), 8.83s (P2P).
  • Qwen3-235B-Instruct (BF16, 8×H800): 6.75s (broadcast), 16.47s (P2P).
  • DeepSeek-V3.1 (FP8, 16×H20): 12.22s (broadcast), 25.77s (P2P).
  • Kimi-K2-Instruct (FP8, 256×H20): ~21.5s (broadcast), 34.49s (P2P).

Even at trillion-parameter scale with 256 GPUs, broadcast updates complete in about 20 seconds, validating its design goal.

What are some trade-offs?

Checkpoint-engine introduces notable advantages, but also comes with limitations:

  • Memory Overhead: Overlapped pipelines require additional GPU memory; insufficient memory triggers slower fallback paths.
  • P2P Latency: Peer-to-peer updates support elastic clusters but at a performance cost.
  • Compatibility: Officially tested with vLLM only; broader engine support requires engineering work.
  • Quantization: FP8 support exists but remains experimental.

Where does it fit in deployment scenarios?

Checkpoint-engine is most valuable for:

  • Reinforcement learning pipelines where frequent weight updates are required.
  • Large inference clusters serving 100B–1T+ parameter models.
  • Elastic environments with dynamic scaling, where P2P flexibility offsets latency trade-offs.

Summary

Checkpoint-engine represents a focused solution to one of the hardest problems in large-scale LLM deployment: rapid weight synchronization without halting inference. With demonstrated updates at trillion-parameter scale in around 20 seconds, flexible support for both broadcast and P2P modes, and an optimized communication pipeline, it provides a practical path forward for reinforcement learning pipelines and high-performance inference clusters. While still limited to vLLM and requiring refinements in quantization and dynamic scaling, it establishes an important foundation for efficient, continuous model updates in production AI systems.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

Building an Advanced Convolutional Neural Network with Attention for DNA Sequence Classification and Interpretability

 

In this tutorial, we take a hands-on approach to building an advanced convolutional neural network for DNA sequence classification. We focus on simulating real biological tasks, such as promoter prediction, splice site detection, and regulatory element identification. By combining one-hot encoding, multi-scale convolutional layers, and an attention mechanism, we design a model that not only learns complex motifs but also provides interpretability. As we progress, we generate synthetic data, train with robust callbacks, and visualize results to ensure we fully understand the strengths and limitations of our approach. Check out the .

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import random


np.random.seed(42)
tf.random.set_seed(42)
random.seed(42)

We begin by importing the libraries for deep learning, data handling, and visualization. We set random seeds to ensure reproducibility so that our experiments run consistently each time. Check out the .

class DNASequenceClassifier:
   def __init__(self, sequence_length=200, num_classes=2):
       self.sequence_length = sequence_length
       self.num_classes = num_classes
       self.model = None
       self.history = None
      
   def one_hot_encode(self, sequences):
       mapping = {'A': 0, 'T': 1, 'G': 2, 'C': 3}
       encoded = np.zeros((len(sequences), self.sequence_length, 4))
      
       for i, seq in enumerate(sequences):
           for j, nucleotide in enumerate(seq[:self.sequence_length]):
               if nucleotide in mapping:
                   encoded[i, j, mapping[nucleotide]] = 1
       return encoded
  
   def attention_layer(self, inputs, name="attention"):
       attention_weights = layers.Dense(1, activation='tanh', name=f"{name}_weights")(inputs)
       attention_weights = layers.Flatten()(attention_weights)
       attention_weights = layers.Activation('softmax', name=f"{name}_softmax")(attention_weights)
       attention_weights = layers.RepeatVector(inputs.shape[-1])(attention_weights)
       attention_weights = layers.Permute([2, 1])(attention_weights)
      
       attended = layers.Multiply(name=f"{name}_multiply")([inputs, attention_weights])
       return layers.GlobalMaxPooling1D()(attended)
  
   def build_model(self):
       inputs = layers.Input(shape=(self.sequence_length, 4), name="dna_input")
      
       conv_layers = []
       filter_sizes = [3, 7, 15, 25]
      
       for i, filter_size in enumerate(filter_sizes):
           conv = layers.Conv1D(
               filters=64,
               kernel_size=filter_size,
               activation='relu',
               padding='same',
               name=f"conv_{filter_size}"
           )(inputs)
           conv = layers.BatchNormalization(name=f"bn_conv_{filter_size}")(conv)
           conv = layers.Dropout(0.2, name=f"dropout_conv_{filter_size}")(conv)
          
           attended = self.attention_layer(conv, name=f"attention_{filter_size}")
           conv_layers.append(attended)
      
       if len(conv_layers) > 1:
           merged = layers.Concatenate(name="concat_multiscale")(conv_layers)
       else:
           merged = conv_layers[0]
      
       dense = layers.Dense(256, activation='relu', name="dense_1")(merged)
       dense = layers.BatchNormalization(name="bn_dense_1")(dense)
       dense = layers.Dropout(0.5, name="dropout_dense_1")(dense)
      
       dense = layers.Dense(128, activation='relu', name="dense_2")(dense)
       dense = layers.BatchNormalization(name="bn_dense_2")(dense)
       dense = layers.Dropout(0.3, name="dropout_dense_2")(dense)
      
       if self.num_classes == 2:
           outputs = layers.Dense(1, activation='sigmoid', name="output")(dense)
           loss = 'binary_crossentropy'
           metrics = ['accuracy', 'precision', 'recall']
       else:
           outputs = layers.Dense(self.num_classes, activation='softmax', name="output")(dense)
           loss = 'categorical_crossentropy'
           metrics = ['accuracy']
      
       self.model = keras.Model(inputs=inputs, outputs=outputs, name="DNA_CNN_Classifier")
      
       optimizer = keras.optimizers.Adam(
           learning_rate=0.001,
           beta_1=0.9,
           beta_2=0.999,
           epsilon=1e-7
       )
      
       self.model.compile(
           optimizer=optimizer,
           loss=loss,
           metrics=metrics
       )
      
       return self.model
  
   def generate_synthetic_data(self, n_samples=10000):
       sequences = []
       labels = []
      
       positive_motifs = ['TATAAA', 'CAAT', 'GGGCGG', 'TTGACA']
       negative_motifs = ['AAAAAAA', 'TTTTTTT', 'CCCCCCC', 'GGGGGGG']
      
       nucleotides = ['A', 'T', 'G', 'C']
      
       for i in range(n_samples):
           sequence = ''.join(random.choices(nucleotides, k=self.sequence_length))
          
           if i < n_samples // 2:
               motif = random.choice(positive_motifs)
               pos = random.randint(0, self.sequence_length - len(motif))
               sequence = sequence[:pos] + motif + sequence[pos + len(motif):]
               label = 1
           else:
               if random.random() < 0.3:
                   motif = random.choice(negative_motifs)
                   pos = random.randint(0, self.sequence_length - len(motif))
                   sequence = sequence[:pos] + motif + sequence[pos + len(motif):]
               label = 0
          
           sequences.append(sequence)
           labels.append(label)
      
       return sequences, np.array(labels)
  
   def train(self, X_train, y_train, X_val, y_val, epochs=50, batch_size=32):
       callbacks = [
           keras.callbacks.EarlyStopping(
               monitor='val_loss',
               patience=10,
               restore_best_weights=True
           ),
           keras.callbacks.ReduceLROnPlateau(
               monitor='val_loss',
               factor=0.5,
               patience=5,
               min_lr=1e-6
           )
       ]
      
       self.history = self.model.fit(
           X_train, y_train,
           validation_data=(X_val, y_val),
           epochs=epochs,
           batch_size=batch_size,
           callbacks=callbacks,
           verbose=1
       )
      
       return self.history
  
   def evaluate_and_visualize(self, X_test, y_test):
       y_pred_proba = self.model.predict(X_test)
       y_pred = (y_pred_proba > 0.5).astype(int).flatten()
      
       print("Classification Report:")
       print(classification_report(y_test, y_pred))
      
       fig, axes = plt.subplots(2, 2, figsize=(15, 10))
      
       axes[0,0].plot(self.history.history['loss'], label='Training Loss')
       axes[0,0].plot(self.history.history['val_loss'], label='Validation Loss')
       axes[0,0].set_title('Training History - Loss')
       axes[0,0].set_xlabel('Epoch')
       axes[0,0].set_ylabel('Loss')
       axes[0,0].legend()
      
       axes[0,1].plot(self.history.history['accuracy'], label='Training Accuracy')
       axes[0,1].plot(self.history.history['val_accuracy'], label='Validation Accuracy')
       axes[0,1].set_title('Training History - Accuracy')
       axes[0,1].set_xlabel('Epoch')
       axes[0,1].set_ylabel('Accuracy')
       axes[0,1].legend()
      
       cm = confusion_matrix(y_test, y_pred)
       sns.heatmap(cm, annot=True, fmt='d', ax=axes[1,0], cmap='Blues')
       axes[1,0].set_title('Confusion Matrix')
       axes[1,0].set_ylabel('Actual')
       axes[1,0].set_xlabel('Predicted')
      
       axes[1,1].hist(y_pred_proba[y_test==0], bins=50, alpha=0.7, label='Negative', density=True)
       axes[1,1].hist(y_pred_proba[y_test==1], bins=50, alpha=0.7, label='Positive', density=True)
       axes[1,1].set_title('Prediction Score Distribution')
       axes[1,1].set_xlabel('Prediction Score')
       axes[1,1].set_ylabel('Density')
       axes[1,1].legend()
      
       plt.tight_layout()
       plt.show()
      
       return y_pred, y_pred_proba

We define a DNASequenceClassifier that encodes sequences, learns multi-scale motifs with CNNs, and applies an attention mechanism for interpretability. We build and compile the model, generate synthetic motif-rich data, and then train with robust callbacks and visualize performance to evaluate classification quality. Check out the .

def main():
   print("🧬 Advanced DNA Sequence Classification with CNN")
   print("=" * 50)
  
   classifier = DNASequenceClassifier(sequence_length=200, num_classes=2)
  
   print("Generating synthetic DNA sequences...")
   sequences, labels = classifier.generate_synthetic_data(n_samples=10000)
  
   print("Encoding DNA sequences...")
   X = classifier.one_hot_encode(sequences)
  
   X_trn, X_test, y_trn, y_test = train_test_split(
       X, labels, test_size=0.2, random_state=42, stratify=labels
   )
   X_trn, X_val, y_trn, y_val = train_test_split(
       X_trn, y_trn, test_size=0.2, random_state=42, stratify=y_train
   )
  
   print(f"Training set: {X_train.shape}")
   print(f"Validation set: {X_val.shape}")
   print(f"Test set: {X_test.shape}")
  
   print("Building CNN model...")
   model = classifier.build_model()
   print(model.summary())
  
   print("Training model...")
   classifier.train(X_train, y_train, X_val, y_val, epochs=30, batch_size=64)
  
   print("Evaluating model...")
   y_pred, y_pred_proba = classifier.evaluate_and_visualize(X_test, y_test)
  
   print("✅ Training and evaluation complete!")


if __name__ == "__main__":
   main()

We wrap up the workflow in the main() function, where we generate synthetic DNA data, encode it, split it into training, validation, and test sets, then build, train, and evaluate our CNN model. We conclude by visualizing the performance and confirming that the classification pipeline runs successfully from start to finish.

In conclusion, we successfully demonstrate how a carefully designed CNN with attention can classify DNA sequences with high accuracy and interpretability. We see how synthetic biological motifs help validate the model’s capacity for pattern recognition, and how visualization techniques provide meaningful insights into training dynamics and predictions. Through this journey, we enhance our ability to integrate deep learning architectures with biological data, laying the groundwork for applying these methods to real-world genomics research.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
OpenAI Introduces GPT-5-Codex: An Advanced Version of GPT-5 Further Optimized for Agentic Coding in Codex

OpenAI Introduces GPT-5-Codex: An Advanced Version of GPT-5 Further Optimized for Agentic Coding in Codex

 

OpenAI has just released GPT-5-Codex, a version of GPT-5 further optimized for “agentic coding” tasks within the Codex ecosystem. The goal: improve reliability, speed, and autonomous behavior so that Codex acts more like a teammate, not just a prompt-executor.

Codex is now available across the full developer workflow: CLI, IDE extensions, web, mobile, GitHub code reviews. It integrates well with cloud environments and developer tools.

https://openai.com/index/introducing-upgrades-to-codex/

Key Capabilities / Improvements

  1. Agentic behavior
    GPT-5-Codex can take on long, complex, multi-step tasks more autonomously. It balances “interactive” sessions (short feedback loops) with “independent execution” (long refactors, tests, etc.).
  2. Steerability & style compliance
    Less need for developers to micro-specify style / hygiene. The model better understands high-level instructions (“do this”, “follow cleanliness guidelines”) without being told every detail each time.
  3. Code review improvements
    • Trained to catch critical bugs, not just surface or stylistic issues.
    • It examines the full context: codebase, dependencies, tests.
    • Can run code & tests to validate behavior.
    • Evaluated on pull requests / commits from popular open source. Feedback from actual engineers confirms fewer “incorrect/unimportant” comments.
  4. Performance & efficiency
    • For small requests, the model is “snappier”.
    • For big tasks, it “thinks more”—spends more compute/time reasoning, editing, iterating.
    • On internal testing: bottom-10% of user turns (by tokens) use ~93.7% fewer tokens than vanilla GPT-5. Top-10% use roughly twice as much reasoning/iteration.
  5. Tooling & integration improvements
    • Codex CLI: better tracking of progress (to-do lists), ability to embed/share images (wireframes, screenshots), upgraded terminal UI, improved permission modes.
    • IDE Extension: works in VSCode, Cursor (and forks); maintains context of open files / selection; allows switching between cloud/local work seamlessly; preview local code changes directly.
    • Cloud environment enhancements:
      • Cached containers → median completion time for new tasks / follow-ups ↓ ~90%.
      • Automatic setup of environments (scanning for setup scripts, installing dependencies).
      • Configurable network access and ability to run pip installs etc. at runtime.
  6. Visual & front-end context
    The model now accepts image or screenshot inputs (e.g. UI designs or bugs) and can show visual output, e.g. screenshots of its work. Better human preference performance in mobile web / front-end tasks.
  7. Safety, trust, and deployment controls
    • Default sandboxed execution (network access disabled unless explicitly permitted).
    • Approval modes in tools: read-only vs auto access vs full access.
    • Support for reviewing agent work, terminal logs, test results.
    • Marked as “High capability” in Biological / Chemical domains; extra safeguards.

Use Cases & Scenarios

  • Large scale refactoring: changing architecture, propagating context (e.g. threading a variable through many modules) in multiple languages (Python, Go, OCaml) as demonstrated.
  • Feature additions with tests: generate new functionality and tests, fixing broken tests, handling test failures.
  • Continuous code reviews: PR review suggestions, catching regressions or security flaws earlier.
  • Front-end / UI design workflows: prototype or debug UI from specs/screenshots.
  • Hybrid workflows human + agent: human gives high-level instruction; Codex manages sub-tasks, dependencies, iteration.
https://openai.com/index/introducing-upgrades-to-codex/

Implications

  • For engineering teams: can shift more burden to Codex for repetitive / structurally heavy work (refactoring, test scaffolding), freeing human time for architectural decisions, design, etc.
  • For codebases: maintaining consistency in style, dependencies, test coverage could be easier since Codex consistently applies patterns.
  • For hiring / workflow: teams may need to adjust roles: reviewer focus may shift from “spotting minor errors” to oversight of agent suggestions.
  • Tool ecosystem: tighter IDE integrations mean workflows become more seamless; code reviews via bots may become more common & expected.
  • Risk management: organizations will need policy & audit controls for agentic code tasks, esp. for production-critical or high-security code.

Comparison: GPT-5 vs GPT-5-Codex

Dimension GPT-5 (base) GPT-5-Codex
Autonomy on long tasks Less, more interactive / prompt heavy More: longer independent execution, iterative work
Use in agentic coding environments Possible, but not optimized Purpose-built and tuned for Codex workflows only
Steerability & instruction compliance Requires more detailed directions Better adherence to high-level style / code quality instructions
Efficiency (token usage, latency) More tokens and passes; slower on big tasks More efficient on small tasks; spends extra reasoning only when needed

Conclusion

GPT-5-Codex represents a meaningful step forward in AI-assisted software engineering. By optimizing for long tasks, autonomous work, and integrating deeply into developer workflows (CLI, IDE, cloud, code review), it offers tangible improvements in speed, quality, and efficiency. But it does not eliminate the need for expert oversight; safe usage requires policies, review loops, and understanding of the system’s limitations.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Powerful and Versatile 3D Video Annotation Tool for Spatial AI

NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Powerful and Versatile 3D Video Annotation Tool for Spatial AI

 

How do you create 3D datasets to train AI for Robotics without expensive traditional approaches? A team of researchers from NVIDIA released “” bringing a key improvement for Spatial AI. It addresses the central, agonizing bottleneck that has constrained the field of 3D computer vision for years. 

is a robust, versatile engine designed to process raw, unconstrained, “in-the-wild” video footage and automatically output the critical elements of 3D reality:

  • Camera Intrinsics (sensor calibration parameters)
  • Precise Camera Motion (pose)
  • Dense, Metric Depth Maps (real-world distances for every pixel)

To truly know the magnitude of this breakthrough, we must first understand the profound difficulty of the problem it solves.

The challenge: Unlocking 3D Reality from 2D Video 

The ultimate goal of Spatial AI is to enable machines, robots , autonomous vehicles, and AR glasses, to perceive and interact with the world in 3D. We live in a 3D world, but the vast majority of our recorded data, from smartphone clips to cinematic footage, is trapped in 2D.

The Core Problem: How do we reliably and scalably reverse-engineer the 3D reality hidden inside these flat video streams?

Achieving this accurately from everyday video, which features shaky movements, dynamic objects, and unknown camera types, is notoriously difficult, yet it is the essential first step for virtually any advanced spatial application.

Problems with Existing Approaches

For decades, the field has been forced to choose between 2 powerful yet flawed paradigms.

1. The Precision Trap (Classical SLAM/SfM) 

Traditional methods like Simultaneous Localization and Mapping (SLAM) and Structure-from-Motion (SfM) rely on sophisticated geometric optimization. They are capable of pinpoint accuracy under ideal conditions.

The Fatal Flaw: Brittleness. These systems generally assume the world is static. Introduce a moving car, a textureless wall, or use an unknown camera, and the entire reconstruction can shatter. They are too delicate for the messy reality of everyday video.

2. The Scalability Wall (End-to-End Deep Learning) 

Recently, powerful models have emerged. By training on vast datasets, they learn robust “priors” about the world and are impressively resilient to noise and dynamism.

The Fatal Flaw: Intractability. These models are computationally hungry. Their memory requirements explode as video length increases, making the processing of long videos practically impossible. They simply do not scale.

This deadlock created a dilemma. The future of advanced AI demands massive datasets annotated with perfect 3D geometry, but the tools required to generate that data were either too brittle or too slow to deploy at scale.

Meet ViPE: NVIDIA’s Hybrid Breakthrough Shatters the Mold 

This is where changes the game. It is not merely an incremental improvement; it is a well-designed and well-integrated hybrid pipeline that successfully fuses the best of both worlds. It takes the efficient, mathematically rigorous optimization framework of classical SLAM and injects it with the powerful, learned intuition of modern deep neural networks.

This synergy allows to be simultaneously. delivers a solution that scales without compromising on precision.

How it Works: Inside the ViPE Engine 

‘s architecture uses a keyframe-based for efficiency. 

Here are the Key Innovations:

Key Innovation 1: A Synergy of Powerful Constraints

achieves unprecedented accuracy by masterfully balancing three critical inputs:

  • Dense Flow (Learned Robustness): Uses a learned optical flow network for robust correspondences between frames, even in tough conditions.
  • Sparse Tracks (Classical Precision): Incorporates high-resolution, traditional feature tracking to capture fine-grained details, drastically improving localization accuracy.
  • Metric Depth Regularization (Real-World Scale): ViPE integrates priors from state-of-the-art monocular depth models to produce results in true, real-world metric scale.

Key Innovation 2: Mastering Dynamic, Real-World Scenes 

To handle the chaos of real-world video, employs advanced foundational segmentation tools, GroundingDINO and Segment Anything (SAM), to identify and mask out moving objects (e.g., people, cars). By intelligently ignoring these dynamic regions, ViPE ensures the camera motion is calculated based only on the static environment.

Key Innovation 3: Fast Speed & General Versatility 

operates at a remarkable 3-5 FPS on a single GPU, making it significantly faster than comparable methods. Furthermore, ViPE is universally applicable, supporting diverse camera models including standard, wide-angle/fisheye, and even 360° panoramic videos, automatically optimizing the intrinsics for each.

Key Innovation 4: High-Fidelity Depth Maps

The final output is enhanced by a sophisticated post-processing step. ViPE smoothly aligns high-detail depth maps with the geometrically consistent maps from its core process. The result is stunning: depth maps that are both high-fidelity and temporally stable.

The results are stunning even complex scenes…see below

Proven Performance

demonstrates superior performance, outperforming existing uncalibrated pose estimation baselines by a staggering:

  • 18% on the TUM dataset (indoor dynamics)
  • 50% on the KITTI dataset (outdoor driving)

Crucially, the evaluations confirm that , while other approaches/engines often produce inconsistent, unusable scales.

The Real Innovation: A Data Explosion for Spatial AI

The most significant contribution of this work is not just the engine itself, but its deployment as a large-scale data annotation factory to fuel the future of AI. The lack of massive, diverse, geometrically annotated video data has been the primary bottleneck for training robust 3D models. solves this problem!.How

The research team used to create and release an unprecedented dataset totaling approximately 96 million annotated frames:

  • : Nearly 100,000 real-world internet videos (15.7M frames) with high-quality poses and dense geometry.
  • : A massive collection of 1 million high-quality, AI-generated videos (78M frames).
  • : A specialized dataset of annotated panoramic videos.

This massive release provides the necessary fuel for the next generation of 3D geometric foundation models and is already proving instrumental in training advanced world generation models like NVIDIA’s and .

By resolving the fundamental conflicts between accuracy, robustness, and scalability, ViPE provides the practical, efficient, and universal tool needed to unlock the 3D structure of almost any video. Its release is poised to dramatically accelerate innovation across the entire landscape of Spatial AI, robotics, and AR/VR.

NVIDIA AI has released the

Sources /links

Datasets:

  • https://huggingface.co/datasets/nvidia/vipe-dynpose-100kpp
  • https://huggingface.co/datasets/nvidia/vipe-wild-sdg-1m
  • https://huggingface.co/datasets/nvidia/vipe-web360
  • https://www.nvidia.com/en-us/ai/cosmos/

Thanks to the NVIDIA team for the thought leadership/ Resources for this article. NVIDIA team has supported and sponsored this content/article.

The post appeared first on .

Read More