Building an End-to-End Object Tracking and Analytics System with Roboflow Supervision

Building an End-to-End Object Tracking and Analytics System with Roboflow Supervision

In this advanced Roboflow Supervision tutorial, we build a complete object detection pipeline with the Supervision library. We begin by setting up real-time object tracking using ByteTracker, adding detection smoothing, and defining polygon zones to monitor specific regions in a video stream. As we process the frames, we annotate them with bounding boxes, object IDs, and speed data, enabling us to track and analyze object behavior over time. Our goal is to showcase how we can combine detection, tracking, zone-based analytics, and visual annotation into a seamless and intelligent video analysis workflow. Check out the Full Codes here.

!pip install supervision ultralytics opencv-python
!pip install --upgrade supervision 


import cv2
import numpy as np
import supervision as sv
from ultralytics import YOLO
import matplotlib.pyplot as plt
from collections import defaultdict


model = YOLO('yolov8n.pt')

We start by installing the necessary packages, including Supervision, Ultralytics, and OpenCV. After ensuring we have the latest version of Supervision, we import all required libraries. We then initialize the YOLOv8n model, which serves as the core detector in our pipeline. Check out the Full Codes here.

try:
   tracker = sv.ByteTrack()
except AttributeError:
   try:
       tracker = sv.ByteTracker()
   except AttributeError:
       print("Using basic tracking - install latest supervision for advanced tracking")
       tracker = None


try:
   smoother = sv.DetectionsSmoother(length=5)
except AttributeError:
   smoother = None
   print("DetectionsSmoother not available in this version")


try:
   box_annotator = sv.BoundingBoxAnnotator(thickness=2)
   label_annotator = sv.LabelAnnotator()
   if hasattr(sv, 'TraceAnnotator'):
       trace_annotator = sv.TraceAnnotator(thickness=2, trace_length=30)
   else:
       trace_annotator = None
except AttributeError:
   try:
       box_annotator = sv.BoxAnnotator(thickness=2)
       label_annotator = sv.LabelAnnotator()
       trace_annotator = None
   except AttributeError:
       print("Using basic annotators - some features may be limited")
       box_annotator = None
       label_annotator = None 
       trace_annotator = None


def create_zones(frame_shape):
   h, w = frame_shape[:2]
  
   try:
       entry_zone = sv.PolygonZone(
           polygon=np.array([[0, h//3], [w//3, h//3], [w//3, 2*h//3], [0, 2*h//3]]),
           frame_resolution_wh=(w, h)
       )
      
       exit_zone = sv.PolygonZone(
           polygon=np.array([[2*w//3, h//3], [w, h//3], [w, 2*h//3], [2*w//3, 2*h//3]]),
           frame_resolution_wh=(w, h)
       )
   except TypeError:
       entry_zone = sv.PolygonZone(
           polygon=np.array([[0, h//3], [w//3, h//3], [w//3, 2*h//3], [0, 2*h//3]])
       )
       exit_zone = sv.PolygonZone(
           polygon=np.array([[2*w//3, h//3], [w, h//3], [w, 2*h//3], [2*w//3, 2*h//3]])
       )
  
   return entry_zone, exit_zone

We set up essential components from the Supervision library, including object tracking with ByteTrack, optional smoothing using DetectionsSmoother, and flexible annotators for bounding boxes, labels, and traces. To ensure compatibility across versions, we use try-except blocks to fall back to alternative classes or basic functionality when needed. Additionally, we define dynamic polygon zones within the frame to monitor specific regions like entry and exit areas, enabling advanced spatial analytics. Check out the Full Codes here.

class AdvancedAnalytics:
   def __init__(self):
       self.track_history = defaultdict(list)
       self.zone_crossings = {"entry": 0, "exit": 0}
       self.speed_data = defaultdict(list)
      
   def update_tracking(self, detections):
       if hasattr(detections, 'tracker_id') and detections.tracker_id is not None:
           for i in range(len(detections)):
               track_id = detections.tracker_id[i]
               if track_id is not None:
                   bbox = detections.xyxy[i]
                   center = np.array([(bbox[0] + bbox[2]) / 2, (bbox[1] + bbox[3]) / 2])
                   self.track_history[track_id].append(center)
                  
                   if len(self.track_history[track_id]) >= 2:
                       prev_pos = self.track_history[track_id][-2]
                       curr_pos = self.track_history[track_id][-1]
                       speed = np.linalg.norm(curr_pos - prev_pos)
                       self.speed_data[track_id].append(speed)
  
   def get_statistics(self):
       total_tracks = len(self.track_history)
       avg_speed = np.mean([np.mean(speeds) for speeds in self.speed_data.values() if speeds])
       return {
           "total_objects": total_tracks,
           "zone_entries": self.zone_crossings["entry"],
           "zone_exits": self.zone_crossings["exit"],
           "avg_speed": avg_speed if not np.isnan(avg_speed) else 0
       }


def process_video(source=0, max_frames=300):
   """
   Process video source with advanced supervision features
   source: video path or 0 for webcam
   max_frames: limit processing for demo
   """
   cap = cv2.VideoCapture(source)
   analytics = AdvancedAnalytics()
  
   ret, frame = cap.read()
   if not ret:
       print("Failed to read video source")
       return
  
   entry_zone, exit_zone = create_zones(frame.shape)
  
   try:
       entry_zone_annotator = sv.PolygonZoneAnnotator(
           zone=entry_zone,
           color=sv.Color.GREEN,
           thickness=2
       )
       exit_zone_annotator = sv.PolygonZoneAnnotator(
           zone=exit_zone,
           color=sv.Color.RED,
           thickness=2
       )
   except (AttributeError, TypeError):
       entry_zone_annotator = sv.PolygonZoneAnnotator(zone=entry_zone)
       exit_zone_annotator = sv.PolygonZoneAnnotator(zone=exit_zone)
  
   frame_count = 0
   results_frames = []
  
   cap.set(cv2.CAP_PROP_POS_FRAMES, 0) 
  
   while ret and frame_count < max_frames:
       ret, frame = cap.read()
       if not ret:
           break
          
       results = model(frame, verbose=False)[0]
       detections = sv.Detections.from_ultralytics(results)
      
       detections = detections[detections.class_id == 0]
      
       if tracker is not None:
           detections = tracker.update_with_detections(detections)
      
       if smoother is not None:
           detections = smoother.update_with_detections(detections)
      
       analytics.update_tracking(detections)
      
       entry_zone.trigger(detections)
       exit_zone.trigger(detections)
      
       labels = []
       for i in range(len(detections)):
           confidence = detections.confidence[i] if detections.confidence is not None else 0.0
          
           if hasattr(detections, 'tracker_id') and detections.tracker_id is not None:
               track_id = detections.tracker_id[i]
               if track_id is not None:
                   speed = analytics.speed_data[track_id][-1] if analytics.speed_data[track_id] else 0
                   label = f"ID:{track_id} | Conf:{confidence:.2f} | Speed:{speed:.1f}"
               else:
                   label = f"Conf:{confidence:.2f}"
           else:
               label = f"Conf:{confidence:.2f}"
           labels.append(label)
      
       annotated_frame = frame.copy()
      
       annotated_frame = entry_zone_annotator.annotate(annotated_frame)
       annotated_frame = exit_zone_annotator.annotate(annotated_frame)
      
       if trace_annotator is not None:
           annotated_frame = trace_annotator.annotate(annotated_frame, detections)
      
       if box_annotator is not None:
           annotated_frame = box_annotator.annotate(annotated_frame, detections)
       else:
           for i in range(len(detections)):
               bbox = detections.xyxy[i].astype(int)
               cv2.rectangle(annotated_frame, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (0, 255, 0), 2)
      
       if label_annotator is not None:
           annotated_frame = label_annotator.annotate(annotated_frame, detections, labels)
       else:
           for i, label in enumerate(labels):
               if i < len(detections):
                   bbox = detections.xyxy[i].astype(int)
                   cv2.putText(annotated_frame, label, (bbox[0], bbox[1]-10),
                              cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)
      
       stats = analytics.get_statistics()
       y_offset = 30
       for key, value in stats.items():
           text = f"{key.replace('_', ' ').title()}: {value:.1f}"
           cv2.putText(annotated_frame, text, (10, y_offset),
                      cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)
           y_offset += 30
      
       if frame_count % 30 == 0:
           results_frames.append(cv2.cvtColor(annotated_frame, cv2.COLOR_BGR2RGB))
      
       frame_count += 1
      
       if frame_count % 50 == 0:
           print(f"Processed {frame_count} frames...")
  
   cap.release()
  
   if results_frames:
       fig, axes = plt.subplots(2, 2, figsize=(15, 10))
       axes = axes.flatten()
      
       for i, (ax, frame) in enumerate(zip(axes, results_frames[:4])):
           ax.imshow(frame)
           ax.set_title(f"Frame {i*30}")
           ax.axis('off')
      
       plt.tight_layout()
       plt.show()
  
   final_stats = analytics.get_statistics()
   print("n=== FINAL ANALYTICS ===")
   for key, value in final_stats.items():
       print(f"{key.replace('_', ' ').title()}: {value:.2f}")
  
   return analytics


print("Starting advanced supervision demo...")
print("Features: Object detection, tracking, zones, speed analysis, smoothing")

We define the AdvancedAnalytics class to track object movement, calculate speed, and count zone crossings, enabling rich real-time video insights. Inside the process_video function, we read each frame from the video source and run it through our detection, tracking, and smoothing pipeline. We annotate frames with bounding boxes, labels, zone overlays, and live statistics, giving us a powerful, flexible system for object monitoring and spatial analytics. Throughout the loop, we also collect data for visualization and print final statistics, showcasing the effectiveness of Roboflow Supervision’s end-to-end capabilities. Check out the Full Codes here.

def create_demo_video():
   """Create a simple demo video with moving objects"""
   fourcc = cv2.VideoWriter_fourcc(*'mp4v')
   out = cv2.VideoWriter('demo.mp4', fourcc, 20.0, (640, 480))
  
   for i in range(100):
       frame = np.zeros((480, 640, 3), dtype=np.uint8)
      
       x1 = int(50 + i * 2)
       y1 = 200
       x2 = int(100 + i * 1.5)
       y2 = 250
      
       cv2.rectangle(frame, (x1, y1), (x1+50, y1+50), (0, 255, 0), -1)
       cv2.rectangle(frame, (x2, y2), (x2+50, y2+50), (255, 0, 0), -1)
      
       out.write(frame)
  
   out.release()
   return 'demo.mp4'


demo_video = create_demo_video()
analytics = process_video(demo_video, max_frames=100)


print("nTutorial completed! Key features demonstrated:")
print("✓ YOLO integration with Supervision")
print("✓ Multi-object tracking with ByteTracker")
print("✓ Detection smoothing")
print("✓ Polygon zones for area monitoring")
print("✓ Advanced annotations (boxes, labels, traces)")
print("✓ Real-time analytics and statistics")
print("✓ Speed calculation and tracking history")

To test our full pipeline, we generate a synthetic demo video with two moving rectangles simulating tracked objects. This allows us to validate detection, tracking, zone monitoring, and speed analysis without needing a real-world input. We then run the process_video function on the generated clip. At the end, we print out a summary of all key features we’ve implemented, showcasing the power of Roboflow Supervision for real-time visual analytics.

In conclusion, we have successfully implemented a full pipeline that brings together object detection, tracking, zone monitoring, and real-time analytics. We demonstrate how to visualize key insights like object speed, zone crossings, and tracking history with annotated video frames. This setup empowers us to go beyond basic detection and build a smart surveillance or analytics system using open-source tools. Whether for research or production use, we now have a powerful foundation to expand upon with even more advanced capabilities.


Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Building an End-to-End Object Tracking and Analytics System with Roboflow Supervision appeared first on MarkTechPost.

MarkTechPost

Read More
New algorithms enable efficient machine learning with symmetric data

New algorithms enable efficient machine learning with symmetric data

If you rotate an image of a molecular structure, a human can tell the rotated image is still the same molecule, but a machine-learning model might think it is a new data point. In computer science parlance, the molecule is “symmetric,” meaning the fundamental structure of that molecule remains the same if it undergoes certain transformations, like rotation.

If a drug discovery model doesn’t understand symmetry, it could make inaccurate predictions about molecular properties. But despite some empirical successes, it’s been unclear whether there is a computationally efficient method to train a good model that is guaranteed to respect symmetry.

A new study by MIT researchers answers this question, and shows the first method for machine learning with symmetry that is provably efficient in terms of both the amount of computation and data needed.

These results clarify a foundational question, and they could aid researchers in the development of more powerful machine-learning models that are designed to handle symmetry. Such models would be useful in a variety of applications, from discovering new materials to identifying astronomical anomalies to unraveling complex climate patterns.

“These symmetries are important because they are some sort of information that nature is telling us about the data, and we should take it into account in our machine-learning models. We’ve now shown that it is possible to do machine-learning with symmetric data in an efficient way,” says Behrooz Tahmasebi, an MIT graduate student and co-lead author of this study.

He is joined on the paper by co-lead author and MIT graduate student Ashkan Soleymani; Stefanie Jegelka, an associate professor of electrical engineering and computer science (EECS) and a member of the Institute for Data, Systems, and Society (IDSS) and the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author Patrick Jaillet, the Dugald C. Jackson Professor of Electrical Engineering and Computer Science and a principal investigator in the Laboratory for Information and Decision Systems (LIDS). The research was recently presented at the International Conference on Machine Learning.

Studying symmetry

Symmetric data appear in many domains, especially the natural sciences and physics. A model that recognizes symmetries is able to identify an object, like a car, no matter where that object is placed in an image, for example.

Unless a machine-learning model is designed to handle symmetry, it could be less accurate and prone to failure when faced with new symmetric data in real-world situations. On the flip side, models that take advantage of symmetry could be faster and require fewer data for training.

But training a model to process symmetric data is no easy task.

One common approach is called data augmentation, where researchers transform each symmetric data point into multiple data points to help the model generalize better to new data. For instance, one could rotate a molecular structure many times to produce new training data, but if researchers want the model to be guaranteed to respect symmetry, this can be computationally prohibitive.

An alternative approach is to encode symmetry into the model’s architecture. A well-known example of this is a graph neural network (GNN), which inherently handles symmetric data because of how it is designed.

“Graph neural networks are fast and efficient, and they take care of symmetry quite well, but nobody really knows what these models are learning or why they work. Understanding GNNs is a main motivation of our work, so we started with a theoretical evaluation of what happens when data are symmetric,” Tahmasebi says.

They explored the statistical-computational tradeoff in machine learning with symmetric data. This tradeoff means methods that require fewer data can be more computationally expensive, so researchers need to find the right balance.

Building on this theoretical evaluation, the researchers designed an efficient algorithm for machine learning with symmetric data.

Mathematical combinations

To do this, they borrowed ideas from algebra to shrink and simplify the problem. Then, they reformulated the problem using ideas from geometry that effectively capture symmetry.

Finally, they combined the algebra and the geometry into an optimization problem that can be solved efficiently, resulting in their new algorithm.

“Most of the theory and applications were focusing on either algebra or geometry. Here we just combined them,” Tahmasebi says.

The algorithm requires fewer data samples for training than classical approaches, which would improve a model’s accuracy and ability to adapt to new applications.

By proving that scientists can develop efficient algorithms for machine learning with symmetry, and demonstrating how it can be done, these results could lead to the development of new neural network architectures that could be more accurate and less resource-intensive than current models.

Scientists could also use this analysis as a starting point to examine the inner workings of GNNs, and how their operations differ from the algorithm the MIT researchers developed.

“Once we know that better, we can design more interpretable, more robust, and more efficient neural network architectures,” adds Soleymani.

This research is funded, in part, by the National Research Foundation of Singapore, DSO National Laboratories of Singapore, the U.S. Office of Naval Research, the U.S. National Science Foundation, and an Alexander von Humboldt Professorship.

MIT News – Artificial intelligence

Read More
“FUTURE PHASES” showcases new frontiers in music technology and interactive performance

“FUTURE PHASES” showcases new frontiers in music technology and interactive performance

Music technology took center stage at MIT during “FUTURE PHASES,” an evening of works for string orchestra and electronics, presented by the MIT Music Technology and Computation Graduate Program as part of the 2025 International Computer Music Conference (ICMC). 

The well-attended event was held last month in the Thomas Tull Concert Hall within the new Edward and Joyce Linde Music Building. Produced in collaboration with the MIT Media Lab’s Opera of the Future Group and Boston’s self-conducted chamber orchestra A Far Cry, “FUTURE PHASES” was the first event to be presented by the MIT Music Technology and Computation Graduate Program in MIT Music’s new space.

“FUTURE PHASES” offerings included two new works by MIT composers: the world premiere of “EV6,” by MIT Music’s Kenan Sahin Distinguished Professor Evan Ziporyn and professor of the practice Eran Egozy; and the U.S. premiere of “FLOW Symphony,” by the MIT Media Lab’s Muriel R. Cooper Professor of Music and Media Tod Machover. Three additional works were selected by a jury from an open call for works: “The Wind Will Carry Us Away,” by Ali Balighi; “A Blank Page,” by Celeste Betancur Gutiérrez and Luna Valentin; and “Coastal Portrait: Cycles and Thresholds,” by Peter Lane. Each work was performed by Boston’s own multi-Grammy-nominated string orchestra, A Far Cry.

“The ICMC is all about presenting the latest research, compositions, and performances in electronic music,” says Egozy, director of the new Music Technology and Computation Graduate Program at MIT. When approached to be a part of this year’s conference, “it seemed the perfect opportunity to showcase MIT’s commitment to music technology, and in particular the exciting new areas being developed right now: a new master’s program in music technology and computation, the new Edward and Joyce Linde Music Building with its enhanced music technology facilities, and new faculty arriving at MIT with joint appointments between MIT Music and Theater Arts (MTA) and the Department of Electrical Engineering and Computer Science (EECS).” These recently hired professors include Anna Huang, a keynote speaker for the conference and creator of the machine learning model Coconet that powered Google’s first AI Doodle, the Bach Doodle.

Egozy emphasizes the uniqueness of this occasion: “You have to understand that this is a very special situation. Having a full 18-member string orchestra [A Far Cry] perform new works that include electronics does not happen very often. In most cases, ICMC performances consist either entirely of electronics and computer-generated music, or perhaps a small ensemble of two-to-four musicians. So the opportunity we could present to the larger community of music technology was particularly exciting.”

To take advantage of this exciting opportunity, an open call was put out internationally to select the other pieces that would accompany Ziporyn and Egozy’s “EV6” and Machover’s “FLOW Symphony.” Three pieces were selected from a total of 46 entries to be a part of the evening’s program by a panel of judges that included Egozy, Machover, and other distinguished composers and technologists.

“We received a huge variety of works from this call,” says Egozy. “We saw all kinds of musical styles and ways that electronics would be used. No two pieces were very similar to each other, and I think because of that, our audience got a sense of how varied and interesting a concert can be for this format. A Far Cry was really the unifying presence. They played all pieces with great passion and nuance. They have a way of really drawing audiences into the music. And, of course, with the Thomas Tull Concert Hall being in the round, the audience felt even more connected to the music.”

Egozy continues, “we took advantage of the technology built into the Thomas Tull Concert Hall, which has 24 built-in speakers for surround sound allowing us to broadcast unique, amplified sound to every seat in the house. Chances are that every person might have experienced the sound slightly differently, but there was always some sense of a multidimensional evolution of sound and music as the pieces unfolded.”

The five works of the evening employed a range of technological components that included playing synthesized, prerecorded, or electronically manipulated sounds; attaching microphones to instruments for use in real-time signal processing algorithms; broadcasting custom-generated musical notation to the musicians; utilizing generative AI to process live sound and play it back in interesting and unpredictable ways; and audience participation, where spectators use their cellphones as musical instruments to become a part of the ensemble.

Ziporyn and Egozy’s piece, “EV6,” took particular advantage of this last innovation: “Evan and I had previously collaborated on a system called Tutti, which means ‘together’ in Italian. Tutti gives an audience the ability to use their smartphones as musical instruments so that we can all play together.” Egozy developed the technology, which was first used in the MIT Campaign for a Better World in 2017. The original application involved a three-minute piece for cellphones only. “But for this concert,” Egozy explains, “Evan had the idea that we could use the same technology to write a new piece — this time, for audience phones and a live string orchestra as well.”

To explain the piece’s title, Ziporyn says, “I drive an EV6; it’s my first electric car, and when I first got it, it felt like I was driving an iPhone. But of course it’s still just a car: it’s got wheels and an engine, and it gets me from one place to another. It seemed like a good metaphor for this piece, in which a lot of the sound is literally played on cellphones, but still has to work like any other piece of music. It’s also a bit of an homage to David Bowie’s song ‘TVC 15,’ which is about falling in love with a robot.”

Egozy adds, “We wanted audience members to feel what it is like to play together in an orchestra. Through this technology, each audience member becomes a part of an orchestral section (winds, brass, strings, etc.). As they play together, they can hear their whole section playing similar music while also hearing other sections in different parts of the hall play different music. This allows an audience to feel a responsibility to their section, hear how music can move between different sections of an orchestra, and experience the thrill of live performance. In ‘EV6,’ this experience was even more electrifying because everyone in the audience got to play with a live string orchestra — perhaps for the first time in recorded history.”

After the concert, guests were treated to six music technology demonstrations that showcased the research of undergraduate and graduate students from both the MIT Music program and the MIT Media Lab. These included a gamified interface for harnessing just intonation systems (Antonis Christou); insights from a human-AI co-created concert (Lancelot Blanchard and Perry Naseck); a system for analyzing piano playing data across campus (Ayyub Abdulrezak ’24, MEng ’25); capturing music features from audio using latent frequency-masked autoencoders (Mason Wang); a device that turns any surface into a drum machine (Matthew Caren ’25); and a play-along interface for learning traditional Senegalese rhythms (Mariano Salcedo ’25). This last example led to the creation of Senegroove, a drumming-based application specifically designed for an upcoming edX online course taught by ethnomusicologist and MIT associate professor in music Patricia Tang, and world-renowned Senegalese drummer and MIT lecturer in music Lamine Touré, who provided performance videos of the foundational rhythms used in the system.

Ultimately, Egozy muses, “’FUTURE PHASES’ showed how having the right space — in this case, the new Edward and Joyce Linde Music Building — really can be a driving force for new ways of thinking, new projects, and new ways of collaborating. My hope is that everyone in the MIT community, the Boston area, and beyond soon discovers what a truly amazing place and space we have built, and are still building here, for music and music technology at MIT.”

MIT News – Artificial intelligence

Read More
Robot, know thyself: New vision-based system teaches machines to understand their bodies

Robot, know thyself: New vision-based system teaches machines to understand their bodies

In an office at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), a soft robotic hand carefully curls its fingers to grasp a small object. The intriguing part isn’t the mechanical design or embedded sensors — in fact, the hand contains none. Instead, the entire system relies on a single camera that watches the robot’s movements and uses that visual data to control it.

This capability comes from a new system CSAIL scientists developed, offering a different perspective on robotic control. Rather than using hand-designed models or complex sensor arrays, it allows robots to learn how their bodies respond to control commands, solely through vision. The approach, called Neural Jacobian Fields (NJF), gives robots a kind of bodily self-awareness. An open-access paper about the work was published in Nature on June 25.

“This work points to a shift from programming robots to teaching robots,” says Sizhe Lester Li, MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and lead researcher on the work. “Today, many robotics tasks require extensive engineering and coding. In the future, we envision showing a robot what to do, and letting it learn how to achieve the goal autonomously.”

The motivation stems from a simple but powerful reframing: The main barrier to affordable, flexible robotics isn’t hardware — it’s control of capability, which could be achieved in multiple ways. Traditional robots are built to be rigid and sensor-rich, making it easier to construct a digital twin, a precise mathematical replica used for control. But when a robot is soft, deformable, or irregularly shaped, those assumptions fall apart. Rather than forcing robots to match our models, NJF flips the script — giving robots the ability to learn their own internal model from observation.

Look and learn

This decoupling of modeling and hardware design could significantly expand the design space for robotics. In soft and bio-inspired robots, designers often embed sensors or reinforce parts of the structure just to make modeling feasible. NJF lifts that constraint. The system doesn’t need onboard sensors or design tweaks to make control possible. Designers are freer to explore unconventional, unconstrained morphologies without worrying about whether they’ll be able to model or control them later.

“Think about how you learn to control your fingers: you wiggle, you observe, you adapt,” says Li. “That’s what our system does. It experiments with random actions and figures out which controls move which parts of the robot.”

The system has proven robust across a range of robot types. The team tested NJF on a pneumatic soft robotic hand capable of pinching and grasping, a rigid Allegro hand, a 3D-printed robotic arm, and even a rotating platform with no embedded sensors. In every case, the system learned both the robot’s shape and how it responded to control signals, just from vision and random motion.

The researchers see potential far beyond the lab. Robots equipped with NJF could one day perform agricultural tasks with centimeter-level localization accuracy, operate on construction sites without elaborate sensor arrays, or navigate dynamic environments where traditional methods break down.

At the core of NJF is a neural network that captures two intertwined aspects of a robot’s embodiment: its three-dimensional geometry and its sensitivity to control inputs. The system builds on neural radiance fields (NeRF), a technique that reconstructs 3D scenes from images by mapping spatial coordinates to color and density values. NJF extends this approach by learning not only the robot’s shape, but also a Jacobian field, a function that predicts how any point on the robot’s body moves in response to motor commands.

To train the model, the robot performs random motions while multiple cameras record the outcomes. No human supervision or prior knowledge of the robot’s structure is required — the system simply infers the relationship between control signals and motion by watching.

Once training is complete, the robot only needs a single monocular camera for real-time closed-loop control, running at about 12 Hertz. This allows it to continuously observe itself, plan, and act responsively. That speed makes NJF more viable than many physics-based simulators for soft robots, which are often too computationally intensive for real-time use.

In early simulations, even simple 2D fingers and sliders were able to learn this mapping using just a few examples. By modeling how specific points deform or shift in response to action, NJF builds a dense map of controllability. That internal model allows it to generalize motion across the robot’s body, even when the data are noisy or incomplete.

“What’s really interesting is that the system figures out on its own which motors control which parts of the robot,” says Li. “This isn’t programmed — it emerges naturally through learning, much like a person discovering the buttons on a new device.”

The future is soft

For decades, robotics has favored rigid, easily modeled machines — like the industrial arms found in factories — because their properties simplify control. But the field has been moving toward soft, bio-inspired robots that can adapt to the real world more fluidly. The trade-off? These robots are harder to model.

“Robotics today often feels out of reach because of costly sensors and complex programming. Our goal with Neural Jacobian Fields is to lower the barrier, making robotics affordable, adaptable, and accessible to more people. Vision is a resilient, reliable sensor,” says senior author and MIT Assistant Professor Vincent Sitzmann, who leads the Scene Representation group. “It opens the door to robots that can operate in messy, unstructured environments, from farms to construction sites, without expensive infrastructure.”

“Vision alone can provide the cues needed for localization and control — eliminating the need for GPS, external tracking systems, or complex onboard sensors. This opens the door to robust, adaptive behavior in unstructured environments, from drones navigating indoors or underground without maps to mobile manipulators working in cluttered homes or warehouses, and even legged robots traversing uneven terrain,” says co-author Daniela Rus, MIT professor of electrical engineering and computer science and director of CSAIL. “By learning from visual feedback, these systems develop internal models of their own motion and dynamics, enabling flexible, self-supervised operation where traditional localization methods would fail.”

While training NJF currently requires multiple cameras and must be redone for each robot, the researchers are already imagining a more accessible version. In the future, hobbyists could record a robot’s random movements with their phone, much like you’d take a video of a rental car before driving off, and use that footage to create a control model, with no prior knowledge or special equipment required.

The system doesn’t yet generalize across different robots, and it lacks force or tactile sensing, limiting its effectiveness on contact-rich tasks. But the team is exploring new ways to address these limitations: improving generalization, handling occlusions, and extending the model’s ability to reason over longer spatial and temporal horizons.

“Just as humans develop an intuitive understanding of how their bodies move and respond to commands, NJF gives robots that kind of embodied self-awareness through vision alone,” says Li. “This understanding is a foundation for flexible manipulation and control in real-world environments. Our work, essentially, reflects a broader trend in robotics: moving away from manually programming detailed models toward teaching robots through observation and interaction.”

This paper brought together the computer vision and self-supervised learning work from the Sitzmann lab and the expertise in soft robots from the Rus lab. Li, Sitzmann, and Rus co-authored the paper with CSAIL affiliates Annan Zhang SM ’22, a PhD student in electrical engineering and computer science (EECS); Boyuan Chen, a PhD student in EECS; Hanna Matusik, an undergraduate researcher in mechanical engineering; and Chao Liu, a postdoc in the Senseable City Lab at MIT. 

The research was supported by the Solomon Buchsbaum Research Fund through MIT’s Research Support Committee, an MIT Presidential Fellowship, the National Science Foundation, and the Gwangju Institute of Science and Technology.

MIT News – Artificial intelligence

Read More
MIT tool visualizes and edits “physically impossible” objects

MIT tool visualizes and edits “physically impossible” objects

M.C. Escher’s artwork is a gateway into a world of depth-defying optical illusions, featuring “impossible objects” that break the laws of physics with convoluted geometries. What you perceive his illustrations to be depends on your point of view — for example, a person seemingly walking upstairs may be heading down the steps if you tilt your head sideways

Computer graphics scientists and designers can recreate these illusions in 3D, but only by bending or cutting a real shape and positioning it at a particular angle. This workaround has downsides, though: Changing the smoothness or lighting of the structure will expose that it isn’t actually an optical illusion, which also means you can’t accurately solve geometry problems on it.

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a unique approach to represent “impossible” objects in a more versatile way. Their “Meschers” tool converts images and 3D models into 2.5-dimensional structures, creating Escher-like depictions of things like windows, buildings, and even donuts. The approach helps users relight, smooth out, and study unique geometries while preserving their optical illusion.

This tool could assist geometry researchers with calculating the distance between two points on a curved impossible surface (“geodesics”) and simulating how heat dissipates over it (“heat diffusion”). It could also help artists and computer graphics scientists create physics-breaking designs in multiple dimensions.

Lead author and MIT PhD student Ana Dodik aims to design computer graphics tools that aren’t limited to replicating reality, enabling artists to express their intent independently of whether a shape can be realized in the physical world. “Using Meschers, we’ve unlocked a new class of shapes for artists to work with on the computer,” she says. “They could also help perception scientists understand the point at which an object truly becomes impossible.”

Dodik and her colleagues will present their paper at the SIGGRAPH conference in August.

Making impossible objects possible

Impossible objects can’t be fully replicated in 3D. Their constituent parts often look plausible, but these parts don’t glue together properly when assembled in 3D. But what can be computationally imitated, as the CSAIL researchers found out, is the process of how we perceive these shapes.

Take the Penrose Triangle, for instance. The object as a whole is physically impossible because the depths don’t “add up,” but we can recognize real-world 3D shapes (like its three L-shaped corners) within it. These smaller regions can be realized in 3D — a property called “local consistency” — but when we try to assemble them together, they don’t form a globally consistent shape.

The Meschers approach models’ locally consistent regions without forcing them to be globally consistent, piecing together an Escher-esque structure. Behind the scenes, Meschers represents impossible objects as if we know their x and y coordinates in the image, as well as differences in z coordinates (depth) between neighboring pixels; the tool uses these differences in depth to reason about impossible objects indirectly.

The many uses of Meschers

In addition to rendering impossible objects, Meschers can subdivide their structures into smaller shapes for more precise geometry calculations and smoothing operations. This process enabled the researchers to reduce visual imperfections of impossible shapes, such as a red heart outline they thinned out.

The researchers also tested their tool on an “impossibagel,” where a bagel is shaded in a physically impossible way. Meschers helped Dodik and her colleagues simulate heat diffusion and calculate geodesic distances between different points of the model.

“Imagine you’re an ant traversing this bagel, and you want to know how long it’ll take you to get across, for example,” says Dodik. “In the same way, our tool could help mathematicians analyze the underlying geometry of impossible shapes up close, much like how we study real-world ones.”

Much like a magician, the tool can create optical illusions out of otherwise practical objects, making it easier for computer graphics artists to create impossible objects. It can also use “inverse rendering” tools to convert drawings and images of impossible objects into high-dimensional designs. 

“Meschers demonstrates how computer graphics tools don’t have to be constrained by the rules of physical reality,” says senior author Justin Solomon, associate professor of electrical engineering and computer science and leader of the CSAIL Geometric Data Processing Group. “Incredibly, artists using Meschers can reason about shapes that we will never find in the real world.”

Meschers can also aid computer graphics artists with tweaking the shading of their creations, while still preserving an optical illusion. This versatility would allow creatives to change the lighting of their art to depict a wider variety of scenes (like a sunrise or sunset) — as Meschers demonstrated by relighting a model of a dog on a skateboard.

Despite its versatility, Meschers is just the start for Dodik and her colleagues. The team is considering designing an interface to make the tool easier to use while building more elaborate scenes. They’re also working with perception scientists to see how the computer graphics tool can be used more broadly.

Dodik and Solomon wrote the paper with CSAIL affiliates Isabella Yu ’24, SM ’25; PhD student Kartik Chandra SM ’23; MIT professors Jonathan Ragan-Kelley and Joshua Tenenbaum; and MIT Assistant Professor Vincent Sitzmann. 

Their work was supported, in part, by the MIT Presidential Fellowship, the Mathworks Fellowship, the Hertz Foundation, the U.S. National Science Foundation, the Schmidt Sciences AI2050 fellowship, MIT Quest for Intelligence, the U.S. Army Research Office, U.S. Air Force Office of Scientific Research, SystemsThatLearn@CSAIL initiative, Google, the MIT–IBM Watson AI Laboratory, from the Toyota–CSAIL Joint Research Center, Adobe Systems, the Singapore Defence Science and Technology Agency, and the U.S. Intelligence Advanced Research Projects Activity.

MIT News – Artificial intelligence

Read More
Now It’s Claude’s World: How Anthropic Overtook OpenAI in the Enterprise AI Race

Now It’s Claude’s World: How Anthropic Overtook OpenAI in the Enterprise AI Race

The tides have turned in the enterprise AI landscape. According to Menlo Ventures’ 2025 “Mid-Year LLM Market Update,” Anthropic’s Claude has overtaken OpenAI as the leading language model provider for enterprise, now capturing 32% of market share compared to OpenAI’s 25%—a dramatic reversal from OpenAI’s dominant 50% share just one year ago. This is more than a leaderboard shuffle: it’s a testament to the maturation of enterprise AI and a signal for what businesses truly value in this next phase.

Anthropic’s Strategic Acceleration

Anthropic has charted a meteoric rise, catapulting revenues from $1B to $4B in just six months—largely on the strength of enterprise adoption by discerning, high-value customers. Rather than chasing ubiquity, Anthropic doubled down on the complex needs of large organizations, focusing on areas where AI adoption is not a curiosity but a necessity. With robust logic, structured reasoning, and rigorous regulatory compliance, Claude has become the preferred partner for industries where stakes are highest and trust is non-negotiable.

These efforts are evident in the suite of enterprise-tailored features that Claude now offers: advanced data privacy, granular user management, seamless integration with legacy IT, and sector-specific governance controls. The result? Anthropic’s dominance in code generation, where it now commands a remarkable 42% of the category—twice that of its nearest rival.

Why Enterprise Buyers Are Changing Course

The days when AI adoption decisions were swayed by splashy benchmarks or marginal gains in test scores are behind us. The Menlo Ventures report makes clear that, in 2025, enterprises are investing in outcomes, not outputs. They seek models that don’t merely process language, but power complex workflows, comply with stringent regulations, and snap natively into their existing digital fabric612.

Enterprise leaders now prioritize:

  • Code generation tools to fuel innovation and productivity—now a $1.9B market and steadily rising;
  • Agent-first architectures that enable autonomous, business-aware solutions;
  • Production-grade inference that moves AI from experimentation to mission-critical workloads;
  • Seamless integration with enterprise systems and data, rather than siloed “chatbots.”

The Paradox of Scale: Plummeting Costs, Surging Spend

Since 2022, model costs have plummeted a spectacular 280-fold, yet enterprise AI spending has never been higher. Investment is exploding at a 44% annual pace, headed toward $371B globally in 2025, driven by wide-scale deployment and real-world impact—not just experiments in the lab.

Why the paradox? Enterprises are no longer buying tokens; they are investing in transformation. They pay, and pay handsomely, for platforms that can be molded to their unique needs, that offer trust and compliance, and that promise lasting operational lift.

Model Parity, Workflow Primacy

With model performance now at near parity between Claude and OpenAI, the competitive edge has shifted decisively toward reliability, governance, and successful enterprise integration—not tiny improvements in general intelligence.

Image source: Marktechpost.com

The Road Ahead: Where Enterprise AI Will Win

As the Menlo report affirms, forward-thinking leaders must now orient their teams toward:

  • Advanced code generation with demonstrable business value;
  • Autonomous agent frameworks that embed AI deeply into workflow;
  • Optimization for live, always-on production inference;
  • Relentless focus on integration and compliance across the entire enterprise stack.

The New Playbook for Enterprise AI

The AI race is no longer about having the largest, fastest, or cheapest model—it’s about trust, results, and partnership. Anthropic’s rapid ascent proves that understanding and serving enterprise needs is the true competitive differentiator. In an era of technological parity, the winner will be the one who best translates model capabilities into business transformation, system-level integration, and operational trust.

As enterprise AI budgets continue to swell, the crown will belong not to the loudest innovator, but to the one that delivers quantifiable value at scale. In 2025, Anthropic wears that crown.


Sources:

  1. https://www.linkedin.com/posts/matt-murphy-0415543_2025-mid-year-llm-market-update-foundation-activity-7356682316062056448-ZBNN
  2. https://www.cnbc.com/2025/05/30/anthropic-hits-3-billion-in-annualized-revenue-on-business-demand-for-ai.html
  3. https://beginswithai.com/claude-for-enterprise/
  4. https://www.emarketer.com/content/anthropic-s-claude-enterprise-takes-on-openai-with-business-focused-ai-capabilities
  5. https://menlovc.com/perspective/2025-mid-year-llm-market-update/
  6. https://explodingtopics.com/blog/ai-statistics
  7. https://www.wsj.com/tech/ai/tech-ai-spending-company-valuations-7b92104b

The post Now It’s Claude’s World: How Anthropic Overtook OpenAI in the Enterprise AI Race appeared first on MarkTechPost.

MarkTechPost

Read More