A Coding Guide to Design a Complete Agentic Workflow in Gemini for Automated Medical Evidence Gathering and Prior Authorization Submission

 

In this tutorial, we devise how to orchestrate a fully functional, tool-using medical prior-authorization agent powered by Gemini. We walk through each component step by step, from securely configuring the model to building realistic external tools and finally constructing an intelligent agent loop that reasons, acts, and responds entirely through structured JSON. As we progress, we see how the system thinks, retrieves evidence, and interacts with simulated medical systems to complete a complex workflow. Check out the .

!pip install -q -U google-generative-ai


import google.generativeai as genai
from google.colab import userdata
import os
import getpass
import json
import time


try:
   GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
except:
   print("Please enter your Google API Key:")
   GOOGLE_API_KEY = getpass.getpass("API Key: ")


genai.configure(api_key=GOOGLE_API_KEY)


print("n🔍 Scanning for available models...")
available_models = [m.name for m in genai.list_models()]
target_model = ""


if 'models/gemini-1.5-flash' in available_models:
   target_model = 'gemini-1.5-flash'
elif 'models/gemini-1.5-flash-001' in available_models:
   target_model = 'gemini-1.5-flash-001'
elif 'models/gemini-pro' in available_models:
   target_model = 'gemini-pro'
else:
   for m in available_models:
       if 'generateContent' in genai.get_model(m).supported_generation_methods:
           target_model = m
           break


if not target_model:
   raise ValueError("❌ No text generation models found for this API key.")


print(f"✅ Selected Model: {target_model}")
model = genai.GenerativeModel(target_model)

We set up our environment and automatically detect the best available Gemini model. We configure the API key securely and let the system choose the most capable model without hardcoding anything. This ensures that we start the tutorial with a clean, flexible, and reliable foundation. Check out the .

class MedicalTools:
   def __init__(self):
       self.ehr_docs = [
           "Patient: John Doe | DOB: 1980-05-12",
           "Visit 2023-01-10: Diagnosed with Type 2 Diabetes. Prescribed Metformin.",
           "Visit 2023-04-15: Patient reports severe GI distress with Metformin. Discontinued.",
           "Visit 2023-04-20: BMI recorded at 32.5. A1C is 8.4%.",
           "Visit 2023-05-01: Doctor recommends starting Ozempic (Semaglutide)."
       ]


   def search_ehr(self, query):
       print(f"   🔎 [Tool] Searching EHR for: '{query}'...")
       results = [doc for doc in self.ehr_docs if any(q.lower() in doc.lower() for q in query.split())]
       if not results:
           return "No records found."
       return "n".join(results)


   def submit_prior_auth(self, drug_name, justification):
       print(f"   📤 [Tool] Submitting claim for {drug_name}...")
       justification_lower = justification.lower()
       if "metformin" in justification_lower and ("discontinued" in justification_lower or "intolerance" in justification_lower):
           if "bmi" in justification_lower and "32" in justification_lower:
               return "SUCCESS: Authorization Approved. Auth ID: #998877"
       return "DENIED: Policy requires proof of (1) Metformin failure and (2) BMI > 30."

We define the medical tools that our agent can use during the workflow. We simulate an EHR search and a prior-authorization submission system so the agent has real actions to perform. By doing this, we ground the agent’s reasoning in tool-enabled interactions rather than plain text generation. Check out the .

class AgenticSystem:
   def __init__(self, model, tools):
       self.model = model
       self.tools = tools
       self.history = []
       self.max_steps = 6
      
       self.system_prompt = """
       You are an expert Medical Prior Authorization Agent.
       Your goal is to get approval for a medical procedure/drug.
      
       You have access to these tools:
       1. search_ehr(query)
       2. submit_prior_auth(drug_name, justification)


       RULES:
       1. ALWAYS think before you act.
       2. You MUST output your response in STRICT JSON format:
          {
            "thought": "Your reasoning here",
            "action": "tool_name_or_finish",
            "action_input": "argument_string_or_dict"
          }
       3. Do not guess patient data. Use 'search_ehr'.
       4. If you have the evidence, use 'submit_prior_auth'.
       5. If the task is done, use action "finish".
       """

We initialize the agent and provide its full system prompt. We define the rules, the JSON response format, and the expectation that the agent must think before acting. This gives us a controlled, deterministic structure for building a safe and traceable agent loop. Check out the .

 def execute_tool(self, action_name, action_input):
       if action_name == "search_ehr":
           return self.tools.search_ehr(action_input)
       elif action_name == "submit_prior_auth":
           if isinstance(action_input, str):
               return "Error: submit_prior_auth requires a dictionary."
           return self.tools.submit_prior_auth(**action_input)
       else:
           return "Error: Unknown tool."


   def run(self, objective):
       print(f"🤖 AGENT STARTING. Objective: {objective}n" + "-"*50)
       self.history.append(f"User: {objective}")


       for i in range(self.max_steps):
           print(f"n🔄 STEP {i+1}")
           prompt = self.system_prompt + "nnHistory:n" + "n".join(self.history) + "nnNext JSON:"
          
           try:
               response = self.model.generate_content(prompt)
               text_response = response.text.strip().replace("```json", "").replace("```", "")
               agent_decision = json.loads(text_response)
           except Exception as e:
               print(f"   ⚠ Error parsing AI response. Retrying... ({e})")
               continue


           print(f"   🧠 THOUGHT: {agent_decision['thought']}")
           print(f"   👉 ACTION: {agent_decision['action']}")


           if agent_decision['action'] == "finish":
               print(f"n✅ TASK COMPLETED: {agent_decision['action_input']}")
               break
          
           tool_result = self.execute_tool(agent_decision['action'], agent_decision['action_input'])
           print(f"   👁 OBSERVATION: {tool_result}")


           self.history.append(f"Assistant: {text_response}")
           self.history.append(f"System: {tool_result}")
          
           if "SUCCESS" in str(tool_result):
               print("n🎉 SUCCESS! The Agent successfully navigated the insurance portal.")
               break

We implement the core agent loop where reasoning, tool execution, and observations happen step by step. We watch the agent decide its next action, execute tools, update history, and evaluate success conditions. This is where the agent truly comes alive and performs iterative reasoning. Check out the .

tools_instance = MedicalTools()
agent = AgenticSystem(model, tools_instance)
agent.run("Please get prior authorization for Ozempic for patient John Doe.")

We instantiate the tools and agent, then run the entire system end-to-end with a real objective. We see the full workflow unfold as the agent navigates through medical history, validates evidence, and attempts prior authorization. This final snippet demonstrates the complete pipeline working seamlessly.

In conclusion, we reflect on how this compact yet powerful framework enables us to design real-world agentic behaviors that go beyond simple text responses. We watch our agent plan, consult tools, gather evidence, and ultimately complete a structured insurance authorization task, entirely through autonomous reasoning. It provides confidence that we can now expand the system with additional tools, stronger policies, domain-specific logic, or even multi-agent collaboration.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Mistral AI Releases OCR 3: A Smaller Optical Character Recognition (OCR) Model for Structured Document AI at Scale

Mistral AI Releases OCR 3: A Smaller Optical Character Recognition (OCR) Model for Structured Document AI at Scale

 

Mistral AI has released Mistral OCR 3, its latest optical character recognition service that powers the company’s Document AI stack. The model, named as mistral-ocr-2512, is built to extract interleaved text and images from PDFs and other documents while preserving structure, and it does this at an aggressive price of $2 per 1,000 pages with a 50% discount when used through the Batch API.

What Mistral OCR 3 is Optimized for?

Mistral OCR 3 targets typical enterprise document workloads. The model is tuned for forms, scanned documents, complex tables, and handwriting. It is evaluated on internal benchmarks drawn from real business use cases, where it achieves a 74% overall win rate over Mistral OCR 2 across these document categories using a fuzzy match metric against ground truth.

The model outputs markdown that preserves document layout, and when table formatting is enabled, it enriches the output with HTML based table representations. This combination gives downstream systems both the content and the structural information that is needed for retrieval pipelines, analytics, and agent workflows.

Role in Mistral Document AI

OCR 3 sits inside Mistral Document AI, the company’s document processing capability that combines OCR with structured data extraction and Document QnA.

It now powers the Document AI Playground in Mistral AI Studio. In this interface, users upload PDFs or images and get back either clean text or structured JSON without writing code. The same underlying OCR pipeline is accessible via the public API, which allows teams to move from interactive exploration to production workloads without changing the core model.

Inputs, Outputs, And Structure

The OCR processor accepts multiple document formats through a single API. The document field can point to:

  • document_url for PDFs, pptx, docx and more
  • image_url for image types such as png, jpeg or avif
  • Uploaded or base64 encoded PDFs or images through the same schema

This is documented in the OCR Processor section of Mistral’s Document AI docs.

The response is a JSON object with a pages array. Each page contains an index, a markdown string, a list of images, a list of tables when table_format="html" is used, detected hyperlinks, optional header and footer fields when header or footer extraction is enabled, and a dimensions object with page size. There is also a document_annotation field for structured annotations and a usage_info block for accounting information.

When images and HTML tables are extracted, the markdown includes placeholders such as ![img-0.jpeg](img-0.jpeg) and [tbl-3.html](tbl-3.html). These placeholders are mapped back to actual content using the images and tables arrays in the response, which simplifies downstream reconstruction.

Upgrades Over Mistral OCR 2

Mistral OCR 3 introduces several concrete upgrades relative to OCR 2. The public release notes emphasize four main areas.

  • Handwriting Mistral OCR 3 more accurately interprets cursive, mixed content annotations, and handwritten text placed on top of printed templates.
  • Forms It improves detection of boxes, labels, and handwritten entries in dense layouts such as invoices, receipts, compliance forms, and government documents.
  • Scanned and complex documents The model is more robust to compression artifacts, skew, distortion, low DPI, and background noise in scanned pages.
  • Complex tables It reconstructs table structures with headers, merged cells, multi row blocks, and column hierarchies, and it can return HTML tables with proper colspan and rowspan tags so that layout is preserved.
https://mistral.ai/news/mistral-ocr-3

Pricing, Batch Inference, And Annotations

The OCR 3 model card lists pricing at $2 per 1,000 pages for standard OCR and $3 per 1,000 annotated pages when structured annotations are used.

Mistral also exposes OCR 3 through its Batch Inference API /v1/batch, which is documented under the batching section of the platform. Batch processing halves the effective OCR price to $1 per 1,000 pages by applying a 50% discount for jobs that run through the batch pipeline.

The model integrates with two important features on the same endpoint, Annotations – Structured and BBox Extraction. These allow developers to attach schema driven labels to regions of a document and get bounding boxes for text and other elements, which is useful when mapping content into downstream systems or UI overlays.

Key Takeaways

  1. Model and role: Mistral OCR 3, named as mistral-ocr-2512, is the new OCR service that powers Mistral’s Document AI stack for page based document understanding.
  2. Accuracy gains: On internal benchmarks covering forms, scanned documents, complex tables, and handwriting, OCR 3 achieves a 74% overall win rate over Mistral OCR 2, and Mistral positions it as state of the art against both traditional and AI native OCR systems.
  3. Structured outputs for RAG: The service extracts interleaved text and embedded images and returns markdown enriched with HTML reconstructed tables, preserving layout and table structure so outputs can feed directly into RAG, agents, and search pipelines with minimal extra parsing.
  4. API and document formats: Developers access OCR 3 via the /v1/ocr endpoint or SDK, passing PDFs as document_url and images such as png or jpeg as image_url, and can enable options like HTML table output, header or footer extraction, and base64 images in the response.
  5. Pricing and batch processing: OCR 3 is priced at 2 dollars per 1,000 pages and 3 dollars per 1,000 annotated pages, and when used through the Batch API the effective price for standard OCR drops to 1 dollar per 1,000 pages for large scale processing.

Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

How to Build a High-Performance Distributed Task Routing System Using Kombu with Topic Exchanges and Concurrent Workers

 

In this tutorial, we build a fully functional event-driven workflow using , treating messaging as a core architectural capability. We walk through step by step the setup of exchanges, routing keys, background workers, and concurrent producers, allowing us to observe a real distributed system. As we implement each component, we see how clean message flow, asynchronous processing, and routing patterns give us the same power that production microservices rely on every day. Check out the .

!pip install kombu


import threading
import time
import logging
import uuid
import datetime
import sys


from kombu import Connection, Exchange, Queue, Producer, Consumer
from kombu.mixins import ConsumerMixin


logging.basicConfig(
   level=logging.INFO,
   format='%(message)s',
   handlers=[logging.StreamHandler(sys.stdout)],
   force=True
)
logger = logging.getLogger(__name__)


BROKER_URL = "memory://localhost/"

We begin by installing Kombu, importing dependencies, and configuring logging so we can clearly see every message flowing through the system. We also set the in-memory broker URL, allowing us to run everything locally in Colab without needing RabbitMQ. This setup forms the foundation for our distributed messaging workflow. Check out the .

media_exchange = Exchange('media_exchange', type='topic', durable=True)


task_queues = [
   Queue('video_queue', media_exchange, routing_key='video.#'),
   Queue('audit_queue', media_exchange, routing_key='#'),
]

We define a topic exchange to flexibly route messages using wildcard patterns. We also create two queues: one dedicated to video-related tasks and another audit queue that listens to everything. Using topic routing, we can precisely control how messages flow across the system. Check out the .

class Worker(ConsumerMixin):
   def __init__(self, connection, queues):
       self.connection = connection
       self.queues = queues
       self.should_stop = False


   def get_consumers(self, Consumer, channel):
       return [
           Consumer(queues=self.queues,
                    callbacks=[self.on_message],
                    accept=['json'],
                    prefetch_count=1)
       ]


   def on_message(self, body, message):
       routing_key = message.delivery_info['routing_key']
       payload_id = body.get('id', 'unknown')


       logger.info(f"n⚡ RECEIVED MSG via key: [{routing_key}]")
       logger.info(f"   Payload ID: {payload_id}")
      
       try:
           if 'video' in routing_key:
               self.process_video(body)
           elif 'audit' in routing_key:
               logger.info("   🔍 [Audit] Logging event...")
          
           message.ack()
           logger.info(f"   ✅ ACKNOWLEDGED")


       except Exception as e:
           logger.error(f"   ❌ ERROR: {e}")


   def process_video(self, body):
       logger.info("   ⚙  [Processor] Transcoding video (Simulating work...)")
       time.sleep(0.5)

We implement a custom worker using Kombu’s ConsumerMixin to run it in a background thread. In the message callback, we inspect the routing key, invoke the appropriate processing function, and acknowledge the message. This worker architecture gives us clean, concurrent message consumption with full control. Check out the .

def publish_messages(connection):
   producer = Producer(connection)
  
   tasks = [
       ('video.upload', {'file': 'movie.mp4'}),
       ('user.login', {'user': 'admin'}),
   ]


   logger.info("n🚀 PRODUCER: Starting to publish messages...")
  
   for r_key, data in tasks:
       data['id'] = str(uuid.uuid4())[:8]
      
       logger.info(f"📤 SENDING: {r_key} -> {data}")
      
       producer.publish(
           data,
           exchange=media_exchange,
           routing_key=r_key,
           serializer='json'
       )
       time.sleep(1.5)


   logger.info("🏁 PRODUCER: Done.")

We now build a producer that sends structured JSON payloads into the exchange with different routing keys. We generate unique IDs for each event and observe how they are routed to other queues. This mirrors real-world microservice event publishing, where producers and consumers remain decoupled. Check out the .

def run_example():
   with Connection(BROKER_URL) as conn:
       worker = Worker(conn, task_queues)
       worker_thread = threading.Thread(target=worker.run)
       worker_thread.daemon = True
       worker_thread.start()
      
       logger.info("✅ SYSTEM: Worker thread started.")
       time.sleep(1)


       try:
           publish_messages(conn)
           time.sleep(2)
       except KeyboardInterrupt:
           pass
       finally:
           worker.should_stop = True
           logger.info("n👋 SYSTEM: Execution complete.")


if __name__ == "__main__":
   run_example()

We start the worker in a background thread and fire the producer in the main thread. This structure gives us a mini distributed system running in Colab. By observing the logs, we see messages published → routed → consumed → acknowledged, completing the full event-processing lifecycle.

In conclusion, we orchestrated a dynamic, distributed task-routing pipeline that processes real-time events with clarity and precision. We witnessed how Kombu abstracts away the complexity of messaging systems while still giving us fine-grained control over routing, consumption, and worker concurrency. As we see messages move from producer to exchange to queue to worker, we gained a deeper appreciation for the elegance of event-driven system design, and we are now well-equipped to scale this foundation into robust microservices, background processors, and enterprise-grade workflows.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

A Complete Workflow for Automated Prompt Optimization Using Gemini Flash, Few-Shot Selection, and Evolutionary Instruction Search

 

In this tutorial, we shift from traditional prompt crafting to a more systematic, programmable approach by treating prompts as tunable parameters rather than static text. Instead of guessing which instruction or example works best, we build an optimization loop around Gemini 2.0 Flash that experiments, evaluates, and automatically selects the strongest prompt configuration. In this implementation, we watch our model improve step by step, demonstrating how prompt engineering becomes far more powerful when we orchestrate it with data-driven search rather than intuition. Check out the .

import google.generativeai as genai
import json
import random
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
import numpy as np
from collections import Counter


def setup_gemini(api_key: str = None):
   if api_key is None:
       api_key = input("Enter your Gemini API key: ").strip()
   genai.configure(api_key=api_key)
   model = genai.GenerativeModel('gemini-2.0-flash-exp')
   print("✓ Gemini 2.0 Flash configured")
   return model


@dataclass
class Example:
   text: str
   sentiment: str
   def to_dict(self):
       return {"text": self.text, "sentiment": self.sentiment}


@dataclass
class Prediction:
   sentiment: str
   reasoning: str = ""
   confidence: float = 1.0

We import all required libraries and define the setup_gemini helper to configure Gemini 2.0 Flash. We also create the Example and Prediction data classes to represent dataset entries and model outputs in a clean, structured way. Check out the .

def create_dataset() -> Tuple[List[Example], List[Example]]:
   train_data = [
       Example("This movie was absolutely fantastic! Best film of the year.", "positive"),
       Example("Terrible experience, waste of time and money.", "negative"),
       Example("The product works as expected, nothing special.", "neutral"),
       Example("I'm blown away by the quality and attention to detail!", "positive"),
       Example("Disappointing and overpriced. Would not recommend.", "negative"),
       Example("It's okay, does the job but could be better.", "neutral"),
       Example("Incredible customer service and amazing results!", "positive"),
       Example("Complete garbage, broke after one use.", "negative"),
       Example("Average product, met my basic expectations.", "neutral"),
       Example("Revolutionary! This changed everything for me.", "positive"),
       Example("Frustrating bugs and poor design choices.", "negative"),
       Example("Decent quality for the price point.", "neutral"),
       Example("Exceeded all my expectations, truly remarkable!", "positive"),
       Example("Worst purchase I've ever made, avoid at all costs.", "negative"),
       Example("It's fine, nothing to complain about really.", "neutral"),
       Example("Absolutely stellar performance, 5 stars!", "positive"),
       Example("Broken and unusable, total disaster.", "negative"),
       Example("Meets requirements, standard quality.", "neutral"),
   ]
   val_data = [
       Example("Absolutely love it, couldn't be happier!", "positive"),
       Example("Broken on arrival, very upset.", "negative"),
       Example("Works fine, no major issues.", "neutral"),
       Example("Outstanding performance and great value!", "positive"),
       Example("Regret buying this, total letdown.", "negative"),
       Example("Adequate for basic use.", "neutral"),
   ]
   return train_data, val_data


class PromptTemplate:
   def __init__(self, instruction: str = "", examples: List[Example] = None):
       self.instruction = instruction
       self.examples = examples or []
   def format(self, text: str) -> str:
       prompt_parts = []
       if self.instruction:
           prompt_parts.append(self.instruction)
       if self.examples:
           prompt_parts.append("nExamples:")
           for ex in self.examples:
               prompt_parts.append(f"nText: {ex.text}")
               prompt_parts.append(f"Sentiment: {ex.sentiment}")
       prompt_parts.append(f"nText: {text}")
       prompt_parts.append("Sentiment:")
       return "n".join(prompt_parts)
   def clone(self):
       return PromptTemplate(self.instruction, self.examples.copy())

We generate a small but diverse sentiment dataset for training and validation using the create_dataset function. We then define PromptTemplate, which lets us assemble instructions, a few-shot examples, and a current query into a single prompt string. We treat the template as a programmable object so we can swap instructions and examples during optimization. Check out the .

class SentimentModel:
   def __init__(self, model, prompt_template: PromptTemplate):
       self.model = model
       self.prompt_template = prompt_template


   def predict(self, text: str) -> Prediction:
       prompt = self.prompt_template.format(text)
       try:
           response = self.model.generate_content(prompt)
           result = response.text.strip().lower()
           for sentiment in ['positive', 'negative', 'neutral']:
               if sentiment in result:
                   return Prediction(sentiment=sentiment, reasoning=result)
           return Prediction(sentiment='neutral', reasoning=result)
       except Exception as e:
           return Prediction(sentiment='neutral', reasoning=str(e))


   def evaluate(self, dataset: List[Example]) -> float:
       correct = 0
       for example in dataset:
           pred = self.predict(example.text)
           if pred.sentiment == example.sentiment:
               correct += 1
       return (correct / len(dataset)) * 100

We wrap Gemini in the SentimentModel class so we can call it like a regular classifier. We format prompts via the template, call generate_content, and post-process the text to extract one of three sentiments. We also add an evaluate method so we can measure accuracy over any dataset with a single call. Check out the .

class PromptOptimizer:
   def __init__(self, model):
       self.model = model
       self.instruction_candidates = [
           "Analyze the sentiment of the following text. Classify as positive, negative, or neutral.",
           "Classify the sentiment: positive, negative, or neutral.",
           "Determine if this text expresses positive, negative, or neutral sentiment.",
           "What is the emotional tone? Answer: positive, negative, or neutral.",
           "Sentiment classification (positive/negative/neutral):",
           "Evaluate sentiment and respond with exactly one word: positive, negative, or neutral.",
       ]


   def select_best_examples(self, train_data: List[Example], val_data: List[Example], n_examples: int = 3) -> List[Example]:
       best_examples = None
       best_score = 0
       for _ in range(10):
           examples_by_sentiment = {
               'positive': [e for e in train_data if e.sentiment == 'positive'],
               'negative': [e for e in train_data if e.sentiment == 'negative'],
               'neutral': [e for e in train_data if e.sentiment == 'neutral']
           }
           selected = []
           for sentiment in ['positive', 'negative', 'neutral']:
               if examples_by_sentiment[sentiment]:
                   selected.append(random.choice(examples_by_sentiment[sentiment]))
           remaining = [e for e in train_data if e not in selected]
           while len(selected) < n_examples and remaining:
               selected.append(random.choice(remaining))
               remaining.remove(selected[-1])
           template = PromptTemplate(instruction=self.instruction_candidates[0], examples=selected)
           test_model = SentimentModel(self.model, template)
           score = test_model.evaluate(val_data[:3])
           if score > best_score:
               best_score = score
               best_examples = selected
       return best_examples


   def optimize_instruction(self, examples: List[Example], val_data: List[Example]) -> str:
       best_instruction = self.instruction_candidates[0]
       best_score = 0
       for instruction in self.instruction_candidates:
           template = PromptTemplate(instruction=instruction, examples=examples)
           test_model = SentimentModel(self.model, template)
           score = test_model.evaluate(val_data)
           if score > best_score:
               best_score = score
               best_instruction = instruction
       return best_instruction

We introduce the PromptOptimizer class and define a pool of candidate instructions to test. We implement select_best_examples to search for a small, diverse set of few-shot examples and optimize_instruction to score each instruction variant on validation data. We are effectively turning prompt design into a lightweight search problem over examples and instructions. Check out the .

  def compile(self, train_data: List[Example], val_data: List[Example], n_examples: int = 3) -> PromptTemplate:
       best_examples = self.select_best_examples(train_data, val_data, n_examples)
       best_instruction = self.optimize_instruction(best_examples, val_data)
       optimized_template = PromptTemplate(instruction=best_instruction, examples=best_examples)
       return optimized_template


def main():
   print("="*70)
   print("Prompt Optimization Tutorial")
   print("Stop Writing Prompts, Start Programming Them!")
   print("="*70)


   model = setup_gemini()
   train_data, val_data = create_dataset()
   print(f"✓ {len(train_data)} training examples, {len(val_data)} validation examples")


   baseline_template = PromptTemplate(
       instruction="Classify sentiment as positive, negative, or neutral.",
       examples=[]
   )
   baseline_model = SentimentModel(model, baseline_template)
   baseline_score = baseline_model.evaluate(val_data)


   manual_examples = train_data[:3]
   manual_template = PromptTemplate(
       instruction="Classify sentiment as positive, negative, or neutral.",
       examples=manual_examples
   )
   manual_model = SentimentModel(model, manual_template)
   manual_score = manual_model.evaluate(val_data)


   optimizer = PromptOptimizer(model)
   optimized_template = optimizer.compile(train_data, val_data, n_examples=4)

We add the compile method to combine the best examples and best instructions into a final optimized PromptTemplate. Inside main, we configure Gemini, build the dataset, and evaluate both a zero-shot baseline and a simple manual few-shot prompt. We then call the optimizer to produce our compiled, optimized prompt for sentiment analysis. Check out the .

optimized_model = SentimentModel(model, optimized_template)
   optimized_score = optimized_model.evaluate(val_data)


   print(f"Baseline (zero-shot):     {baseline_score:.1f}%")
   print(f"Manual few-shot:          {manual_score:.1f}%")
   print(f"Optimized (compiled):     {optimized_score:.1f}%")


   print(f"nInstruction: {optimized_template.instruction}")
   print(f"nSelected Examples ({len(optimized_template.examples)}):")
   for i, ex in enumerate(optimized_template.examples, 1):
       print(f"n{i}. Text: {ex.text}")
       print(f"   Sentiment: {ex.sentiment}")


   test_cases = [
       "This is absolutely amazing, I love it!",
       "Completely broken and unusable.",
       "It works as advertised, no complaints."
   ]


   for test_text in test_cases:
       print(f"nInput: {test_text}")
       pred = optimized_model.predict(test_text)
       print(f"Predicted: {pred.sentiment}")


   print("✓ Tutorial Complete!")


if __name__ == "__main__":
   main()

We evaluate the optimized model and compare its accuracy against the baseline and manual few-shot setups. We print the chosen instruction and the selected examples so we can inspect what the optimizer discovers, and then we run a few live test sentences to see predictions in action. We finish by summarizing the improvements and reinforcing the idea that prompts can be tuned programmatically rather than written by hand.

In conclusion, we implemented how programmatic prompt optimization provides a repeatable, evidence-driven workflow for designing high-performing prompts. We began with a fragile baseline, then iteratively tested instructions, selected diverse examples, and compiled an optimized template that outperforms manual attempts. This process shows that we no longer rely on trial-and-error prompting; instead, we orchestrated a controlled optimization cycle. Also, we can extend this pipeline to new tasks, richer datasets, and more advanced scoring methods, allowing us to engineer prompts with precision, confidence, and scalability.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to . Wait! are you on telegram? 

The post appeared first on .

Read More

Unsloth AI and NVIDIA are Revolutionizing Local LLM Fine-Tuning: From RTX Desktops to DGX Spark

 

Fine-tune popular AI models faster with on NVIDIA RTX AI PCs such as to and the new to build personalized assistants for coding, creative work, and complex agentic workflows.

The landscape of modern AI is shifting. We are moving away from a total reliance on massive, generalized cloud models and entering the era of local, agentic AI. Whether it is tuning a chatbot to handle hyper-specific product support or building a personal assistant that manages intricate schedules, the potential for generative AI on local hardware is boundless.

However, developers face a persistent bottleneck: How do you get a Small Language Model (SLM) to punch above its weight class and respond with high accuracy for specialized tasks?

The answer is Fine-Tuning, and the tool of choice is .

Unsloth provides an easy and high-speed method to customize models. Optimized for efficient, low-memory training on NVIDIA GPUs, Unsloth scales effortlessly from all the way to the , the world’s smallest AI supercomputer.

The Fine-Tuning Paradigm

Think of fine-tuning as a high-intensity boot camp for your AI. By feeding the model examples tied to a specific workflow, it learns new patterns, adapts to specialized tasks, and dramatically improves accuracy.

Depending on your hardware and goals, developers generally utilize one of three main methods:

1. Parameter-Efficient Fine-Tuning (PEFT)

  • The Tech: LoRA (Low-Rank Adaptation) or QLoRA.
  • How it Works: Instead of retraining the whole brain, this updates only a small portion of the model. It is the most efficient way to inject domain knowledge without breaking the bank.
  • Best For: Improving coding accuracy, legal/scientific adaptation, or tone alignment.
  • Data Needed: Small datasets (100–1,000 prompt-sample pairs).

2. Full Fine-Tuning

  • The Tech: Updating all model parameters.
  • How it Works: This is a total overhaul. It is essential when the model needs to rigidly adhere to specific formats or strict guardrails.
  • Best For: Advanced AI agents and distinct persona constraints.
  • Data Needed: Large datasets (1,000+ prompt-sample pairs).

3. Reinforcement Learning (RL)

  • The Tech: Preference optimization (RLHF/DPO).
  • How it Works: The model learns by interacting with an environment and receiving feedback signals to improve behavior over time.
  • Best For: High-stakes domains (Law, Medicine) or autonomous agents.
  • Data Needed: Action model + Reward model + RL Environment.

The Hardware Reality: VRAM Management Guide

One of the most critical factors in local fine-tuning is Video RAM (VRAM). Unsloth is magic, but physics still applies. Here is the breakdown of what hardware you need based on your target model size and tuning method.

For PEFT (LoRA/QLoRA)

This is where most hobbyists and individual developers will live.

  • <12B Parameters: ~8GB VRAM (Standard GeForce RTX GPUs).
  • 12B–30B Parameters: ~24GB VRAM (Perfect for GeForce RTX 5090).
  • 30B–120B Parameters: ~80GB VRAM (Requires DGX Spark or RTX PRO).

For Full Fine-Tuning

For when you need total control over the model weights.

  • <3B Parameters: ~25GB VRAM (GeForce RTX 5090 or RTX PRO).
  • 3B–15B Parameters: ~80GB VRAM (DGX Spark territory).

For Reinforcement Learning

The cutting edge of agentic behavior.

  • <12B Parameters: ~12GB VRAM (GeForce RTX 5070).
  • 12B–30B Parameters: ~24GB VRAM (GeForce RTX 5090).
  • 30B–120B Parameters: ~80GB VRAM (DGX Spark).

Unsloth: The “Secret Sauce” of Speed

Why is Unsloth winning the fine-tuning race? It comes down to math.

LLM fine-tuning involves billions of matrix multiplications, the kind of math well suited for parallel, GPU-accelerated computing. Unsloth excels by translating the complex matrix multiplication operations into efficient, custom kernels on NVIDIA GPUs. This optimization allows Unsloth to boost the performance of the Hugging Face transformers library by 2.5x on NVIDIA GPUs.

By combining raw speed with ease of use, Unsloth is democratizing high-performance AI, making it accessible to everyone from a student on a laptop to a researcher on a DGX system.

Representative Use Case Study 1: The “Personal Knowledge Mentor”

The Goal: Take a base model (like Llama 3.2 ) and teach it to respond in a specific, high-value style, acting as a mentor who explains complex topics using simple analogies and always ends with a thought-provoking question to encourage critical thinking.

The Problem: Standard system prompts are brittle. To get a high-quality “Mentor” persona, you must provide a 500+ token instruction block. This creates a “Token Tax” that slows down every response and eats up valuable memory. Over long conversations, the model suffers from “Persona Drift,” eventually forgetting its rules and reverting to a generic, robotic assistant. Furthermore, it is nearly impossible to “prompt” a specific verbal rhythm or subtle “vibe” without the model sounding like a forced caricature.

The Solution: sing Unsloth to run a local QLoRA fine-tune on a GeForce RTX GPU, powered by a curated dataset of 50–100 high-quality “Mentor” dialogue examples. This process “bakes” the personality directly into the model’s neural weights rather than relying on the temporary memory of a prompt. 

The Result: A standard model might miss the analogy or forget the closing question when the topic gets difficult. The fine-tuned model acts as a “Native Mentor.” It maintains its persona indefinitely without a single line of system instructions. It picks up on implicit patterns, the specific way a mentor speaks, making the interaction feel authentic and fluid.

Representative use Case Study 2: The “Legacy Code” Architect

To see the power of local fine-tuning, look no further than the banking sector.

The Problem: Banks run on ancient code (COBOL, Fortran). Standard 7B models hallucinate when trying to modernize this logic, and sending proprietary banking code to GPT-4 is a massive security violation.

The Solution: Using Unsloth to fine-tune a 32B model (like Qwen 2.5 Coder) specifically on the company’s 20-year-old “spaghetti code.”

The Result: A standard 7B model translates line-by-line. The fine-tuned 32B model acts as a “Senior Architect.” It holds entire files in context, refactoring 2,000-line monoliths into clean microservices while preserving exact business logic, all performed securely on local NVIDIA hardware.

Representative use Case Study 3: The Privacy-First “AI Radiologist”

While text is powerful, the next frontier of local AI is Vision. Medical institutions sit on mountains of imaging data (X-rays, CT scans) that cannot legally be uploaded to public cloud models due to HIPAA/GDPR compliance.

The Problem: Radiologists are overwhelmed, and standard Vision Language Models (VLMs) like Llama 3.2 Vision are too generalized, identifying a “person” easily, but missing subtle hairline fractures or early-stage anomalies in low-contrast X-rays.

The Solution: A healthcare research team utilizes . Instead of training from scratch (costing millions), they take a pre-trained Llama 3.2 Vision (11B) model and fine-tune it locally on an NVIDIA DGX Spark or dual-RTX 6000 Ada workstation. They feed the model a curated, private dataset of 5,000 anonymized X-rays paired with expert radiologist reports, using LoRA to update vision encoders specifically for medical anomalies.

The Outcome: The result is a specialized “AI Resident” operating entirely offline.

  • Accuracy: Detection of specific pathologies improves over the base model.
  • Privacy: No patient data ever leaves the on-premise hardware.
  • Speed: Unsloth optimizes the vision adapters, cutting training time from weeks to hours, allowing for weekly model updates as new data arrives.

Here is the technical breakdown of how to build this solution using Unsloth based on the Unsloth.

For a tutorial on how to fine-tune vision models using Llama 3.2 click . 

Ready to Start?

Unsloth and NVIDIA have provided comprehensive guides to get you running immediately.

  • For Desktop Users:
  • For Vision Models:
  • For Pros: Learn how to .

Thanks to the NVIDIA AI team for the thought leadership/ Resources for this article. NVIDIA AI team has supported this content/article.

The post appeared first on .

Read More

How to Orchestrate a Fully Autonomous Multi-Agent Research and Writing Pipeline Using CrewAI and Gemini for Real-Time Intelligent Collaboration

 

In this tutorial, we implement how we build a small but powerful two-agent system that collaborates using the Gemini Flash model. We set up our environment, authenticate securely, define specialized agents, and orchestrate tasks that flow from research to structured writing. As we run the crew, we observe how each component works together in real time, giving us a hands-on understanding of modern agentic workflows powered by LLMs. With these steps, we clearly see how multi-agent pipelines become practical, modular, and developer-friendly. Check out the .

import os
import sys
import getpass
from textwrap import dedent


print("Installing CrewAI and tools... (this may take 1-2 mins)")
!pip install -q crewai crewai-tools


from crewai import Agent, Task, Crew, Process, LLM

We set up our environment and installed the required CrewAI packages so we can run everything smoothly in Colab. We import the necessary modules and lay the foundation for our multi-agent workflow. This step ensures that our runtime is clean and ready for the agents we create next. Check out the .

print("n--- API Authentication ---")
api_key = None


try:
   from google.colab import userdata
   api_key = userdata.get('GEMINI_API_KEY')
   print("✅ Found GEMINI_API_KEY in Colab Secrets.")
except Exception:
   pass


if not api_key:
   print("ℹ  Key not found in Secrets.")
   api_key = getpass.getpass("🔑 Enter your Google Gemini API Key: ")


os.environ["GEMINI_API_KEY"] = api_key


if not api_key:
   sys.exit("❌ Error: No API Key provided. Please restart and enter a key.")

We authenticate ourselves securely by retrieving or entering the Gemini API key. We ensure the key is securely stored in the environment so the model can operate without interruption. This step gives us confidence that our agent framework can communicate reliably with the LLM. Check out the .

gemini_flash = LLM(
   model="gemini/gemini-2.0-flash",
   temperature=0.7
)

We configure the Gemini Flash model that our agents rely on for reasoning and generation. We choose the temperature and model variant to balance creativity and precision. This configuration becomes the shared intelligence that drives all agent tasks ahead. Check out the .

researcher = Agent(
   role='Tech Researcher',
   goal='Uncover cutting-edge developments in AI Agents',
   backstory=dedent("""You are a veteran tech analyst with a knack for finding emerging trends before they become mainstream. You specialize in Autonomous AI Agents and Large Language Models."""),
   verbose=True,
   allow_delegation=False,
   llm=gemini_flash
)


writer = Agent(
   role='Technical Writer',
   goal='Write a concise, engaging blog post about the researcher's findings',
   backstory=dedent("""You transform complex technical concepts into compelling narratives. You write for a developer audience who wants practical insights without fluff."""),
   verbose=True,
   allow_delegation=False,
   llm=gemini_flash
)

We define two specialized agents, a researcher and a writer, each with a clear role and backstory. We design them so they complement one another, allowing one to discover insights while the other transforms them into polished writing. Here, we begin to see how multi-agent collaboration takes shape. Check out the .

research_task = Task(
   description=dedent("""Conduct a simulated research analysis on 'The Future of Agentic AI in 2025'. Identify three key trends: 1. Multi-Agent Orchestration 2. Neuro-symbolic AI 3. On-device Agent execution Provide a summary for each based on your 'expert knowledge'."""),
   expected_output="A structured list of 3 key AI trends with brief descriptions.",
   agent=researcher
)


write_task = Task(
   description=dedent("""Using the researcher's findings, write a short blog post (approx 200 words). The post should have: - A catchy title - An intro - The three bullet points - A conclusion on why developers should care."""),
   expected_output="A markdown-formatted blog post.",
   agent=writer,
   context=[research_task]
)

We create two tasks that assign specific responsibilities to our agents. We let the researcher generate structured insights and then pass the output to the writer to create a complete blog post. This step shows how we orchestrate sequential task dependencies cleanly within CrewAI. Check out the .

tech_crew = Crew(
   agents=[researcher, writer],
   tasks=[research_task, write_task],
   process=Process.sequential,
   verbose=True
)


print("n--- 🤖 Starting the Crew ---")
result = tech_crew.kickoff()


from IPython.display import Markdown
print("nn########################")
print("##   FINAL OUTPUT     ##")
print("########################n")
display(Markdown(str(result)))

We assemble the agents and tasks into a crew and run the entire multi-agent workflow. We watch how the system executes step by step, producing the final markdown output. This is where everything comes together, and we see our agents collaborating in real time.

In conclusion, we appreciate how seamlessly CrewAI allows us to create coordinated agent systems that think, research, and write together. We experience firsthand how defining roles, tasks, and process flows lets us modularize complex work and achieve coherent outputs with minimal code. This framework empowers us to build richer, more autonomous agentic applications, and we walk away confident in extending this foundation into larger multi-agent systems, production pipelines, or more creative AI collaborations.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to . Wait! are you on telegram? 

The post appeared first on .

Read More

How to Design a Gemini-Powered Self-Correcting Multi-Agent AI System with Semantic Routing, Symbolic Guardrails, and Reflexive Orchestration

 

In this tutorial, we explore how we design and run a full agentic AI orchestration pipeline powered by semantic routing, symbolic guardrails, and self-correction loops using Gemini. We walk through how we structure agents, dispatch tasks, enforce constraints, and refine outputs using a clean, modular architecture. As we progress through each snippet, we see how the system intelligently chooses the right agent, validates its output, and improves itself through iterative reflection. Check out the .

import os
import json
import time
import typing
from dataclasses import dataclass, asdict
from google import genai
from google.genai import types


API_KEY = os.environ.get("GEMINI_API_KEY", "API Key")
client = genai.Client(api_key=API_KEY)


@dataclass
class AgentMessage:
   source: str
   target: str
   content: str
   metadata: dict
   timestamp: float = time.time()

We set up our core environment by importing essential libraries, defining the API key, and initializing the Gemini client. We also establish the AgentMessage structure, which acts as the shared communication format between agents. Check out the .

class CognitiveEngine:
   @staticmethod
   def generate(prompt: str, system_instruction: str, json_mode: bool = False) -> str:
       config = types.GenerateContentConfig(
           temperature=0.1,
           response_mime_type="application/json" if json_mode else "text/plain"
       )
       try:
           response = client.models.generate_content(
               model="gemini-2.0-flash",
               contents=prompt,
               config=config
           )
           return response.text
       except Exception as e:
           raise ConnectionError(f"Gemini API Error: {e}")


class SemanticRouter:
   def __init__(self, agents_registry: dict):
       self.registry = agents_registry


   def route(self, user_query: str) -> str:
       prompt = f"""
       You are a Master Dispatcher. Analyze the user request and map it to the ONE best agent.
       AVAILABLE AGENTS:
       {json.dumps(self.registry, indent=2)}
       USER REQUEST: "{user_query}"
       Return ONLY a JSON object: {{"selected_agent": "agent_name", "reasoning": "brief reason"}}
       """
       response_text = CognitiveEngine.generate(prompt, "You are a routing system.", json_mode=True)
       try:
           decision = json.loads(response_text)
           print(f"   [Router] Selected: {decision['selected_agent']} (Reason: {decision['reasoning']})")
           return decision['selected_agent']
       except:
           return "general_agent"

We build the cognitive layer using Gemini, allowing us to generate both text and JSON outputs depending on the instruction. We also implement the semantic router, which analyzes queries and selects the most suitable agent. Check out the .

class Agent:
   def __init__(self, name: str, instruction: str):
       self.name = name
       self.instruction = instruction


   def execute(self, message: AgentMessage) -> str:
       return CognitiveEngine.generate(
           prompt=f"Input: {message.content}",
           system_instruction=self.instruction
       )


class Orchestrator:
   def __init__(self):
       self.agents_info = {
           "analyst_bot": "Analyzes data, logic, and math. Returns structured JSON summaries.",
           "creative_bot": "Writes poems, stories, and creative text. Returns plain text.",
           "coder_bot": "Writes Python code snippets."
       }
       self.workers = {
           "analyst_bot": Agent("analyst_bot", "You are a Data Analyst. output strict JSON."),
           "creative_bot": Agent("creative_bot", "You are a Creative Writer."),
           "coder_bot": Agent("coder_bot", "You are a Python Expert. Return only code.")
       }
       self.router = SemanticRouter(self.agents_info)

We construct the worker agents and the central orchestrator. Each agent receives a clear role, analyst, creative, or coder, and we configure the orchestrator to manage them. As we review this section, we see how we define the agent ecosystem and prepare it for intelligent task delegation. Check out the .

 def validate_constraint(self, content: str, constraint_type: str) -> tuple[bool, str]:
       if constraint_type == "json_only":
           try:
               json.loads(content)
               return True, "Valid JSON"
           except:
               return False, "Output was not valid JSON."
       if constraint_type == "no_markdown":
           if "```" in content:
               return False, "Output contains Markdown code blocks, which are forbidden."
           return True, "Valid Text"
       return True, "Pass"


   def run_task(self, user_input: str, constraint: str = None, max_retries: int = 2):
       print(f"n--- New Task: {user_input} ---")
       target_name = self.router.route(user_input)
       worker = self.workers.get(target_name)
       current_input = user_input
       history = []
       for attempt in range(max_retries + 1):
           try:
               msg = AgentMessage(source="User", target=target_name, content=current_input, metadata={})
               print(f"   [Exec] {worker.name} working... (Attempt {attempt+1})")
               result = worker.execute(msg)
               if constraint:
                   is_valid, error_msg = self.validate_constraint(result, constraint)
                   if not is_valid:
                       print(f"   [Guardrail] VIOLATION: {error_msg}")
                       current_input = f"Your previous answer failed a check.nOriginal Request: {user_input}nYour Answer: {result}nError: {error_msg}nFIX IT immediately."
                       continue
               print(f"   [Success] Final Output:n{result[:100]}...")
               return result
           except Exception as e:
               print(f"   [System Error] {e}")
               time.sleep(1)
       print("   [Failed] Max retries reached or self-correction failed.")
       return None

We implement symbolic guardrails and a self-correction loop to enforce constraints like strict JSON or no Markdown. We run iterative refinement whenever outputs violate requirements, allowing our agents to fix their own mistakes. Check out the .

if __name__ == "__main__":
   orchestrator = Orchestrator()
   orchestrator.run_task(
       "Compare the GDP of France and Germany in 2023.",
       constraint="json_only"
   )
   orchestrator.run_task(
       "Write a Python function for Fibonacci numbers.",
       constraint="no_markdown"
   )

We execute two complete scenarios, showcasing routing, agent execution, and constraint validation in action. We run a JSON-enforced analytical task and a coding task with Markdown restrictions to observe the reflexive behavior. 

In conclusion, we now see how multiple components, routing, worker agents, guardrails, and self-correction, come together to create a reliable and intelligent agentic system. We witness how each part contributes to robust task execution, ensuring that outputs remain accurate, aligned, and constraint-aware. As we reflect on the architecture, we recognize how easily we can expand it with new agents, richer constraints, or more advanced reasoning strategies.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to . Wait! are you on telegram? 

The post appeared first on .

Read More