Sakana AI Released ShinkaEvolve: An Open-Source Framework that Evolves Programs for Scientific Discovery with Unprecedented Sample-Efficiency

Sakana AI Released ShinkaEvolve: An Open-Source Framework that Evolves Programs for Scientific Discovery with Unprecedented Sample-Efficiency

 

Table of contents

Sakana AI has released ShinkaEvolve, an open-sourced framework that uses large language models (LLMs) as mutation operators in an evolutionary loop to evolve programs for scientific and engineering problems—while drastically cutting the number of evaluations needed to reach strong solutions. On the canonical circle-packing benchmark (n=26 in a unit square), ShinkaEvolve reports a new SOTA configuration using ~150 program evaluations, where prior systems typically burned thousands. The project ships under Apache-2.0, with a research report and public code.

https://sakana.ai/shinka-evolve/

What problem is it actually solving?

Most “agentic” code-evolution systems explore by brute force: they mutate code, run it, score it, and repeat—consuming enormous sampling budgets. ShinkaEvolve targets that waste explicitly with three interacting components:

  1. Adaptive parent sampling to balance exploration/exploitation. Parents are drawn from “islands” via fitness- and novelty-aware policies (power-law or weighted by performance and offspring counts) rather than always climbing the current best.
  2. Novelty-based rejection filtering to avoid re-evaluating near-duplicates. Mutable code segments are embedded; if cosine similarity exceeds a threshold, a secondary LLM acts as a “novelty judge” before execution.
  3. Bandit-based LLM ensembling so the system learns which model (e.g., GPT/Gemini/Claude/DeepSeek families) is yielding the biggest relative fitness jumps and routes future mutations accordingly (UCB1-style update on improvement over parent/baseline).

Does the sample-efficiency claim hold beyond toy problems?

The research team evaluates four distinct domains and shows consistent gains with small budgets:

  • Circle packing (n=26): reaches an improved configuration in roughly 150 evaluations; the research team also validate with stricter exact-constraint checking.
  • AIME math reasoning (2024 set): evolves agentic scaffolds that trace out a Pareto frontier (accuracy vs. LLM-call budget), outperforming hand-built baselines under limited query budgets / Pareto frontier of accuracy vs. calls and transferring to other AIME years and LLMs.
  • Competitive programming (ALE-Bench LITE): starting from ALE-Agent solutions, ShinkaEvolve delivers ~2.3% mean improvement across 10 tasks and pushes one task’s solution from 5th → 2nd in an AtCoder leaderboard counterfactual.
  • LLM training (Mixture-of-Experts): evolves a new load-balancing loss that improves perplexity and downstream accuracy at multiple regularization strengths vs. the widely-used global-batch LBL.
https://sakana.ai/shinka-evolve/

How does the evolutionary loop look in practice?

ShinkaEvolve maintains an archive of evaluated programs with fitness, public metrics, and textual feedback. For each generation: sample an island and parent(s); construct a mutation context with top-K and random “inspiration” programs; then propose edits via three operators—diff edits, full rewrites, and LLM-guided crossovers—while protecting immutable code regions with explicit markers. Executed candidates update both the archive and the bandit statistics that steer subsequent LLM/model selection. The system periodically produces a meta-scratchpad that summarizes recently successful strategies; those summaries are fed back into prompts to accelerate later generations.

What are the concrete results?

  • Circle packing: combined structured initialization (e.g., golden-angle patterns), hybrid global–local search (simulated annealing + SLSQP), and escape mechanisms (temperature reheating, ring rotations) discovered by the system—not hand-coded a priori.
  • AIME scaffolds: three-stage expert ensemble (generation → critical peer review → synthesis) that hits the accuracy/cost sweet spot at ~7 calls while retaining robustness when swapped to different LLM backends.
  • ALE-Bench: targeted engineering wins (e.g., kd-tree subtree stats; “targeted edge moves” toward misclassified items) that push scores without wholesale rewrites.
  • MoE loss: adds an entropy-modulated under-use penalty to the global-batch objective; empirically reduces miss-routing and improves perplexity/benchmarks as layer routing concentrates.

How does this compare to AlphaEvolve and related systems?

AlphaEvolve demonstrated strong closed-source results but at higher evaluation counts. ShinkaEvolve reproduces and surpasses the circle-packing result with orders-of-magnitude fewer samples and releases all components open-source. The research team also contrast variants (single-model vs. fixed ensemble vs. bandit ensemble) and ablate parent selection and novelty filtering, showing each contributes to the observed efficiency.

Summary

ShinkaEvolve is an Apache-2.0 framework for LLM-driven program evolution that cuts evaluations from thousands to hundreds by combining fitness/novelty-aware parent sampling, embedding-plus-LLM novelty rejection, and a UCB1-style adaptive LLM ensemble. It sets a new SOTA on circle packing (~150 evals), finds stronger AIME scaffolds under strict query budgets, improves ALE-Bench solutions (~2.3% mean gain, 5th→2nd on one task), and discovers a new MoE load-balancing loss that improves perplexity and downstream accuracy. Code and report are public.


FAQs — ShinkaEvolve

1) What is ShinkaEvolve?
An open-source framework that couples LLM-driven program mutations with evolutionary search to automate algorithm discovery and optimization. Code and report are public.

2) How does it achieve higher sample-efficiency than prior evolutionary systems?
Three mechanisms: adaptive parent sampling (explore/exploit balance), novelty-based rejection to avoid duplicate evaluations, and a bandit-based selector that routes mutations to the most promising LLMs.

3) What supports the results?
It reaches state-of-the-art circle packing with ~150 evaluations; on AIME-2024 it evolves scaffolds under a 10-query cap per problem; it improves ALE-Bench solutions over strong baselines.

4) Where can I run it and what’s the license?
The GitHub repo provides a WebUI and examples; ShinkaEvolve is released under Apache-2.0


Check out the , and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Google AI Ships a Model Context Protocol (MCP) Server for Data Commons, Giving AI Agents First-Class Access to Public Stats

Google AI Ships a Model Context Protocol (MCP) Server for Data Commons, Giving AI Agents First-Class Access to Public Stats

 

Google released a Model Context Protocol (MCP) server for , exposing the project’s interconnected public datasets—census, health, climate, economics—through a standards-based interface that agentic systems can query in natural language. The Data Commons MCP Server is available now with quickstarts for and Google’s Agent Development Kit (ADK).

What was released

  • An MCP server that lets any MCP-capable client or AI agent discover variables, resolve entities, fetch time series, and generate reports from Data Commons without hand-coding API calls. Google positions it as “from initial discovery to generative reports,” with example prompts spanning exploratory, analytical, and generative workflows.
  • Developer on-ramps: a PyPI package, a Gemini CLI flow, and an ADK sample/Colab to embed Data Commons queries inside agent pipelines.

Why MCP now?

MCP is an open protocol for connecting LLM agents to external tools and data with consistent capabilities (tools, prompts, resources) and transport semantics. By shipping a first-party MCP , Google makes Data Commons addressable through the same interface that agents already use for other sources, reducing per-integration glue code and enabling registry-based discovery alongside other servers.

What you can do with it?

  • Exploratory: “What health data do you have for Africa?” → enumerate variables, coverage, and sources.
  • Analytical: “Compare life expectancy, inequality, and GDP growth for BRICS nations.” → retrieve series, normalize geos, align vintages, and return a table or chart payload.
  • Generative: “Generate a concise report on income vs. diabetes in US counties.” → fetch measures, compute correlations, include provenance.

Integration surface

  • Gemini CLI / any MCP client: install the Data Commons MCP package, point the client at the server, and issue NL queries; the client coordinates tool calls behind the scenes.
  • ADK agents: use Google’s sample agent to compose Data Commons calls with your own tools (e.g., visualization, storage) and return sourced outputs.
  • Docs entry point: MCP — Query data interactively with an AI agent with links to quickstart and user guide.

Real-world use case

Google highlights , built with the Data Commons MCP Server for the ONE Campaign. It lets policy analysts query tens of millions of health-financing datapoints via natural language, visualize results, and export clean datasets for downstream work.

Summary

In short, Google’s Data Commons MCP Server turns a sprawling corpus of public statistics into a first-class, protocol-native data source for agents—reducing custom glue code, preserving provenance, and fitting cleanly into existing MCP clients like Gemini CLI and ADK.


Check out the  and Try it out in . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

OpenAI Releases ChatGPT ‘Pulse’: Proactive, Personalized Daily Briefings for Pro Users

 

OpenAI introduced , a proactive experience that compiles personalized, research-backed updates each morning. In preview on mobile and limited to $200/month Pro subscribers, Pulse surfaces topical cards built from a user’s chats, explicit feedback, and opt-in connected apps (e.g., calendar/email), shifting ChatGPT from a request-driven tool to a context-aware assistant.

What Pulse Actually Does Under the Hood

Each day, performs background research anchored to user signals: recent conversations, long-term interests, thumbs-up/down feedback, and data from connected apps where enabled. The output appears as scannable visual cards (briefs and deep links) rather than an infinite feed, designed for quick triage and drill-down. Early examples include targeted news roundups and context-conditioned suggestions (e.g., travel planning aligned with calendar events).

Data Sources and Controls

Integrations are off by default and can be toggled. When granted, Pulse may use Gmail/Google Calendar context to tailor cards (e.g., meeting prep, itinerary nudges). OpenAI positions this as a user-level personalization layer; reporting notes emphasize optionality and in-app settings for managing connected accounts and memory.

Availability and Rollout Plan

Pulse is rolling out now to Pro on the ChatGPT mobile app as a dedicated tab. OpenAI says it wants broader availability “soon,” with access targeted after product and efficiency improvements. The company reiterated the Pro-first gating due to compute costs.

Product Positioning: Toward Agentic, Goal-Oriented Workflows

OpenAI frames Pulse as the first step toward agent-like behavior where the model tracks goals and initiates updates without prompts. External coverage highlights the shift from chat to assistant workflows that reason over user state and schedule. This aligns with OpenAI’s recent emphasis on agents and proactive help, not passive Q&A.

The Signal from Leadership

Sam Altman summarized the intent succinctly: is his “favorite feature” to date, starting with Pro. His post also underscores the model’s use of interests and recent chats, hinting at broader personalization as users share preferences over time. OpenAI’s official announcement on X mirrors the blog language around daily, proactive updates.

Competitive Context

Pulse lands in a crowded “morning brief” space but differs by tying briefs to your live context and chats rather than generic headlines. It also inches ChatGPT toward hands-on assistant territory seen in agent platforms that watch calendars, draft emails, and pre-stage tasks—yet packaged for consumers inside the ChatGPT app rather than a separate agent runner.

Summary

formalizes ChatGPT as a proactive system: it reads your signals, checks your day, and delivers a compact, personalized brief—first for Pro on mobile, with Plus on the roadmap once the system is optimized. The implementation details (APIs, enterprise knobs, retention policies) will determine how far it goes beyond morning cards into full agent workflows.

The post appeared first on .

Read More
OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on Real-World Economically Valuable Tasks

OpenAI Introduces GDPval: A New Evaluation Suite that Measures AI on Real-World Economically Valuable Tasks

 

OpenAI introduced GDPval, a new evaluation suite designed to measure how AI models perform on real-world, economically valuable tasks across 44 occupations in nine GDP-dominant U.S. sectors. Unlike academic benchmarks, GDPval centers on authentic deliverables—presentations, spreadsheets, briefs, CAD artifacts, audio/video—graded by occupational experts through blinded pairwise comparisons. OpenAI also released a 220-task “gold” subset and an experimental automated grader hosted at evals.openai.com.

From Benchmarks to Billables: How GDPval Builds Tasks

GDPval aggregates 1,320 tasks sourced from industry professionals averaging 14 years of experience. Tasks map to O*NET work activities and include multi-modal file handling (docs, slides, images, audio, video, spreadsheets, CAD), with up to dozens of reference files per task. The gold subset provides public prompts and references; primary scoring still relies on expert pairwise judgments due to subjectivity and format requirements.

https://openai.com/index/gdpval/

What the Data Says: Model vs. Expert

On the gold subset, frontier models approach expert quality on a substantial fraction of tasks under blind expert review, with model progress trending roughly linearly across releases. Reported model-vs-human win/tie rates near parity for top models, error profiles cluster around instruction-following, formatting, data usage, and hallucinations. Increased reasoning effort and stronger scaffolding (e.g., format checks, artifact rendering for self-inspection) yield predictable gains.

Time–Cost Math: Where AI Pays Off

GDPval runs scenario analyses comparing human-only to model-assisted workflows with expert review. It quantifies (i) human completion time and wage-based cost, (ii) reviewer time/cost, (iii) model latency and API cost, and (iv) empirically observed win rates. Results indicate potential time/cost reductions for many task classes once review overhead is included.

Automated Judging: Useful Proxy, Not Oracle

For the gold subset, an automated pairwise grader shows ~66% agreement with human experts, within ~5 percentage points of human–human agreement (~71%). It’s positioned as an accessibility for rapid iteration, not a replacement for expert review.

https://openai.com/index/gdpval/

Why This Isn’t Yet Another Benchmark

  • Occupational breadth: Spans top GDP sectors and a wide slice of O*NET work activities, not just narrow domains.
  • Deliverable realism: Multi-file, multi-modal inputs/outputs stress structure, formatting, and data handling.
  • Moving ceiling: Uses human preference win rate against expert deliverables, enabling re-baselining as models improve.

Boundary Conditions: Where GDPval Doesn’t Reach

GDPval-v0 targets computer-mediated knowledge work. Physical labor, long-horizon interactivity, and organization-specific tooling are out of scope. Tasks are one-shot and precisely specified; ablations show performance drops with reduced context. Construction and grading are resource-intensive, motivating the automated grader—whose limits are documented—and future expansion.

Fit in the Stack: How GDPval Complements Other Evals

GDPval augments existing OpenAI evals with occupational, multi-modal, file-centric tasks and reports human preference outcomes, time/cost analyses, and ablations on reasoning effort and agent scaffolding. v0 is versioned and expected to broaden coverage and realism over time.

Summary

GDPval formalizes evaluation for economically relevant knowledge work by pairing expert-built tasks with blinded human preference judgments and an accessible automated grader. The framework quantifies model quality and practical time/cost trade-offs while exposing failure modes and the effects of scaffolding and reasoning effort. Scope remains v0—computer-mediated, one-shot tasks with expert review—yet it establishes a reproducible baseline for tracking real-world capability gains across occupations.


Check out the , and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More
Meta FAIR Released Code World Model (CWM): A 32-Billion-Parameter Open-Weights LLM, to Advance Research on Code Generation with World Models

Meta FAIR Released Code World Model (CWM): A 32-Billion-Parameter Open-Weights LLM, to Advance Research on Code Generation with World Models

 

Meta FAIR released Code World Model (CWM), a 32-billion-parameter dense decoder-only LLM that injects world modeling into code generation by training on execution traces and long-horizon agent–environment interactions—not just static source text.

What’s new: learning code by predicting execution?

CWM mid-trains on two large families of observation–action trajectories: (1) Python interpreter traces that record local variable states after each executed line, and (2) agentic interactions inside Dockerized repositories that capture edits, shell commands, and test feedback. This grounding is intended to teach semantics (how state evolves) rather than only syntax.

To scale collection, the research team built executable repository images from thousands of GitHub projects and foraged multi-step trajectories via a software-engineering agent (“ForagerAgent”). The release reports ~3M trajectories across ~10k images and 3.15k repos, with mutate-fix and issue-fix variants.

https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/

Model and context window

CWM is a dense, decoder-only Transformer (no MoE) with 64 layers, GQA (48Q/8KV), SwiGLU, RMSNorm, and Scaled RoPE. Attention alternates local 8k and global 131k sliding-window blocks, enabling 131k tokens effective context; training uses document-causal masking.

Training recipe (pre → mid → post)

  • General pretraining: 8T tokens (code-heavy) at 8k context.
  • Mid-training: +5T tokens, long-context (131k) with Python execution traces, ForagerAgent data, PR-derived diffs, IR/compilers, Triton kernels, and Lean math.
  • Post-training: 100B-token SFT for instruction + reasoning, then multi-task RL (~172B-token) across verifiable coding, math, and multi-turn SWE environments using a GRPO-style algorithm and a minimal toolset (bash/edit/create/submit).
  • Quantized inference fits on a single 80 GB H100.

Benchmarks

The research team cites the following pass@1 / scores (test-time scaling noted where applicable):

  • SWE-bench Verified: 65.8% (with test-time scaling).
  • LiveCodeBench-v5: 68.6%; LCB-v6: 63.5%.
  • Math-500: 96.6%; AIME-24: 76.0%; AIME-25: 68.2%.
  • CruxEval-Output: 94.3%.

The research team position CWM as competitive with similarly sized open-weights baselines and even with larger or closed models on SWE-bench Verified.

For context on SWE-bench Verified’s task design and metrics, see the

https://ai.meta.com/research/publications/cwm-an-open-weights-llm-for-research-on-code-generation-with-world-models/

Why world modeling matters for code?

The release emphasizes two operational capabilities:

  1. Execution-trace prediction: given a function and a trace start, CWM predicts stack frames (locals) and the executed line at each step via a structured format—usable as a “neural debugger” for grounded reasoning without live execution.
  2. Agentic coding: multi-turn reasoning with tool use against real repos, verified by hidden tests and patch similarity rewards; the setup trains the model to localize faults and generate end-to-end patches (git diff) rather than snippets.

Some details worth noting

  • Tokenizer: Llama-3 family with reserved control tokens; reserved IDs are used to demarcate trace and reasoning segments during SFT.
  • Attention layout: the 3:1 local:global interleave is repeated across the depth; long-context training occurs at large token batch sizes to stabilize gradients.
  • Compute scaling: learning-rate/batch size schedules are derived from internal scaling-law sweeps tailored for long-context overheads.

Summary

CWM is a pragmatic step toward grounded code generation: Meta ties a 32B dense transformer to execution-trace learning and agentic, test-verified patching, releases intermediate/post-trained checkpoints, and gates usage under the FAIR Non-Commercial Research License—making it a useful platform for reproducible ablations on long-context, execution-aware coding without conflating research with production deployment.


Check out the , , and . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More

How to Build an End-to-End Data Science Workflow with Machine Learning, Interpretability, and Gemini AI Assistance?

 

In this tutorial, we walk through an advanced end-to-end data science workflow where we combine traditional machine learning with the power of Gemini. We begin by preparing and modeling the diabetes dataset, then we dive into evaluation, feature importance, and partial dependence. Along the way, we bring in Gemini as our AI data scientist to explain results, answer exploratory questions, and highlight risks. By doing this, we build a predictive model while also enhancing our insights and decision-making through natural language interaction. Check out the .

!pip -qU google-generativeai scikit-learn matplotlib pandas numpy
from getpass import getpass
import os, json, numpy as np, pandas as pd, matplotlib.pyplot as plt


if not os.environ.get("GOOGLE_API_KEY"):
   os.environ["GOOGLE_API_KEY"] = getpass("🔑 Enter your Gemini API key (hidden): ")


import google.generativeai as genai
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
LLM = genai.GenerativeModel("gemini-1.5-flash")


def ask_llm(prompt, sys=None):
   p = prompt if sys is None else f"System:n{sys}nnUser:n{prompt}"
   r = LLM.generate_content(p)
   return (getattr(r, "text", "") or "").strip()


from sklearn.datasets import load_diabetes
raw = load_diabetes(as_frame=True)
df  = raw.frame.rename(columns={"target":"disease_progression"})
print("Shape:", df.shape); display(df.head())


from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import Pipeline


X = df.drop(columns=["disease_progression"]); y = df["disease_progression"]
num_cols = X.columns.tolist()
pre = ColumnTransformer(
   [("scale", StandardScaler(), num_cols),
    ("rank",  QuantileTransformer(n_quantiles=min(200, len(X)), output_distribution="normal"), num_cols)],
   remainder="drop", verbose_feature_names_out=False)
model = HistGradientBoostingRegressor(max_depth=3, learning_rate=0.07,
                                     l2_regularization=0.0, max_iter=500,
                                     early_stopping=True, validation_fraction=0.15)
pipe  = Pipeline([("prep", pre), ("hgbt", model)])


Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.20, random_state=42)
cv = KFold(n_splits=5, shuffle=True, random_state=42)
cv_mse = -cross_val_score(pipe, Xtr, ytr, scoring="neg_mean_squared_error", cv=cv).mean()
cv_rmse = float(cv_mse ** 0.5)
pipe.fit(Xtr, ytr)

We load the diabetes dataset, preprocess the features, and build a robust pipeline using scaling, quantile transformation, and gradient boosting. We split the data, perform cross-validation to estimate RMSE, and then fit the final model to see how well it generalizes. Check out the .

pred_tr = pipe.predict(Xtr); pred_te = pipe.predict(Xte)
rmse_tr = mean_squared_error(ytr, pred_tr) ** 0.5
rmse_te = mean_squared_error(yte, pred_te) ** 0.5
mae_te  = mean_absolute_error(yte, pred_te)
r2_te   = r2_score(yte, pred_te)
print(f"CV RMSE={cv_rmse:.2f} | Train RMSE={rmse_tr:.2f} | Test RMSE={rmse_te:.2f} | Test MAE={mae_te:.2f} | R²={r2_te:.3f}")


plt.figure(figsize=(5,4))
plt.scatter(pred_te, yte - pred_te, s=12)
plt.axhline(0, lw=1); plt.xlabel("Predicted"); plt.ylabel("Residual"); plt.title("Residuals (Test)")
plt.show()


from sklearn.inspection import permutation_importance
imp = permutation_importance(pipe, Xte, yte, scoring="neg_mean_squared_error", n_repeats=10, random_state=0)
imp_df = pd.DataFrame({"feature": X.columns, "importance": imp.importances_mean}).sort_values("importance", ascending=False)
display(imp_df.head(10))


plt.figure(figsize=(6,4))
top10 = imp_df.head(10).iloc[::-1]
plt.barh(top10["feature"], top10["importance"])
plt.title("Permutation Importance (Top 10)"); plt.xlabel("Δ(MSE)"); plt.tight_layout(); plt.show()

We evaluate our model by computing train, test, and cross-validation metrics, and visualize residuals to check prediction errors. We then calculate permutation importance to identify which features drive the model most, and display the top contributors using a clear bar plot. Check out the .

def compute_pdp(pipe, Xref: pd.DataFrame, feat: str, grid=40):
   xs = np.linspace(np.percentile(Xref[feat], 5), np.percentile(Xref[feat], 95), grid)
   Xtmp = Xref.copy()
   ys = []
   for v in xs:
       Xtmp[feat] = v
       ys.append(pipe.predict(Xtmp).mean())
   return xs, np.array(ys)


top_feats = imp_df["feature"].head(3).tolist()
plt.figure(figsize=(6,4))
for f in top_feats:
   xs, ys = compute_pdp(pipe, Xte.copy(), f, grid=40)
   plt.plot(xs, ys, label=f)
plt.legend(); plt.xlabel("Feature value"); plt.ylabel("Predicted target"); plt.title("Manual PDP (Top 3)")
plt.tight_layout(); plt.show()




report_obj = {
   "dataset": {"rows": int(df.shape[0]), "cols": int(df.shape[1]-1), "target": "disease_progression"},
   "metrics": {"cv_rmse": float(cv_rmse), "train_rmse": float(rmse_tr),
               "test_rmse": float(rmse_te), "test_mae": float(mae_te), "r2": float(r2_te)},
   "top_importances": imp_df.head(10).to_dict(orient="records")
}
print(json.dumps(report_obj, indent=2))


sys_msg = ("You are a senior data scientist. Return: (1) ≤120-word executive summary, "
          "(2) key risks/assumptions bullets, (3) 5 prioritized next experiments w/ rationale, "
          "(4) quick-win feature engineering ideas as Python pseudocode.")
summary = ask_llm(f"Dataset + metrics + importances:n{json.dumps(report_obj)}", sys=sys_msg)
print("n📊 Gemini Executive Briefn" + "-"*80 + f"n{summary}n")

We compute the manual partial dependence for the top three features and visualize how changing each one affects the predictions. We then assemble a compact JSON report of dataset statistics, metrics, and importances, and ask Gemini to generate an executive brief that includes risks, next experiments, and quick-win feature engineering ideas. Check out the .

SAFE_GLOBALS = {"pd": pd, "np": np}
def run_generated_pandas(code: str, df_local: pd.DataFrame):
   banned = ["__", "import", "open(", "exec(", "eval(", "os.", "sys.", "pd.read", "to_csv", "to_pickle", "to_sql"]
   if any(b in code for b in banned): raise ValueError("Unsafe code rejected.")
   loc = {"df": df_local.copy()}
   exec(code, SAFE_GLOBALS, loc)
   return {k:v for k,v in loc.items() if k not in ("df",)}


def eda_qa(question: str):
   prompt = f"""You are a Python+Pandas analyst. DataFrame `df` columns:
{list(df.columns)}. Write a SHORT pandas snippet (no comments/prints) that computes the answer to:
"{question}". Use only pd/np/df; assign the final result to a variable named `answer`."""
   code = ask_llm(prompt, sys="Return only code. No prose.")
   try:
       out = run_generated_pandas(code, df)
       return code, out.get("answer", None)
   except Exception as e:
       return code, f"[Execution error: {e}]"


questions = [
   "What is the Pearson correlation between BMI and disease_progression?",
   "Show mean target by tertiles of BMI (low/med/high).",
   "Which single feature correlates most with the target (absolute value)?"
]
for q in questions:
   code, ans = eda_qa(q)
   print("nQ:", q, "nCode:n", code, "nAnswer:n", ans)

We build a safe sandbox to execute pandas code that Gemini generates for exploratory data analysis. We then ask natural language questions about correlations and feature relationships, let Gemini write the pandas snippets, and automatically run them to get direct answers from the dataset. Check out the .

crossitique = ask_llm(
   f"""Metrics: {report_obj['metrics']}
Top importances: {report_obj['top_importances']}
Identify risks around leakage, overfitting, calibration, OOD robustness, and fairness (even proxy-only).
Propose quick checks (concise Python sketches)."""
)
print("n🧪 Gemini Risk & Robustness Reviewn" + "-"*80 + f"n{critique}n")


def what_if(pipe, Xref: pd.DataFrame, feat: str, delta: float = 0.05):
   x0 = Xref.median(numeric_only=True).to_dict()
   x1, x2 = x0.copy(), x0.copy()
   if feat not in x1: return np.nan
   x2[feat] = x1[feat] + delta
   X1 = pd.DataFrame([x1], columns=X.columns)
   X2 = pd.DataFrame([x2], columns=X.columns)
   return float(pipe.predict(X2)[0] - pipe.predict(X1)[0])


for f in top_feats:
   print(f"Estimated Δtarget if {f} increases by +0.05 ≈ {what_if(pipe, Xte, f, 0.05):.2f}")


print("n✅ Done: Train → Explain → Query with Gemini → Review risks → What-if analysis. "
     "Swap the dataset or tweak model params to extend this notebook.")

We ask Gemini to review our model for risks like leakage, overfitting, and fairness, and get quick Python checks as suggestions. We then run simple “what-if” analyses to see how small changes in top features affect predictions, helping us interpret the model’s behavior more clearly.

In conclusion, we see how seamlessly we can blend machine learning pipelines with Gemini’s reasoning to make data science more interactive and insightful. We train, evaluate, and interpret a model, then ask Gemini to summarize findings, suggest improvements, and critique risks. Through this journey, we establish a workflow that enables us to achieve both predictive performance and interpretability, while also benefiting from having an AI collaborator in our data analysis process.


Check out the . Feel free to check out our . Also, feel free to follow us on  and don’t forget to join our  and Subscribe to .

The post appeared first on .

Read More