In this tutorial, we show how we treat prompts as first-class, versioned artifacts and apply rigorous regression testing to large language model behavior using MLflow. We design an evaluation pipeline that logs prompt versions, prompt diffs, model outputs, and multiple quality metrics in a fully reproducible manner. By combining classical text metrics with semantic similarity and automated regression flags, we demonstrate how we can systematically detect performance drift caused by seemingly small prompt changes. Along the tutorial, we focus on building a workflow that mirrors real software engineering practices, but applied to prompt engineering and LLM evaluation. Check out the .
!pip -q install -U "openai>=1.0.0" mlflow rouge-score nltk sentence-transformers scikit-learn pandas
import os, json, time, difflib, re
from typing import List, Dict, Any, Tuple
import mlflow
import pandas as pd
import numpy as np
from openai import OpenAI
from rouge_score import rouge_scorer
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
if not os.getenv("OPENAI_API_KEY"):
try:
from google.colab import userdata # type: ignore
k = userdata.get("OPENAI_API_KEY")
if k:
os.environ["OPENAI_API_KEY"] = k
except Exception:
pass
if not os.getenv("OPENAI_API_KEY"):
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OPENAI_API_KEY (input hidden): ").strip()
assert os.getenv("OPENAI_API_KEY"), "OPENAI_API_KEY is required."
We set up the execution environment by installing all required dependencies and importing the core libraries used throughout the tutorial. We securely load the OpenAI API key at runtime, ensuring credentials are never hard-coded in the notebook. We also initialize essential NLP resources to ensure the evaluation pipeline runs reliably across different environments.
MODEL = "gpt-4o-mini"
TEMPERATURE = 0.2
MAX_OUTPUT_TOKENS = 250
ABS_SEM_SIM_MIN = 0.78
DELTA_SEM_SIM_MAX_DROP = 0.05
DELTA_ROUGE_L_MAX_DROP = 0.08
DELTA_BLEU_MAX_DROP = 0.10
mlflow.set_tracking_uri("file:/content/mlruns")
mlflow.set_experiment("prompt_versioning_llm_regression")
client = OpenAI()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
EVAL_SET = [
{
"id": "q1",
"input": "Summarize in one sentence: MLflow tracks experiments, runs, parameters, metrics, and artifacts.",
"reference": "MLflow helps track machine learning experiments by logging runs with parameters, metrics, and artifacts."
},
{
"id": "q2",
"input": "Rewrite professionally: 'this model is kinda slow but it works ok.'",
"reference": "The model is somewhat slow, but it performs reliably."
},
{
"id": "q3",
"input": "Extract key fields as JSON: 'Order 5531 by Alice costs $42.50 and ships to Toronto.'",
"reference": '{"order_id":"5531","customer":"Alice","amount_usd":42.50,"city":"Toronto"}'
},
{
"id": "q4",
"input": "Answer briefly: What is prompt regression testing?",
"reference": "Prompt regression testing checks whether prompt changes degrade model outputs compared to a baseline."
},
]
PROMPTS = [
{
"version": "v1_baseline",
"prompt": (
"You are a precise assistant.n"
"Follow the user request carefully.n"
"If asked for JSON, output valid JSON only.n"
"User: {user_input}"
)
},
{
"version": "v2_formatting",
"prompt": (
"You are a helpful, structured assistant.n"
"Respond clearly and concisely.n"
"Prefer clean formatting.n"
"User request: {user_input}"
)
},
{
"version": "v3_guardrailed",
"prompt": (
"You are a rigorous assistant.n"
"Rules:n"
"1) If user asks for JSON, output ONLY valid minified JSON.n"
"2) Otherwise, keep the answer short and factual.n"
"User: {user_input}"
)
},
]
We define all experimental configurations, including model parameters, regression thresholds, and MLflow tracking settings. We construct the evaluation dataset and explicitly declare multiple prompt versions to compare and test for regressions. By centralizing these definitions, we ensure that prompt changes and evaluation logic remain controlled and reproducible.
def call_llm(formatted_prompt: str) -> str:
resp = client.responses.create(
model=MODEL,
input=formatted_prompt,
temperature=TEMPERATURE,
max_output_tokens=MAX_OUTPUT_TOKENS,
)
out = getattr(resp, "output_text", None)
if out:
return out.strip()
try:
texts = []
for item in resp.output:
if getattr(item, "type", "") == "message":
for c in item.content:
if getattr(c, "type", "") in ("output_text", "text"):
texts.append(getattr(c, "text", ""))
return "n".join(texts).strip()
except Exception:
return ""
smooth = SmoothingFunction().method3
rouge = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
def safe_tokenize(s: str) -> List[str]:
s = (s or "").strip().lower()
if not s:
return []
try:
return nltk.word_tokenize(s)
except LookupError:
return re.findall(r"bw+b", s)
def bleu_score(ref: str, hyp: str) -> float:
r = safe_tokenize(ref)
h = safe_tokenize(hyp)
if len(h) == 0 or len(r) == 0:
return 0.0
return float(sentence_bleu([r], h, smoothing_function=smooth))
def rougeL_f1(ref: str, hyp: str) -> float:
scores = rouge.score(ref or "", hyp or "")
return float(scores["rougeL"].fmeasure)
def semantic_sim(ref: str, hyp: str) -> float:
embs = embedder.encode([ref or "", hyp or ""], normalize_embeddings=True)
return float(cosine_similarity([embs[0]], [embs[1]])[0][0])
We implement the core LLM invocation and evaluation metrics used to assess prompt quality. We compute BLEU, ROUGE-L, and semantic similarity scores to capture both surface-level and semantic differences in model outputs. It allows us to evaluate prompt changes from multiple complementary perspectives rather than relying on a single metric.
def evaluate_prompt(prompt_template: str) -> Tuple[pd.DataFrame, Dict[str, float], str]:
rows = []
for ex in EVAL_SET:
p = prompt_template.format(user_input=ex["input"])
y = call_llm(p)
ref = ex["reference"]
rows.append({
"id": ex["id"],
"input": ex["input"],
"reference": ref,
"output": y,
"bleu": bleu_score(ref, y),
"rougeL_f1": rougeL_f1(ref, y),
"semantic_sim": semantic_sim(ref, y),
})
df = pd.DataFrame(rows)
agg = {
"bleu_mean": float(df["bleu"].mean()),
"rougeL_f1_mean": float(df["rougeL_f1"].mean()),
"semantic_sim_mean": float(df["semantic_sim"].mean()),
}
outputs_jsonl = "n".join(json.dumps(r, ensure_ascii=False) for r in rows)
return df, agg, outputs_jsonl
def log_text_artifact(text: str, artifact_path: str):
mlflow.log_text(text, artifact_path)
def prompt_diff(old: str, new: str) -> str:
a = old.splitlines(keepends=True)
b = new.splitlines(keepends=True)
return "".join(difflib.unified_diff(a, b, fromfile="previous_prompt", tofile="current_prompt"))
def compute_regression_flags(baseline: Dict[str, float], current: Dict[str, float]) -> Dict[str, Any]:
d_sem = baseline["semantic_sim_mean"] - current["semantic_sim_mean"]
d_rouge = baseline["rougeL_f1_mean"] - current["rougeL_f1_mean"]
d_bleu = baseline["bleu_mean"] - current["bleu_mean"]
flags = {
"abs_semantic_fail": current["semantic_sim_mean"] < ABS_SEM_SIM_MIN,
"drop_semantic_fail": d_sem > DELTA_SEM_SIM_MAX_DROP,
"drop_rouge_fail": d_rouge > DELTA_ROUGE_L_MAX_DROP,
"drop_bleu_fail": d_bleu > DELTA_BLEU_MAX_DROP,
"delta_semantic": float(d_sem),
"delta_rougeL": float(d_rouge),
"delta_bleu": float(d_bleu),
}
flags["regression"] = any([flags["abs_semantic_fail"], flags["drop_semantic_fail"], flags["drop_rouge_fail"], flags["drop_bleu_fail"]])
return flags
We build the evaluation and regression logic that runs each prompt against the evaluation set and aggregates results. We log prompt artifacts, prompt diffs, and evaluation outputs to MLflow, ensuring every experiment remains auditable. We also compute regression flags that automatically identify whether a prompt version degrades performance relative to the baseline. Check out the .
print("Running prompt versioning + regression testing with MLflow...")
print(f"Tracking URI: {mlflow.get_tracking_uri()}")
print(f"Experiment: {mlflow.get_experiment_by_name('prompt_versioning_llm_regression').name}")
run_summary = []
baseline_metrics = None
baseline_prompt = None
baseline_df = None
baseline_metrics_name = None
with mlflow.start_run(run_name=f"prompt_regression_suite_{int(time.time())}") as parent_run:
mlflow.set_tag("task", "prompt_versioning_regression_testing")
mlflow.log_param("model", MODEL)
mlflow.log_param("temperature", TEMPERATURE)
mlflow.log_param("max_output_tokens", MAX_OUTPUT_TOKENS)
mlflow.log_param("eval_set_size", len(EVAL_SET))
for pv in PROMPTS:
ver = pv["version"]
prompt_t = pv["prompt"]
with mlflow.start_run(run_name=ver, nested=True) as child_run:
mlflow.log_param("prompt_version", ver)
log_text_artifact(prompt_t, f"prompts/{ver}.txt")
if baseline_prompt is not None and baseline_metrics_name is not None:
diff = prompt_diff(baseline_prompt, prompt_t)
log_text_artifact(diff, f"prompt_diffs/{baseline_metrics_name}_to_{ver}.diff")
else:
log_text_artifact("BASELINE_PROMPT (no diff)", f"prompt_diffs/{ver}.diff")
df, agg, outputs_jsonl = evaluate_prompt(prompt_t)
mlflow.log_dict(agg, f"metrics/{ver}_agg.json")
log_text_artifact(outputs_jsonl, f"outputs/{ver}_outputs.jsonl")
mlflow.log_metric("bleu_mean", agg["bleu_mean"])
mlflow.log_metric("rougeL_f1_mean", agg["rougeL_f1_mean"])
mlflow.log_metric("semantic_sim_mean", agg["semantic_sim_mean"])
if baseline_metrics is None:
baseline_metrics = agg
baseline_prompt = prompt_t
baseline_df = df
baseline_metrics_name = ver
flags = {"regression": False, "delta_bleu": 0.0, "delta_rougeL": 0.0, "delta_semantic": 0.0}
mlflow.set_tag("regression", "false")
else:
flags = compute_regression_flags(baseline_metrics, agg)
mlflow.log_metric("delta_bleu", flags["delta_bleu"])
mlflow.log_metric("delta_rougeL", flags["delta_rougeL"])
mlflow.log_metric("delta_semantic", flags["delta_semantic"])
mlflow.set_tag("regression", str(flags["regression"]).lower())
for k in ["abs_semantic_fail","drop_semantic_fail","drop_rouge_fail","drop_bleu_fail"]:
mlflow.set_tag(k, str(flags[k]).lower())
run_summary.append({
"prompt_version": ver,
"bleu_mean": agg["bleu_mean"],
"rougeL_f1_mean": agg["rougeL_f1_mean"],
"semantic_sim_mean": agg["semantic_sim_mean"],
"delta_bleu_vs_baseline": float(flags.get("delta_bleu", 0.0)),
"delta_rougeL_vs_baseline": float(flags.get("delta_rougeL", 0.0)),
"delta_semantic_vs_baseline": float(flags.get("delta_semantic", 0.0)),
"regression_flag": bool(flags["regression"]),
"mlflow_run_id": child_run.info.run_id,
})
summary_df = pd.DataFrame(run_summary).sort_values("prompt_version")
print("n=== Aggregated Results (higher is better) ===")
display(summary_df)
regressed = summary_df[summary_df["regression_flag"] == True]
if len(regressed) > 0:
print("n
Regressions detected:")
display(regressed[["prompt_version","delta_bleu_vs_baseline","delta_rougeL_vs_baseline","delta_semantic_vs_baseline","mlflow_run_id"]])
else:
print("n
No regressions detected under current thresholds.")
if len(regressed) > 0 and baseline_df is not None:
worst_ver = regressed.sort_values("delta_semantic_vs_baseline", ascending=False).iloc[0]["prompt_version"]
worst_prompt = next(p["prompt"] for p in PROMPTS if p["version"] == worst_ver)
worst_df, _, _ = evaluate_prompt(worst_prompt)
merged = baseline_df[["id","output","bleu","rougeL_f1","semantic_sim"]].merge(
worst_df[["id","output","bleu","rougeL_f1","semantic_sim"]],
on="id",
suffixes=("_baseline", f"_{worst_ver}")
)
merged["delta_semantic"] = merged["semantic_sim_baseline"] - merged[f"semantic_sim_{worst_ver}"]
merged["delta_rougeL"] = merged["rougeL_f1_baseline"] - merged[f"rougeL_f1_{worst_ver}"]
merged["delta_bleu"] = merged["bleu_baseline"] - merged[f"bleu_{worst_ver}"]
print(f"n=== Per-example deltas: baseline vs {worst_ver} (positive delta = worse) ===")
display(
merged[["id","delta_semantic","delta_rougeL","delta_bleu","output_baseline",f"output_{worst_ver}"]]
.sort_values("delta_semantic", ascending=False)
)
print("nOpen MLflow UI (optional) by running:")
print("!mlflow ui --backend-store-uri file:/content/mlruns --host 0.0.0.0 --port 5000")
We orchestrate the full prompt regression testing workflow using nested MLflow runs. We compare each prompt version against the baseline, log metric deltas, and record regression outcomes in a structured summary table. This completes a repeatable, engineering-grade pipeline for prompt versioning and regression testing that we can extend to larger datasets and real-world applications.
In conclusion, we established a practical, research-oriented framework for prompt versioning and regression testing that enables us to evaluate LLM behavior with discipline and transparency. We showed how MLflow enables us to track prompt evolution, compare outputs across versions, and automatically flag regressions based on well-defined thresholds. This approach helps us move away from ad hoc prompt tuning and toward measurable, repeatable experimentation. By adopting this workflow, we ensured that prompt updates improve model behavior intentionally rather than introducing hidden performance regressions.
Check out the . Also, feel free to follow us on and don’t forget to join our and Subscribe to . Wait! are you on telegram?
The post appeared first on .
