Built with GenAI

Blog

AI-900 Matrices
Explained

Master AI-900 exam matrices with real-world examples — confusion matrix, accuracy, precision, recall, F1, ROC/AUC, and GenAI metrics explained simply.

AI-900 Matrices Explained — Confusion Matrix, Accuracy, Precision, Recall, F1, and ROC/AUC

Quick takeaways:
  • The confusion matrix is the foundation — it shows where your model gets things right and wrong.
  • Accuracy alone can be misleading with imbalanced data — use precision, recall, or F1 instead.
  • ROC/AUC measures overall model discrimination ability across all thresholds.
  • GenAI models need different metrics like BLEU, ROUGE, perplexity, and hallucination rate.

Preparing for the Microsoft AI-900 (Azure AI Fundamentals) exam? One of the most important skills is understanding how to evaluate machine learning models. This post explains the key matrices you’ll see in the exam—like confusion matrix, accuracy, precision, recall, F1 score, and ROC/AUC—in plain English, with practical examples.

Why Do These Matrices Matter?

When you build or use an AI model, you need to know how well it’s performing. These matrices help you measure performance, spot problems, and make better decisions. They’re also core topics in the AI-900 exam.


The Confusion Matrix

A confusion matrix is a table that shows how well your model is classifying things. It compares what the model predicted to what actually happened.

Example: Suppose you’re building a model to detect spam emails. Here’s a simple confusion matrix:

Predicted: SpamPredicted: Not Spam
Actual: Spam8020
Actual: Not Spam1090

Another Example: (Medical Diagnosis) Imagine a model that predicts whether a patient has a disease:

  • True Positives (TP): Sick patients correctly diagnosed as sick
  • False Positives (FP): Healthy patients incorrectly diagnosed as sick
  • False Negatives (FN): Sick patients missed by the model
  • True Negatives (TN): Healthy patients correctly identified as healthy

This helps doctors see not just how many patients were diagnosed, but also where mistakes happen (missed cases or false alarms).


Accuracy

Accuracy measures how often the model is right overall.

$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$

Example: From the matrix above:

When to use: Good when classes are balanced. Not always helpful if one class is much more common than the other.

Another Example: (Fraud Detection) Suppose a credit card fraud model correctly flags 950 out of 1,000 transactions (including both fraud and non-fraud), its accuracy is 95%. But if only 10 out of 1,000 are actually fraud, a model that always says “not fraud” would still be 99% accurate—so accuracy alone can be misleading when classes are imbalanced.

Use accuracy when both classes (e.g., fraud and not fraud) are about equally common.


Precision

Precision tells you, out of all the times the model said “Spam,” how many were actually spam.

$$ Precision = \frac{TP}{TP + FP} $$

Example:

When to use: Important when the cost of a false positive is high (e.g., marking important emails as spam).

Another Example: (Fraud Alerts) If a bank’s fraud system flags 100 transactions as fraud, but only 60 are truly fraudulent, precision is 60%. High precision means fewer false alarms—so fewer customers are bothered by unnecessary account freezes.

Use precision when you want to avoid false alarms that annoy users or cause extra work.


Recall

Recall (also called Sensitivity) tells you, out of all the actual spam emails, how many the model caught.

$$ Recall = \frac{TP}{TP + FN} $$

Example:

When to use: Important when missing a positive case is costly (e.g., missing spam that could be dangerous).

Another Example: (Medical Screening) If a cancer screening test finds 90 out of 100 real cases, recall is 90%. High recall means fewer missed cases, which is critical in healthcare.

Use recall when missing a positive case (like a disease or a security threat) is much worse than a false alarm.


F1 Score

F1 Score balances precision and recall. It’s the harmonic mean of the two.

$$ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} $$

Example:

When to use: Useful when you need a balance between precision and recall.

Another Example: (Customer Churn Prediction) Suppose a telecom company wants to predict which customers will leave. F1 helps balance catching as many true churners as possible (recall) while not annoying loyal customers with false alarms (precision).

Use F1 when you want a single score that balances both types of errors, especially if your data is imbalanced.


ROC Curve and AUC

ROC (Receiver Operating Characteristic) Curve shows how well the model separates classes at different thresholds. The AUC (Area Under the Curve) measures the overall ability to distinguish between classes.

  • AUC = 1.0: Perfect model
  • AUC = 0.5: No better than random guessing

Example: If your spam filter has an AUC of 0.95, it’s very good at telling spam from not spam.


Summary Table

MetricWhat It MeasuresExample ValueWhen to Use
AccuracyOverall correctness85%Classes are balanced
PrecisionCorrectness of positive predictions89%False positives are costly
RecallCoverage of actual positives80%Missing positives is costly
F1 ScoreBalance of precision and recall84%Need a trade-off
ROC/AUCAbility to separate classes0.95Overall model discrimination

Evaluation Matrices for GenAI and LLMs

When working with Generative AI (GenAI) and Large Language Models (LLMs), traditional metrics like accuracy and F1 are not always enough. Here are some common evaluation matrices for GenAI/LLM use cases, with simple explanations and practical examples:

BLEU (Bilingual Evaluation Understudy)

  • What it measures: How closely a generated text (like a translation) matches one or more reference texts.
  • Example:
    • Machine translation: If Google Translate outputs “The cat sits on the mat” and the reference is “The cat is sitting on the mat,” BLEU measures the overlap in words and phrases.
  • When to use: Useful for translation, summarization, and text generation tasks.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

  • What it measures: Overlap of n-grams (word sequences) between generated and reference texts. Focuses on recall.
  • Example:
    • Text summarization: If an LLM-generated summary covers most of the key points from a human-written summary, it will have a high ROUGE score.
  • When to use: Common for evaluating summaries and paraphrasing.

Perplexity

  • What it measures: How well a language model predicts a sample. Lower perplexity means the model is more confident and accurate.
  • Example:
    • If a chatbot has low perplexity on customer support queries, it means it’s good at predicting what comes next in a conversation.
  • When to use: Used to evaluate language models during training.

Human Evaluation

  • What it measures: Real people rate the quality, helpfulness, or correctness of generated outputs.
  • Example:
    • Asking users to rate chatbot responses as “helpful” or “not helpful.”
  • When to use: Essential for tasks where quality is subjective or hard to measure automatically (e.g., creative writing, open-ended answers).

Hallucination Rate

  • What it measures: How often the model generates information that is factually incorrect or made up (“hallucinations”).
  • Example:
    • If an LLM answers 100 questions and 8 contain made-up facts, the hallucination rate is 8%.
  • When to use: Important for applications where factual accuracy matters, like medical or legal advice.

Key Takeaways

  • Know what each matrix means and when to use it.
  • Practice with real examples—like spam detection—to make the ideas stick.
  • For AI-900, focus on understanding the intuition, not just memorizing formulas.

Good luck with your exam prep!