Calibration Analysis¶
Part of Understanding Section
Learn about calibration metrics with statistical confidence intervals. Related topics:
- **[Fairness Metrics](fairness-metrics.md)** - Statistical rigor for fairness analysis
- **[Configuration Guide](../getting-started/configuration.md)** - How to enable calibration analysis
- **[SR 11-7 §III.B.2](../compliance/sr-11-7-mapping.md)** - Validation testing requirements
Calibration Analysis¶
Model calibration measures whether predicted probabilities match observed outcomes. A well-calibrated model predicting 70% confidence should be correct 70% of the time.
Why Calibration Matters¶
Poor calibration misleads decision-makers even when classification accuracy is high.
Example: Loan Approval
- Model predicts 90% approval probability
- Well-calibrated: 90% of these applicants would repay
- Poorly calibrated: Only 70% would repay
- Impact: Bank loses money on 20% unexpected defaults
Regulatory requirement: SR 11-7 Section III.B.1 requires validation testing including probability accuracy.
Calibration with Confidence Intervals (E10+)¶
GlassAlpha provides statistical rigor for calibration analysis:
- Expected Calibration Error (ECE) with 95% confidence intervals
- Brier Score with 95% confidence intervals
- Bin-wise calibration curves with error bars
- Deterministic bootstrap for reproducibility
Metrics¶
Expected Calibration Error (ECE)¶
Definition: Average absolute difference between predicted probability and observed frequency.
Formula:
Where:
n_b= number of samples in bin bn= total samplesaccuracy_b= observed accuracy in bin bconfidence_b= mean predicted probability in bin b
Interpretation:
| ECE | Interpretation | Status |
|---|---|---|
| < 0.05 | Well calibrated | ✅ Excellent |
| 0.05-0.10 | Acceptable | ⚠️ Fair |
| > 0.10 | Poorly calibrated | 🔴 Poor |
Example:
- ECE = 0.03: Predictions are off by 3% on average (good)
- ECE = 0.15: Predictions are off by 15% on average (poor)
Brier Score¶
Definition: Mean squared difference between predicted probability and actual outcome.
Formula:
Where:
p_i= predicted probability for sample iy_i= actual outcome (0 or 1)
Interpretation:
| Brier Score | Interpretation |
|---|---|
| < 0.10 | Excellent |
| 0.10-0.20 | Good |
| 0.20-0.30 | Fair |
| > 0.30 | Poor |
Properties:
- Lower is better (0 = perfect)
- Combines calibration and discrimination
- Sensitive to both over/under-confidence
Confidence Intervals¶
Why CIs Matter¶
Point estimates can be misleading with small samples:
Example:
- ECE = 0.08 (seems acceptable)
- But with 95% CI = [0.02, 0.18]
- Interpretation: Could be excellent (0.02) or poor (0.18)
- Action: Collect more data for precise estimate
Bootstrap Method¶
Process:
- Resample data with replacement (1000 times by default)
- Compute ECE/Brier for each resample
- CI bounds = 2.5th and 97.5th percentiles (95% CI)
Deterministic: Seeded random sampling ensures byte-identical results.
Configuration¶
metrics:
calibration:
enabled: true
# Binning strategy
n_bins: 10 # Fixed bins (default)
bin_strategy: fixed # fixed or adaptive
# Confidence intervals
compute_confidence_intervals: true # Default: true
n_bootstrap: 1000 # Bootstrap samples (default: 1000)
confidence_level: 0.95 # CI level (default: 0.95)
# Bin-wise CIs
compute_bin_wise_ci: true # Error bars for calibration curve
Calibration Curves¶
What They Show¶
Calibration curve plots predicted probability vs observed frequency:
- X-axis: Predicted probability bin (e.g., 0.0-0.1, 0.1-0.2, ..., 0.9-1.0)
- Y-axis: Observed frequency (fraction of positives in bin)
- Diagonal: Perfect calibration line
Well-calibrated model: Points cluster near diagonal Poorly calibrated model: Points far from diagonal
Bin-Wise Confidence Intervals¶
Error bars show uncertainty in observed frequency for each bin:
Bin | Mean Pred | Observed | 95% CI | Samples
------------|-----------|----------|---------------|--------
[0.0, 0.1) | 0.05 | 0.08 | [0.02, 0.14] | 25
[0.1, 0.2) | 0.15 | 0.12 | [0.06, 0.18] | 30
[0.2, 0.3) | 0.25 | 0.24 | [0.18, 0.30] | 42
...
[0.9, 1.0) | 0.95 | 0.91 | [0.82, 1.00] | 18
Wide CI: Small sample in bin, uncertain estimate Narrow CI: Large sample, precise estimate
Skipped bins: Bins with <10 samples are skipped (insufficient for bootstrap)
PDF Visualization¶
Calibration curves in audit PDF include:
- Scatter plot of predicted vs observed
- Perfect calibration diagonal (reference line)
- Error bars (bin-wise 95% CIs)
- ECE annotation
- Bin sample sizes
PDF Output¶
Calibration analysis appears as a dedicated section in the audit PDF:
Calibration Metrics Table¶
Calibration Analysis with Confidence Intervals
Metric | Value | 95% CI | Interpretation
-------------------------|--------|---------------|----------------
Expected Calibration | 0.042 | [0.028, 0.058]| Well Calibrated
Error (ECE) | | |
Brier Score | 0.156 | [0.142, 0.171]| Good
Sample size: 200 | Bins: 10 | Bootstrap samples: 1,000
Bin-Wise Calibration Error¶
Bin | Mean Pred | Observed | 95% CI | |Pred - Obs| | Samples
------------|-----------|----------|---------------|-------------|--------
[0.0, 0.1) | 0.05 | 0.08 | [0.02, 0.14] | 0.03 | 25
[0.1, 0.2) | 0.15 | 0.12 | [0.06, 0.18] | 0.03 | 30
[0.2, 0.3) | 0.25 | 0.24 | [0.18, 0.30] | 0.01 | 42
[0.3, 0.4) | 0.35 | 0.40 | [0.31, 0.49] | 0.05 | 28
...
Calibration Curve Plot¶
Visual display with:
- Scatter points (predicted vs observed by bin)
- Diagonal reference line (y = x)
- Error bars (95% CIs)
- ECE value annotated
JSON Export¶
All calibration results in audit manifest:
{
"calibration_ci": {
"ece": 0.042,
"ece_ci": {
"ci_lower": 0.028,
"ci_upper": 0.058,
"confidence_level": 0.95,
"n_bootstrap": 1000
},
"brier_score": 0.156,
"brier_ci": {
"ci_lower": 0.142,
"ci_upper": 0.171,
"confidence_level": 0.95,
"n_bootstrap": 1000
},
"bin_calibration": [
{
"bin_range": [0.0, 0.1],
"mean_predicted": 0.05,
"observed_frequency": 0.08,
"ci_lower": 0.02,
"ci_upper": 0.14,
"n_samples": 25
},
...
],
"n_bins": 10,
"n_samples": 200
}
}
Binning Strategies¶
Fixed Bins (Default)¶
Divide probability range [0, 1] into N equal-width bins:
Bins: [0.0, 0.1), [0.1, 0.2), ..., [0.9, 1.0]
Pros: Simple, consistent across models Cons: Empty bins possible if predictions concentrate in narrow range
Adaptive Bins¶
Adjust bin edges based on prediction distribution:
Method: Bins contain equal number of samples (quantiles)
Pros: All bins have samples, robust to skewed predictions Cons: Bin edges differ across models (harder to compare)
Interpreting Results¶
Well-Calibrated Model¶
ECE: 0.035 [0.022, 0.048]
Brier: 0.142 [0.128, 0.156]
Calibration curve: Points cluster near diagonal
All bin CIs include diagonal
Interpretation: Predictions are reliable probability estimates. Can be used directly for decision-making.
Action: Proceed with deployment.
Poorly Calibrated Model¶
ECE: 0.18 [0.14, 0.22]
Brier: 0.34 [0.30, 0.38]
Calibration curve: Points systematically below diagonal (underconfident)
Or: Points above diagonal (overconfident)
Interpretation: Predictions are unreliable. Model may be accurate (good AUC) but probabilities are miscalibrated.
Actions:
- Apply calibration (Platt scaling, isotonic regression)
- Retrain with better probability estimates
- Do NOT use raw probabilities for decision-making
Overconfident Model¶
Pattern: Observed frequency < predicted probability
Example: Model predicts 85% approval, but only 65% actually approved.
Risk: False confidence → poor decisions
Causes:
- Overfitting
- Class imbalance
- Uncalibrated algorithms (e.g., tree ensembles)
Underconfident Model¶
Pattern: Observed frequency > predicted probability
Example: Model predicts 65% approval, but 82% actually approved.
Risk: Missed opportunities (reject good applicants)
Causes:
- Overregularization
- Conservative probability estimates
Improving Calibration¶
1. Platt Scaling (Logistic Calibration)¶
Fit logistic regression on model outputs:
from sklearn.calibration import CalibratedClassifierCV
calibrated_model = CalibratedClassifierCV(
base_model,
method='sigmoid', # Platt scaling
cv='prefit'
)
calibrated_model.fit(X_val, y_val)
Best for: Overconfident models (e.g., XGBoost, Random Forest)
2. Isotonic Regression¶
Non-parametric calibration (piecewise constant):
calibrated_model = CalibratedClassifierCV(
base_model,
method='isotonic', # Isotonic regression
cv='prefit'
)
calibrated_model.fit(X_val, y_val)
Best for: Non-monotonic miscalibration
3. Temperature Scaling¶
Scale logits before softmax:
Best for: Deep learning models
After Calibration¶
Re-run GlassAlpha audit to verify improvement:
Compare ECE/Brier before and after calibration.
Deterministic Execution¶
All calibration metrics are fully deterministic with explicit seeds:
What's deterministic:
- Bootstrap sample selection
- Bin assignment
- CI computation
Guarantee: Same config + same seed = byte-identical results
Complete Example¶
Configuration¶
# audit_config.yaml
# Direct configuration
reproducibility:
random_seed: 42
data:
dataset: german_credit
target_column: credit_risk
model:
type: xgboost
params:
n_estimators: 100
random_state: 42
metrics:
calibration:
enabled: true
n_bins: 10
bin_strategy: fixed
compute_confidence_intervals: true
n_bootstrap: 1000
confidence_level: 0.95
compute_bin_wise_ci: true
Run Audit¶
Expected Output¶
Running calibration analysis...
✓ Computed probabilities for 200 samples
✓ ECE: 0.042 [0.028, 0.058] (Well Calibrated)
✓ Brier: 0.156 [0.142, 0.171] (Good)
✓ Bin-wise CIs: 10 bins (8 with n≥10)
✓ Bootstrap: 1,000 samples
Calibration Summary:
Status: PASS (ECE < 0.05)
Confidence: High (narrow CIs)
Troubleshooting¶
"Calibration section missing from PDF"¶
Cause: Calibration metrics not enabled or model doesn't output probabilities.
Fixes:
- Enable:
metrics.calibration.enabled: true - Check model supports
predict_proba()(not justpredict())
"Wide confidence intervals"¶
Cause: Small sample size or high variance.
Fixes:
- Collect more data (preferred)
- Increase bootstrap samples:
n_bootstrap: 5000 - Use fewer bins:
n_bins: 5(more samples per bin)
"Many bins skipped (n<10)"¶
Cause: Predictions concentrated in narrow probability range.
Fixes:
- Use adaptive bins:
bin_strategy: adaptive - Reduce number of bins:
n_bins: 5 - Check if model is overly confident (all predictions near 0 or 1)
"ECE and Brier disagree"¶
Example: ECE low (0.04) but Brier high (0.28)
Explanation:
- ECE: Measures calibration only
- Brier: Measures calibration + discrimination (accuracy)
Interpretation: Model is well-calibrated but has poor discrimination (low AUC). Probabilities are reliable but model can't separate classes well.
"Calibration looks good but fairness fails"¶
Possible issue: Calibration may differ across protected groups.
Solution: Check group-specific calibration:
Currently: Export predictions and compute group-specific ECE manually.
Best Practices¶
1. Always Check Calibration for Probability-Based Decisions¶
If you use predicted probabilities (not just binary predictions), calibration is critical:
Use cases requiring calibration:
- Risk scoring (credit, insurance, healthcare)
- Resource allocation (based on probability thresholds)
- Cost-sensitive decisions (expected value calculations)
2. Calibration ≠ Accuracy¶
Example:
- Model A: 90% accuracy, ECE = 0.15 (poor calibration)
- Model B: 85% accuracy, ECE = 0.03 (good calibration)
Which to use?
- For binary decisions: Model A (higher accuracy)
- For probability-based decisions: Model B (reliable probabilities)
3. Validate on Holdout Set¶
Calibration can overfit to validation set during tuning:
Best practice: Reserve separate holdout set for final calibration assessment.
4. Document Calibration Method¶
If applying calibration, record in audit manifest:
model:
type: xgboost
calibration:
method: platt_scaling # or isotonic, temperature
fitted_on: validation_set
ece_before: 0.18
ece_after: 0.04
5. Monitor Calibration Drift¶
Calibration can degrade over time as data distribution shifts:
Recommendation: Re-run calibration analysis quarterly or after major data changes.
Related Features¶
- Fairness Metrics: Group-level performance with CIs
- Robustness Testing: Stability under perturbations
- Shift Testing: Robustness to demographic changes
- SR 11-7 Mapping: Section III.B.2 validation testing
Implementation Details¶
Modules:
glassalpha.metrics.calibration.quality: ECE and Brier computationglassalpha.metrics.calibration.confidence: Bootstrap CI computationglassalpha.metrics.calibration.binning: Binning strategies
API:
assess_calibration_quality(): Main entry pointcompute_calibration_with_ci(): Calibration with CIscompute_bin_wise_ci(): Per-bin error bars
Test Coverage: 25+ contract tests + German Credit integration tests validating determinism, accuracy, and edge cases.
Summary¶
Calibration analysis with statistical rigor:
- ✅ ECE with 95% CIs: Point estimate + uncertainty quantification
- ✅ Brier Score with CIs: Combined calibration + discrimination metric
- ✅ Bin-wise error bars: Visualize uncertainty in calibration curve
- ✅ Deterministic bootstrap: Reproducible with random seeds
- ✅ Flexible binning: Fixed or adaptive strategies
Critical for: Risk scoring, probability-based decisions, regulatory compliance (SR 11-7 Section III.B.2).
Remember: A model can be accurate but poorly calibrated. Always check both.