Audit Report Contents¶
Complete breakdown of what's included in every GlassAlpha audit report.
Visual walkthrough: anatomy of an audit report¶
This section shows you exactly what auditors, regulators, and validators look for in each section of a GlassAlpha audit report.
graph TB
Report[GlassAlpha<br/>Audit Report]
Report --> Header[š Report Header<br/>ID, Timestamp, Profile]
Report --> Perf[š Performance<br/>Metrics & Errors]
Report --> Fair[āļø Fairness<br/>Bias Detection]
Report --> Explain[š Explainability<br/>Feature Importance]
Report --> Cal[šÆ Calibration<br/>Probability Accuracy]
Report --> Manifest[š Reproducibility<br/>Hashes & Seeds]
Header --> H[ā Model ID<br/>ā Timestamp<br/>ā Compliance Profile]
Perf --> P[ā Confusion Matrix<br/>ā Accuracy/AUC<br/>ā Precision/Recall]
Fair --> F[ā Demographic Parity<br/>ā Equal Opportunity<br/>ā Group Breakdown]
Explain --> E[ā Feature Importance<br/>ā SHAP Values<br/>ā Sample Cases]
Cal --> C[ā Calibration Curve<br/>ā ECE < 0.05<br/>ā Confidence Intervals]
Manifest --> M[ā Config Hash<br/>ā Data Hash<br/>ā Random Seeds<br/>ā Package Versions]
style Report fill:#e1f5ff
style Header fill:#e1f5ff
style Perf fill:#d4edda
style Fair fill:#fff3cd
style Explain fill:#e7d4f5
style Cal fill:#ffd7e5
style Manifest fill:#d1ecf1
What auditors look for: Section-by-section checklist¶
š Report Header (5 seconds)¶
Critical checks:
- ā Model ID matches registry entry
- ā Timestamp is recent (not stale analysis)
- ā Compliance profile matches regulatory requirement (e.g., SR 11-7 for credit)
- ā Author/submitter identified
Red flags:
- ā Missing model identifier
- ā Timestamp >6 months old
- ā Wrong compliance profile for use case
- ā No contact information
Typical auditor question: "Is this the same model deployed in production?"
š Performance Section (30 seconds)¶
Critical checks:
- ā Accuracy >80% (banking/insurance standard)
- ā AUC-ROC >0.80 (discrimination ability)
- ā Confusion matrix shows acceptable error rates
- ā Test set size nā„1000 (statistical validity)
Red flags:
- ā Accuracy <70% (model not production-ready)
- ā AUC-ROC <0.70 (barely better than random)
- ā High false negative rate (missed good customers)
- ā Test set <100 samples (not statistically valid)
Typical auditor questions:
- "What's the cost of false positives vs false negatives?"
- "How does this compare to the baseline?"
- "Was this evaluated on held-out data?"
What they're really checking: Is the model accurate enough to justify its complexity and potential bias risks?
āļø Fairness Section (2 minutes - most scrutinized)¶
Critical checks:
- ā Demographic parity <10% (ECOA standard for credit)
- ā Equal opportunity <10% (qualified applicants treated equally)
- ā Group sample sizes nā„30 (statistical validity)
- ā Statistical significance documented (p-values)
- ā Protected attributes identified (race, gender, age)
Red flags:
- ā Disparity >10% without business justification
- ā Sample size <30 for any group (unreliable)
- ā No intersectional analysis (single-attribute only)
- ā Missing statistical tests (could be sampling noise)
- ā Protected attributes used as model inputs (illegal for credit)
Typical auditor questions:
- "Why is there a 12% disparity in equal opportunity?"
- "What's the business justification for this difference?"
- "Have you tested intersectional bias (e.g., Black women)?"
- "What mitigation steps were attempted?"
What they're really checking: Does this model systematically disadvantage protected groups? If yes, can you justify it?
Regulatory thresholds:
- ECOA/FCRA (Credit): <10% disparity in approval rates
- EEOC (Employment): 80% rule (20% relative difference)
- GDPR Article 22: No automated decision if discriminatory
- EU AI Act: High-risk systems must demonstrate fairness testing
š Explainability Section (1 minute)¶
Critical checks:
- ā Top features make business sense (not spurious correlations)
- ā No protected attributes in top 10 (race, gender, etc.)
- ā Explainer method documented (TreeSHAP, Coefficients, etc.)
- ā Sample explanations provided (3-5 representative cases)
- ā Feature contributions sum to prediction (additivity property)
Red flags:
- ā Top feature is zip code (proxy for race)
- ā Protected attributes appear in feature importance
- ā Explainer method not documented
- ā Explanations don't match business logic
- ā SHAP values don't satisfy additivity
Typical auditor questions:
- "Why is 'first name' the #2 most important feature?" (gender proxy)
- "Can you explain this specific denial to the applicant?"
- "How do you know these explanations are accurate?"
What they're really checking: Are the model's reasons legitimate business factors, or is it learning to discriminate via proxies?
Regulatory requirements:
- SR 11-7: Model must be explainable to validators
- ECOA: Adverse action notices must cite specific reasons
- GDPR Article 22: Right to explanation for automated decisions
- EU AI Act: High-risk systems must provide explanations
šÆ Calibration Section (30 seconds)¶
Critical checks:
- ā Expected Calibration Error (ECE) <0.05 (well-calibrated)
- ā Calibration curve close to diagonal (predicted = actual)
- ā Brier score reported (lower is better)
- ā Confidence intervals provided (statistical uncertainty)
Red flags:
- ā ECE >0.10 (poorly calibrated)
- ā Calibration curve far from diagonal (over/under-confident)
- ā No confidence intervals (can't assess reliability)
- ā Calibration not tested per group (could be calibrated overall but not within groups)
Typical auditor questions:
- "If the model says 80% probability, is it right 80% of the time?"
- "Is calibration consistent across demographic groups?"
What they're really checking: Can we trust the model's confidence scores for high-stakes decisions?
Why calibration matters:
- Insurance pricing: Premiums based on predicted probabilities
- Credit scoring: Interest rates tied to default probability
- Healthcare: Treatment decisions based on risk scores
š Reproducibility Manifest (15 seconds)¶
Critical checks:
- ā Config hash provided (can reproduce exact run)
- ā Data hash provided (tamper detection)
- ā Random seeds documented (all sources of randomness)
- ā Package versions listed (environment reproducibility)
- ā Git commit SHA provided (exact code version)
Red flags:
- ā No config hash (can't verify exact settings)
- ā No random seeds (non-reproducible)
- ā Missing package versions (environment drift risk)
- ā No git commit (can't inspect source code)
- ā Timestamp mismatch (manifest generated at different time)
Typical auditor questions:
- "Can I reproduce this audit byte-for-byte?"
- "What happens if I run this again?"
- "How do I know this data wasn't tampered with?"
What they're really checking: Is this audit trustworthy? Can it be independently validated?
Regulatory requirements:
- SR 11-7: Model validation must be reproducible
- FDA (Medical Devices): Clinical validation must be reproducible
- EU AI Act: High-risk systems must maintain audit trails
Common auditor questions by role¶
Compliance officer (regulatory risk)¶
- "Does this pass our fairness thresholds?" ā Check Fairness Section
- "Can we defend this to CFPB/EEOC/FDA?" ā Check Fairness + Explainability
- "Is there documentation for legal?" ā Check Manifest + Config Hash
- "What's our exposure if we deploy this?" ā Check Red Flags across all sections
Model validator (technical verification)¶
- "Is the accuracy acceptable?" ā Check Performance Section
- "Are the explanations correct?" ā Check Explainability + SHAP additivity
- "Can I reproduce this?" ā Check Manifest + Seeds + Hashes
- "What are the model limitations?" ā Check Sample Sizes + Confidence Intervals
Risk manager (business impact)¶
- "What error rate can we tolerate?" ā Check Confusion Matrix + Cost Analysis
- "Which demographic groups are affected?" ā Check Fairness Group Breakdown
- "What's the worst-case scenario?" ā Check Maximum Disparity + Regulatory Thresholds
- "Do we need to retrain or mitigate?" ā Check FAIL flags + Justifications
External auditor/regulator (independent verification)¶
- "Show me the evidence pack." ā Manifest + Config + Data Hashes
- "Can you reproduce this in front of me?" ā Seeds + Git Commit + Package Versions
- "Explain this 12% fairness violation." ā Fairness Section + Business Justification
- "Why should I trust these explanations?" ā Explainer Documentation + Validation
How to use this guide¶
Before submitting an audit:
- Go through each section with the auditor checklist
- Mark ā for items that pass, ā for items that fail
- Prepare justifications for any ā red flags
- Ensure Manifest section is complete (reproducibility is critical)
When reviewing someone else's audit:
- Start with Fairness Section (highest regulatory risk)
- Check Manifest (can you reproduce it?)
- Validate Performance (is accuracy acceptable?)
- Review Explainability (do reasons make sense?)
- Verify Calibration (can you trust probabilities?)
For regulatory submission:
- Generate evidence pack (PDF + Manifest + Config + Hashes)
- Prepare 1-page summary for compliance officer
- Document all red flags with justifications
- Include independent validator sign-off
- Archive for retention period (typically 7 years)
1. Model performance metrics¶
Every audit includes comprehensive performance evaluation:
- Classification metrics: Accuracy, precision, recall, F1 score, AUC-ROC
- Confusion matrices: Visual breakdown of true/false positives and negatives
- Performance curves: ROC curves and precision-recall curves
- Cross-validation results: Statistical validation of model stability
These metrics provide the foundation for understanding model behavior and are required by most regulatory frameworks.
2. Model explanations¶
Understanding why models make specific predictions:
Feature importance¶
- For linear models: Coefficient-based explanations (zero dependencies)
- For tree models: SHAP (SHapley Additive exPlanations) values
- Visual rankings: Clear ordering of most impactful features
Individual predictions¶
- Per-prediction breakdown: Feature contributions to specific decisions
- Visual explanations: Force plots showing positive/negative influences
- Deterministic ranking: Consistent ordering across runs
See explainer selection guide ā
3. Fairness analysis¶
Comprehensive bias detection across demographic groups:
Group fairness¶
- Demographic parity: Equal positive prediction rates across groups
- Equal opportunity: Equal true positive rates across groups
- Statistical confidence: Confidence intervals for all fairness metrics
Intersectional fairness¶
- Multi-attribute analysis: Combined effects of multiple protected attributes
- Subgroup detection: Identification of particularly affected intersections
Individual fairness¶
- Consistency testing: Similar individuals receive similar predictions
- Matched pairs analysis: Direct comparison of similar cases
- Disparate treatment detection: Identification of inconsistent decisions
See fairness metrics reference ā
4. Calibration analysis¶
Model confidence accuracy evaluation:
- Calibration curves: Visual representation of prediction reliability
- Expected Calibration Error (ECE): Quantitative calibration quality
- Brier score: Comprehensive probability accuracy measure
- Confidence intervals: Statistical bounds on calibration metrics
Calibration is critical for high-stakes decisions where probability estimates matter.
5. Robustness testing¶
Adversarial perturbation analysis:
- Epsilon sweeps: Model behavior under small input changes
- Feature perturbations: Individual feature stability testing
- Robustness score: Quantitative measure of model stability
6. Reason codes (ECOA compliance)¶
For credit decisions, adverse action notice generation:
- Top-N negative contributions: Features that hurt the applicant's score
- ECOA-compliant formatting: Regulatory-ready adverse action notices
- Protected attribute exclusion: Automatic removal of prohibited factors
- Deterministic ranking: Consistent reason codes across runs
7. Dataset bias detection¶
Pre-model bias identification:
- Proxy correlation analysis: Identification of protected attribute proxies
- Distribution drift: Changes in demographic composition
- Class imbalance: Detection of underrepresented groups
8. Preprocessing verification¶
Production artifact validation:
- File hash: SHA256 fingerprint of preprocessing pipeline
- Params hash: Canonical hash of learned parameters
- Version compatibility: Runtime environment verification
- Class allowlisting: Security validation against pickle exploits
9. Reproducibility manifest¶
Complete audit trail for regulatory submission:
Configuration hash¶
- Complete config fingerprint: SHA256 of entire configuration
- Policy version: Specific compliance rules applied
- Profile used: Audit profile and feature set
Dataset fingerprint¶
- Data hash: Cryptographic hash of input data
- Schema lock: Structure and column validation
- Sample size: Number of records processed
Runtime environment¶
- Git commit SHA: Exact code version used
- Timestamp: ISO 8601 formatted execution time
- Package versions: All dependencies with versions
- Random seeds: All seeds used for reproducibility
Model artifacts¶
- Model hash: Fingerprint of trained model
- Preprocessing hash: Hash of preprocessing artifacts
- Feature list: Exact features used
This manifest enables byte-identical reproduction of the audit on the same inputs.
See determinism guide ā See evidence pack guide ā - Package audits for regulatory submission
Example audit¶
See a complete audit in action:
- German Credit Audit - Full walkthrough with credit scoring
- Healthcare Bias Detection - Medical AI compliance
- Fraud Detection Audit - Financial services compliance
Regulatory mapping¶
See how these components map to specific regulatory requirements:
- SR 11-7 Technical Mapping - Federal Reserve guidance for banking
- Trust & Deployment - Architecture and compliance overview