Data Scientist Workflow¶
Guide for data scientists and researchers using GlassAlpha for exploratory model analysis, fairness research, and notebook-based development.
Overview¶
This guide is for data scientists and researchers who need to:
- Explore model fairness in Jupyter notebooks
- Conduct interactive bias analysis
- Compare multiple models quickly
- Prototype audit configurations
- Research fairness metrics and trade-offs
- Generate publication-ready visualizations
Not a data scientist? For production workflows, see ML Engineer Workflow. For compliance workflows, see Compliance Officer Workflow.
Key Capabilities¶
Notebook-First Development¶
Interactive exploration with inline results:
- Jupyter/Colab integration with
from_model()API - Inline HTML audit summaries
- Interactive plotting with
result.plot()methods - No configuration files needed for quick experiments
Rapid Model Comparison¶
Compare fairness across models:
- Audit multiple models in same notebook
- Side-by-side metric comparison
- Threshold sweep analysis
- Trade-off visualization (accuracy vs fairness)
Research-Friendly Features¶
Support for academic work:
- Statistical confidence intervals for all metrics
- Reproducible experiments (fixed seeds)
- Export metrics as JSON/CSV for papers
- Publication-quality plots
Typical Workflows¶
Workflow 1: Quick Model Fairness Check¶
Scenario: You've trained a model and want to quickly check for bias before diving deeper.
Step 1: Train model in notebook¶
# Notebook cell 1: Train model
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Load your data
df = pd.read_csv("data/credit_applications.csv")
X = df.drop(columns=["approved", "gender", "race"])
y = df["approved"]
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# Train model
model = RandomForestClassifier(random_state=42, n_estimators=100)
model.fit(X_train, y_train)
print(f"Training accuracy: {model.score(X_train, y_train):.3f}")
print(f"Test accuracy: {model.score(X_test, y_test):.3f}")
Step 2: Audit inline (no config file)¶
# Notebook cell 2: Quick audit
import glassalpha as ga
# Create audit result directly from model
result = ga.audit.from_model(
model=model,
X_test=X_test,
y_test=y_test,
protected_attributes={
"gender": df.loc[X_test.index, "gender"],
"race": df.loc[X_test.index, "race"]
},
random_seed=42
)
# Display inline summary
result # Jupyter automatically displays HTML summary
What you see: Interactive HTML summary with:
- Performance metrics (accuracy, AUC, F1)
- Fairness metrics by group
- Feature importance
- Warnings if bias detected
Step 3: Explore metrics interactively¶
# Notebook cell 3: Dig into specific metrics
print(f"Overall accuracy: {result.performance['accuracy']:.3f}")
print(f"AUC-ROC: {result.performance.auc_roc:.3f}")
print("\nFairness by gender:")
print(f" Demographic parity: {result.fairness.demographic_parity_difference:.3f}")
print(f" Equal opportunity: {result.fairness.equal_opportunity_difference:.3f}")
print("\nFairness by race:")
for group in result.fairness.groups:
print(f" {group}: TPR={result.fairness.tpr[group]:.3f}")
Step 4: Visualize if needed¶
# Notebook cell 4: Plot key metrics
result.fairness.plot_group_metrics()
result.calibration.plot()
result.performance.plot_confusion_matrix()
Step 5: Export PDF when satisfied¶
Workflow 2: Model Comparison (Fairness vs Accuracy Trade-offs)¶
Scenario: Compare multiple models to find best fairness/accuracy balance.
Step 1: Train multiple models¶
# Notebook cell 1: Train 3 models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import glassalpha as ga
models = {
"Logistic Regression": LogisticRegression(random_state=42),
"Random Forest": RandomForestClassifier(random_state=42, n_estimators=50),
"XGBoost": XGBClassifier(random_state=42, n_estimators=50)
}
# Train all models
for name, model in models.items():
model.fit(X_train, y_train)
print(f"{name}: {model.score(X_test, y_test):.3f} accuracy")
Step 2: Audit all models¶
# Notebook cell 2: Audit each model
results = {}
for name, model in models.items():
results[name] = ga.audit.from_model(
model=model,
X_test=X_test,
y_test=y_test,
protected_attributes={"gender": df.loc[X_test.index, "gender"]},
random_seed=42
)
Step 3: Compare metrics¶
# Notebook cell 3: Build comparison table
import pandas as pd
comparison = []
for name, result in results.items():
comparison.append({
"Model": name,
"Accuracy": result.performance['accuracy'],
"AUC": result.performance.auc_roc,
"Demographic Parity": result.fairness.demographic_parity_difference,
"Equal Opportunity": result.fairness.equal_opportunity_difference,
"Fairness Score": result.fairness.overall_fairness_score
})
comparison_df = pd.DataFrame(comparison)
print(comparison_df.to_string(index=False))
Step 4: Visualize trade-offs¶
# Notebook cell 4: Plot accuracy vs fairness
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(
comparison_df["Accuracy"],
comparison_df["Fairness Score"],
s=100
)
for idx, row in comparison_df.iterrows():
ax.annotate(row["Model"], (row["Accuracy"], row["Fairness Score"]))
ax.set_xlabel("Accuracy")
ax.set_ylabel("Fairness Score (higher is better)")
ax.set_title("Model Comparison: Accuracy vs Fairness")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Step 5: Select best model¶
# Notebook cell 5: Decision logic
# Find model with accuracy >0.75 and best fairness score
threshold_met = comparison_df[comparison_df["Accuracy"] > 0.75]
best_model_name = threshold_met.loc[threshold_met["Fairness Score"].idxmax(), "Model"]
print(f"Best model: {best_model_name}")
print(f" Accuracy: {threshold_met.loc[threshold_met['Fairness Score'].idxmax(), 'Accuracy']:.3f}")
print(f" Fairness: {threshold_met.loc[threshold_met['Fairness Score'].idxmax(), 'Fairness Score']:.3f}")
# Generate full audit for best model
best_result = results[best_model_name]
best_result.to_pdf(f"reports/{best_model_name.lower().replace(' ', '_')}_audit.pdf")
Workflow 3: Interactive Threshold Exploration¶
Scenario: Find optimal decision threshold balancing performance and fairness.
Step 1: Audit at default threshold¶
# Notebook cell 1
import glassalpha as ga
result_baseline = ga.audit.from_model(
model=model,
X_test=X_test,
y_test=y_test,
protected_attributes={"gender": df.loc[X_test.index, "gender"]},
threshold=0.5, # Default
random_seed=42
)
print(f"Baseline (threshold=0.5):")
print(f" Accuracy: {result_baseline.performance.accuracy:.3f}")
print(f" Demographic parity: {result_baseline.fairness.demographic_parity_difference:.3f}")
Step 2: Sweep thresholds¶
# Notebook cell 2: Test multiple thresholds
import numpy as np
thresholds = np.arange(0.3, 0.8, 0.05)
sweep_results = []
for threshold in thresholds:
result = ga.audit.from_model(
model=model,
X_test=X_test,
y_test=y_test,
protected_attributes={"gender": df.loc[X_test.index, "gender"]},
threshold=threshold,
random_seed=42
)
sweep_results.append({
"threshold": threshold,
"accuracy": result.performance['accuracy'],
"precision": result.performance.precision,
"recall": result.performance.recall,
"dem_parity": result.fairness.demographic_parity_difference,
"eq_opp": result.fairness.equal_opportunity_difference
})
sweep_df = pd.DataFrame(sweep_results)
Step 3: Visualize trade-offs¶
# Notebook cell 3: Plot threshold sweep
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Performance metrics
ax1.plot(sweep_df["threshold"], sweep_df["accuracy"], label="Accuracy", marker="o")
ax1.plot(sweep_df["threshold"], sweep_df["precision"], label="Precision", marker="s")
ax1.plot(sweep_df["threshold"], sweep_df["recall"], label="Recall", marker="^")
ax1.set_xlabel("Decision Threshold")
ax1.set_ylabel("Metric Value")
ax1.set_title("Performance vs Threshold")
ax1.legend()
ax1.grid(True, alpha=0.3)
# Fairness metrics
ax2.plot(sweep_df["threshold"], sweep_df["dem_parity"], label="Demographic Parity", marker="o")
ax2.plot(sweep_df["threshold"], sweep_df["eq_opp"], label="Equal Opportunity", marker="s")
ax2.axhline(y=0.05, color='r', linestyle='--', label="Tolerance (±0.05)")
ax2.axhline(y=-0.05, color='r', linestyle='--')
ax2.set_xlabel("Decision Threshold")
ax2.set_ylabel("Fairness Gap")
ax2.set_title("Fairness vs Threshold")
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Step 4: Find optimal threshold¶
# Notebook cell 4: Select optimal threshold
# Find threshold where fairness < 0.05 and accuracy maximized
fair_results = sweep_df[sweep_df["dem_parity"].abs() < 0.05]
optimal_idx = fair_results["accuracy"].idxmax()
optimal_threshold = fair_results.loc[optimal_idx, "threshold"]
print(f"Optimal threshold: {optimal_threshold:.2f}")
print(f" Accuracy: {fair_results.loc[optimal_idx, 'accuracy']:.3f}")
print(f" Demographic parity: {fair_results.loc[optimal_idx, 'dem_parity']:.3f}")
Workflow 4: Research Paper Figures¶
Scenario: Generate publication-quality figures for academic papers.
Step 1: Run comprehensive audit¶
# Notebook cell 1
import glassalpha as ga
result = ga.audit.from_model(
model=model,
X_test=X_test,
y_test=y_test,
protected_attributes={
"gender": df.loc[X_test.index, "gender"],
"race": df.loc[X_test.index, "race"]
},
random_seed=42
)
Step 2: Export metrics for tables¶
# Notebook cell 2: Export metrics as CSV
metrics_df = pd.DataFrame({
"Metric": [
"Accuracy",
"Precision",
"Recall",
"F1 Score",
"AUC-ROC",
"Demographic Parity (Gender)",
"Equal Opportunity (Gender)",
"Demographic Parity (Race)",
"Equal Opportunity (Race)"
],
"Value": [
result.performance['accuracy'],
result.performance.precision,
result.performance.recall,
result.performance.f1,
result.performance.auc_roc,
result.fairness.demographic_parity_difference_gender,
result.fairness.equal_opportunity_difference_gender,
result.fairness.demographic_parity_difference_race,
result.fairness.equal_opportunity_difference_race
]
})
metrics_df.to_csv("paper/tables/model_metrics.csv", index=False)
Step 3: Generate publication plots¶
# Notebook cell 3: High-quality plots for paper
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-paper') # Publication style
# Calibration plot
fig, ax = plt.subplots(figsize=(6, 5))
result.calibration.plot(ax=ax, show_confidence=True, style="paper")
plt.savefig("paper/figures/calibration.pdf", bbox_inches='tight', dpi=300)
# Fairness comparison
fig, ax = plt.subplots(figsize=(6, 5))
result.fairness.plot_group_comparison(ax=ax, style="paper")
plt.savefig("paper/figures/fairness_comparison.pdf", bbox_inches='tight', dpi=300)
# Feature importance
fig, ax = plt.subplots(figsize=(6, 5))
result.explanations.plot_importance(ax=ax, top_n=10, style="paper")
plt.savefig("paper/figures/feature_importance.pdf", bbox_inches='tight', dpi=300)
Step 4: Export for LaTeX tables¶
# Notebook cell 4: LaTeX-formatted tables
latex_table = metrics_df.to_latex(
index=False,
float_format="%.3f",
caption="Model Performance and Fairness Metrics",
label="tab:metrics"
)
with open("paper/tables/metrics_table.tex", "w") as f:
f.write(latex_table)
Best Practices¶
Reproducibility¶
- Always set seeds: Use consistent
random_seedparameter - Record environment: Document package versions (
pip freeze > requirements.txt) - Save configs: Export audit configurations for later reuse
- Document experiments: Use markdown cells to explain each analysis step
Exploratory Analysis¶
- Start simple: Begin with default settings before customizing
- Iterate quickly: Use
from_model()for fast prototyping - Compare baselines: Always benchmark against simple models
- Visualize early: Use plot methods to catch issues quickly
Model Development¶
- Audit during development: Don't wait until final model
- Track fairness metrics: Monitor alongside accuracy during training
- Test multiple configurations: Try different thresholds, hyperparameters
- Document decisions: Record why you chose specific models/thresholds
Publication Preparation¶
- Use consistent seeds: Ensure figures are reproducible
- Export metrics as data: Don't manually transcribe numbers
- Version control configs: Git track audit configurations
- Archive artifacts: Save models, data, and audit results together
Common Analysis Patterns¶
Pattern 1: Quick Fairness Check¶
result = ga.audit.from_model(model, X_test, y_test, protected_attributes={"gender": gender})
if result.fairness.has_bias():
print("⚠️ Bias detected!")
result.fairness.plot_group_metrics()
Pattern 2: Metric Extraction¶
metrics = {
"accuracy": result.performance['accuracy'],
"fairness": result.fairness.demographic_parity_difference,
"calibration": result.calibration.expected_calibration_error
}
Pattern 3: Batch Audit¶
results = [
ga.audit.from_model(model, X, y, protected_attributes=attrs, random_seed=seed)
for seed in range(5) # Multiple random seeds for robustness
]
# Aggregate results
avg_accuracy = np.mean([r.performance.accuracy for r in results])
Pattern 4: Custom Threshold¶
result = ga.audit.from_model(
model, X_test, y_test,
protected_attributes={"race": race},
threshold=0.45, # Custom threshold
random_seed=42
)
Transitioning to Production¶
When your exploratory work is ready for production:
Step 1: Create config file¶
Step 2: Validate reproducibility¶
# Run from CLI to ensure consistency (use --fast for quick validation)
glassalpha audit --config production_audit.yaml --output prod_audit.pdf --fast
# For final production audit: omit --fast and add --strict
# glassalpha audit --config production_audit.yaml --output prod_audit.pdf --strict
Step 3: Hand off to ML Engineer¶
Share with ML engineering team:
- Audit configuration file
- Model artifact (
.pklfile) - Notebook with analysis
- Requirements file
See ML Engineer Workflow for production integration.
Troubleshooting¶
Issue: Inline display not working¶
Symptom: result in notebook doesn't show HTML summary
Solution:
# Explicitly display
from IPython.display import display
display(result)
# Or use direct method
result.display()
Issue: Plot methods not found¶
Symptom: AttributeError: 'AuditResult' has no attribute 'plot'
Solution:
# Update GlassAlpha to latest version
!pip install --upgrade glassalpha
# Or use component-specific plots
result.fairness.plot_group_metrics()
result.calibration.plot()
Issue: Slow audit in notebook¶
Symptom: Audit takes >30 seconds in notebook
Solution:
# Reduce explainer samples for faster iteration
result = ga.audit.from_model(
model, X_test, y_test,
protected_attributes=attrs,
explainer_samples=100, # Default 1000
random_seed=42
)
Issue: Memory error with large dataset¶
Symptom: Kernel dies during audit
Solution:
# Sample data for exploration
X_sample = X_test.sample(n=1000, random_state=42)
y_sample = y_test.loc[X_sample.index]
result = ga.audit.from_model(model, X_sample, y_sample, ...)
Related Resources¶
For Researchers¶
- Fairness Metrics Reference - Statistical definitions
- Calibration Analysis - Probability calibration
For Transition to Production¶
- ML Engineer Workflow - CI/CD integration
- Configuration Guide - Full config reference
- Compliance Officer Workflow - Regulatory submission
Examples¶
- German Credit Audit - Complete walkthrough
- Example Tutorials - Walkthrough guides
- Interactive Notebooks - Jupyter notebooks on GitHub
Support¶
For research-specific questions:
- GitHub Discussions: GlassAlpha/glassalpha/discussions
- Email: research@glassalpha.com
- Documentation: glassalpha.com