Using custom data¶
This guide shows you how to use your own datasets with GlassAlpha for ML auditing. Whether you have CSV files, database exports, or other tabular data, this tutorial will get you up and running quickly.
Quick start¶
The fastest way to use custom data:
Option 1: Use quickstart generator (recommended for first-time users)
The easiest way is to use the quickstart command and then modify the generated config:
# Generate a project with example config
glassalpha quickstart
# Edit the generated config with your data
cd my-audit-project
nano audit_config.yaml
Update the key fields in audit_config.yaml:
data.dataset: custom→ Use "custom" instead of built-in dataset namedata.path: /path/to/your/data.csv→ Your dataset pathdata.target_column: your_target→ Your prediction target columndata.protected_attributes: [...]→ Your sensitive features
Then run:
Option 2: Minimal configuration (if you know what you're doing)
# my_audit_config.yaml
# Direct configuration
reproducibility:
random_seed: 42
data:
dataset: custom # Important: Use "custom" for your own data
path: /path/to/your/data.csv
target_column: your_target_column
protected_attributes:
- gender
- age
model:
type: logistic_regression
params:
random_state: 42
Run the audit:
That's it! GlassAlpha will automatically load your data, train the model, and generate a comprehensive audit report.
Data requirements¶
Minimum requirements¶
Your dataset must have:
- Tabular format: Rows are observations, columns are features
- Target column: The outcome you're predicting
- Feature columns: Input variables for prediction
- Consistent data types: Each column has one type (numeric, categorical, etc.)
Supported file formats¶
| Format | Extension | Best For | Notes |
|---|---|---|---|
| CSV | .csv |
Small-medium datasets | Most common, widely supported |
| Parquet | .parquet |
Large datasets | Compressed, faster loading |
| Feather | .feather |
Fast I/O | Efficient binary format |
| Pickle | .pkl |
Python objects | Python-specific format |
Format is automatically detected from file extension.
Recommended data characteristics¶
For best results:
- Sample size: 500+ rows minimum, 5,000+ recommended
- Features: 5-100 features (TreeSHAP slows with >1,000 features)
- Target: Binary classification (0/1 or True/False)
- Missing values: <30% per column
- Protected attributes: At least one (race, sex, age, etc.)
Step-by-step tutorial¶
Step 1: Prepare your data¶
Example dataset structure¶
age,income,education,credit_score,employment_years,loan_approved,gender,race
35,65000,bachelors,720,8,1,female,white
42,48000,high_school,650,15,0,male,black
28,85000,masters,780,3,1,female,asian
...
Key points:
- Target column (
loan_approved): What you're predicting (0 or 1) - Feature columns: All others except protected attributes (if desired)
- Protected attributes (
gender,race): For fairness analysis - Headers: First row should contain column names
- No index column: Remove any row number columns
Data cleaning checklist¶
Before using your data with GlassAlpha:
- [ ] Remove any row index columns
- [ ] Ensure column names have no special characters
- [ ] Convert dates to numeric features if needed
- [ ] Verify target column is binary (0/1)
- [ ] Check for excessive missing values
- [ ] Encode categorical variables if needed (GlassAlpha handles this automatically)
Step 2: Create configuration file¶
Create a YAML configuration file specifying your data:
# loan_approval_audit.yaml
# Direct configuration
# Essential: Set random seed for reproducibility
reproducibility:
random_seed: 42
# Data configuration
data:
dataset: custom # Required for custom data
path: ~/data/loan_applications.csv # Absolute or home-relative path
target_column: loan_approved # Column containing 0/1 outcomes
# Optional: Specify which columns to use as features
feature_columns:
- age
- income
- education
- credit_score
- employment_years
# Note: protected_attributes are automatically included
# Required for fairness analysis
protected_attributes:
- gender
- race
# Model configuration
model:
type: xgboost # Or: logistic_regression, lightgbm
params:
objective: binary:logistic
n_estimators: 100
max_depth: 5
random_state: 42
# Explanation configuration
explainers:
strategy: first_compatible
priority:
- treeshap # Use TreeSHAP for XGBoost
- kernelshap # Fallback for any model
# Metrics to compute
metrics:
performance:
metrics:
- accuracy
- precision
- recall
- f1
- auc_roc
fairness:
metrics:
- demographic_parity
- equal_opportunity
- equalized_odds
config:
demographic_parity:
threshold: 0.05 # Maximum 5% difference between groups
Step 3: Validate configuration¶
Before running a full audit, validate your configuration:
# Check for configuration errors
glassalpha validate --config loan_approval_audit.yaml
# Dry run (checks data loading without generating report)
glassalpha audit --config loan_approval_audit.yaml --output test.pdf --dry-run
Step 4: Run the audit¶
Generate your audit report:
Flags explained:
--config: Path to your configuration file--output: Where to save the PDF report--strict: Enable regulatory compliance mode (recommended)
Expected output:
GlassAlpha Audit Generation
========================================
Loading configuration from: loan_approval_audit.yaml
Audit profile: tabular_compliance
Strict mode: ENABLED
Running audit pipeline...
✓ Data loaded: 5,234 samples, 12 features
✓ Model trained: XGBoost (100 estimators)
✓ Explanations generated: TreeSHAP
✓ Fairness metrics computed: 3 groups analyzed
📊 Audit Summary:
✅ Performance: 82.3% accuracy, 0.86 AUC
⚠️ Bias detected: gender.demographic_parity (7.2% difference)
Generating PDF report...
✓ Report saved: loan_approval_audit.pdf (1.4 MB)
⏱️ Total time: 6.3s
Configuration options¶
Data section¶
Basic options¶
data:
dataset: custom # Required for custom data
path: /path/to/data.csv # Absolute or ~/relative path
target_column: outcome # Column name for prediction target
Feature selection¶
Option 1: Use all columns (default)
data:
dataset: custom
path: data.csv
target_column: approved
# All columns except target and protected become features
Option 2: Explicitly specify features
data:
dataset: custom
path: data.csv
target_column: approved
feature_columns: # Only these columns used as features
- age
- income
- credit_score
Protected attributes¶
Protected attributes are used for fairness analysis:
data:
protected_attributes:
- gender # Binary or categorical
- race # Multiple categories
- age # Can be continuous or binned
Important: Protected attributes are automatically included in the feature set for model training unless explicitly excluded.
File path options¶
# Absolute path (recommended)
data:
path: /Users/username/data/my_data.csv
# Home directory relative (also good)
data:
path: ~/data/my_data.csv
# Current directory relative (not recommended - can fail)
data:
path: data/my_data.csv
# Environment variable
data:
path: ${DATA_DIR}/my_data.csv
Common Mistake: Relative Paths
Using relative paths like data/file.csv can fail depending on where you run the command.
**Problem**: `FileNotFoundError: Data file not found at data/file.csv`
**Fix**: Always use absolute paths or home-relative paths:
```yaml
path: /Users/yourname/data/file.csv # Absolute
path: ~/data/file.csv # Home-relative
```
Model selection¶
Choose the model type based on your needs:
Pro Tip: Start with LogisticRegression
XGBoost and LightGBM are powerful but require additional installation (pip install 'glassalpha[explain]').
**Best practice**: Start with `type: logistic_regression` (always available) to verify your setup works, then upgrade to tree models if you need better performance.
Logistic Regression (baseline)¶
model:
type: logistic_regression
params:
random_state: 42
max_iter: 1000
C: 1.0 # Regularization strength
When to use:
- Quick baseline audit
- Linear relationships
- High interpretability needed
- XGBoost/LightGBM not installed
XGBoost (recommended)¶
model:
type: xgboost
params:
objective: binary:logistic
n_estimators: 100
max_depth: 5
learning_rate: 0.1
random_state: 42
When to use:
- Best predictive performance
- TreeSHAP explanations desired
- Non-linear relationships
- 1K-100K samples
LightGBM (fast alternative)¶
model:
type: lightgbm
params:
objective: binary
n_estimators: 100
num_leaves: 31
learning_rate: 0.1
random_state: 42
When to use:
- Large datasets (>100K samples)
- Need faster training
- Memory constraints
- Many features (>100)
Preprocessing options¶
GlassAlpha handles preprocessing automatically, but you can customize:
preprocessing:
handle_missing: true # Automatically handle missing values
missing_strategy: median # For numeric: median, mean, mode
scale_features: false # Not needed for tree models
categorical_encoding: label # label, onehot, target
Domain-specific examples¶
Financial services (credit scoring)¶
# Direct configuration
data:
dataset: custom
path: ~/data/loan_applications.csv
target_column: approved
protected_attributes:
- gender
- race
- age_group
model:
type: xgboost
params:
objective: binary:logistic
n_estimators: 150
max_depth: 6
metrics:
fairness:
config:
demographic_parity:
threshold: 0.05 # ECOA compliance
equal_opportunity:
threshold: 0.05
Healthcare (treatment outcomes)¶
# Direct configuration
data:
dataset: custom
path: ~/data/patient_outcomes.csv
target_column: treatment_success
protected_attributes:
- race
- gender
- age
- disability_status
model:
type: xgboost
params:
objective: binary:logistic
n_estimators: 100
metrics:
fairness:
config:
demographic_parity:
threshold: 0.03 # Stricter for healthcare
equal_opportunity:
threshold: 0.03
Hiring (candidate screening)¶
# Direct configuration
data:
dataset: custom
path: ~/data/candidate_screening.csv
target_column: hired
protected_attributes:
- gender
- race
- age
- veteran_status
model:
type: logistic_regression # More interpretable for HR
params:
random_state: 42
max_iter: 1000
metrics:
fairness:
metrics:
- demographic_parity
- equal_opportunity
- predictive_parity
config:
demographic_parity:
threshold: 0.02 # Very strict for hiring
Criminal justice (risk assessment)¶
# Direct configuration
data:
dataset: custom
path: ~/data/risk_assessments.csv
target_column: recidivism
protected_attributes:
- race
- sex
- age_category
model:
type: logistic_regression
params:
random_state: 42
metrics:
fairness:
metrics:
- demographic_parity
- equal_opportunity
- equalized_odds
- predictive_parity
config:
demographic_parity:
threshold: 0.05
equal_opportunity:
threshold: 0.05
Common issues and solutions¶
Issue: "Data file not found"¶
Problem:
Solutions:
- Use absolute paths:
/Users/username/data/my_data.csv - Verify file exists:
ls ~/data/my_data.csv - Check file permissions:
ls -l ~/data/my_data.csv - Use
dataset: customin config
Issue: "Target column not found"¶
Problem:
Solutions:
- Check exact column name (case-sensitive)
- Print column names:
import pandas as pd; print(pd.read_csv('data.csv').columns) - Remove any spaces:
outcomenotoutcome - Verify CSV has headers
Issue: "Protected attributes not detected"¶
Problem: Fairness metrics show errors or no groups
Common Mistake: Missing Protected Attributes
If you don't specify protected_attributes, fairness metrics will fail with errors.
**Problem**: `ValueError: No protected attributes specified for fairness analysis`
**Fix**: Always include at least one protected attribute:
```yaml
protected_attributes:
- gender
- race
- age
```
Solutions:
- Verify column names match config exactly (case-sensitive!)
- Check for missing values in protected columns
- Ensure protected columns are in dataset
- Review data types:
df['gender'].dtype
Issue: "Model training failed"¶
Problem: Error during model fitting
Solutions:
- Check for NaN values:
df.isnull().sum() - Verify target is binary (0/1)
- Ensure sufficient samples (>100)
- Try logistic_regression first
- Check feature data types
Issue: "SHAP computation too slow"¶
Problem: Audit takes too long on large dataset
Solutions:
- Reduce SHAP samples in config:
- Use sample of data for testing
- Enable parallel processing:
- Consider LightGBM (faster than XGBoost)
Data preparation scripts¶
Convert Excel to CSV¶
import pandas as pd
# Read Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Save as CSV
df.to_csv('data.csv', index=False)
Clean column names¶
import pandas as pd
df = pd.read_csv('data.csv')
# Remove special characters and spaces
df.columns = df.columns.str.replace('[^a-zA-Z0-9_]', '_', regex=True)
df.columns = df.columns.str.lower()
df.to_csv('data_cleaned.csv', index=False)
Handle missing values¶
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# Check missing values
print(df.isnull().sum())
# Option 1: Drop rows with missing target
df = df.dropna(subset=['target_column'])
# Option 2: Fill numeric with median
df['age'] = df['age'].fillna(df['age'].median())
# Option 3: Fill categorical with mode
df['category'] = df['category'].fillna(df['category'].mode()[0])
df.to_csv('data_cleaned.csv', index=False)
Create protected attribute bins¶
import pandas as pd
df = pd.read_csv('data.csv')
# Bin continuous age into categories
df['age_group'] = pd.cut(
df['age'],
bins=[0, 25, 40, 60, 100],
labels=['young', 'middle', 'senior', 'elderly']
)
df.to_csv('data_with_bins.csv', index=False)
Best practices¶
Data privacy¶
- Remove PII: Strip names, addresses, SSNs before auditing
- Anonymize IDs: Hash or remove customer IDs
- Aggregate when possible: Use binned age instead of exact
- Document data handling: Record what was removed/changed
Model selection¶
- Start simple: Use logistic_regression for baseline
- Compare models: Try XGBoost and LightGBM
- Balance performance vs interpretability: Logistic is more interpretable
- Consider domain: Healthcare may prefer simpler models
Fairness analysis¶
- Set appropriate thresholds: Stricter for high-stakes domains
- Use multiple metrics: Demographic parity + equal opportunity
- Check intersectionality: Race × gender interactions
- Document trade-offs: Performance vs fairness
Reproducibility¶
- Always set random_seed:
random_seed: 42 - Version control configs: Use git for YAML files
- Document data sources: Where data came from, when
- Save preprocessing steps: Scripts for data cleaning
Next steps¶
You're now ready to audit your own models! Here's what to do:
- ✅ Try it: Run your first custom data audit using the quick start above
- 📊 Compare: Test with built-in datasets to benchmark your results
- ⚙️ Optimize: Learn about configuration options to customize your audits
- 🎯 Choose wisely: Pick the best model and explainer for your use case (coming soon)
Found this helpful? Star us on GitHub ⭐ to help others discover GlassAlpha!
Additional resources¶
- Configuration Guide - Full YAML reference
- Built-in Datasets - Automatic dataset fetching
- Model Selection - Choosing the right model
- Troubleshooting - Common issues