Using custom data¶

This guide shows you how to use your own datasets with GlassAlpha for ML auditing. Whether you have CSV files, database exports, or other tabular data, this tutorial will get you up and running quickly.

Quick start¶

The fastest way to use custom data:

Option 1: Use quickstart generator (recommended for first-time users)

The easiest way is to use the quickstart command and then modify the generated config:

# Generate a project with example config
glassalpha quickstart

# Edit the generated config with your data
cd my-audit-project
nano audit_config.yaml

Update the key fields in audit_config.yaml:

data.dataset: custom → Use "custom" instead of built-in dataset name
data.path: /path/to/your/data.csv → Your dataset path
data.target_column: your_target → Your prediction target column
data.protected_attributes: [...] → Your sensitive features

Then run:

glassalpha audit

Option 2: Minimal configuration (if you know what you're doing)

# my_audit_config.yaml
# Direct configuration

reproducibility:
  random_seed: 42

data:
  dataset: custom # Important: Use "custom" for your own data
  path: /path/to/your/data.csv
  target_column: your_target_column
  protected_attributes:
    - gender
    - age

model:
  type: logistic_regression
  params:
    random_state: 42

Run the audit:

glassalpha audit --config my_audit_config.yaml --output audit.pdf

That's it! GlassAlpha will automatically load your data, train the model, and generate a comprehensive audit report.

Data requirements¶

Minimum requirements¶

Your dataset must have:

Tabular format: Rows are observations, columns are features
Target column: The outcome you're predicting
Feature columns: Input variables for prediction
Consistent data types: Each column has one type (numeric, categorical, etc.)

Supported file formats¶

Format	Extension	Best For	Notes
CSV	`.csv`	Small-medium datasets	Most common, widely supported
Parquet	`.parquet`	Large datasets	Compressed, faster loading
Feather	`.feather`	Fast I/O	Efficient binary format
Pickle	`.pkl`	Python objects	Python-specific format

Format is automatically detected from file extension.

Recommended data characteristics¶

For best results:

Sample size: 500+ rows minimum, 5,000+ recommended
Features: 5-100 features (TreeSHAP slows with >1,000 features)
Target: Binary classification (0/1 or True/False)
Missing values: <30% per column
Protected attributes: At least one (race, sex, age, etc.)

Step-by-step tutorial¶

Step 1: Prepare your data¶

Example dataset structure¶

age,income,education,credit_score,employment_years,loan_approved,gender,race
35,65000,bachelors,720,8,1,female,white
42,48000,high_school,650,15,0,male,black
28,85000,masters,780,3,1,female,asian
...

Key points:

Target column (loan_approved): What you're predicting (0 or 1)
Feature columns: All others except protected attributes (if desired)
Protected attributes (gender, race): For fairness analysis
Headers: First row should contain column names
No index column: Remove any row number columns

Data cleaning checklist¶

Before using your data with GlassAlpha:

[ ] Remove any row index columns
[ ] Ensure column names have no special characters
[ ] Convert dates to numeric features if needed
[ ] Verify target column is binary (0/1)
[ ] Check for excessive missing values
[ ] Encode categorical variables if needed (GlassAlpha handles this automatically)

Step 2: Create configuration file¶

Create a YAML configuration file specifying your data:

# loan_approval_audit.yaml
# Direct configuration

# Essential: Set random seed for reproducibility
reproducibility:
  random_seed: 42

# Data configuration
data:
  dataset: custom # Required for custom data
  path: ~/data/loan_applications.csv # Absolute or home-relative path
  target_column: loan_approved # Column containing 0/1 outcomes

  # Optional: Specify which columns to use as features
  feature_columns:
    - age
    - income
    - education
    - credit_score
    - employment_years
    # Note: protected_attributes are automatically included

  # Required for fairness analysis
  protected_attributes:
    - gender
    - race

# Model configuration
model:
  type: xgboost # Or: logistic_regression, lightgbm
  params:
    objective: binary:logistic
    n_estimators: 100
    max_depth: 5
    random_state: 42

# Explanation configuration
explainers:
  strategy: first_compatible
  priority:
    - treeshap # Use TreeSHAP for XGBoost
    - kernelshap # Fallback for any model

# Metrics to compute
metrics:
  performance:
    metrics:
      - accuracy
      - precision
      - recall
      - f1
      - auc_roc

  fairness:
    metrics:
      - demographic_parity
      - equal_opportunity
      - equalized_odds
    config:
      demographic_parity:
        threshold: 0.05 # Maximum 5% difference between groups

Step 3: Validate configuration¶

Before running a full audit, validate your configuration:

# Check for configuration errors
glassalpha validate --config loan_approval_audit.yaml

# Dry run (checks data loading without generating report)
glassalpha audit --config loan_approval_audit.yaml --output test.pdf --dry-run

Step 4: Run the audit¶

Generate your audit report:

glassalpha audit \
  --config loan_approval_audit.yaml \
  --output loan_approval_audit.pdf \
  --strict

Flags explained:

--config: Path to your configuration file
--output: Where to save the PDF report
--strict: Enable regulatory compliance mode (recommended)

Expected output:

GlassAlpha Audit Generation
========================================
Loading configuration from: loan_approval_audit.yaml
Audit profile: tabular_compliance
Strict mode: ENABLED

Running audit pipeline...
✓ Data loaded: 5,234 samples, 12 features
✓ Model trained: XGBoost (100 estimators)
✓ Explanations generated: TreeSHAP
✓ Fairness metrics computed: 3 groups analyzed

📊 Audit Summary:
  ✅ Performance: 82.3% accuracy, 0.86 AUC
  ⚠️ Bias detected: gender.demographic_parity (7.2% difference)

Generating PDF report...
✓ Report saved: loan_approval_audit.pdf (1.4 MB)

⏱️ Total time: 6.3s

Configuration options¶

Data section¶

Basic options¶

data:
  dataset: custom # Required for custom data
  path: /path/to/data.csv # Absolute or ~/relative path
  target_column: outcome # Column name for prediction target

Feature selection¶

Option 1: Use all columns (default)

data:
  dataset: custom
  path: data.csv
  target_column: approved
  # All columns except target and protected become features

Option 2: Explicitly specify features

data:
  dataset: custom
  path: data.csv
  target_column: approved
  feature_columns: # Only these columns used as features
    - age
    - income
    - credit_score

Protected attributes¶

Protected attributes are used for fairness analysis:

data:
  protected_attributes:
    - gender # Binary or categorical
    - race # Multiple categories
    - age # Can be continuous or binned

Important: Protected attributes are automatically included in the feature set for model training unless explicitly excluded.

File path options¶

# Absolute path (recommended)
data:
  path: /Users/username/data/my_data.csv

# Home directory relative (also good)
data:
  path: ~/data/my_data.csv

# Current directory relative (not recommended - can fail)
data:
  path: data/my_data.csv

# Environment variable
data:
  path: ${DATA_DIR}/my_data.csv

Common Mistake: Relative Paths

Using relative paths like data/file.csv can fail depending on where you run the command.

**Problem**: `FileNotFoundError: Data file not found at data/file.csv`

**Fix**: Always use absolute paths or home-relative paths:
```yaml
path: /Users/yourname/data/file.csv  # Absolute
path: ~/data/file.csv                 # Home-relative
```

Model selection¶

Choose the model type based on your needs:

Pro Tip: Start with LogisticRegression

XGBoost and LightGBM are powerful but require additional installation (pip install 'glassalpha[explain]').

**Best practice**: Start with `type: logistic_regression` (always available) to verify your setup works, then upgrade to tree models if you need better performance.

Logistic Regression (baseline)¶

model:
  type: logistic_regression
  params:
    random_state: 42
    max_iter: 1000
    C: 1.0 # Regularization strength

When to use:

Quick baseline audit
Linear relationships
High interpretability needed
XGBoost/LightGBM not installed

XGBoost (recommended)¶

model:
  type: xgboost
  params:
    objective: binary:logistic
    n_estimators: 100
    max_depth: 5
    learning_rate: 0.1
    random_state: 42

When to use:

Best predictive performance
TreeSHAP explanations desired
Non-linear relationships
1K-100K samples

LightGBM (fast alternative)¶

model:
  type: lightgbm
  params:
    objective: binary
    n_estimators: 100
    num_leaves: 31
    learning_rate: 0.1
    random_state: 42

When to use:

Large datasets (>100K samples)
Need faster training
Memory constraints
Many features (>100)

Preprocessing options¶

GlassAlpha handles preprocessing automatically, but you can customize:

preprocessing:
  handle_missing: true # Automatically handle missing values
  missing_strategy: median # For numeric: median, mean, mode
  scale_features: false # Not needed for tree models
  categorical_encoding: label # label, onehot, target

Domain-specific examples¶

Financial services (credit scoring)¶

# Direct configuration

data:
  dataset: custom
  path: ~/data/loan_applications.csv
  target_column: approved
  protected_attributes:
    - gender
    - race
    - age_group

model:
  type: xgboost
  params:
    objective: binary:logistic
    n_estimators: 150
    max_depth: 6

metrics:
  fairness:
    config:
      demographic_parity:
        threshold: 0.05 # ECOA compliance
      equal_opportunity:
        threshold: 0.05

Healthcare (treatment outcomes)¶

# Direct configuration

data:
  dataset: custom
  path: ~/data/patient_outcomes.csv
  target_column: treatment_success
  protected_attributes:
    - race
    - gender
    - age
    - disability_status

model:
  type: xgboost
  params:
    objective: binary:logistic
    n_estimators: 100

metrics:
  fairness:
    config:
      demographic_parity:
        threshold: 0.03 # Stricter for healthcare
      equal_opportunity:
        threshold: 0.03

Hiring (candidate screening)¶

# Direct configuration

data:
  dataset: custom
  path: ~/data/candidate_screening.csv
  target_column: hired
  protected_attributes:
    - gender
    - race
    - age
    - veteran_status

model:
  type: logistic_regression # More interpretable for HR
  params:
    random_state: 42
    max_iter: 1000

metrics:
  fairness:
    metrics:
      - demographic_parity
      - equal_opportunity
      - predictive_parity
    config:
      demographic_parity:
        threshold: 0.02 # Very strict for hiring

Criminal justice (risk assessment)¶

# Direct configuration

data:
  dataset: custom
  path: ~/data/risk_assessments.csv
  target_column: recidivism
  protected_attributes:
    - race
    - sex
    - age_category

model:
  type: logistic_regression
  params:
    random_state: 42

metrics:
  fairness:
    metrics:
      - demographic_parity
      - equal_opportunity
      - equalized_odds
      - predictive_parity
    config:
      demographic_parity:
        threshold: 0.05
      equal_opportunity:
        threshold: 0.05

Common issues and solutions¶

Issue: "Data file not found"¶

Problem:

FileNotFoundError: Data file not found at ~/data/my_data.csv

Solutions:

Use absolute paths: /Users/username/data/my_data.csv
Verify file exists: ls ~/data/my_data.csv
Check file permissions: ls -l ~/data/my_data.csv
Use dataset: custom in config

Issue: "Target column not found"¶

Problem:

ValueError: Target column 'outcome' not found in dataset

Solutions:

Check exact column name (case-sensitive)
Print column names: import pandas as pd; print(pd.read_csv('data.csv').columns)
Remove any spaces: outcome not outcome
Verify CSV has headers

Issue: "Protected attributes not detected"¶

Problem: Fairness metrics show errors or no groups

Common Mistake: Missing Protected Attributes

If you don't specify protected_attributes, fairness metrics will fail with errors.

**Problem**: `ValueError: No protected attributes specified for fairness analysis`

**Fix**: Always include at least one protected attribute:
```yaml
protected_attributes:
  - gender
  - race
  - age
```

Solutions:

Verify column names match config exactly (case-sensitive!)
Check for missing values in protected columns
Ensure protected columns are in dataset
Review data types: df['gender'].dtype

Issue: "Model training failed"¶

Problem: Error during model fitting

Solutions:

Check for NaN values: df.isnull().sum()
Verify target is binary (0/1)
Ensure sufficient samples (>100)
Try logistic_regression first
Check feature data types

Issue: "SHAP computation too slow"¶

Problem: Audit takes too long on large dataset

Solutions:

Reduce SHAP samples in config:

explainers:
  config:
    treeshap:
      max_samples: 100 # Default: 1000

Use sample of data for testing
Enable parallel processing:

performance:
  n_jobs: -1

Consider LightGBM (faster than XGBoost)

Data preparation scripts¶

Convert Excel to CSV¶

import pandas as pd

# Read Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Save as CSV
df.to_csv('data.csv', index=False)

Clean column names¶

import pandas as pd

df = pd.read_csv('data.csv')

# Remove special characters and spaces
df.columns = df.columns.str.replace('[^a-zA-Z0-9_]', '_', regex=True)
df.columns = df.columns.str.lower()

df.to_csv('data_cleaned.csv', index=False)

Handle missing values¶

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

# Check missing values
print(df.isnull().sum())

# Option 1: Drop rows with missing target
df = df.dropna(subset=['target_column'])

# Option 2: Fill numeric with median
df['age'] = df['age'].fillna(df['age'].median())

# Option 3: Fill categorical with mode
df['category'] = df['category'].fillna(df['category'].mode()[0])

df.to_csv('data_cleaned.csv', index=False)

Create protected attribute bins¶

import pandas as pd

df = pd.read_csv('data.csv')

# Bin continuous age into categories
df['age_group'] = pd.cut(
    df['age'],
    bins=[0, 25, 40, 60, 100],
    labels=['young', 'middle', 'senior', 'elderly']
)

df.to_csv('data_with_bins.csv', index=False)

Best practices¶

Data privacy¶

Remove PII: Strip names, addresses, SSNs before auditing
Anonymize IDs: Hash or remove customer IDs
Aggregate when possible: Use binned age instead of exact
Document data handling: Record what was removed/changed

Model selection¶

Start simple: Use logistic_regression for baseline
Compare models: Try XGBoost and LightGBM
Balance performance vs interpretability: Logistic is more interpretable
Consider domain: Healthcare may prefer simpler models

Fairness analysis¶

Set appropriate thresholds: Stricter for high-stakes domains
Use multiple metrics: Demographic parity + equal opportunity
Check intersectionality: Race × gender interactions
Document trade-offs: Performance vs fairness

Reproducibility¶

Always set random_seed: random_seed: 42
Version control configs: Use git for YAML files
Document data sources: Where data came from, when
Save preprocessing steps: Scripts for data cleaning

Next steps¶

You're now ready to audit your own models! Here's what to do:

✅ Try it: Run your first custom data audit using the quick start above
📊 Compare: Test with built-in datasets to benchmark your results
⚙️ Optimize: Learn about configuration options to customize your audits
🎯 Choose wisely: Pick the best model and explainer for your use case (coming soon)

Found this helpful? Star us on GitHub ⭐ to help others discover GlassAlpha!

Additional resources¶

Configuration Guide - Full YAML reference
Built-in Datasets - Automatic dataset fetching
Model Selection - Choosing the right model
Troubleshooting - Common issues

Questions? Open an issue on GitHub or check the FAQ.