Configuration guide¶
Complete guide to configuring GlassAlpha for different use cases, models, and compliance requirements.
Overview¶
GlassAlpha uses YAML configuration files to define every aspect of the audit process. Configuration files are policy-as-code, enabling version control, review processes, and reproducible audits.
Basic structure¶
# Required: Audit profile determines component selection
audit_profile: tabular_compliance
# Required: Reproducibility settings
reproducibility:
random_seed: 42
# Required: Data configuration
data:
path: data/my_dataset.csv
target_column: outcome
# Required: Model configuration
model:
type: xgboost
# Optional: Additional sections
explainers: { ... }
metrics: { ... }
report: { ... }
Core configuration sections¶
Audit profile¶
The audit profile determines which components are available and what validations are enforced.
# Determines the audit context and available components
audit_profile: tabular_compliance # Currently supported profile
Available Profiles:
tabular_compliance
- Standard tabular ML compliance auditgerman_credit_default
- German Credit dataset specific profile
Reproducibility settings¶
Ensures deterministic, reproducible audit results.
reproducibility:
# Master random seed (required in strict mode)
random_seed: 42
# Optional: Advanced reproducibility settings
deterministic: true # Enforce deterministic behavior
capture_environment: true # Capture system information
validate_determinism: true # Verify reproducibility
Best Practices:
- Always set
random_seed
for reproducible results - Use the same seed for comparative audits
- Document seed values in audit reports
Data configuration¶
Defines the dataset and feature structure for the audit.
data:
# Required: Path to dataset
path: data/my_dataset.csv
# Required: Target column name
target_column: outcome
# Optional: Explicit feature columns
feature_columns:
- feature1
- feature2
- feature3
# Optional: Protected attributes for fairness analysis
protected_attributes:
- gender
- age_group
- ethnicity
Supported Data Formats:
- CSV (
.csv
) - Parquet (
.parquet
) - Feather (
.feather
) - Pickle (
.pkl
)
Feature Selection:
- If
feature_columns
not specified, uses all columns except target - Protected attributes should be included in features for bias analysis
- Features are automatically preprocessed based on data type
Model configuration¶
Specifies the ML model to audit and its parameters.
model:
# Required: Model type (triggers appropriate wrapper)
type: logistic_regression # Default baseline model (always available)
# Optional: Allow fallback to baseline model if requested model unavailable
allow_fallback: true # Default: true
# Optional: Pre-trained model path
path: models/my_model.pkl
# Optional: Model parameters (for training)
params:
random_state: 42
max_iter: 1000
Supported Model Types:
logistic_regression
- Scikit-learn LogisticRegression (baseline, always available)xgboost
- XGBoost gradient boosting (optional, requirespip install 'glassalpha[xgboost]'
)lightgbm
- LightGBM gradient boosting (optional, requirespip install 'glassalpha[lightgbm]'
)sklearn_generic
- Generic scikit-learn models (baseline, always available)
Optional Dependencies:
GlassAlpha uses optional dependencies to keep the core installation lightweight. If you request a model that's not installed, GlassAlpha will:
- Automatically fall back to
logistic_regression
ifallow_fallback: true
(default) - Show clear installation instructions for the requested model
- Fail gracefully if
allow_fallback: false
and model unavailable
Example fallback behavior:
glassalpha audit --config my_config.yaml --output report.pdf
# If XGBoost requested but not installed:
# "Model 'xgboost' not available. Falling back to 'logistic_regression'.
# To enable 'xgboost', run: pip install 'glassalpha[xgboost]'"
Model Loading vs Training:
- If
path
exists: loads pre-trained model - If
path
missing: trains new model withparams
- Parameters are passed to the underlying library
Explainer configuration¶
Controls how model predictions are explained and interpreted.
explainers:
# Required: Selection strategy
strategy: first_compatible
# Required: Priority order (deterministic selection)
priority:
- treeshap # First choice for tree models
- kernelshap # Fallback for any model
# Optional: Explainer-specific configuration
config:
treeshap:
max_samples: 1000 # Samples for SHAP computation
check_additivity: true # Verify SHAP properties
kernelshap:
n_samples: 500 # Model evaluations
background_size: 100 # Background dataset size
Available Explainers:
treeshap
- Exact SHAP values for tree models (XGBoost, LightGBM)kernelshap
- Model-agnostic SHAP approximationnoop
- No-op placeholder (for testing)
Selection Strategies:
first_compatible
- Use first explainer compatible with modelbest_available
- Select highest-priority compatible explainer
Metrics configuration¶
Defines which performance and fairness metrics to compute.
metrics:
# Performance evaluation metrics
performance:
metrics:
- accuracy
- precision
- recall
- f1
- auc_roc
- classification_report
# Fairness and bias detection metrics
fairness:
metrics:
- demographic_parity
- equal_opportunity
- equalized_odds
- predictive_parity
# Optional: Bias tolerance thresholds
config:
demographic_parity:
threshold: 0.05 # 5% maximum group difference
# Optional: Data drift detection metrics
drift:
metrics:
- population_stability_index
- kl_divergence
- kolmogorov_smirnov
Performance Metrics:
accuracy
- Overall classification accuracyprecision
- Positive predictive valuerecall
- True positive rate (sensitivity)f1
- Harmonic mean of precision and recallauc_roc
- Area under ROC curveclassification_report
- Comprehensive per-class metrics
Fairness Metrics:
demographic_parity
- Equal positive prediction rates across groupsequal_opportunity
- Equal true positive rates across groupsequalized_odds
- Equal TPR and FPR across groupspredictive_parity
- Equal precision across groups
Report configuration¶
Controls the format and content of generated audit reports.
report:
# Report template (determines structure and styling)
template: standard_audit
# Output format
output_format: pdf
# Optional: Report sections to include
include_sections:
- executive_summary
- data_overview
- model_performance
- global_explanations
- local_explanations
- fairness_analysis
- audit_manifest
- regulatory_compliance
# Optional: Report styling
styling:
color_scheme: professional
page_size: A4
margins: standard
compliance_statement: true
Available Templates:
standard_audit
- Comprehensive audit report with all sections
Styling Options:
color_scheme
: professional, minimal, colorfulpage_size
: A4, Letter, Legalmargins
: standard, narrow, wide
Advanced configuration¶
Strict mode¶
Enforces additional regulatory compliance requirements.
Strict Mode Requirements:
- Explicit random seeds (no defaults)
- Complete data schema specification
- Full manifest generation
- Deterministic component selection
- All optional validations enabled
Manifest configuration¶
Controls audit trail generation and completeness.
manifest:
enabled: true # Generate audit manifest
include_git_sha: true # Include Git commit information
include_config_hash: true # Include configuration integrity hash
include_data_hash: true # Include dataset integrity hash
track_component_selection: true # Track selected components
include_execution_info: true # Include timing and environment
Preprocessing options¶
Controls data preprocessing before model training/evaluation.
preprocessing:
handle_missing: true # Handle missing values
missing_strategy: median # median, mode, drop
scale_features: false # Feature scaling (not needed for trees)
scaling_method: standard # standard, minmax, robust
categorical_encoding: label # label, onehot, target
feature_selection: false # Enable feature selection
selection_method: mutual_info # mutual_info, correlation
max_features: 20 # Maximum features to select
Validation configuration¶
Controls model evaluation and statistical testing.
validation:
cv_folds: 5 # Cross-validation folds
cv_scoring: roc_auc # Scoring metric for CV
test_size: 0.2 # Train/test split ratio
stratify_split: true # Stratify split by target
bootstrap_samples: 1000 # Bootstrap samples for confidence intervals
confidence_level: 0.95 # Statistical confidence level
Performance optimization¶
Controls computational performance and resource usage.
performance:
n_jobs: -1 # Parallel processing (-1 = all cores)
low_memory_mode: false # Optimize for memory usage
verbose: true # Enable progress reporting
progress_bar: true # Show progress bars
Configuration examples¶
Basic German Credit audit¶
audit_profile: german_credit_default
reproducibility:
random_seed: 42
data:
path: ~/.glassalpha/data/german_credit_processed.csv
target_column: credit_risk
protected_attributes:
- gender
- age_group
model:
type: xgboost
params:
objective: binary:logistic
n_estimators: 100
explainers:
strategy: first_compatible
priority: [treeshap, kernelshap]
metrics:
performance:
metrics: [accuracy, precision, recall, f1, auc_roc]
fairness:
metrics: [demographic_parity, equal_opportunity]
Enterprise compliance configuration¶
audit_profile: tabular_compliance
strict_mode: true
reproducibility:
random_seed: 42
deterministic: true
capture_environment: true
data:
path: data/production_dataset.csv
target_column: decision
feature_columns:
- income
- employment_length
- debt_to_income
- credit_score
protected_attributes:
- race
- gender
- age_group
model:
type: lightgbm
params:
objective: binary
metric: auc
num_leaves: 31
feature_fraction: 0.9
bagging_fraction: 0.8
bagging_freq: 5
explainers:
strategy: first_compatible
priority: [treeshap, kernelshap]
config:
treeshap:
max_samples: 10000
check_additivity: true
metrics:
performance:
metrics: [accuracy, precision, recall, f1, auc_roc, classification_report]
fairness:
metrics:
[demographic_parity, equal_opportunity, equalized_odds, predictive_parity]
config:
demographic_parity:
threshold: 0.02 # Stricter threshold for production
equal_opportunity:
threshold: 0.02
manifest:
enabled: true
include_git_sha: true
include_config_hash: true
include_data_hash: true
track_component_selection: true
report:
template: standard_audit
output_format: pdf
styling:
color_scheme: professional
compliance_statement: true
compliance:
frameworks: [gdpr, ecoa, fcra]
fairness_thresholds:
demographic_parity: 0.02
equal_opportunity: 0.02
Custom model configuration¶
audit_profile: tabular_compliance
reproducibility:
random_seed: 123
data:
path: data/custom_dataset.csv
target_column: target
feature_columns:
- numerical_feature_1
- numerical_feature_2
- categorical_feature_1
- categorical_feature_2
protected_attributes:
- protected_attribute_1
model:
type: logistic_regression
params:
C: 1.0
penalty: l2
solver: lbfgs
max_iter: 1000
explainers:
strategy: first_compatible
priority: [kernelshap] # Use KernelSHAP for linear models
config:
kernelshap:
n_samples: 1000
background_size: 500
metrics:
performance:
metrics: [accuracy, precision, recall, f1]
fairness:
metrics: [demographic_parity]
preprocessing:
handle_missing: true
missing_strategy: median
scale_features: true # Important for linear models
scaling_method: standard
categorical_encoding: onehot
validation:
cv_folds: 10
test_size: 0.3
stratify_split: true
Configuration best practices¶
Reproducibility¶
- Always set random seeds for deterministic results
- Use version control for configuration files
- Document configuration changes in commit messages
- Enable manifest generation for complete audit trails
Performance¶
- Use appropriate model types for your data size and complexity
- Adjust sample sizes for explainers based on dataset size
- Enable parallel processing (
n_jobs: -1
) for faster computation - Use appropriate metrics - don't compute unnecessary evaluations
Compliance¶
- Enable strict mode for regulatory submissions
- Set appropriate bias thresholds for your use case and jurisdiction
- Include all relevant protected attributes in fairness analysis
- Document configuration rationale for audit review
Security¶
- Use relative paths or environment variables for file locations
- Don't embed sensitive data in configuration files
- Review configurations before committing to version control
- Use appropriate access controls for configuration repositories
Troubleshooting configuration issues¶
Common configuration errors¶
Missing Required Fields:
Invalid Model Type:
# Error: Model type 'invalid_model' not found in registry
model:
type: invalid_model # Should be: xgboost, lightgbm, etc.
Incompatible Components:
# Warning: No compatible explainers for model type
explainers:
priority: [treeshap] # TreeSHAP only works with tree models
model:
type: logistic_regression # Linear model - use kernelshap instead
Validation commands¶
# Validate configuration before running audit
glassalpha validate --config my_config.yaml
# Check strict mode compliance
glassalpha validate --config my_config.yaml --strict
# List available components
glassalpha list
Configuration schema validation¶
GlassAlpha uses Pydantic for configuration validation with detailed error messages:
ValidationError: 2 validation errors for AuditConfig
data.target_column
field required (type=value_error.missing)
model.type
ensure this value has at least 1 characters (type=value_error.any_str.min_length; limit_value=1)
Schema reference¶
For detailed technical information about GlassAlpha's architecture and interfaces, see the Trust & Deployment Guide.
This configuration guide provides the foundation for creating effective, compliant audit configurations. Start with the provided examples and customize based on your specific requirements, data characteristics, and regulatory context.