ML Pipeline Overview
Machine learning systems are fundamentally different from traditional software. Instead of explicit programming rules, ML models learn patterns from data. This creates unique testing challenges at every stage of the ML pipeline.
The ML Pipeline
Each stage requires different testing approaches:
- Data: Quality, completeness, bias, freshness
- Features: Correctness, consistency, leakage detection
- Model: Accuracy, fairness, robustness, interpretability
- Serving: Latency, throughput, versioning, rollback
Data Quality Testing
Data is the foundation of ML — bad data produces bad models:
| Test | What to Check |
|---|---|
| Completeness | Missing values, null rates by feature |
| Consistency | Same entity has same representation across sources |
| Freshness | Data is recent enough for the model’s use case |
| Distribution | Feature distributions match expected ranges |
| Duplicates | No unintended duplicate records |
| Labels | Training labels are accurate and consistent |
| Schema | Data matches expected schema (types, ranges) |
Feature Engineering Testing
Features transform raw data into model inputs:
- Feature values are within expected ranges
- Feature computation is deterministic (same input → same output)
- No data leakage (features do not contain target information)
- Feature importance aligns with domain knowledge
Model Evaluation Testing
Standard Metrics
| Metric | Use Case | Formula |
|---|---|---|
| Accuracy | Balanced classes | (TP + TN) / Total |
| Precision | When false positives are costly | TP / (TP + FP) |
| Recall | When false negatives are costly | TP / (TP + FN) |
| F1 Score | Balanced precision-recall | 2 * P * R / (P + R) |
| AUC-ROC | Overall discriminative ability | Area under ROC curve |
Beyond Accuracy
- Slice-based evaluation: Model performance across data subgroups (by age, geography, device)
- Edge case testing: Adversarial inputs, out-of-distribution data, boundary conditions
- Regression testing: New model version is not worse than previous version on any metric
- Robustness testing: Small input perturbations should not drastically change outputs
Bias and Fairness Testing
ML models can perpetuate or amplify societal biases:
- Demographic parity: Positive prediction rates should be similar across groups
- Equal opportunity: True positive rates should be similar across groups
- Calibration: Predicted probabilities should be accurate for all groups
- Disparate impact: Adverse decision rates should not disproportionately affect protected groups
Test for bias across: race, gender, age, disability status, geographic location, socioeconomic status.
Advanced ML Testing
Data Drift Monitoring
Production data changes over time:
- Feature drift: Input feature distributions shift
- Concept drift: The relationship between features and target changes
- Label drift: The distribution of target values changes
Monitoring approach:
- Statistical tests (Kolmogorov-Smirnov, Population Stability Index)
- Distribution visualization dashboards
- Automated alerts when drift exceeds thresholds
- Triggered retraining pipelines
Model Serving Testing
ML models in production face infrastructure challenges:
- Inference latency (P50, P95, P99) under load
- Throughput (predictions per second)
- Model versioning and gradual rollout (canary deployment)
- A/B testing between model versions
- Fallback to previous model on failure
- Batch vs. real-time inference pipelines
ML Security Testing
- Adversarial attacks: inputs crafted to fool the model
- Model extraction: preventing unauthorized copying of model behavior
- Data poisoning: detecting tampered training data
- Privacy: model does not memorize and leak training data (membership inference)
Hands-On Exercise
Design a test plan for a credit scoring ML model:
- Data quality: Verify training data completeness, check for historical bias
- Model accuracy: Evaluate precision, recall, and AUC on holdout test set
- Bias testing: Verify fair outcomes across age groups, genders, and zip codes
- Robustness: Test with edge cases (zero income, extremely high credit limit)
- Monitoring: Define drift detection metrics and retraining triggers
Solution Guide
Bias tests:
- Calculate approval rates by gender: difference should be < 5%
- Calculate approval rates by age group: no group should have > 2x rejection rate
- Verify model explanation (SHAP values) does not rely on protected attributes
Robustness tests:
- Income = $0: model should handle gracefully, not crash
- Credit utilization = 100%: should produce reasonable (likely low) score
- All features at boundary values: model should not produce extreme outlier scores
Pro Tips
- Test data before testing models — most ML bugs are actually data bugs
- Monitor production model performance continuously — accuracy degrades silently without monitoring
- Always test for bias with real demographic data — synthetic data may not reveal real-world biases
- Version everything — data, features, models, and configurations must be traceable and reproducible
- Compare new models against baselines — a simpler model that performs nearly as well may be preferable
Key Takeaways
- ML testing requires testing the entire pipeline: data, features, model, serving, and monitoring
- Model accuracy alone is insufficient — fairness, robustness, and interpretability matter equally
- Data drift is the silent killer of ML models — continuous monitoring is essential
- ML bias testing is not optional — it has legal, ethical, and business implications