[{"data":1,"prerenderedAt":202},["ShallowReactive",2],{"page-\u002Fwriting\u002Fdata-valuation-shapley":3},{"id":4,"title":5,"body":6,"date":188,"description":12,"extension":189,"meta":190,"navigation":191,"path":192,"seo":193,"stem":194,"subtitle":195,"tags":196,"__hash__":201},"content\u002Fwriting\u002Fdata-valuation-shapley.md","data valuation for debugging ml pipelines",{"type":7,"value":8,"toc":178},"minimark",[9,13,18,21,24,27,31,34,37,40,53,56,60,67,81,87,93,113,119,125,129,135,141,147,150,154,168,172,175],[10,11,12],"p",{},"Garbage in, garbage out is true. But not all garbage is equally guilty. Here's how I used Shapley-based data valuation to debug ML pipelines for accuracy and fairness.",[14,15,17],"h2",{"id":16},"motivation","Motivation",[10,19,20],{},"Most data debugging looks at the trained model. You poke at predictions after the fact with tools like SHAP or LIME, and try to explain why the model did what it did. That's useful, but it skips a step.",[10,22,23],{},"If a model is unfair or inaccurate, the cause is often upstream, in the training data. Mislabeled examples. Outliers. Missing values that cluster around one demographic group instead of being spread randomly. Debugging only at the model level also misses errors introduced earlier, during imputation, scaling, or encoding, before the classifier ever sees the data.",[10,25,26],{},"My core idea: don't treat the training set as fixed. Score every training point by how much it actually contributes to model accuracy or fairness, looking at the whole pipeline, not just the final model. Rank points by that score. Fix or remove the worst ones. Then check whether that actually beats fixing things at random.",[14,28,30],{"id":29},"why-shapley-values-and-why-the-naive-version-doesnt-scale","Why Shapley values, and why the naive version doesn't scale",[10,32,33],{},"The Shapley value comes from game theory. It's a way to split credit among players based on their average contribution across every possible team they could be on. Applied here, each training point is a \"player,\" and the \"game\" is model performance.",[10,35,36],{},"The problem: computing this exactly means testing every possible subset of the training data. That number explodes fast, so it's only feasible for tiny datasets.",[10,38,39],{},"DataScope, the framework I used, makes this practical for real pipelines. In short, it:",[41,42,43,47,50],"ul",{},[44,45,46],"li",{},"Approximates the model with a nearest-neighbor classifier, which makes the math tractable",[44,48,49],{},"Reuses global steps like scalers and imputers across calculations instead of refitting them constantly",[44,51,52],{},"Uses some clever data-structure tricks to avoid recomputing the same things over and over",[10,54,55],{},"The result is a method reported to run orders of magnitude faster than brute-force approaches, while still ranking data points well for cleaning. That speed is what makes it usable in practice instead of just a thought experiment.",[14,57,59],{"id":58},"how-i-set-things-up","How I set things up",[10,61,62,66],{},[63,64,65],"strong",{},"Pipelines."," I tested two different pipeline setups, since a method that only works on one model type isn't proven to generalize:",[41,68,69,75],{},[44,70,71,74],{},[63,72,73],{},"Pipeline A",": a standard scikit-learn-style flow, impute, transform, scale, then logistic regression.",[44,76,77,80],{},[63,78,79],{},"Pipeline B",": impute, scale, reduce dimensions with PCA, select the best features, then a random forest.",[10,82,83,86],{},[63,84,85],{},"Datasets."," Four classic benchmark datasets: Adult Income, German Credit, Titanic, and COMPAS. Each has a different protected attribute (sex or race) and a different size, from under a thousand rows to tens of thousands. Using more than one dataset matters, since a result that only shows up once is weak evidence.",[10,88,89,92],{},[63,90,91],{},"Noise types."," I corrupted the training data in three different ways, never touching the test set:",[41,94,95,101,107],{},[44,96,97,100],{},[63,98,99],{},"Biased label noise",": flipping labels, but only within one demographic group, to mimic a biased labeler rather than random error.",[44,102,103,106],{},[63,104,105],{},"Biased missing data",": blanking out feature values, again targeted at one group, to mimic data that's systematically harder to collect for some people.",[44,108,109,112],{},[63,110,111],{},"Outliers",": pushing a feature value to an extreme, implausible number for a random slice of rows.",[10,114,115,118],{},[63,116,117],{},"Comparison."," For each noise type, I ranked the suspicious points three ways: by importance score (DataScope), by model uncertainty (entropy), and randomly as a control. Then I cleaned the top-ranked points, retrained, and measured what happened.",[10,120,121,124],{},[63,122,123],{},"What I measured."," Accuracy, plus two fairness metrics (demographic parity + equalised odds): whether the model treats both groups equally in who gets a positive prediction, and whether it makes the same error rates across groups. Tracking fairness alongside accuracy was the whole point. A method that fixes accuracy but quietly leaves bias untouched isn't actually solving the problem.",[14,126,128],{"id":127},"what-i-found","What I found",[10,130,131,134],{},[63,132,133],{},"Breaking things confirmed the scores were measuring something real."," Before trying to clean anything, I tested the reverse: deliberately corrupting the highest-scoring points. If the scores meant anything, this should hurt the model far more than corrupting random points. It did, by a wide margin. Corrupting the same number of low-importance points barely moved accuracy. That gap is the strongest evidence that the scores track real influence on the model.",[10,136,137,140],{},[63,138,139],{},"Targeted cleaning recovered accuracy fast."," After injecting label noise, cleaning the highest-scoring points recovered most of the lost accuracy, consistently across multiple runs. Cleaning by uncertainty recovered some, but noticeably less. Random cleaning barely helped at all. If you can only review a limited number of points, which ones you pick matters a lot.",[10,142,143,146],{},[63,144,145],{},"But for outliers, the fancy method lost to a simple rule."," When the noise was extreme, out-of-range feature values, capping the importance-ranked points barely moved accuracy, and didn't improve steadily as I cleaned more. A basic statistical rule (flag anything beyond two standard deviations from the mean) did better, and even beat the original, uncorrupted baseline.",[10,148,149],{},"My read: importance scoring answers \"which points change the model's behavior the most.\" That's the right question for subtle label errors, where nothing about the data looks obviously wrong. It's the wrong question for outliers, where the data already looks wrong on its face, and a simple statistical check catches it just fine. The lesson isn't that one method is better. It's that the right tool depends on what kind of problem you're actually solving.",[14,151,153],{"id":152},"what-this-doesnt-show","What this doesn't show",[41,155,156,159,162,165],{},[44,157,158],{},"Everything here is binary classification. I haven't yet tested whether this holds for multi-class problems or regression.",[44,160,161],{},"DataScope's scores are themselves an approximation, not the true Shapley value. I compared methods against each other, not against a perfect ground truth.",[44,163,164],{},"I only checked two fairness metrics. Other notions of fairness, or cases involving more than one protected group at once, aren't covered.",[44,166,167],{},"The datasets are old and have known issues. I'd trust the relative pattern between methods more than I'd trust any of this generalizing directly to a modern, real-world dataset.",[14,169,171],{"id":170},"where-this-leaves-things","Where this leaves things",[10,173,174],{},"The takeaway isn't \"DataScope wins\" or \"simple rules win.\" It's that the right tool depends on the kind of noise you're dealing with. Importance scoring earns its cost on subtle errors that nothing simpler can catch. Cheap statistical rules are hard to beat when the problem is just a number sitting somewhere it shouldn't be. Treating data valuation as a universal fix is the assumption this idea pushes back on hardest.",[10,176,177],{},"Happy to share the code and results if you want to dig into the details or try this on a different dataset. I also have a Python package and example scripts along with all methods dropping soon :)",{"title":179,"searchDepth":180,"depth":180,"links":181},"",2,[182,183,184,185,186,187],{"id":16,"depth":180,"text":17},{"id":29,"depth":180,"text":30},{"id":58,"depth":180,"text":59},{"id":127,"depth":180,"text":128},{"id":152,"depth":180,"text":153},{"id":170,"depth":180,"text":171},"2026-06-23","md",{},true,"\u002Fwriting\u002Fdata-valuation-shapley",{"title":5,"description":12},"writing\u002Fdata-valuation-shapley","what Shapley values can (and can't) tell you about bad data",[197,198,199,200],"data-valuation","shapley","ml-pipelines","fairness","yLR5zecpq2Hz1psrk5EXvV0xKy2t3nxH8b5enqssAEU",1782496776625]