Interesting post from Stitch Fix team. Specific techniques described are intriguing, but what is even more valuable is their articulation of the value found in understanding underlying causality and recognizing that using model results to change your behavior is not as simple as it seems.
It is surprisingly common for data scientists to undervalue causal understanding and misrepresent their models ability to guide changes to behavior.
” … challenges for sound decision-making. In some organizations, earnest efforts to be ‘data-driven’ devolve into a culture of false certainty in which the influence of metrics is dictated not by their reliability but rather by their abundance and the confidence with which they’re wielded … ability to slice and dice these data has given the impression of visibility into performance at any level of granularity. Unfortunately, it’s often only an impression. Looking at more data on more things doesn’t necessarily produce better decisions; in many cases, it actually leads to worse ones.”
“If there’s one thing data scientists are good at, it’s throwing a bunch of data into XGBoost and estimating 𝔼(Y∣X). Decisions, however, rely on a very special kind of estimate—one that tells us about the outcome we’d observe if we intervened on the world by enacting decision d, aka the potential outcome. … When we talk about potential outcomes under different decision rules, Yd:d∈D, we’re talking about the causal relationships between our decisions and outcomes. The better our understanding of those causal relationships, the better decisions we’re likely to make. If our goal is to improve our understanding of causal relationships, more data is not necessarily better. In this context, data is only valuable if it improves our causal understanding.”
“We’ve all heard the platitude that correlation doesn’t imply causation, but sometimes it does! The process of understanding when it does is often referred to as causal identification. If a relationship between two variables is causally identified, it means we can directly estimate useful summaries (e.g., the expectations) of its potential outcomes from the data we have (despite the missing potential outcomes we could only observe with a time machine)”
” … data can only improve decisions insofar as it enables us to learn about the potential outcomes associated with the alternatives under consideration. Instead of being naively data driven, we should seek to be causal information driven. Causal inference provides a set of powerful tools for understanding the extent to which causal relationships can be learned from the data we have. Standard machinery will often produce poor causal effect estimates … “