Good to remember when thinking about generalization

This paper is a couple years old but still relevant.  The dangers of over-fitting a neural network are substantial.

… we train several standard architectures on a copy of the data where the true labels were replaced by random labels. Our central finding can be summarized as: Deep neural networks easily fit random labels … The effective capacity of neural networks is sufficient for memorizing the entire data set …

Extending on this first set of experiments, we also replace the true images by completely random pixels and observe that convolutional neural networks continue to fit the data with zero training error. This shows that despite their structure, convolutional neural nets can fit random noise … Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.

Manage impact of “inherent” AI errors

Saleema Amersh makes good points in a workshop paper regarding how to think about AI systems failure.

1) Assume that even useful systems will have regular failures

AI models are our attempts to represent and operationalize key aspects of real world systems. By “attempt” I mean that it is difficult, if not impossible, for AI models to ever fully capture the complexity of the real world. Consequently, errors are essentially inherent to AI models by design. That is, many AI algorithms work to optimize some objective function, such as minimizing some notion of loss or cost. In doing so, and because AI models only partially represent reality, AI algorithms necessarily must trade off errors and sacrifice parts of the input space to produce functions that can generalize to new scenarios. Therefore, while efforts to avoid AI biases and harms are needed, ethical AI development must also recognize failures as inevitable and work towards systematically and proactively identifying, assessing, and mitigating harms that could be caused by such failures.

2) Think about entire “system” not just the ML model in isolation.  Accept that investments in transparency and explanations are worthwhile even if if it means a trade-off in somewhat lower model metrics.  Focus on the outcome delivered by the system not just optimization of the model metrics.

When thinking about AI failures, we need to think holistically about AI-based systems. AI-based systems include AI models (and the datasets and algorithms used to train them), the software and infrastructure supporting the operation of those models within an application, and the application interfaces between those models and the people directly interacting with or indirectly affected by them. AI-based failures therefore go beyond traditional notions of model errors (such as false positives and false negatives) and model accuracy measures (such as precision and recall) and include sociotechnical failures that arise when AI-based systems interact with people in the real world. For example, medical doctors or judges viewing AI-based recommendations to inform decision making may over- or under-estimate the capabilities of the AI components making those recommendations, to the potential detriment of patients and litigants. Acknowledging this type of sociotechnical failure has motivated an exciting and active area of research in algorithm transparency and explanations … types of sociotechnical AI failures may include expectation mismatches, careless model updates, and insufficient support for human oversight and control.

Understanding causality and the use of models to guide changes in behavior

Interesting post from Stitch Fix team.  Specific techniques described are intriguing, but what is even more valuable is their articulation of the value found in understanding underlying causality and recognizing that using model results to change your behavior is not as simple as it seems.

It is surprisingly common for data scientists to undervalue causal understanding and misrepresent their models ability to guide changes to behavior.

” …  challenges for sound decision-making. In some organizations, earnest efforts to be ‘data-driven’ devolve into a culture of false certainty in which the influence of metrics is dictated not by their reliability but rather by their abundance and the confidence with which they’re wielded … ability to slice and dice these data has given the impression of visibility into performance at any level of granularity.  Unfortunately, it’s often only an impression. Looking at more data on more things doesn’t necessarily produce better decisions; in many cases, it actually leads to worse ones.”

“If there’s one thing data scientists are good at, it’s throwing a bunch of data into XGBoost and estimating 𝔼(YX). Decisions, however, rely on a very special kind of estimate—one that tells us about the outcome we’d observe if we intervened on the world by enacting decision d, aka the potential outcome. … When we talk about potential outcomes under different decision rules, Yd:dD, we’re talking about the causal relationships between our decisions and outcomes. The better our understanding of those causal relationships, the better decisions we’re likely to make. If our goal is to improve our understanding of causal relationships, more data is not necessarily better. In this context, data is only valuable if it improves our causal understanding.”

“We’ve all heard the platitude that correlation doesn’t imply causation, but sometimes it does! The process of understanding when it does is often referred to as causal identification. If a relationship between two variables is causally identified, it means we can directly estimate useful summaries (e.g., the expectations) of its potential outcomes from the data we have (despite the missing potential outcomes we could only observe with a time machine)”

” … data can only improve decisions insofar as it enables us to learn about the potential outcomes associated with the alternatives under consideration. Instead of being naively data driven, we should seek to be causal information driven. Causal inference provides a set of powerful tools for understanding the extent to which causal relationships can be learned from the data we have. Standard machinery will often produce poor causal effect estimates … “

Another approach to dealing with out-of-distribution failure

Rather than put guardrails around a model some researchers suggest how to train a model to give low-confidence indicators for results generated from out-of-distribution inputs.  This paper is a good example.

Deep neural networks “perform well only when evaluated on instances very similar to those from the training set. When evaluated on slightly different distributions, neural networks often provide incorrect predictions with strikingly high confidence … systems quickly degrade in performance as the distributions of training and testing data differ slightly from each other … This problem is one of the most central challenges in deep learning … vanilla neural networks spread the training data widely throughout the representation space, and assign high confidence predictions to almost the entire volume of representations. This leads to major drawbacks since the network will provide high-confidence predictions to examples off the data manifold, thus lacking enough incentives to learn discriminative representations about the training data. To address these issues, we … encourages the neural network to be uncertain across the volume of the representation space unseen during training. This leads to concentrating the representations of the real training examples in a low dimensional subspace, resulting in more discriminative features”

“Out-of-distribution” failures and the need for guardrails

This paper by Papernot and McDaniel highlights a fundamental failure of many machine learning models: they fail to identify when an input is “out-of-distribution” and therefore give inaccurately confident predictions when presented with those inputs.

Their proposed solution is interesting.  However there will certainly be other proposed solutions.  We will see which proves most robust in practice.

What is more interesting is their good summary of the problem:

More generally, as the ML community moves towards an end-to-end approach to learning, models are taking up roles that used to be fulfilled by pre-processing pipelines. Features are no longer manually engineered to extract a representation of the input. A good example of that is machine translation—significant progress was made by replacing systems engineered for several decades with a holistic sequence-to-sequence model. This lack of pre-processing pipeline generally means that the input domain is less constrained. Despite this, models are deployed with little input validation, which implicitly boils down to expecting the classifier to correctly classify any input that can be represented by their input layer. This goes against one of the fundamental assumptions of machine learning: models should be presented at test time with inputs that fall on their training manifold. Hence, if we deploy a model in an environment were inputs may fall outside of this data manifold, we need mechanisms for figuring out whether a specific input/output pair is acceptable for a given ML model. In security, we sometimes refer to this as admission control. This is what we’d like to achieve by estimating the training data support for a particular prediction.

This problem is most dramatically highlighted via adversarial attacks, such as those illustrated in this paper and this one.  However, real world variations of input without any bad actors involved can trigger similar failures.

In practice most data scientists are not conscious of these risks and blithely recommend deploying models without proposing any validating guardrails to go with them.