AI Hubris

Yudkowsky made this point more than a decade ago and it still applies today

By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it … The field of AI has a reputation for making huge promises and then failing to deliver on them. Most observers conclude that AI is hard; as indeed it is. But the embarrassment does not stem from the difficulty. It is difficult to build a star from hydrogen, but the field of stellar astronomy does not have a terrible reputation for promising to build stars and then failing. The critical inference is not that AI is hard, but that, for some reason, it is very easy for people to think they know far more about Artificial Intelligence than they actually do.

Beating Breast Cancer – example of need for hybrid systems and explainability

Recent work by Google shows progress on applying AI to diagnosing breast cancer.  This work also highlights some key points.

  • need to integrate machine learning models into hybrid human/AI systems
  • need “explainablity” from machine learning models
  • need to be able to seed machine learning models with existing knowledge to improve their performance

Here is the relevant observation:

While the AI system caught cancers that the radiologists missed, the radiologists … caught cancers that the AI system missed. Sometimes, all six U.S. readers caught a cancer that slipped past the AI, and vice versa … The cancers that the AI system caught were generally more invasive than those caught by the radiologists; the researchers didn’t have an explanation for the discrepancies.

Recognizing a ruler instead of a cancer

Skin cancer diagnosis AI illustrates why it is important to provide explainability and to look for the connections between the model’s behavior and actual causality operating in the system.  By understanding why a prediction is being made and what a model says about the underlying system we can catch issues like this.

When dermatologists are looking at a lesion that they think might be a tumor, they’ll break out a ruler—the type you might have used in grade school—to take an accurate measurement of its size. Dermatologists tend to do this only for lesions that are a cause for concern. So in the set of biopsy images, if an image had a ruler in it, the algorithm was more likely to call a tumor malignant, because the presence of a ruler correlated with an increased likelihood a lesion was cancerous. Unfortunately, as Novoa emphasizes, the algorithm doesn’t know why that correlation makes sense, so it could easily misinterpret a random ruler sighting as grounds to diagnose cancer

When going from a proof-of-concept to a production application of this AI the lesion photos might change.  For example you might standardize the capturing of photos to always have the ruler present.  This would cause the production observations to not be consistent with the training set and would be one source of “model rot”.

In another paper a similar issue was found because doctors sometimes use purple markers to highlight potentially-malignant skin cancers for easier examination.  Some argue that the purple marks are a real signal that should be incorporated in the model just as the visual appearance of the tumor itself is incorporated.  However, if your goal is robust generalizability over time it is probably best to not have your AI incorporate the human applied purple marks as signal, as the standards for applying those marks may vary across teams and across time.  In any case you certainly want to be aware that those purple marks are part of what is driving the models predictions so you can make a conscious decision about whether you want that to be the case.  It is through a commitment to explainability and looking for underlying causation that you will become aware of these sorts of impacts.

Uber Manifold Highlights “Heterogeneous Awareness” Failure

Models are simplified representations of some real world system, which hopefully generalize to make useful predictions about new observations within that system.

Building a model that is both accurate (makes reasonably good predictions) and generalizable (can make those predictions about never seen before instances) is hard.  It is especially hard if the system we are modeling is an aggregate of sub-systems each of which has distinct causality driving the outcome to be predicted. Failing to recognize this heterogeneity is one of the fundamental AI failure types and might be referred to as a “heterogeneous awareness failure”.

The Uber team’s release of the Manifold tool is both a recognition that this type of failure exists and a helpful step towards addressing it.

Taking advantage of visual analytics techniques, Manifold allows ML practitioners to look beyond overall summary metrics to detect which subset of data a model is inaccurately predicting. Manifold also explains the potential cause of poor model performance by surfacing the feature distribution difference between better and worse-performing subsets of data. Moreover, it can display how several candidate models have different prediction accuracies for each subset of data, providing justification for advanced treatments such as model ensembling.