Did It Happen?

Versions of the Story

Drawing on the usual suspects⁠ (Google/Google Books/Google Scholar/Libgen/LessWrong/Hacker News⁠/ Twitter) in investigating leprechauns⁠⁠, I have compiled a large number of variants of the story; below, in reverse chronological order by decade, letting us trace the evolution of the story back towards its roots:

2010s

Heather Murphy, “Why Stanford Researchers Tried to Create a ‘Gaydar’ Machine”⁠ (NYT), 2017-10-09:

So What Did the Machines See? Dr. Kosinski and Mr. Wang [Wang & Kosinski2018⁠⁠; see also ⁠Leuner2019⁠⁠/ Kosinski2021⁠] say that the algorithm is responding to fixed facial features, like nose shape, along with “grooming choices,” such as eye makeup. But it’s also possible that the algorithm is seeing something totally unknown. “The more data it has, the better it is at picking up patterns,” said Sarah Jamie Lewis, an independent privacy researcher who Tweeted a critique of the study. “But the patterns aren’t necessarily the ones you think that they are.” Tomaso Poggio⁠, the director of M.I.T.’s Center for Brains, Minds and Machines, offered a classic parable used to illustrate this disconnect. The Army trained a program to differentiate American tanks from Russian tanks with 100% accuracy. Only later did analysts realized that the American tanks had been photographed on a sunny day and the Russian tanks had been photographed on a cloudy day. The computer had learned to detect brightness. Dr. Cox has spotted a version of this in his own studies of dating profiles. Gay people, he has found, tend to post higher-quality photos. Dr. Kosinski said that they went to great lengths to guarantee that such confounders did not influence their results. Still, he agreed that it’s easier to teach a machine to see than to understand what it has seen.

[It is worth noting that Arcs et al’s criticisms⁠⁠, such as their ‘gay version’ photographs, do not appear to have been confirmed by an independent replication⁠⁠.]

Alexander Harrowell, ⁠“It was called a perceptron for a reason, damn it”⁠, 2017-09-30:

You might think that this is rather like one of the classic optical illusions, but it’s worse than that. If you notice that you look at something this way, and then that way, and it looks different, you’ll notice something is odd. This is not something our deep learner will do. Nor is it able to identify any bias that might exist in the corpus of data it was trained on…or maybe it is. If there is any property of the training data set that is strongly predictive of the training criterion, it will zero in on that property with the ferocious clarity of Darwinism. In the 1980s, an early backpropagating neural network was set to find Soviet tanks in a pile of reconnaissance photographs. It worked, until someone noticed that the Red Army usually trained when the weather was good, and in any case the satellite could only see them when the sky was clear. The medical school at St Thomas’ Hospital in London found theirs had learned that their successful students were usually white.

An interesting story with a distinct “family resemblance” is told about a NN classifying wolves/dogs, by Evgeniy Nikolaychuk, “Dogs, Wolves, Data Science, and Why Machines Must Learn Like Humans Do”⁠⁠, 2017-06-09:

Neural networks are designed to learn like the human brain, but we have to be careful. This is not because I’m scared of machines taking over the planet. Rather, we must make sure machines learn correctly. One example that always pops into my head is how one neural network learned to differentiate between dogs and wolves. It didn’t learn the differences between dogs and wolves, but instead learned that wolves were on snow in their picture and dogs were on grass. It learned to differentiate the two animals by looking at snow and grass. Obviously, the network learned incorrectly. What if the dog was on snow and the wolf was on grass? Then, it would be wrong.

However, in his source, “‘Why Should I Trust You?’ Explaining the Predictions of Any Classifier [LIME]”⁠, Ribeiro2016, they specify of their dog/wolf snow-detector NN that they “trained this bad classifier intentionally, to evaluate whether subjects are able to detect it [the bad performance]” using LIME for insight into how the classifier was making its classification, concluding that “After examining the explanations, however, almost all of the subjects identified the correct insight, with much more certainty that it was a determining factor. Further, the trust in the classifier also dropped substantially.” So Nikolaychuk appears to have misremembered. (Perhaps in another 25 years students will be told in their classes of how a NN was once trained by ecologists to count wolves…)

⁠Redditor mantrap2⁠ gives on 2015-06-20 this version of the story:

I remember this kind of thing from the 1980s: the US Army was testing image recognition seekers for missiles and was getting excellent results on Northern German tests with NATO tanks. Then they tested the same systems in other environment and there results were suddenly shockingly bad. Turns out the image recognition was keying off the trees with tank-like minor features rather than the tank itself. Putting other vehicles in the same forests got similar high hits but tanks by themselves (in desert test ranges) didn’t register. Luckily a sceptic somewhere decided to “do one more test to make sure”.

Dennis Polis, God, Science and Mind, 2012 (pg131, limited Google Books snippet, unclear what ref 44 is):

These facts refute a Neoplatonic argument for the essential immateriality of the soul, viz. that since the mind deals with universal representations, it operates in a specifically immaterial way…So, awareness is not explained by connectionism. The results of neural net training are not always as expected. One team intended to train neural nets to recognize battle tanks in aerial photos. The system was trained using photos with and without tanks. After the training, a different set of photos was used for evaluation, and the system failed miserably—being totally incapable of distinguishing those with tanks. The system actually discriminated cloudy from sunny days. It happened that all the training photos with tanks were taken on cloudy days, while those without were on clear days.44 What does this show? That neural net training is mindless. The system had no idea of the intent of the enterprise, and did what it was programmed to do without any concept of its purpose. As with Dawkins’ evolution simulation (p. 66), the goals of computer neural nets are imposed by human programmers.

Blay Whitby, Artificial Intelligence: A Beginner’s Guide⁠ 2012 (pg53):

It is not yet clear how an artificial neural net could be trained to deal with “the world” or any really open-ended sets of problems. Now some readers may feel that this unpredictability is not a problem. After all, we are talking about training not programming and we expect a neural net to behave rather more like a brain than a computer. Given the usefulness of nets in unsupervised learning, it might seem therefore that we do not really need to worry about the problem being of manageable size and the training process being predictable. This is not the case; we really do need a manageable and well-defined problem for the training process to work. A famous AI urban myth may help to make this clearer.

The story goes something like this. A research team was training a neural net to recognize pictures containing tanks. (I’ll leave you to guess why it was tanks and not tea-cups.) To do this they showed it two training sets of photographs. One set of pictures contained at least one tank somewhere in the scene, the other set contained no tanks. The net had to be trained to discriminate between the two sets of photographs. Eventually, after all that back-propagation stuff, it correctly gave the output “tank” when there was a tank in the picture and “no tank” when there wasn’t. Even if, say, only a little bit of the gun was peeping out from behind a sand dune it said “tank”. Then they presented a picture where no part of the tank was visible—it was actually completely hidden behind a sand dune—and the program said “tank”.

Now when this sort of thing happens research labs tend to split along age-based lines. The young hairs say “Great! We’re in line for the Nobel Prize!” and the old heads say “Something’s gone wrong”. Unfortunately, the old heads are usually right—as they were in this case. What had happened was that the photographs containing tanks had been taken in the morning while the army played tanks on the range. After lunch the photographer had gone back and taken pictures from the same angles of the empty range. So the net had identified the most reliable single feature which enabled it to classify the two sets of photos, namely the angle of the shadows. “AM = tank, PM = no tank”. This was an extremely effective way of classifying the two sets of photographs in the training set. What it most certainly was not was a program that recognizes tanks. The great advantage of neural nets is that they find their own classification criteria. The great problem is that it may not be the one you want!

⁠Thom Blake⁠ notes in 2011-09-20 that the story is:

Probably apocryphal. I haven’t been able to track this down, despite having heard the story both in computer ethics class and at academic conferences.

⁠“Embarrassing mistakes in perceptron research”⁠, Marvin Minsky, 2011-01-31:

Like I had a friend in Italy who had a perceptron that looked at a visual… it had visual inputs. So, he… he had scores of music written by Bach of chorales and he had scores of chorales written by music students at the local conservatory. And he had a perceptron—a big machine—that looked at these and those and tried to distinguish between them. And he was able to train it to distinguish between the masterpieces by Bach and the pretty good chorales by the conservatory students. Well, so, he showed us this data and I was looking through it and what I discovered was that in the lower left hand corner of each page, one of the sets of data had single whole notes. And I think the ones by the students usually had four quarter notes. So that, in fact, it was possible to distinguish between these two classes of… of pieces of music just by looking at the lower left… lower right hand corner of the page. So, I told this to the… to our scientist friend and he went through the data and he said: ‘You guessed right. That’s… that’s how it happened to make that distinction.’ We thought it was very funny.

A similar thing happened here in the United States at one of our research institutions. Where a perceptron had been trained to distinguish between—this was for military purposes—It could… it was looking at a scene of a forest in which there were camouflaged tanks in one picture and no camouflaged tanks in the other. And the perceptron—after a little training—got… made a 100% correct distinction between these two different sets of photographs. Then they were embarrassed a few hours later to discover that the two rolls of film had been developed differently. And so these pictures were just a little darker than all of these pictures and the perceptron was just measuring the total amount of light in the scene. But it was very clever of the perceptron to find some way of making the distinction.

2000s

⁠Eliezer Yudkowsky⁠⁠, ⁠2008-08-24⁠ (similarly quoted in “Artificial Intelligence as a Negative and Positive Factor in Global Risk”⁠⁠, “Artificial Intelligence in global risk” in Global Catastrophic Risks 2011, & “Friendly Artificial Intelligence” in Singularity Hypotheses 2013):

Once upon a time—I’ve seen this story in several versions and several places, sometimes cited as fact, but I’ve never tracked down an original source—once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks. The researchers trained a neural net on 50 photos of camouflaged tanks amid trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set—output “yes” for the 50 photos of camouflaged tanks, and output “no” for the 50 photos of forest. Now this did not prove, or even imply, that new examples would be classified correctly. The neural network might have “learned” 100 special cases that wouldn’t generalize to new problems. Not, “camouflaged tanks versus forest”, but just, “photo-1 positive, photo-2 negative, photo-3 negative, photo-4 positive…” But wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees, and had used only half in the training set. The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly. Success confirmed! The researchers handed the finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos. It turned out that in the researchers’ data set, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest. This parable—which might or might not be fact—illustrates one of the most fundamental problems in the field of supervised learning and in fact the whole field of Artificial Intelligence…

Gordon Rugg, Using Statistics: A Gentle Introduction⁠⁠, 2007-10-01 (pg114–115):

Neural nets and genetic algorithms (including the story of the Russian tanks): Neural nets (or artificial neural networks, to give them their full name) are pieces of software inspired by the way the human brain works. In brief, you can train a neural net to do tasks like classifying images by giving it lots of examples, and telling it which examples fit into which categories; the neural net works out for itself what the defining characteristics are for each category. Alternatively, you can give it a large set of data and leave it to work out connections by itself, without giving it any feedback. There’s a story, which is probably an urban legend, which illustrates how the approach works and what can go wrong with it. According to the story, some NATO researchers trained a neural net to distinguish between photos of NATO and Warsaw Pact tanks. After a while, the neural net could get it right every time, even with photos it had never seen before. The researchers had gleeful visions of installing neural nets with miniature cameras in missiles, which could then be fired at a battlefield and left to choose their own targets. To demonstrate the method, and secure funding for the next stage, they organised a viewing by the military. On the day, they set up the system and fed it a new batch of photos. The neural net responded with apparently random decisions, sometimes identifying NATO tanks correctly, sometimes identifying them mistakenly as Warsaw Pact tanks. This did not inspire the powers that be, and the whole scheme was abandoned on the spot. It was only afterwards that the researchers realised that all their training photos of NATO tanks had been taken on sunny days in Arizona, whereas the Warsaw Pact tanks had been photographed on grey, miserable winter days on the steppes, so the neural net had flawlessly learned the unintended lesson that if you saw a tank on a gloomy day, then you made its day even gloomier by marking it for destruction.

N. Katherine Hayles, “Computing the Human” (Inventive Life: Approaches to the New Vitalism, Fraser2006; pg424):

While humans have for millennia used what Cariani calls ‘active sensing’—‘poking, pushing, bending’—to extend their sensory range and for hundreds of years have used prostheses to create new sensory experiences (for example, microscopes and telescopes), only recently has it been possible to construct evolving sensors and what Cariani (1998: 718)⁠ calls ‘internalized sensing’, that is, “bringing the world into the device” by creating internal, analog representations of the world out of which internal sensors extract newly-relevant properties’.

…Another conclusion emerges from Cariani’s call (1998) for research in sensors that can adapt and evolve independently of the epistemic categories of the humans who create them. The well-known and perhaps apocryphal story of the neural net trained to recognize army tanks will illustrate the point. For obvious reasons, the army wanted to develop an intelligent machine that could discriminate between real and pretend tanks. A neural net was constructed and trained using two sets of data, one consisting of photographs showing plywood cutouts of tanks and the other actual tanks. After some training, the net was able to discriminate flawlessly between the situations. As is customary, the net was then tested against a third data set showing pretend and real tanks in the same landscape; it failed miserably. Further investigation revealed that the original two data sets had been filmed on different days. One of the days was overcast with lots of clouds, and the other day was clear. The net, it turned out, was discriminating between the presence and absence of clouds. The anecdote shows the ambiguous potential of epistemically autonomous devices for categorizing the world in entirely different ways from the humans with whom they interact. While this autonomy might be used to enrich the human perception of the world by revealing novel kinds of constructions, it also can create a breed of autonomous devices that parse the world in radically different ways from their human trainers.

A counter-narrative, also perhaps apocryphal, emerged from the 1991 Gulf War. US soldiers firing at tanks had been trained on simulators that imaged flames shooting out from the tank to indicate a kill. When army investigators examined Iraqi tanks that were defeated in battles, they found that for some tanks the soldiers had fired four to five times the amount of munitions necessary to disable the tanks. They hypothesized that the overuse of firepower happened because no flames shot out, so the soldiers continued firing. If the hypothesis is correct, human perceptions were altered in accord with the idiosyncrasies of intelligent machines, providing an example of what can happen when human-machine perceptions are caught in a feedback loop with one another.

Linda Null & Julie Lobur, The Essentials of Computer Organization and Architecture (third edition)⁠⁠, 2003/2014 (pg439–440 in 1st edition, pg658 in 3rd edition):

Correct training requires thousands of steps. The training time itself depends on the size of the network. As the number of perceptrons increases, the number of possible “states” also increases.

Let’s consider a more sophisticated example, that of determining whether a tank is hiding in a photograph. A neural net can be configured so that each output value correlates to exactly one pixel. If the pixel is part of the image of a tank, the net should output a one; otherwise, the net should output a zero. The input information would most likely consist of the color of the pixel. The network would be trained by feeding it many pictures with and without tanks. The training would continue until the network correctly identified whether the photos included tanks. The U.S. military conducted a research project exactly like the one we just described. One hundred photographs were taken of tanks hiding behind trees and in bushes, and another 100 photographs were taken of ordinary landscape with no tanks. Fifty photos from each group were kept “secret,” and the rest were used to train the neural network. The network was initialized with random weights before being fed one picture at a time. When the network was incorrect, it adjusted its input weights until the correct output was reached. Following the training period, the 50 “secret” pictures from each group of photos were fed into the network. The neural network correctly identified the presence or absence of a tank in each photo. The real question at this point has to do with the training—had the neural net actually learned to recognize tanks? The Pentagon’s natural suspicion led to more testing. Additional photos were taken and fed into the network, and to the researchers’ dismay, the results were quite random. The neural net could not correctly identify tanks within photos. After some investigation, the researchers determined that in the original set of 200 photos, all photos with tanks had been taken on a cloudy day, whereas the photos with no tanks had been taken on a sunny day. The neural net had properly separated the two groups of pictures, but had done so using the color of the sky to do this rather than the existence of a hidden tank. The government was now the proud owner of a very expensive neural net that could accurately distinguish between sunny and cloudy days!

This is a great example of what many consider the biggest issue with neural networks. If there are more than 10 to 20 neurons, it is impossible to understand how the network is arriving at its results. One cannot tell if the net is making decisions based on correct information, or, as in the above example, something totally irrelevant. Neural networks have a remarkable ability to derive meaning and extract patterns from data that are too complex to be analyzed by human beings. However, some people trust neural networks to be experts in their area of training. Neural nets are used in such areas as sales forecasting, risk management, customer research, undersea mine detection, facial recognition, and data validation. Although neural networks are promising, and the progress made in the past several years has led to significant funding for neural net research, many people are hesitant to put confidence in something that no human being can completely understand.

David Gerhard, “Pitch Extraction and Fundamental Frequency: History and Current Techniques”⁠⁠, Technical Report TR-CS 2003–06, November 2003:

The choice of the dimensionality and domain of the input set is crucial to the success of any connectionist model. A common example of a poor choice of input set and test data is the Pentagon’s foray into the field of object recognition. This story is probably apocryphal and many different versions exist on-line, but the story describes a true difficulty with neural nets.

As the story goes, a network was set up with the input being the pixels in a picture, and the output was a single bit, yes or no, for the existence of an enemy tank hidden somewhere in the picture. When the training was complete, the network performed beautifully, but when applied to new data, it failed miserably. The problem was that in the test data, all of the pictures that had tanks in them were taken on cloudy days, and all of the pictures without tanks were taken on sunny days. The neural net was identifying the existence or non-existence of sunshine, not tanks.

Rice lecture #24, “COMP 200: Elements of Computer Science”⁠⁠, 2002-03-18:

Tanks in Desert Storm

Sometimes you have to be careful what you train on . . .

The problem with neural nets is that you never know what features they’re actually training on. For example:

The US military tried to use neural nets in Desert Storm for tank recognition, so unmanned tanks could identify enemy tanks and destroy them. They trained the neural net on multiple images of “friendly” and enemy tanks, and eventually had a decent program that seemed to correctly identify friendly and enemy tanks.

Then, when they actually used the program in a real-world test phase with actual tanks, they found that the tanks would either shoot at nothing or shoot at everything. They certainly seemed to be incapable of distinguishing friendly or enemy tanks.

Why was this? It turns out that the images they were training on always had glamour-shot type photos of friendly tanks, with an immaculate blue sky, etc. The enemy tank photos, on the other hand, were all spy photos, not very clear, sometimes fuzzy, etc. And it was these characteristics that the neural net was training on, not the tanks at all. On a bright sunny day, the tanks would do nothing. On an overcast, hazy day, they’d start firing like crazy . . .

Andrew Ilachinski, Cellular Automata: A Discrete Universe, 2001 (pg547):

There is an telling story about how the Army recently went about teaching a backpropagating net to identify tanks set against a variety of environmental backdrops. The programmers correctly fed their multi-layer net photograph after photograph of tanks in grasslands, tanks in swamps, no tanks on concrete, and so on. After many trials and many thousands of iterations, their net finally learned all of the images in their database. The problem was that when the presumably “trained” net was tested with other images that were not part of the original training set, it failed to do any better than what would be expected by chance. What had happened was that the input/training fact set was statistically corrupt. The database consisted mostly of images that showed a tank only if there were heavy clouds, the tank itself was immersed in shadow or there was no sun at all. The Army’s neural net had indeed identified a latent pattern, but it unfortunately had nothing to do with tanks: it had effectively learned to identify the time of day! The obvious lesson to be taken away from this amusing example is that how well a net “learns” the desired associations depends almost entirely on how well the database of facts is defined. Just as Monte Carlo simulations in statistical mechanics may fall short of intended results if they are forced to rely upon poorly coded random number generators, so do backpropagating nets typically fail to achieve expected results if the facts they are trained on are statistically corrupt.

Intelligent Data Analysis In Science⁠, Hugh M. Cartwright 2000, pg126, writes (according to Google Books’s snippet view; Cartwright’s version appears to be a direct quote or close paraphrase of an earlier 1994 chemistry paper, Goodacre1994):

…television programme Horizon⁠; a neural network was trained to attempt to distinguish tanks from trees. Pictures were taken of forest scenes lacking military hardware and of similar but perhaps less bucolic landscapes which also contained more-or-less camouflaged battle tanks. A neural network was trained with these input data and found to differentiate successfully between tanks and trees. However, when a new set of pictures was analysed by the network, it failed to detect the tanks. After further investigation, it was found…

Daniel Robert Franklin & Philippe Crochat, ⁠libneural tutorial⁠⁠, 2000-03-23:

A neural network is useless if it only sees one example of a matching input/output pair. It cannot infer the characteristics of the input data for which you are looking for from only one example; rather, many examples are required. This is analogous to a child learning the difference between (say) different types of animals—the child will need to see several examples of each to be able to classify an arbitrary animal… It is the same with neural networks. The best training procedure is to compile a wide range of examples (for more complex problems, more examples are required) which exhibit all the different characteristics you are interested in. It is important to select examples which do not have major dominant features which are of no interest to you, but are common to your input data anyway. One famous example is of the US Army “Artificial Intelligence” tank classifier. It was shown examples of Soviet tanks from many different distances and angles on a bright sunny day, and examples of US tanks on a cloudy day. Needless to say it was great at classifying weather, but not so good at picking out enemy tanks.

1990s

⁠“Neural Network Follies”⁠, Neil Fraser, September 1998:

In the 1980s, the Pentagon wanted to harness computer technology to make their tanks harder to attack…The research team went out and took 100 photographs of tanks hiding behind trees, and then took 100 photographs of trees—with no tanks. They took half the photos from each group and put them in a vault for safe-keeping, then scanned the other half into their mainframe computer. The huge neural network was fed each photo one at a time and asked if there was a tank hiding behind the trees. Of course at the beginning its answers were completely random since the network didn’t know what was going on or what it was supposed to do. But each time it was fed a photo and it generated an answer, the scientists told it if it was right or wrong. If it was wrong it would randomly change the weightings in its network until it gave the correct answer. Over time it got better and better until eventually it was getting each photo correct. It could correctly determine if there was a tank hiding behind the trees in any one of the photos…So the scientists took out the photos they had been keeping in the vault and fed them through the computer. The computer had never seen these photos before—this would be the big test. To their immense relief the neural net correctly identified each photo as either having a tank or not having one. Independent testing: The Pentagon was very pleased with this, but a little bit suspicious. They commissioned another set of photos (half with tanks and half without) and scanned them into the computer and through the neural network. The results were completely random. For a long time nobody could figure out why. After all nobody understood how the neural had trained itself. Eventually someone noticed that in the original set of 200 photos, all the images with tanks had been taken on a cloudy day while all the images without tanks had been taken on a sunny day. The neural network had been asked to separate the two groups of photos and it had chosen the most obvious way to do it—not by looking for a camouflaged tank hiding behind a tree, but merely by looking at the color of the sky…This story might be apocryphal, but it doesn’t really matter. It is a perfect illustration of the biggest problem behind neural networks. Any automatically trained net with more than a few dozen neurons is virtually impossible to analyze and understand.

Tom White⁠ attributes (in October 2017) to Marvin Minsky some version of the tank story being told in MIT classes 20 years before, ~1997 (but doesn’t specify the detailed story or version other than apparently the results were “classified”).

Vasant Dhar & Roger Stein, Intelligent Decision Support Methods⁠⁠, 1997 (pg98, limited Google Books snippet):

…However, when a new set of photographs were used, the results were horrible. At first the team was puzzled. But after careful inspection of the first two sets of photographs, they discovered a very simple explanation. The photos with tanks in them were all taken on sunny days, and those without the tanks were taken on overcast days. The network had not learned to identify tank like images; instead, it had learned to identify photographs of sunny days and overcast days.

Royston Goodacre, Mark J. Neal, & Douglas B. Kell, “Quantitative Analysis of Multivariate Data Using Artificial Neural Networks: A Tutorial Review and Applications to the Deconvolution of Pyrolysis Mass Spectra”⁠⁠, 1994-04-29:

…As in all other data analysis techniques, these supervised learning methods are not immune from sensitivity to badly chosen initial data (113). [113: Zupan, J. and J. Gasteiger: Neural Networks for Chemists: An Introduction. VCH Verlagsgesellschaft, Weinheim (1993)] Therefore the exemplars for the training set must be carefully chosen; the golden rule is “garbage in—garbage out”. An excellent example of an unrepresentative training set was discussed some time ago on the BBC television programme Horizon; a neural network was trained to attempt to distinguish tanks from trees. Pictures were taken of forest scenes lacking military hardware and of similar but perhaps less bucolic landscapes which also contained more-or-less camouflaged battle tanks. A neural network was trained with these input data and found to differentiate most successfully between tanks and trees. However, when a new set of pictures was analysed by the network, it failed to distinguish the tanks from the trees. After further investigation, it was found that the first set of pictures containing tanks had been taken on a sunny day whilst those containing no tanks were obtained when it was overcast. The neural network had therefore thus learned simply to recognise the weather! We can conclude from this that the training and tests sets should be carefully selected to contain representative exemplars encompassing the appropriate variance over all relevant properties for the problem at hand.

Fernando Pereira, “neural redlining”, RISKS 16(41), 1994-09-12:

Fred’s comments will hold not only of neural nets but of any decision model trained from data (eg. Bayesian models⁠, decision trees). It’s just an instance of the old “GIGO” phenomenon in statistical modeling…Overall, the whole issue of evaluation, let alone certification and legal standing, of complex statistical models is still very much open. (This reminds me of a possibly apocryphal story of problems with biased data in neural net training. Some US defense contractor had supposedly trained a neural net to find tanks in scenes. The reported performance was excellent, with even camouflaged tanks mostly hidden in vegetation being spotted. However, when the net was tested on yet a new set of images supplied by the client, the net did not do better than chance. After an embarrassing investigation, it turned out that all the tank images in the original training and test sets had very different average intensity than the non-tank images, and thus the net had just learned to discriminate between two image intensity levels. Does anyone know if this actually happened, or is it just in the neural net “urban folklore”?)

Erich Harth, The Creative Loop: How the Brain Makes a Mind⁠⁠, 1993/1995 (pg158, limited Google Books snippet):

…55. The net was trained to detect the presence of tanks in a landscape. The training consisted in showing the device many photographs of scene, some with tanks, some without. In some cases—as in the picture on page 143—the tank’s presence was not very obvious. The inputs to the neural net were digitized photographs;

Hubert L. Dreyfus & Stuart E. Dreyfus⁠, “What Artificial Experts Can and Cannot Do”⁠⁠, 1992:

All the “continue this sequence” questions found on intelligence tests, for example, really have more than one possible answer but most human beings share a sense of what is simple and reasonable and therefore acceptable. But when the net produces an unexpected association can one say it has failed to generalize? One could equally well say that the net has all along been acting on a different definition of “type” and that that difference has just been revealed. For an amusing and dramatic case of creative but unintelligent generalization, consider the legend of one of connectionism’s first applications. In the early days of the perceptron the army decided to train an artificial neural network to recognize tanks partly hidden behind trees in the woods. They took a number of pictures of a woods without tanks, and then pictures of the same woods with tanks clearly sticking out from behind trees. They then trained a net to discriminate the two classes of pictures. The results were impressive, and the army was even more impressed when it turned out that the net could generalize its knowledge to pictures from each set that had not been used in training the net. Just to make sure that the net had indeed learned to recognize partially hidden tanks, however, the researchers took some more pictures in the same woods and showed them to the trained net. They were shocked and depressed to find that with the new pictures the net totally failed to discriminate between pictures of trees with partially concealed tanks behind them and just plain trees. The mystery was finally solved when someone noticed that the training pictures of the woods without tanks were taken on a cloudy day, whereas those with tanks were taken on a sunny day. The net had learned to recognize and generalize the difference between a woods with and without shadows! Obviously, not what stood out for the researchers as the important difference. This example illustrates the general point that a net must share size, architecture, initial connections, configuration and socialization with the human brain if it is to share our sense of appropriate generalization

Hubert Dreyfus appears to have told this story earlier in 1990 or 1991, as a similar story appears in episode 4 (⁠German⁠) (starting 33m49s) of the BBC documentary series The Machine That Changed the World⁠, broadcast 1991-11-08. Hubert L. Dreyfus, What Computers Still Can’t Do: A Critique of Artificial Reason⁠⁠, 1992, repeats the story in very similar but not quite identical wording (⁠Jeff Kaufman notes that Dreyfus drops the qualifying “legend of” description):

…But when the net produces an unexpected association, can one say that it has failed to generalize? One could equally well say that the net has all along been acting on a different definition of “type” and that that difference has just been revealed. For an amusing and dramatic case of creative but unintelligent generalization, consider one of connectionism’s first applications. In the early days of this work the army tried to train an artificial neural network to recognize tanks in a forest. They took a number of pictures of a forest without tanks and then, on a later day, with tanks clearly sticking out from behind trees, and they trained a net to discriminate the two classes of pictures. The results were impressive, and the army was even more impressed when it turned out that the net could generalize its knowledge to pictures that had not been part of the training set. Just to make sure that the net was indeed recognizing partially hidden tanks, however, the researchers took more pictures in the same forest and showed them to the trained net. They were depressed to find that the net failed to discriminate between the new pictures of trees with tanks behind them and the new pictures of just plain trees. After some agonizing, the mystery was finally solved when someone noticed that the original pictures of the forest without tanks were taken on a cloudy day and those with tanks were taken on a sunny day. The net had apparently learned to recognize and generalize the difference between a forest with and without shadows! This example illustrates the general point that a network must share our commonsense understanding of the world if it is to share our sense of appropriate generalization.

Dreyfus’s What Computers Still Can’t Do is listed as a revision of his 1972 book, ⁠What Computers Can’t Do: A Critique of Artificial Reason⁠, but the tank story is not in the 1972 book, only the 1992 one. (Dreyfus’s version is also quoted in the 2017 NYT article and Hillis1996’s Geography, Identity, and Embodiment in Virtual Reality, pg346.)

Laveen N. Kanal, Artificial Neural Networks and Statistical Pattern Recognition: Old and New Connections’s⁠ Foreword, discusses some early NN/tank research (predating not just LeCun’s convolutions but backpropagation), 1991:

…[Frank] Rosenblatt had not limited himself to using just a single Threshold Logic Unit but used networks of such units. The problem was how to train multilayer perceptron networks. A paper on the topic written by Block, Knight and Rosenblatt was murky indeed, and did not demonstrate a convergent procedure to train such networks. In 1962–63 at Philco-Ford, seeking a systematic approach to designing layered classification nets, we decided to use a hierarchy of threshold logic units with a first layer of “feature logics” which were threshold logic units on overlapping receptive fields of the image, feeding two additional levels of weighted threshold logic decision units. The weights in each level of the hierarchy were estimated using statistical methods rather than iterative training procedures [L.N. Kanal & N.C. Randall, “Recognition System Design by Statistical Analysis”⁠⁠, Proc. 19th Conf. ACM, 1964]. We referred to the networks as two layer networks since we did not count the input as a layer. On a project to recognize tanks in aerial photography, the method worked well enough in practice that the U.S. Army agency sponsoring the project decided to classify the final reports, although previously the project had been unclassified. We were unable to publish the classified results! Then, enamored by the claimed promise of coherent optical filtering as a parallel implementation for automatic target recognition, the funding we had been promised was diverted away from our electro-optical implementation to a coherent optical filtering group. Some years later we presented the arguments favoring our approach, compared to optical implementations and trainable systems, in an article titled “Systems Considerations for Automatic Imagery Screening” by T.J. Harley, L.N. Kanal and N.C. Randall, which is included in the IEEE Press reprint volume titled Machine Recognition of Patterns⁠ edited by A. Agrawala1977⁠⁠1⁠⁠. In the years which followed multilevel statistically designed classifiers and AI search procedures applied to pattern recognition held my interest, although comments in my 1974 survey, “Patterns In Pattern Recognition: 1968–1974” [IEEE Trans. on IT, 1974], mention papers by Amari and others and show an awareness that neural networks and biologically motivated automata were making a comeback. In the last few years trainable multilayer neural networks have returned to dominate research in pattern recognition and this time there is potential for gaining much greater insight into their systematic design and performance analysis…

While Kanal & Randall1964 matches in some ways, including the image counts, there is no mention of failure either in the paper or Kanal’s1991 reminiscences (rather, Kanal implies it was highly promising), there is no mention of a field deployment or additional testing which could have revealed overfitting, and given their use of binarizing, it’s not clear to me that their 2-layer algorithm even could overfit to global brightness; the photos also appear to have been taken at low enough altitude for there to be no clouds, and to be taken under similar (possibly controlled) lighting conditions. The description in Kanal & Randall1964 is somewhat opaque to me, particularly of the ‘Laplacian’ they use to binarize or convert to edges, but there’s more background in their “Semi-Automatic Imagery Screening Research Study and Experimental Investigation, Volume 1”, Harley, Bryan, Kanal, Taylor & Grayum1962 (mirror⁠), which indicates that in their preliminary studies they were already interested in prenormalization/preprocessing images to correct for altitude and brightness, and the Laplacian, along with silhouetting and “lineness editing”, noting that “The Laplacian operation eliminates absolute brightness scale as well as low-spatial frequencies which are of little consequence in screening operations.”⁠⁠2⁠

An anonymous reader says he heard the story in 1990:

I was told about the tank recognition failure by a lecturer on my 1990 Intelligent Knowledge Based Systems MSc, almost certainly Libor Spacek⁠, in terms of being aware of context in data sets; that being from (the former) Czechoslovakia he expected to see tanks on a motorway whereas most British people didn’t. I also remember reading about a project with DARPA funding aimed at differentiating Russian, European and US tanks where what the image recognition learned was not to spot the differences between tanks but to find trees, because of the US tank photos being on open ground and the Russian ones being in forests; that was during the same MSc course—so very similar to predicting tumours by looking for the ruler used to measure them in the photo—but I don’t recall the source (it wasn’t one of the books you cite though, it was either a journal article or another text book).

1980s

Chris Brew⁠ states (2017-10-16) that he “Heard the story in 1984 with pigeons instead of neural nets”.

1960s

Edward Fredkin⁠, in an email to Eliezer Yudkowsky on 2013-02-26, recounts an interesting anecdote about the 1960s claiming to be the grain of truth:

By the way, the story about the two pictures of a field, with and without army tanks in the picture, comes from me. I attended a meeting in Los Angeles [at RAND?], about half a century ago [~1963?] where someone gave a paper showing how a random net could be trained to detect the tanks in the picture. I was in the audience. At the end of the talk I stood up and made the comment that it was obvious that the picture with the tanks was made on a sunny day while the other picture (of the same field without the tanks) was made on a cloudy day. I suggested that the “neural net” had merely trained itself to recognize the difference between a bright picture and a dim picture.

Evaluation

Sourcing

The absence of any hard citations is striking: even when a citation is supplied, it is invariably to a relatively recent source like Dreyfus, and then the chain ends. Typically for a real story, one will find at least one or two hints of a penultimate citation and then a final definitive citation to some very difficult-to-obtain or obscure work (which then is often quite different from the popularized version but still recognizable as the original); for example, another popular cautionary AI urban legend is that the 1956 Dartmouth workshop claimed that a single graduate student working for a summer could solve computer vision (or perhaps AI in general), which is a highly distorted misleading description of the ⁠original 1955 proposal’s realistic claim that “a 2 month, 10 man study of artificial intelligence” could yield “a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.”⁠⁠3⁠ Instead, everyone either disavows it as an urban legend or possibly apocryphal, or punts to someone else. (Minsky’s2011 version initially seems concrete, but while he specifically attributes the musical score story to a friend & claims to have found the trick personally, he is then as vague as anyone else about the tank story, saying it just “happened” somewhere “in the United States at one of our research institutes”, at an unmentioned institute by unmentioned people at an unmentioned point in time for an unmentioned branch of the military.)

Variations

Question to Radio Yerevan: “Is it correct that Grigori Grigorievich Grigoriev won a luxury car at the All-Union Championship in Moscow?”

Radio Yerevan answered: “In principle, yes. But first of all it was not Grigori Grigorievich Grigoriev, but Vassili Vassilievich Vassiliev; second, it was not at the All-Union Championship in Moscow, but at a Collective Farm Sports Festival in Smolensk; third, it was not a car, but a bicycle; and fourth he didn’t win it, but rather it was stolen from him.”

⁠“Radio Yerevan Jokes”⁠ (collected by Allan Stevo)

It is also interesting that not all the stories imply quite the same problem with the hypothetical NN. Dataset bias/selection effects is not the same thing as overfitting or disparate impact, but some of the story tellers don’t realize that. For example, in some stories, the NN fails when it’s tested on additional heldout data (overfitting), not when it’s tested on data from an entire different photographer or field exercise or data source (dataset bias/distributional shift). Or, Alexander Harrowell cites disparate impact in a medical school as if it were an example of the same problem, but it’s not—at least in the USA, a NN would be correct in inferring that white students are more likely to succeed, as that is a real predictor (this would be an example of how people play rather fast and loose with claims of “algorithmic bias”), and it would not necessarily be the case that, say, randomized admission of more non-white students would be certain to increase the number of successful graduates; such a scenario is, however, possible and illustrates the difference between predictive models & causal models for control & optimization, and the need for experiments/reinforcement learning⁠.

A read of all the variants together raises more questions than it answers:

Did this story happen in the 1960s, 1980s, 1990s, or during Desert Storm in the 1990s?
Was the research conducted by the US military, or researchers for another NATO country?
Were the photographs taken by satellite, from the air, on the ground, or by spy cameras?
Were the photographs of American tanks, plywood cutouts, Soviet tanks, or Warsaw Pact tanks?
Were the tanks out in the open, under cover, or fully camouflaged?
Were these photographs taken in forests, fields, deserts, swamps, or all of them?
Were the photographs taken in same place but different time of day, same place but different days, or different places entirely?
Were there 100, 200, or thousands of photographs; and how many were in the training vs validation set?
Was the input in black-and-white binary, grayscale, or color?
Was the tell-tale feature either field vs forest, bright vs dark, the presence vs absence of clouds, the presence vs absence of shadows, the length of shadows, or an accident in film development unrelated to weather entirely?
Was the NN to be used for image processing or in autonomous robotic tanks?
Was it even a NN?
Was the dataset bias caught quickly within “a few hours”, later by a suspicious team member, later still when applied to an additional set of tank photographs, during further testing producing a new dataset, much later during a live demo for military officers, or only after live deployment in the field?

Almost every aspect of the tank story which could vary does vary.

Urban Legends

We could also compare the tank story with many of the characteristics of urban legends (of the sort so familiar from Snopes): they typically have a clear dramatic arc, involve horror or humor while playing on common concerns (distrust of NNs has been a theme from the start of NN research⁠⁠4⁠), make an important didactic or moral point, claim to be true while sourcing remains limited to social proof such as the usual “friend of a friend” attributions, often try to associate with a respected institution (such as the US military), are transmitted primarily orally through social mechanisms & appear spontaneously & independently in many sources without apparent origin (most people seem to hear the tank story in unspecified classes, conferences, personal discussions rather than in a book or paper), exists in many mutually-contradictory variants often with overly-specific details⁠⁠5⁠ spontaneously arising in the retelling, been around for a long time (it appears almost fully formed in Dreyfus1992, suggesting incubation before then), sometimes have a grain of truth (dataset bias certainly is real), and the full tank story is “too good not to pass along” (even authors who are sure it’s an urban legend can’t resist retelling it yet again for didactic effect or entertainment). The tank story matches almost all the usual criteria for an urban legend.

Origin

So where does this urban legend come from? The key anecdote appears to be Edward Fredkin’s as it precedes all other excerpts except perhaps the research Kanal describes; Fredkin’s story does not confirm the tank story as he merely speculates that brightness was driving the results, much less all the extraneous details about photographic film being accidentally overdeveloped or robot tanks going berserk or a demo failing in front of Army brass.

But it’s easy to see how Fredkin’s reasonable question could have memetically evolved into the tank story as finally fixed into published form by Dreyfus’s article:

Setting: Kanal & Randall set up their very small simple early perceptrons on some tiny binary aerial photos of tanks, in interesting early work, and Fredkin attends the talk sometime around 1960–1963
The Question: Fredkin then asks in the Q&A whether the perceptron is not learning square-shapes but brightness
Punting: of course neither Fredkin nor Kanal & Randall can know on the spot whether this critique is right or wrong (perhaps that question motivated the binarized results reported in Kanal & Randall1964?), and the question remains unanswered
Anecdotizing: but someone in the audience considers that an excellent observation about methodological flaws in NN research, and perhaps they (or Fredkin) repeats the story to others, who find it useful too, and along the way, Fredkin’s question mark gets dropped and the possible flaw becomes an actual flaw, with the punchline: “…and it turned out their NN were just detecting average brightness!”

One might expect Kanal & Randall to rebut these rumors, if only by publishing additional papers on their functioning system, but by a quirk of fate, as Kanal explains in his preface, after their 1964 paper, the Army liked it enough to make it classified and then they were reassigned to an entirely different task, killing progress entirely. (Something similar happened to the best early facial recognition systems⁠⁠.)
Proliferation: In the absence of any counternarrative (silence is considered consent), the tank story continues spreading.
Mutation: but now the story is incomplete, a joke missing most of the setup to its punchline—how did these Army researchers discover the NN had tricked them and what was the brightness difference from? The various versions propose different resolutions, and likewise, appropriate details about the tank data must invented.
Fixation: Eventually, after enough mutations, a version reaches Dreyfus, already a well-known critic of the AI establishment, who then uses it in his article/book, virally spreading it globally to pop up in random places thenceforth, and fixating it as an universally-known ur-text. (Further memetic mutations can and often will occur, but diligent writers & researchers will ‘correct’ variants by returning to the Dreyfus version.)

One might try to write Dreyfus off as a coincidence and argue that the US Army must have had so many neural net research programs going that one of the others is the real origin, but one would expect those programs to result in spinoffs, more reports, reports since declassified, etc. It’s been half a century, after all. And despite the close association of the US military with MIT and early AI work, tanks do not seem to have been a major focus of early NN research—for example, Schmidhuber’s history⁠ does not mention tanks at all, and most of my paper searches kept pulling up NN papers about ‘tanks’ as in vats, such as controlling stirring/mixing tanks for chemistry. Nor is it a safe assumption that the military always has much more advanced technology than the public or private sectors; often, they can be quite behind or at the status quo.⁠⁠6⁠

The Neural Net Tank Urban Legend

Did It Happen?

Versions of the Story

2010s

2000s

1990s

1980s

1960s

Evaluation

Sourcing

Variations

Urban Legends

Origin

Recommend

Could these marks on a cave wall be oldest-known Neanderthal “finger paintings”?

Tesla is looking to acquire wireless charging startup

如何在long-running task中调用async方法 - 黑洞视界

Microsoft Teams certified devices are getting a new "pre-join" screen...

组成微星全家桶，微星刀锋100R机箱体验_原创_新浪众测

What is a Magic Shiba Starter?

Jane Street Tech Blog - Oxidizing OCaml: Rust-Style Ownership

Microsoft Edge 116 is now available in the Dev Channel

Get your first look at the OnePlus V Fold, thanks to render leaks

Wealthiest People in Turkey (June 21, 2023)

About Joyk