What makes a teapot a teapot? Is it the shape or size, its texture or function? As far as I know, there is no Theory of Teapots proscribing the criteria to be properly considered a teapot. And yet, I know a teapot when I see a teapot, even if it shares some characteristics of non-teapots, or is a teapot but also something else, like a teapot in the shape of R2D2. This ability I have to imagine teapots that are not precisely teapots is what makes me, or really any human, vastly better than even the best visual recognition AIs.
Computer vision is one of the most widely used applications of artificial intelligence today, powering everything from facial recognition software to autonomous vehicles to porn identification on Tumblr. But despite intense focus on the field by tech giants like Amazon and Google, computer vision isn’t flawless: juggalos evade recognition, widespread adoption of autonomous vehicles is perpetually a few years a way, and machines haven’t really figured out the whole art/porn distinction. With more and more resources being directed at this problem, it’s becoming clear that these challenges aren’t simply the result of a lack of processing power; perfecting computer vision may require us to understand perception from a machine’s point of view.
In a study published in PLOS Computational Biology last month, a group of cognitive psychologists from UCLA conducted a series of image recognition experiments to shed light on the differences between human sight and computer vision. They found that, despite incredible advances in computer vision over the past decade, certain image recognition tasks still confound deep convolutional neural networks (DCNNs, the architecture used for most computer vision systems).
In order to test the competency of an AI system, the researchers fed surreal images of animals and objects to the top-of-the-line image recognition neural network VGG-19. The images all had the basic outline of an object or animal, but were filled with a strange texture: a otter made to look like a speedometer; a patchwork-quilt elephant; a teapot stamped like a golf ball. While a bit disorienting, the images are easily recognizable to us. But the neural network struggled with these categorizations, assigning only a 41 percent probability that the teapot was a teapot, and a 0 percent chance that the elephant was, in fact, an elephant.
The DCNN did guess that the teapot was a golf ball, suggesting that texture was a key feature for its recognition systems. For people, shape is usually the most important identifying factor, because we perceive objects globally; a flamingo, for example, is not a pink body, long neck, head and wings perched upon a single leg, but one cohesive thing (a weird bird). AI systems, on the other hand, seem to categorize images by breaking them down into discrete parts. To a neural network, a flamingo is an assemblage of appendages; they are more dependent on local features than global features.
The piecemeal approach used by the network to identify images became even more clear in a later experiment. The researchers picked out six images that the network had properly identified, cut them up into a grid, and scrambled the pieces before feeding the altered images back to the network. The network correctly identified five out of the six scrambled images, while ten human study participants only correctly identified 36 percent of the images.
“This study shows these systems get the right answer in the images they were trained on without considering shape,” study author Philip Kellman told UCLA Newsroom. “For humans, overall shape is primary for object recognition, and identifying images by overall shape doesn’t seem to be in these deep learning systems at all.”
That’s not to say artificial intelligence is just, well, dumber than we are. In certain arenas, like chess, AI is far superior to people. But AI dominance is restricted to highly structured, rule-based domains. Something as simple as pointing out stop signs in an image can stump a machine – that’s why these kinds of recognition tasks are used in CAPTCHAs to prove one’s humanity (and, perversely, to provide training data for a neural network). Your human brain can take everything it knows about teapots and golf balls and successfully identify a teapot shaped or patterned like a golf ball; so far, AIs don’t have that ability to reason.
When it comes to image recognition, there’s a level of abstraction that we haven’t figured out how to encode in machines. Neural networks, after all, are trained using a technique called reinforcement learning, where the network is fed gobs of data and “learns” based on the examples it is given. If you were going to train a network to recognize stop signs, you would feed it a ton of pictures of stop signs, and a bunch of non-stop sign images, and tweak its parameters until it learned to recognize stop signs based on the patterns it detects in the images. What makes this training process so difficult is that you need the network to pick up enough signals from the data so that it recognizes stop signs — the concept — but not just those particular stop signs in the dataset — the examples meant to illustrate a concept.
People are able to pick out what makes a thing that thing – its essence — and abstract that to other objects that have the same essence, but different contingent properties. This is the ultimate goal of computer vision, a modest step that would represent a huge leap for computerkind: to be able to look at a teapot that kind of looks like a golf ball and still recognize it as a teapot.