Artificial intelligence still has severe limitations in recognizing what it's seeing

Artificial intelligence won’t take over the world any time soon, a new study suggests — it can’t even “see” properly. Yet.

Teapot with golf ball pattern used in the study.
Image credits: Nicholas Baker et al / PLOS Computational Biology.

Computer networks that draw on deep learning algorithms (often referred to as AI) have made huge strides in recent years. So much so that there is a lot of anxiety (or enthusiasm, depending on which side of the contract you find yourself) that these networks will take over human jobs and other tasks that computers simply couldn’t perform up to now.

Recent work at the University of California Los Angeles (UCLA), however, shows that such systems are still in their infancy. A team of UCLA cognitive psychologists showed that these networks identify objects in a fundamentally different manner from human brains — and that they are very easy to dupe.

Binary-tinted glasses

“The machines have severe limitations that we need to understand,” said Philip Kellman, a UCLA distinguished professor of psychology and a senior author of the study. “We’re saying, ‘Wait, not so fast.”

The team explored how machine learning networks see the world in a series of five experiments. Keep in mind that the team wasn’t trying to fool the networks — they were working to understand how they identify objects, and if it’s similar to how the human brain does it.

For the first one, they worked with a deep learning network called VGG-19. It’s considered one of the (if not the) best networks currently developed for image analysis and recognition. The team showed VGG-19 altered, color images of animals and objects. One image showed the surface of a golf ball displayed on the contour of a teapot, for example. Others showed a camel with zebra stripes or the pattern of a blue and red argyle sock on an elephant. The network was asked what it thought the picture most likely showed in the form of a ranking (with the top choice being most likely, the second one less likely, and so on).

Examples of the images used during this step.
Image credits Nicholas Baker et al., 2018, PLOS Computational Biology.

VGG-19, the team reports, listed the correct item as its first choice for only 5 out of the 40 images it was shown during this experiment (12.5% success rate). It was also interesting to see just how well the team managed to deceive the network. VGG-19 listed a 0% chance that the argyled elephant was an elephant, for example, and only a 0.41% chance that the teapot was a teapot. Its first choice for the teapot image was a golf ball, the team reports.

Kellman says he isn’t surprised that the network suggested a golf ball — calling it “absolutely reasonable” — but was surprised to see that the teapot didn’t even make the list. Overall, the results of this step hinted that such networks draw on the texture of an object much more than its shape, says lead author Nicholas Baker, a UCLA psychology graduate student. The team decided to explore this idea further.

Missing the forest for the trees

For the second experiment, the team showed images of glass figurines to VGG-19 and a second deep learning network called AlexNet. Both networks were trained to recognize objects using a database called ImageNet. While VGG-19 performed better than AlexNet, they were still both pretty terrible. Neither network could correctly identify the figurines as their first choice: an elephant figurine, for example, was ranked with almost a 0% chance of being an elephant by both networks. On average, AlexNet ranked the correct answer 328th out of 1,000 choices.

Well, they’re definitely glass figurines to you and me. Not so obvious to AI.
Image credits Nicholas Baker et al / PLOS Computational Biology.

In this experiment, too, the networks’ first choices were pretty puzzling: VGG-19, for example, chose “website” for a goose figure and “can opener” for a polar bear.

“The machines make very different errors from humans,” said co-author Hongjing Lu, a UCLA professor of psychology. “Their learning mechanisms are much less sophisticated than the human mind.”

“We can fool these artificial systems pretty easily.”

For the third and fourth experiment, the team focused on contours. First, they showed the networks 40 drawings outlined in black, with the images in white. Again, the machine did a pretty poor job of identifying common items (such as bananas or butterflies). In the fourth experiment, the researchers showed both networks 40 images, this time in solid black. Here, the networks did somewhat better — they listed the correct object among their top five choices around 50% of the time. They identified some items with good confidence (99.99% chance for an abacus and 61% chance for a cannon from VGG-19, for example) while they simply dropped the ball on others (both networks listed a white hammer outlined in black for under 1% chance of being a hammer).

Still, it’s undeniable that both algorithms performed better during this step than any other before them. Kellman says this is likely because the images here lacked “internal contours” — edges that confuse the programs.

Throwing a wrench in

Now, in experiment five, the team actually tried to throw the machine off their game as much as possible. They worked with six images that VGG-16 identified correctly in the previous steps, scrambling them to make them harder to recognize while preserving some pieces of the objects shown. They also employed a group of ten UCLA undergrads as a control group.

The students were shown objects in black silhouettes — some scrambled to be difficult to recognize and some unscrambled, some objects for just one second, and some for as long as the students wanted to view them. Students correctly identified 92% of the unscrambled objects and 23% of the scrambled ones when allowed a single second to view them. When the students could see the silhouettes for as long as they wanted, they correctly identified 97% of the unscrambled objects and 37% of the scrambled objects.

Example of a silhouette (a) and scrambled image (b) of a bear.
Image credits Nicholas Baker et al / PLOS Computational Biology.

VGG-19 correctly identified five of these six images (and was quite close on the sixth, too, the team writes). The team says humans probably had more trouble identifying the images than the machine because we observe the entire object when trying to determine what we’re seeing. Artificial intelligence, in contrast, works by identifying fragments.

“This study shows these systems get the right answer in the images they were trained on without considering shape,” Kellman said. “For humans, overall shape is primary for object recognition, and identifying images by overall shape doesn’t seem to be in these deep learning systems at all.”

The results suggest that right now, AI (as we know and program it) is simply too immature to actually face the real world. It’s easily duped, and it works differently than us — so it’s hard to intuit how it will behave. Still, understanding how such networks ‘see’ the world around them would be very helpful as we move forward with them, the team explains. If we know their weaknesses, we know where we need to put most work in to make meaningful strides.

The paper “Deep convolutional networks do not classify based on global object shape” has been published in the journal PLOS Computational Biology.