Elebenty gazillion readers sent me this story (with diverse links) over the last ten days. Thanks to all of you, who are now too numerous to name. (See links at The New York Times, Wired Science, and the BBC, among others.)
The interesting results are encoded in scientific language in the abstract of a yet-unpublished manuscript by Le et al. (reference and link below). If you’re like me, the following won’t make a ton of sense:
We consider the problem of building high- level, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 bil- lion connections, the dataset has 10 million 200×200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a clus- ter with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental re- sults reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bod- ies. Starting with these learned features, we trained our network to obtain 15.8% accu- racy in recognizing 20,000 object categories from ImageNet, a leap of 70% relative im- provement over the previous state-of-the-art.
At any rate, the tale is this: Google scientists interested in face-recognition software set up an immensely complicated artificial neural network, containing 16,000 computer processors, and fed it random images from YouTube. And what came out? Cat recognition! (Note: NOT squid recognition.) Cats, of course, are everywhere on YouTube.
The NYT notes:
Presented with 10 million digital images found in YouTube videos, what did Google’s brain do? What millions of humans do with YouTube: looked for cats.
The neural network taught itself to recognize cats [JAC: this took 3 days], which is actually no frivolous activity. This week the researchers will present the results of their work at a conference in Edinburgh, Scotland. The Google scientists and programmers will note that while it is hardly news that the Internet is full of cat videos, the simulation nevertheless surprised them. It performed far better than any previous effort by roughly doubling its accuracy in recognizing objects in a challenging list of 20,000 distinct items. . .
To find them, the Google research team, led by the Stanford Universitycomputer scientist Andrew Y. Ng and the Google fellow Jeff Dean, used an array of 16,000 processors to create a neural network with more than one billion connections. They then fed it random thumbnails of images, one each extracted from 10 million YouTube videos.
The videos were selected randomly and that in itself is an interesting comment on what interests humans in the Internet age. However, the research is also striking. That is because the software-based neural network created by the researchers appeared to closely mirror theories developed by biologists that suggest individual neurons are trained inside the brain to detect significant objects. . .
“We never told it during the training, ‘This is a cat,’ ” said Dr. Dean, who originally helped Google design the software that lets it easily break programs into many tasks that can be computed simultaneously. “It basically invented the concept of a cat. We probably have other ones that are side views of cats.”
The Google brain assembled a dreamlike digital image of a cat by employing a hierarchy of memory locations to successively cull out general features after being exposed to millions of images. The scientists said, however, that it appeared they had developed a cybernetic cousin to what takes place in the brain’s visual cortex.
The BBC notes:
The work of the team stands at odds with many image-recognition techniques, which depend on telling a computer to look for specific features of a target object before any are presented to it.
By contrast, the Google machine knew nothing about the images it was to see. However, its 16,000 processing cores ran software that simulated the workings of a biological neural network with about one billion connections.
In a similar way nerves in brains are heavily interconnected and it is believed that “recognition” involves the triggering of a specific pathway through that thicket of connections.
Pathways for particular objects, people or other stimuli are thought to be built up as organisms learn about the world. Some neuroscientists speculate that parts of the human visual system become so specialised they recognise very specific subjects such as a person’s grandmother or their cat.
As millions of images were analysed by Google’s network of silicon nerves, some parts of it started to react to specific elements in those pictures.
After three days and 10 million images the network could spot a cat, even though it had never been told what one looked like.
Although the work at first seems useless, it isn’t. We learn to recognize objects by repeated exposure to them. And it’s always been a mystery to scientists how we’re able to form and remember images of people and friends whom we repeatedly see. (When I was younger, my father used to ask me the question, “Try to imagine a face that you’ve never seen before.” Try it—it’s not easy!) Some day face-recognition software will be everywhere: identifying you before letting you into secure facilities, helping police solve crimes, and so on.
And of course it will also give us a clue about how our brain works when recognizing faces. We can easily clue in on human faces, but not so easily on the faces of individuals from different species, which may be nearly as distinct from each other as one human is from others. And there’s an evolutionary reason for that: it was crucial for our group-living ancestors to recognize not only kin but other groupmates (who helped us before, and who was bad to us?), and to discriminate group-mates from potentially hostile members of outgroups.
Lee, Q. V. et al. 2012. Building high-level features using large scale unsupervised learning. Manuscript at Cornell University Library. (free pdf download at link).