You’ve probably seen FaceApp, the viral photo filter app that makes people look old or young or like another gender. It also offered an option to make the photo “hot,” which in practice just made everyone look white by lightening skin, rounding eyes, and shrinking noses, and just generally enforcing racist beauty standards. When Motherboard asked its creator what happened, Wireless Lab CEO Yaroslav Goncharov blamed the data used to train the FaceApp algorithm. “It is an unfortunate side-effect of the underlying neural network caused by the training set bias,” he said, “not intended behavior.” Wireless Lab renamed the feature “spark” and said it is fixing the code.
Wait. So where did this training data come from, and why was it racist?
Computer scientists have found that the fastest way to teach machines to do higher-order tasks is often through mimicry. Instead of coding explicitly, this is what a face looks like, see there’s a nose area and a chin area, etc. etc., they just give computers tens of thousands or hundreds of thousands of faces to process. “Here, computer. These are all faces. You figure out what they have in common.” Then, the next test: They present the computer with an image that isn’t in the database. “Hey, computer. You think you know what a face is. Is this a face?”
This process is a type of deep learning, and if you swap in data about how humans drive or a history of human Go matches, the same basic idea can also be used to train self-driving cars or teach a computer to play Go. Its use is expanding rapidly, and researchers, watchdogs, and the public are starting to pay more attention to the data being used to train these algorithms. Because if your training data has a blind spot, your algorithm will, too.
Adam Geitgey, a machine learning consultant and the former director of engineering at Groupon, experienced this when he built a facial recognition system using a data set of 13,000 photos called “Labeled Faces in the Wild,” which was published by researchers at the University of Massachusetts.
“It had never seen a child before, so it couldn’t tell children apart.”
“When I built the system it worked really well and got a really high accuracy, but as soon as I tried it on pictures of my kids, it didn’t work at all,” he told The Outline. “It had never seen a child before, so it couldn’t tell children apart.”
Companies like Amazon, Facebook, and Google generate their own training data through the products they provide their users. Alexa is learning from your voice commands, just as Facebook is learning from the photos you tag. But researchers at universities, nonprofits, and startups like FaceApp don’t have this option.
FaceApp’s Goncharov declined to say exactly where the data came from. “We assembled our own data set for training, but not ready to share additional details about it at this point,” he told The Outline in an email. However, it’s likely that FaceApp got at least some of its training data the same way most deep learning researchers do: from the internet.
As machine learning research accelerates, scientists have started pooling their resources. ImageNet is a popular data set produced by researchers at Stanford and Princeton that contains 14 million images grouped by nouns in synonym sets such as “kid, child,” “woman, adult female,” “office, business office.” Its creators expressed a desire to push the field forward. “The ImageNet project is inspired by a growing sentiment in the image and vision research field — the need for more data,” the site reads. “This is the motivation for us to put together ImageNet. We hope it will become a useful resource to our research community, as well as anyone whose research and education would benefit from using a large image database.”
ImageNet is one of many publicly available data sets made by corporations and research groups and released for free online for others to use in training algorithms. Labeled Faces in the Wild, the one Geitgey used in his grown-ups-only facial recognition project, is a popular data set of celebrity images collected from the web and labeled with the celebrity’s name. Microsoft Common Objects in Context is a set of more than 300,000 images of everyday objects in natural settings, with the objects highlighted, labeled, and classified. The TV News Channel Commercial Detection data set has 129,685 videos that can be used to detect and block ads. The Berkeley Multimodal Human Action Database (MHAD) has video of 55 people each performing 12 different actions. The BioID Face Database is 1,521 images with the eyes marked, while the Yale Face Database has images of 15 people making 11 different facial expressions. These sets are also called corpuses, and there are many more.
If none of these data sets fulfill the needs of the project, researchers can create their own. For something like FaceApp, Geitgey speculated, the researchers might have set up a “Hot or Not”-style website, in order to get data about attractiveness, and paid laborers to use it through a service like Amazon’s popular Mechanical Turk. It’s also possible to just scrape data that’s already on the web. Researchers working on language projects, for example, sometimes download huge portions of Wikipedia or Google News in order to, for example, figure out how to make a chatbot sound natural.
The problem with this approach is that any bias in the training data can quickly proliferate across applications, and it may not be as noticeable as it was in FaceApp. Researchers have found that algorithms trained on this data have significant biases. In a recent Science paper, researchers trained an algorithm using data from Common Crawl, a data set of 5 billion webpages produced by a nonprofit, and then gave it the Implicit Association Test to test for bias. The algorithm associated African-American names with negative concepts and European names with positive concepts, and also linked female descriptors to concepts around family, while linking male descriptors more with concepts related to career.
“Our findings suggest that if we build an intelligent system that learns enough about the properties of language to be able to understand and produce it, in the process it will also acquire historical cultural associations, some of which can be objectionable,” the researchers wrote.
The problem with this approach is that any bias in the training data can quickly proliferate across applications
The researchers also found those biases in Google Translate, which uses machine learning to improve translation. “Google Translate converts these Turkish sentences with gender-neutral pronouns: ‘O bir doktor. O bir hems¸ire.’ to these English sentences: ‘He is a doctor. She is a nurse,’” the researchers wrote.
Unless the training data is explicitly corrected for bias, these machine learning algorithms will continue to perpetuate harmful stereotypes — yet another mechanism of systemic racism and sexism. In an interview with the Harvard Business Review, former data scientist and author of the book Weapons of Math Destruction Cathy O’Neil explained how this could happen.
“I do a thought experiment often with people where I’m imagining that Fox News has a machine learning algorithm to find anchors. And they define success as stay at Fox News for five years and get promoted twice. Now historically speaking, which is how you’re going to train this algorithm, we happen to know now that women were systematically prevented from succeeding at Fox News. So what will happen when you train a machine learning algorithm using that old data, it’ll recognize this pattern. And when you give it a new set of applicants for a new job as an anchor, it will basically be asking the question, who among these new applicants looks like somebody who is successful in the past. And we have reason to suspect that it would filter out the women.”
All this data comes from people, who have spilled their biases onto the web. The first challenge for machine learning engineers is just to get enough training data, period — an anemic data set will lead to very obvious problems right away. But in the rush to grab as much data as possible, researchers keep turning to the internet.
It makes perfect sense. The human race has been feeding knowledge into this digital repository, almost as if we knew that one day we would need incomprehensible amounts of data to train machine learning algorithms. Bias appears immediately if you use the internet as your data set, as access and control of the internet is still dominated by the richer, more educated, English-speaking parts of the world.
Geitgey estimates that FaceApp’s creators would have needed at least 10,000 images of faces to train its algorithm, but hundreds of thousands would be much better. “When you’re building an application like this, especially if you’re a startup, getting the data is the hardest part,” he said. “Basically, you’re grabbing whatever you can get. That means that whatever the distribution of faces is in that data, that’s what’s going to end up in your system.”
In the rush to grab as much data as possible, researchers keep turning to the internet
As we know, the internet reproduces the hate, bigotry, and cruelty of the real world. But in the rush to satisfy hungry machine learning algorithms, researchers haven’t fully reckoned with the consequences of relying on it as a data source. “We’re kind of in this situation where there is a lot of benefit to having these systems, but there is a lot of the data tainting how the systems work, and we don’t really know how to solve that yet,” Geitgey said.
In fact, there is some debate in the machine learning community over how much obligation scientists have to correct this type of bias. “A lot of the researchers feel like, ‘I’m not doing anything subjective. I’m building a completely objective system that’s just math. It just runs through data and just replicates whatever data you feed into it.’ So they get offended when they’re accused of being partial to a certain group or pushing forth ideas they don’t believe it.”
The reality is also that no real alternative has emerged yet. “The most data that exists is on the web, so that’s where you get the data from, so then the systems you build are going to replicate the patterns in that data,” he said. “And I don’t think anybody’s quite figured out exactly how to solve that problem.”
There is no other data source as large and varied as the web. Even the proprietary data that Amazon, Facebook, and Google hold comes in through the internet, and brings its own biases. Scientists need to figure out how to account for those biases if the human race wants to reap the benefits of higher-order systems without perpetuating prejudice.