A Real Story about Bias in Machine Learning
In the news, there have been several notable cases of how bias can affect Artificial Intelligence (AI) and Machine Learning (ML). It is often the case that trained models are merely replicating what they “see” has happened in the past (because of bad data) or are guessing (because of missing data) without recognizing the problems this can cause when different types of biases creep into ML:
- Bad Data: Information used to train your ML model is biased and takes an approach that is unwanted. For example, in training a model that approves/disapproves loan applications, data is used from loan officers who don’t approve loans to certain demographic groups.
- Missing Data: A training data set that is representative of the actual data/population you will be using your model with is not available, which leads to incorrect or random outcomes. For example, when cameras outfitted with a blink detector feature incorrectly “thought” the eyes of smiling Asian faces were blinking.
CVP has been using ML for over a decade. This experience has led us to develop a multi-angle approach for controlling bias using techniques like:
- Removal of demographic data that could be used improperly, as well as proxies for that data. For example, if ZIP codes are in the dataset, the ML model may be taught to reinvent redlining – a negative process where a specific demographic group would be blocked from receiving home loans by excluding certain geographic areas.
- Subject matter expert review of both our input attributes and predictions, to detect any potential bias or evidence of unequal outcomes across different dimensions.
- AI explainability and interpretability techniques to examine which factors drove the predictions and to see if any improper factors are being taken into account.
Even with all this, sometimes cases of bias still exist. CVP recently did an AI proof-of-concept for a client where this problem manifested. Understanding what happened and how to control it was enlightening. The following is what we learned.
The goal of the project was related to something most people are familiar with: seat belts. Though seat belt compliance in 2020 is generally good, in some industries, there are still workers who do not routinely buckle-up or do not get the audible warnings (for not using a seat belt) that most passenger vehicles provide. Our client wanted to figure out if we could build a seat belt detector app to tell whether a person was wearing a seatbelt, and sound an alarm if they were not. Essentially, the app takes photos using a smartphone’s camera and then passes the image to an ML model that is trained to detect:
- Person in the Frame: Belted
- Person in the Frame: No Belt
- No Person in the Frame
Much of the work in creating an ML model is actually in the collection of data. Because this was a proof-of-concept project only, we began (as most mad scientists do) with using ourselves as test subjects.
Above, there are two images from our training set: on the left is Class 1 (Person in the Frame: Belted) and the right is Class 2 (Person in the Frame: No Belt). To make the model generalize, we made sure to take photos from different cameras, different angles, in different cars, and with different color seat belts. Preliminary results were promising, with the model achieving about 80% accuracy on images from the test set. This made us excited, so we went to try it out with someone who wasn’t on the Development Team. Lo and behold, the accuracy dropped precipitously. We immediately realized the problem: our data set was not representative of the overall population and the model was having trouble with people of different shapes. As you can see from the images above, our test subject was what a demographer would call a “skinny white guy” and anyone who didn’t fit into that category wasn’t being handled accurately. We needed to broaden our test set!
Enlisting the help of family members, we took more photos of different people to add to the training set; forcing the model to generalize further. (Don’t worry, no keys were in the vehicle when the unlicensed driver you see to the right was photographed). After adding a couple hundred more images, the performance of our model improved greatly across different ages and genders. Success!
But there was a problem. Family members (unsurprisingly) have a strong resemblance to one another, and in further testing, our model really didn’t work for the general population, as it would randomly have issues with co-workers who tested it. We realized we still had a form of bias and needed to broaden the base further.
CVP values diversity in our employees for the strength it brings to our projects. We called upon this diversity via a company-wide Slack channel to get even more photos:
This worked! We got several dozen more photos of people of all ages, genders, and ethnic groups and the model actually started predicting reliably across different ethnicities. Feeling proud of ourselves, we prepared for the first demonstration to the client, where we ran into another problem: clothing.
Two years ago, CVP adopted a casual dress code policy for our offices, which means if we don’t have a client meeting or event, we’re generally dressed pretty casually. This led to another form of bias: our model didn’t work well with people who didn’t dress like we did. At our client demo, two of the participants wearing collared shirts and ties tried out the app. The diagonal lines of a collar and vertical silk straps (ties) on their bodies confused the model, so back to the lab we went. We now moved to including a broader set of clothing and spent the weekend playing dress-up with things like collared shirts and ties of different colors and patterns:
After this, and some other tweaks, we were able to create a model that worked on 50 different men and women dressed casually and formally, with 98% accuracy on a given frame. Overall, it had an uncanny ability to spot people who forgot their seatbelt within 50 milliseconds.
If you’ve read this far, you are probably starting to get the idea: ML models are really not “smart” enough, in a general sense, to uncover creative ways to discriminate. Along with the biases noted so far, we found several other biases for which we had to go back and add more data in later iterations:
- Facial Expressions: It did not like people smiling so we had to take happy and sad photos.
- Accessories on the Face: We added-in photos with glasses, hats, etc.
- Body Position: It did not like people who drove with one arm up or out to the side so we had to take photos that looked like we were waving at the camera.
Even a four-year old child can tell if someone is wearing a seatbelt or not with relatively little bias, so we thought this project would be straightforward. But the real world implementation of this model demonstrated that this was not the case. Obvious (ethnic group) and non-obvious (clothing) biases snuck in. We had to ensure that we tested constantly in a variety of ways, remained open to tweaking our approach, and remembered to include different people in the process to check our conscious and unconscious biases.
AI bias is a real problem in the world and the speed at which AI can inadvertently pick up on this and make it worse is a big concern – what could have happened if we didn’t catch all the criteria that caused the biases before this app went into wide distribution in the real world?
- If the model didn’t catch the lack of seat belt use in certain groups (false negatives), there may have been a higher fatality rate.
- If the model mishandled certain groups by issuing erroneous warnings (false positives), drivers may think the device is faulty and turn it off or ignore it, and also experience a higher fatality rate.
We learned that if the model had a higher error rate in either direction for a limited, unknown portion of the population, and we still mandated the use of the app in this device, we could make the problem worse!
Now think of all the other situations in which AI is used and how unintended bias could cause huge problems:
- Detecting and responding to terrorists;
- Automating approval of life-changing financial transactions;
- Impacting high-stakes standardized tests, such as the GMAT (see here for more information on the automated scorer).
At CVP, we think Artificial Intelligence has been and will continue to be a net positive for the world, but as with any new technology, it must be programmed and used properly in order to reduce the risk of harm. While organizations can learn how to put together a basic AI model from an online course, some steps required are not easy:
- Spotting different types of bias in your data;
- Creating proper validation and test data sets;
- Determining the right performance measure for your model;
- Integrating your trained model with existing systems and putting it into production.
If you are an organization looking to apply AI capabilities and want to avoid the pitfall of bias (and other problems), make sure you enlist an expert like CVP!