Using AI to Automate Detection of Fake News
The controversial topic of fake news is an emerging problem across news and social media. CVP’s team of over 40 data scientists wanted to investigate whether Artificial Intelligence (AI) could help with this problem. We first started with a source of 7,000 news articles, of which half were from the mainstream media and half were from known purveyors of fake news. We then organized the articles into a dataset suitable for machine learning. After using Natural Language Processing (NLP) to clean up the data and perform tasks like excluding common words, we trained several open-source machine learning algorithms and settled on a model that was ultimately over 90% accurate in identifying articles from real vs. fake sources.
We were actually kind of floored that it was so accurate after just a couple runs. This appeared to be a complex problem that wouldn’t be straightforward for the model to pick up on, but it turned out we were pessimistic. Using a technique called “explainable AI,” we then went further and used a library called SHAP on the machine learning model to actually explain why the model made the decisions it did.
It turns out the biggest factor that the model keyed-in on was that fake news writers tend to state opinions as facts and don’t bother to quote or attribute things to people. The explanations told us that the lack of the word “said” was a huge indicator for detecting fake news because the authors seemed to rarely write about who said what. We found other terms such as “president” were generally correlated with real news, while the words “share” and “article” were associated with fake news, likely because the fake news authors have an important goal of ensuring their message is widely shared on social media.
You can see all the code and visualizations used for this on GitHub at this link.
Although CVP is not actually in the business of monitoring the media, we are using a similar technical approach to help clients understand complex topics and make predictions in other ways. For example, one of our clients in the Federal government runs a hotline where members of the public can log safety concerns against a location or business. By training a machine learning model to recognize which concerns are tied to sites with reported injuries, we use a version of the technical approach outlined above to predict future accidents or injuries. Our early prototype is already running at over 75% accuracy and we look forward to building upon that to further improve public safety.