Meedan’s Research team has recently published the paper "Claim Matching Beyond English to Scale Global Fact-Checking" at the 2021 Association for Computational Languages conference. This paper describes how we built equitable technology through a public dataset of WhatsApp messages and fact-check headlines to claim match in multiple languages. During my summer 2020 fellowship with Meedan, I worked on this project with Kiran Garimella (MIT/Rutgers), Devin Gaffney (Meedan) and Scott Hale (Meedan & Oxford).
Human fact-checking is high-quality but time-consuming. However, we can still scale it by finding all the content that a fact-check applies to. Simple approaches such as measuring overlap between texts often fall short when similar claims are expressed with different words, for example, "the Earth is warming more quickly than ever before" and "climate change has never happened so fast." Similarly, there are times when high word overlap is irrelevant since key information in the claims are different: "Bolsonaro negotiates border reopening with Uruguay" and "Trump negotiates border reopening with Canada" are examples of this shortcoming.
Prior natural language processing (NLP) research by Shaden Shaar and colleagues on claim matching focused on English, but what if we wanted to support fact-checking efforts in India or Brazil? The main obstacles for doing this are that data-hungry, state-of-the-art NLP algorithms are not available for lower-resourced languages, and creating new datasets in such languages is costly.
In English, we found we could build a solution with Sentence-BERT (SBERT). However, we could not train a similar model for other languages due to a lack of data in those languages. SBERT (and most modern NLP methods) represent text as a sequence of numbers (dense vectors). Since SBERT already performed well in English, we might be able to get it to perform well in other languages if the numeric representations of those languages were aligned with English. We did exactly that using a combination of knowledge distillation, parallel datasets and Facebook’s XLM-Roberta model.
The knowledge distillation technique created a new model where (i) the vectors for English sentences were very close to the original SBERT model and (ii) sentences in different languages also vectors very similar to their English translations. This technique essentially dragged all the vector representations for the other languages into alignment with the high-performing English-language model. The resulting model can generate contextual embeddings for a group of high-resource (English, Hindi) and low-resource (Bengali, Malayalam, Marathi, Tamil and Telugu) languages. We publicly released our model called "Indian XLM-R" (I-XLM-R) on huggingface, an open-source online community for sharing NLP models and technology. Our research paper describes the solution architecture in greater detail.
Does it work?
To evaluate our solution we created two multilingual, mainly Indian datasets: a claim detection set ("does this text contain check-worthy claims?") and a claim matching set ("can the claims in these two posts be served with one fact-check?"). This was made possible through collaboration with fact-checking organizations running misinformation tiplines using Meedan’s software, Check, and messages from public WhatsApp groups collected previously by Kiran Garimella. Details of our dataset curation process can be found in our paper, and the data is available. Using our novel constructed benchmark, we found that our I-XLM-R model outperformed existing state-of-the-art NLP models such as Google’s LaBSE and Facebook’s LASER.
Our approach is not limited to Indian languages & works for most languages. Although we mainly evaluated I-XLM-R on monolingual claim pairs (both claims in the same language), we have also observed good cross-lingual performance in practice, which demonstrates the ability of our model to find similar claims across languages.
Meedan is already using the I-XLM-R based solution to support fact-checking partners. Claim matching allows WhatsApp users to get an immediate response to a request they send to a tipline if there is a close match. It also allows fact-checking partners to better understand the prevalence of a claim by grouping all of its repeats and variants together. Members of Meedan’s program team recently used our approach to study the top items submitted to misinformation tiplines run by partners in Brazil and Indian.
Lastly I want to acknowledge the help of and thank my co-authors, my PhD advisor Professor Rada Mihalcea at University of Michigan, the Meedan team and their fact-checking partners, our annotators and the anonymous reviewers who provided us with thoughtful feedback: a big thank you to you all!
Privacy and data governance: The privacy of WhatsApp users is our top priority. This analysis was done on anonymous, aggregated data containing only the raw text of messages sent to the tiplines and large public groups. Our research did not use any information about users.
All tipline data is owned by the organizations running the tiplines. Our research only includes data from partners who opted-in to this research project.