We were thrilled to hear that the focus of Tech & Check this year would be on projects matching speakers’ claims with previously published fact-checks and other related text and image matching efforts in service of fact-checking. Just as we didn’t anticipate that our main technical focus would be the subject of this year’s gathering, we didn’t suspect that our respective planned travel from the UK and Brazil for this year’s Tech & Check would be replaced with an opening night mixer via Zoom due to the restrictions imposed by the COVID-19 crisis around the world. Regardless of the global pandemic, on April 3rd, journalists, engineers, computer scientists and other professionals met virtually to watch a few demos and discuss claim matching in the context of fact-checking.

Last year at Tech & Check, we presented Check Message, an integration between Check (Meedan’s open source platform for collaborative fact-checking) and the WhatsApp Business API, as a tool to deal with misinformation in closed message applications at scale. Just a few days after our presentation, the technology went live with Checkpoint, an effort to monitor, verify, and conduct research on WhatsApp misinformation during the 2019 Indian Election led by Delhi-based media research and training organization PROTO and commissioned and technically assisted by WhatsApp. Check Message allows media partners to operate tip lines to which people can forward possibly misleading, miscontextualised, or false information they see to the partner for fact-checking. During the Indian Elections, PROTO received over 100,000 submissions from the public.

Caio Almeida presenting at Tech & Check 2019 in Duke University. Image via the DeWitt Wallace Center for Media and Democracy, and photo by Colin Huth.

Caio Almeida presenting at Tech & Check 2019 in Duke University. Image via the DeWitt Wallace Center for Media and Democracy, and photo by Colin Huth.

Many of those submissions were similar to other submissions, but at that time we just had the technology implemented to match identical submissions—text or images—so, many submissions that were similar but not identical had to be manually marked as such, which was uselessly costly. In order to optimize that process, we started to work on tools that could suggest similar items automatically to our users.

We were honored to be one of five organizations that demonstrated their work in this area. In our case, we demonstrated our current claim and image matching tools for fact-checkers, which integrate with our open-source software, Check. We developed an open-source architecture to use ElasticSearch to efficiently search language volumes of text items by their vector representations in near real-time and integrated this into Check. For Checkpoint, we used pre trained word vectors, but the architecture we developed is not specific to these choices and can easily be adapted to other vector-representation models and languages. We are now using the Multilingual Universal Sentence Encoder, and are working to develop additional approaches to support even more of many languages our partners across the world use. For images, we are using perceptual hashing and evaluating newer approaches including PDQ.

Architectural diagram of Check’s open source claim similarity framework

Architectural diagram of Check’s open source claim similarity framework

For both images and text, Check users are able to define a set of similarity rules, which can relate items to be verified based on how similar they are. The user defines that minimum similarity threshold, which is a number between 0% and 100%, and if our image and text matching tool returns a similarity value equal or higher than that threshold, two items are related automatically. The idea is that we are able to optimize the fact-checking process by group similar items together rather than working on them separately. This also helps our partners better understand what content is most submitted to their tip lines. The video below demonstrates those features.

Video-demonstration of Check similarity tools, presented at Tech & Check 2020

This was part of the discussion that followed the demos. In the afternoon, people interested and working on claim matching discussed the problem types we have ahead of us, and came up with three main branches:

The first one was obviously claim matching and semantic text similarity: the matching of a novel claim with previously checked claims to determine if there is a substantially similar previous claim. Potential solutions that were proposed include, for example, named entity overlap, vector representations, comparing with knowledge bases; categorization of previously made claims and comparison to a novel claim; and combinations of multiple approaches.

The second problem type was human matching. Pure computational fact checking is prone to mismatches, sometimes funny, sometimes truly detrimental. Where and how can human editors, crowdsourcing and other forms of human intervention be used to assist and mediate this system? One proposal for that is to have an editor as a final step in an otherwise computational pipeline. At Meedan, we’re trying to develop efficient human-in-the-loop approaches where we can finetune and improve our computational tools based on the standard actions users take while using Check. Every time users remove an automated match or add a manually match we gain valuable insights into the types of things that should and shouldn’t match. Over time, we want to use this data to evaluate new approaches and improve the existing approaches. In other words, we see the system more as a circle than a pipeline: our algorithms make suggestions and learn from users’ actions.

The third problem type we discussed was image matching, and we discussed solutions like different image hashing algorithms, object retrieval, object identification and information extraction from images.

We are excited to keep working on those problems and to contribute with the Tech & Check community, and we can’t wait to share more interesting insights and findings next year at Tech & Check.

Tags
Technology
Footnotes
  1. Online conversations are heavily influenced by news coverage, like the 2022 Supreme Court decision on abortion. The relationship is less clear between big breaking news and specific increases in online misinformation.
  2. The tweets analyzed were a random sample qualitatively coded as “misinformation” or “not misinformation” by two qualitative coders trained in public health and internet studies.
  3. This method used Twitter’s historical search API
  4. The peak was a significant outlier compared to days before it using Grubbs' test for outliers for Chemical Abortion (p<0.2 for the decision; p<0.003 for the leak) and Herbal Abortion (p<0.001 for the decision and leak).
  5. All our searches were case insensitive and could match substrings; so, “revers” matches “reverse”, “reversal”, etc.
References
Authors
Words by

Caio Almeida leads engineering at Meedan. A senior software engineer in Ruby On Rails and JavaScript with 15 years of experience, he’s ranked in the Top 5 Ruby developers from Brazil. He holds bachelor and master degrees in computer science from Federal University of Bahia, Brazil and is currently a special student of the PhD program from that same University.

Dr Scott A. Hale leads Meedan’s research in human-in-the-loop machine learning and natural language processing to create equitable access to information. He is a professor and researcher at the University of Oxford on the topic of hate speech, misinformation, and broadening access to data and methods.

Caio Almeida
Scott A. Hale
Words by
Organization
Published on
April 21, 2020
April 20, 2022