Leveraging generative AI for linguistically diverse classification

Generative artificial intelligence (GenAI) has quickly revolutionized how we interact with technology, and its impacts extend well beyond text generation and chatbots. While current challenges like hallucinations and biases do present real and significant obstacles to wider adoption, one of the technology’s more robust and promising applications lies in a less explored domain: multiclass and multilanguage classification tasks. In this area specifically, large language models (LLMs) excel at organizing and interpreting data based on predefined taxonomies.

These classification tasks, which involve tagging content in various languages based on user-specified categories, have traditionally been rather resource-intensive endeavors.

For instance, building a classifier to sort COVID-19 content into predefined categories — such as vaccination, natural remedies, or pandemic politics — typically required lengthy data annotation and rigorous review by diverse participants speaking many languages, a situation for which it can often be difficult to recruit. With LLMs, this process can be streamlined so the work done by humans focuses primarily on verifying the authenticity of work performed by AI, significantly reducing the cost and complexity of building classifiers for journalists.

SynDy: Synthesizing data for specialized models

While using LLMs directly for classification tasks can be effective, the costs of continuously querying large language models can add up quickly when done at scale. To address this challenge, we are training smaller, specialized language models using traces of the larger models.

In one such initiative, we created a dataset generation framework we called SynDy, a term that combines the key descriptors “synthetic” and “dynamic” to help describe the tool. Our demo paper on SynDy was published in the proceedings of the 47th international conference of the Association of Computing Machinery’s Special Interest Group on Information Retrieval in 2024 (SIGIR 2024).

Read our demo paper, “SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation Tasks.”

In this paper, we discuss synthesizing labeled datasets using LLMs. By generating synthetic data, we can train local classifiers without using human-labeled datasets, an innovation that reduces both time and resource constraints.

ClassyCat: SynDy in action

An exciting application of this approach is our internal tool ClassyCat. ClassyCat leverages LLMs to assign labels to a database of social media content based on user-defined taxonomies, all with minimal human supervision. The initial success of this tool illustrates how GenAI can simplify complex classification tasks, even for those who have limited technical expertise.

As part of our development plan for further refining ClassyCat, we are collecting all classification records generated by the tool. These records are then given a new purpose by serving as the training grounds for the development of smaller, specialized models. Eventually, these small local models will supplant the matured taxonomies that were trained directly using LLMs. However, the new smaller models will still be able to query LLMs in novel and unfamiliar contexts.

This approach ensures that our LLM cost per taxonomy zeros out over time, which is essential for efficiently managing LLM costs at leaner organizations such as Meedan.

In particular, we are planning to use the k-nearest neighbors algorithm for local classification, which will have the added benefit of helping us identify when we actually do need to hand off a specific classification decision to more powerful LLMs. Put simply, the k-nearest neighbor algorithm will allow us to calculate how unusual a new query is compared to previously labeled queries. Such approaches can also identify which queries are near a classification boundary, a quality that would indicate more precision might be required for accurate classification.

Challenges of linguistic diversity

At Meedan, we support 66 organizations from 24 countries with our flagship software tool, Check. Most of the data processed by Check is not in English. While popular LLM services like ChatGPT and Claude have shown proficiency in English, their capabilities require thorough evaluation in lower-resource settings — for example, in languages for which there is less available training data, such as Arabic or Hindi.

In order to ensure quality service for our users, we have internally conducted evaluations in Arabic, Hindi, Mandarin, and English on what were, at the time of evaluation, state-of-the-art LLMs. The numbers in the table here are F1 scores measured across different evaluation sets and LLMs. The scores range from 0 to 1, with 1 being the highest. The color coding denotes relative performance within the same language benchmark.

‍

LLM	Arabic	Mandarin	Hindi	English
GPT 3.5 Turbo	0.21	0.47	0.52	0.59
Command R+	0.34	0.63	0.62	0.66
GPT 4 Turbo	0.45	0.69	0.58	0.63
Claude 3 Sonnet	0.52	0.65	0.64	0.62
Mistral Large	–	0.65	0.60	0.66

‍

As a quick note, F1 scores are performance metrics that can be used to evaluate text classification models. They combine evaluations of precision and recall to provide a single score that balances both factors.

We discovered several interesting trends. For instance, the Mistral Large model was functional in every language we evaluated except Arabic. Mistral failed to follow instructions for almost all instances of our Arabic benchmark. GPT-4 Turbo, which was the most expensive model we tested, was not the best model across the board. Instead, we found that Claude 3 Sonnet delivered consistently in all the languages we tested. As such, ClassyCat currently utilizes Claude Sonnet endpoints in production.

ClassyCat taxonomies: A maternal health use case

In collaboration with our academic partners at the Annenberg Public Policy Center of the University of Pennsylvania, Meedan’s ClassyCat was used to evaluate a variety of online maternal health content at scale based on a provided taxonomy for identifying false narratives and intervening with factual information.

Central to this collaboration was the notion that, instead of responding to misleading content after the fact with corrections, the Annenberg team developed a proactive approach for providing enduring explanations about prevalent deceptive narratives related to maternal health topics. These common false narratives were converted into a ClassyCat taxonomy. Now, when given social media posts in English, ClassyCat can recognize which narrative the post aligns with and can then direct users to a corresponding explanation that would “prebunk” — preemptively debunk — the false narrative.

The path forward: Enhancing global communication

Our journey with GenAI is an ever-evolving one, and it’s filled with both challenges and opportunities. By focusing on the “safe hanging fruit” — use cases that allow us to enhance efficiency without requiring significant ethical risk — we aim to empower journalists and information professionals worldwide.

Importantly, we acknowledge that GenAI is already being used to harass and inflict harm upon vulnerable communities. This is something we’ve seen firsthand alongside our partners in our work on gendered disinformation and other forms of gender-based violence.

At Meedan, we are committed to developing AI systems that are culturally aware and ethically responsible, and we strive to place the needs of the communities with whom we collaborate and codesign initiatives at the forefront. While we are excited about the transformative potential for tools like ClassyCat to make information more accessible and actionable, we recognize the importance of prioritizing the safety and well-being of the people we serve. As we continue refining our GenAI offerings, we will remain vigilant in our approach and always center the communities we partner with.

We believe that collaboration and open dialogue are essential for the responsible development of new AI technologies.

We invite researchers, practitioners, and AI enthusiasts to collaborate, engage with our work, share insights, and contribute to this important field. Drop us a line today.

We collaborated with 53 partner organizations worldwide to design and carry out our 2024 elections projects. We extend special gratitude to our lead partners in Brazil, Mexico and Pakistan, whose work we highlight in this essay.