Harnessing Sociolinguistics: A New Framework for Ethical AI Development

Understanding Bias and Discrimination in AI: Why Sociolinguistics Holds the Key to a Fairer World

Researchers have developed a novel framework for better understanding large language models (LLMs) by integrating principles from sociolinguistics. 13 January 2025

Sociolinguistics and AI

The language engines that power generative artificial intelligence (AI) are plagued by a wide range of issues that can hurt society, most notably through the spread of misinformation and discriminatory content, including racist and sexist stereotypes. In large part, these failings of popular AI systems, such as ChatGPT, are due to shortcomings with the language databases upon which they are trained.

To address these issues, researchers from the University of Birmingham have developed a novel framework for better understanding large language models (LLMs) by integrating principles from sociolinguistics – the study of language variation and change. Publishing their research in Frontiers in AI, the experts argue that by accurately representing different varieties of language, the performance of AI systems could be significantly improved – addressing critical challenges in AI, including social bias, misinformation, domain adaptation, and alignment with societal values.

The intersection of language and technology: A new era for AI.

When prompted, generative AIs such as ChatGPT may be more likely to produce negative portrayals about certain ethnicities and genders, but the research offers solutions for how LLMs can be trained in a more principled manner to mitigate social biases. The researchers emphasize the importance of using sociolinguistic principles to train LLMs to better represent the diverse dialects, registers, and periods of which any language is composed – opening new avenues for developing AI systems that are more accurate and reliable, as well as more ethical and socially aware.

Lead author Professor Jack Grieve commented:

“These types of issues can generally be traced back to the data that the LLM was trained on. If the training corpus contains relatively frequent expression of harmful or inaccurate ideas about certain social groups, LLMs will inevitably reproduce those biases resulting in potentially racist or sexist content.”

The study suggests that fine-tuning LLMs on datasets designed to represent the target language in all its diversity—as decades of research in sociolinguistics has described in detail—can enhance the societal value of these AI systems. The researchers also believe that by balancing training data from different social groups and contexts, it is possible to address issues around the amount of data required to train these systems.

Rethinking the datasets: The role of diversity in AI training.

“We propose that increasing the sociolinguistic diversity of training data is far more important than merely expanding its scale,” added Professor Grieve. “For all these reasons, we therefore believe there is a clear and urgent need for sociolinguistic insight in LLM design and evaluation.”

Understanding the structure of society and how it reflects in patterns of language use is critical to maximizing the benefits of LLMs for the societies in which they are increasingly embedded. More generally, incorporating insights from the humanities and the social sciences is crucial for developing AI systems that better serve humanity.

The Path Forward

Adopting sociolinguistics as a foundational aspect of AI development can lead to significant advancements in building systems that not only function effectively but do so while promoting fairness and inclusion. Researchers were also keen to stress that as LLMs continue to evolve, maintaining a commitment to diverse linguistic representation will be key to preventing the re-emergence of harmful biases.

AI technologies have the potential to revolutionize industries and everyday life, but their impact hinges on the integrity of the data and algorithms used in their creation. It is, therefore, incumbent upon researchers and developers to engage with sociolinguistics and other social sciences at every stage of design and implementation.

The future of AI: Balancing technology with human insight.

In conclusion, the integration of sociolinguistic insights can address some of the most pressing challenges in AI today. As we stand on the brink of a new age in technology, it is imperative that those who create and shape these systems prioritize a comprehensive and diverse representation of language and culture. Only then can we ensure that AI serves as a tool for progress rather than a source of discrimination.

Notes for Editors

For further information about these significant research developments, interested parties are encouraged to visit the University of Birmingham and explore the various programs available in linguistics and AI.

The University of Birmingham is ranked amongst the world’s top 100 institutions. Its work brings people from across the world to Birmingham, including researchers, teachers, and more than 8,000 international students from over 150 countries.