The Future of Language Models: Understanding Sarcasm in Livonian
In seven years, AI will understand sarcasm in Livonian, according to a linguist. This ambitious goal is part of a larger effort to support small languages with high-quality language models. One way to achieve this is by collecting more data on each small language. Another option is to automatically translate texts into small languages, giving the machine more learning resources.
“There are four billion words in the combined corpus of the Estonian Language Institute, but that is not enough. So, we have translated 20 times more texts from other languages into Estonian. They are not a substitute for human-generated texts, but at least they give us a way to teach the models, even though only roughly,” said the professor.
The third and most exciting way is to change the way language models are taught. The language acquisition of the human child could be taken as an example. In the first five years of life, a human being hears five million words. This is enough to develop an incomparably better understanding of language and intelligence than an artificial pig.
Language acquisition in humans
According to the professor, this is where the University of Tartu’s neurospeech speech synthesis and the Tallinn University of Technology’s automatic transcription could come together. “Let’s see if one can support the other. For example, can speech synthesis generate data to identify a language? Can we do this multilingually?”
Estonia’s Own Chatbot?
Estonia is striving to develop a robust freeware language model for Estonian, one that is suitable for both government and business use. Estonian would then have its own Llama 2, Mistral, Claude, or ChatGPT.
A robust freeware language model for Estonian
Two Ph.D. students supervised by the professor have already made the first attempt to teach Estonian to Llama 2 in the Meta language model without the model forgetting English. “We called it Llammas in Estonian,” the professor said.
The professor, linguistic technologist Kairit Sirts, and automatic transcription developer Tanel Alumäe are currently seeking funding from their research groups to create a strong freeware language model for Estonian.