Swecha’s Ambitious Project: Building a Telugu LLM Corpus and Culture Portal
Swecha, a non-profit organization dedicated to promoting Free Software and Free Knowledge movements, has announced a massive internship program, dubbed the ‘Summer of AI’, to equip over a lakh engineering students with AI skills this summer. This initiative, undertaken in collaboration with IIIT Hyderabad, Ozonetel, Meta, and TASK, aims to develop a Telugu language-centric Large Language Model (LLM) corpus and culture portal.
“India, with its rich culture and a population that constitutes one-sixth of the world, would greatly benefit from having its own LLMs.” - Y Kiran Chandra, Founder, Swecha
The lack of Indian language-centric LLMs is a significant gap in the AI landscape. Most Indian languages are considered low-resource languages, making it challenging to develop LLMs for them. A significant amount of foundational knowledge needs to be compiled and digitized to create the necessary digital data for these languages.
Engineering students to gain AI skills
The ‘Summer of AI’ project aims to capitalize on the vast talent pool of engineering students graduating in India, training them in AI and engaging them in large-scale data collection through interviews. This presents an opportunity to create a large pool of trained AI engineers, extending well beyond the small group of researchers and developers specialized in deep models.
Collecting information on Telugu folk tales, songs, and local history
The project involves collecting speech, transcribing it, and creating a dataset for both speech and as a base LLM. The team is also working with libraries and the Telugu academy to ingest a large number of books. This process will be done through 100,000 internships, with tools being built to help with data collection.
Developing a comprehensive corpus for Telugu Language Models
On successful completion of this project, a similar approach will be adopted to collect data for other languages and regions. The ‘Summer of AI’ project has the potential to reap riches for the Telugu language by preserving its culture through the documentation of oral traditions, folk knowledge, and personal narratives.
A comprehensive corpus for Telugu Language Models
This initiative will develop a comprehensive corpus that serves as a foundational resource for training and refining Telugu Language Models, ensuring more accurate and contextually appropriate language processing in digital environments. Ultimately, it empowers the Telugu community and supports language revitalization.