The Vanderbilt University Data Science Institute made progress this semester on a research project to bridge the gap in language acquisition of low-resource languages through deep learning. The team is developing an AI assistant for language professors and tools that keep the curricula up-to-date with changes in the Hindi language.
Senior Lecturer in Asian Studies, Dr. Elliott McCarter, and Data Scientist, Umang Chaudhry led the research project. Xiaodi Ruan, a Computer Science student at Vanderbilt, and Sean Onamade, an intern at Vanderbilt University Medical Center, also worked on the project. The project’s goal is to address the challenges faced by less commonly taught languages, such as Hindi, which often struggle to keep up with the changes in language. Existing teaching materials for Hindi are rigid and overly formal, teaching a “pure” version of the language rooted in Sanskrit, and do not account for colloquialisms or loan words from other languages.
To address these issues, the project team is developing an AI assistant for language professors, along with tools that allow curriculums to stay current with changes in the language. The team is also developing a process that can be replicated by others, with a proof of concept that uses AI to generate level-appropriate and updated vocabulary lists.
The team has faced some obstacles, including the lack of free, computer-readable modern Hindi texts. To overcome this challenge, they have turned to newspapers, which still capture how the language evolves, though it is still a slightly more formal version of the language. However, digitized articles can be expensive, costing around $7,000 a year.
The team has tested several pre-trained transformer models and large language models, including 50 Text2Text Generation models on HuggingFace that included Hindi language texts in their training, Poe (Sage, Claude, Dragonfly), ChatGPT, GPT4, BingGPT, and Google BARD.
The next steps for the project include continuing to iterate on prompts, developing vocabulary sets inspired by content in Hindi newspaper articles surrounding a particular topic, using LangChain to develop a document store that allows for semantic search to select articles on a specific topic, and training a model in Hindi/Sanskrit from scratch. With these advancements, the team hopes to develop a product that can serve the entire Hindi language pedagogy community and other researchers in Hindi linguistics, as well as serve as a model for less commonly taught languages to provide data-supported instructional materials.