DevConf.IN 2025

SHISHIR TRIPATHI


Company or affiliation

MIT World Peace University

Job title

Student


Session

03-01
13:30
15min
Need for High Quality Datasets for Indic-NLP
SHISHIR TRIPATHI

The purpose of this proposal is to discuss the importance of high quality datasets and corpus in natural language processing and how it can accelerate the advancements in LLMs and AI in general in India. Performance of several natural language processing applications rely more on the occurrence and frequency of tokens than their lexical arrangements based on the intuition that similar words appear together naturally. This constitutes us to generalize the language when preparing datasets or corpus as a method of computation performing differently on Indian languages than English can be contrasted and disentangled only after the domain distribution, structure, and generalization of an Indian corpus will match that of a standard western one, before that for every research comparison the ambiguous question that whether a method would work differently if a relevant corpus was there remains. In fact, to even refute a theory on how a particular language behaves based on statistical occurrence a good dataset is required. The sheer amount of applications of NLP, that are directly involved in development of LLMs and conversational units similar to OpenAI's ChatGPT, obligates us to plan and prepare datatsets on at a larger level of collaboration and contribution.

AI, Data Science, and Emerging Tech
Swami Vivekananda Auditorium