Saviour or saboteur? Strengthening Bangla in the age of AI
Bengali is one of the world’s most spoken languages but remains largely absent online. With AI shaping the future of communication, this digital gap threatens Bangla’s relevance

According to the 2024 edition of Ethnologue, the standard reference for data on languages worldwide, Bengali is the seventh most spoken language globally, with an estimated 278.2 million speakers using it as their first or second language.
In terms of native speakers, Bengali ranks even higher at number five, as it is the mother tongue of 237 million people. By all accounts, Bengali, or Bangla, is one of the most widely spoken languages in the world.
Strangely enough, Bangla is one of the least-used languages on the Internet. For example, Bengali Wikipedia ranks 63rd in the number of articles among all other language versions. Overall, Bangla has less than 0.1% presence on the Internet, which is a significant problem in the wake of generative AI.
Generative AI, which started to gain mainstream attention after the release of ChatGPT in 2022, is playing an increasing role in content generation. AI models can now write at a level comparable to or better than humans, even in creative and intellectual content such as poetry and research abstracts. Whether we like it or not, AI is already writing news articles, movie scripts, and advertisements.
Generative AI learns to write by training on existing content. A large part of this training data comes from the internet. With nearly half of the online content in English and 90% concentrated in just ten languages (English, Russian, Spanish, German, French, Japanese, Turkish, Portuguese, Italian, and Persian), Bangla is seriously lagging here. The problem is not just limited to Bengali; thousands of languages are being ignored or sidelined in the AI revolution because they are not represented well enough on the internet.
Bangla has historically shown resilience due to its deep cultural and national roots. However, AI poses unique challenges that must be taken seriously. One of the biggest challenges for Bangla in the AI era is the lack of high-quality digital content on the internet for training purposes. Much of the available Bangla text online suffers from non-standard spelling, poor grammar, and structural inconsistencies. AI trained on such data produces equally poor content, creating a negative feedback loop.
Already, AI-generated Bangla content is appearing on the internet, and as AI models continue to learn from these low-quality outputs, the situation will worsen. This phenomenon is not unique to Bangla. Researchers have found that low-resource languages, including many indigenous languages, are facing similar challenges.
The problem extends beyond generative AI. Even widely used translation tools like Google Translate, which are increasingly becoming dependent on AI, perform significantly worse for Bangla than for languages like French or German. This dual degradation of text and translation is bound to impact the language in the long run, especially as AI becomes more embedded in education, media, and daily communication.
It would be a mistake to assume that AI's performance in Bangla will naturally improve over time. Popular large language models have over 90% of their training data in English. For instance, Facebook's parent company, Meta's LLaMa 3 AI model utilises only about 5% of non-English data.
AI researchers have also discovered that attempts to enhance an AI model's non-English performance can sometimes negatively impact the accuracy of its English outputs. This has further reduced the incentive for major AI companies to invest in Bengali and other similar languages.
Several countries, including China, France, and others in East Asia, are developing AI models that are specifically optimised for their native languages. For example, China has heavily invested in AI projects that are primarily trained in Mandarin, ensuring linguistic self-reliance. Without similar efforts, Bangla's presence in AI-driven technology will remain minimal.
Bangladesh must take strategic steps to safeguard Bangla's digital presence. One of the most critical initiatives would be developing a large-scale, high-quality digital corpus of Bengali literature, news, academic papers, and everyday communication. While vast amounts of Bangla text exist in print, much of it has not been digitised.
Projects similar to Google Books could be undertaken at a national level to digitise and standardise Bangla text. Even if not all content is made publicly available, such a database could be invaluable for training AI models that prioritise Bangla.
Moreover, Bangladesh should actively fund AI research focused on Bangla, collaborating with universities, research institutions, and private tech companies. Government policies should support the development of AI tools in Bangla and ensure that generative AI applications used in the country adhere to linguistic standards rather than amplifying poor-quality content.
As AI becomes essential for tasks like writing, translation, and information retrieval, people who cannot use AI effectively in their language may gradually switch to English in many practical use cases. Already, our younger generation prefers to communicate in English in professional and academic settings due to the limitations of Bangla in digital tools. Without intervention, this trend will only deepen with the rise of AI.
We must recognise that AI is not just a technological advancement; it is a force that will shape culture, communication, and identity. The challenges posed by AI to Bangla are not merely hypothetical; they are already unfolding. The government, academia, and private sector must collaborate to ensure that Bangla is not sidelined in the AI revolution.
Investments in digital archives, AI training datasets, and localised AI models are necessary to maintain the relevance of Bangla as AIs become an integral part of our lives. Public awareness campaigns should also emphasise the importance of using correct and standard Bangla in digital communication.
While some may still be skeptical of AI, the reality is that large language models are evolving so quickly that there will soon be no clear distinction between AI-generated and human-generated content. If we do not take control of how Bengali appears online and focus on developing models based on Bangla now, we risk letting AI models trained primarily in English dictate how future generations will read and write in Bangla on the internet.
Amio Galib Chowdhury is a graduate research student at Texas State University.
Disclaimer: The views and opinions expressed in this article are those of the authors and do not necessarily reflect the opinions and views of The Business Standard.