AI's global village opens wider to more voices
Developers look to break from yoke of English language, cater to all groups of people
By Oasis Hu in Hong Kong | China Daily | Updated: 2024-12-06 07:14
Artificial intelligence engineer Jacky Chan Ho-kit has conflicting feelings about his industry.
While he looks forward to a future where AI reaches its pinnacle — possessing humanlike cognitive capabilities — he is deeply concerned that it will only understand English.
"Given the language status quo, this is highly likely to be a reality rather than just alarmism," he said.
Chan is the chief technology officer at Votee, a Hong Kong-based AI company. He is also a language enthusiast who in his free time follows language bloggers on social media, absorbing their linguistic insights. Through his research, he has learned that many languages are disappearing.
Even though there are around 7,000 languages still in use globally, according to the World Atlas of Languages of UNESCO, only 10 boast more than 200 million speakers. UNESCO has said that a language vanishes every two weeks, with 25 disappearing annually.
In the online realm, the disparity in language usage rates is even more pronounced.
Over the last decade, English content has dominated the internet, accounting for 49.4 percent as of Nov 26 — more than eight times the use of Spanish, the second most prevalent online language at 6 percent, according to a report by W3Techs, a company that conducts global web surveys.
Conversely, the proportion of web pages that use Chinese, the second-most spoken language in the physical world with more than 1.1 billion speakers, has plummeted from 4.3 percent in 2013 to 1.2 percent in 2024.
In the realm of AI, prominent large language models, or LLMs, like Open-AI's ChatGPT4, Google's Gemini, and Anthropic's Claude all use English as their main language.
Mainstream AI language models, particularly those originating in the West, are made for English-speaking audiences, with translations for other languages serving as only a support function, said Cao Jiannong, chair professor in the Department of Computing at Hong Kong Polytechnic University.
Artificial intelligence is a field devoted to developing technologies that can replicate or even surpass human intelligence. Before this vision becomes real, large-scale AI companies will continue to prioritize enhancing AI's intelligence ability, instead of expanding their services to encompass more languages, Cao added.
Chan, CTO at Votee, agreed that the endgame of AI is humanlike intelligence, but questions the consequences if such intelligence can only speak English.
"Wouldn't it be even more unfair to non-English speakers? Wouldn't global cultural diversity be greatly eroded? Wouldn't the gap between the world's rich and poor be wider?" Chan said.
Since last year, Votee, which previously concentrated on automated data collection and analysis, has shifted its focus to developing AI services for lesser-used languages.
This year, it unveiled a Cantonese LLM and is actively pursuing clients in Southeast Asia, Africa, and the Chinese mainland. Future initiatives include the launch of LLMs and other AI services for Javanese in Indonesia, Okinawan in the southern region of Japan, and various Chinese dialects including Shanghainese and Hakka.
"In an increasingly polarized world, we aim to utilize technology to bridge this gap," Chan said.
Data scarcity
The cornerstone of training AI lies in data. A significant hurdle in advancing AI's linguistic prowess is the scarcity of data available in numerous languages, Chan said.
Of about 7,000 languages spoken worldwide, nearly 99 percent are considered low-resource languages, as the data available for computational processing and analysis is limited.
The fact that mainstream AI tools predominantly rely on English corpora, or collection of written text, leads to significant inconvenience when handling other languages, said Ting Paksun, CEO of Votee.
These AI tools often result in inaccuracies and biased content, cultural misunderstandings, business errors, and even legal violations, rendering them unsuitable for use in both casual and formal contexts, Ting said.
On the beneficial side, AI tools hold the potential to streamline operations, boost productivity, and have a direct impact on local economies.
At an investment summit in mid-November in Hong Kong, Daniel Pinto, president of JPMorgan Chase, said that AI contributed approximately $1.3 billion to the group's finances last year, through cost reductions or revenue increases, with projections indicating a rise to $2 billion this year.
Chan warned regions that are unable to leverage AI tools due to language limitations are likely to experience decreased productivity in the future.
To avoid lagging behind European and United States tech giants, governments and major tech firms in some regions have initiated the development of LLMs customized to their linguistic needs, Cao from the Hong Kong Polytechnic University said.
The UAE, for instance, introduced Jais, the highest-quality Arabic AI LLM, in 2023. This year, South Korea's LG Group unveiled Exaone 3, the country's inaugural open-source Korean AI model.
Smaller, nimbler
Many smaller companies around the world are also venturing into the creation of small language models, Cao said.
Asiabots Ltd, a Hong Kong-based artificial intelligence company established in 2017, is one such company.
Chris Shum Chiu-fai, co-founder and CEO of Asiabots, said that the company initially prioritized AI capabilities in Cantonese due to its Hong Kong location. However, over time an increasing number of clients have approached them for AI solutions in various languages.
Their clients encompass government bodies and private enterprises worldwide including from Southeast Asia and Europe. Instead of opting for large language models, they prefer small language models tailored to specific scenarios, such as AI-driven customer service, AI speech recognition technology, and AI text-to-speech tools.
Asiabots' clients include the Hong Kong Special Administrative Region government, which asked them to develop AI tools for translation services between Cantonese and Middle Eastern languages. The request followed this year's Policy Address, which called for attracting more Muslim tourists, and encouraged the city's taxi services to offer information in Arabic for visitors from the Middle East.
In July, a tourism company in Kunigami, Okinawa, Japan, engaged Asiabots to develop an AI tool capable of translating multiple languages, including minor ones such as Vietnamese.
"Japan is preparing to host the World Expo next year. With the anticipated increase in global tourism, many Japanese companies are seeking AI tools, leading to a surge in requests from Japan recently," Shum said.
Specialized needs
Many mainstream AI tools excel at translating between widely spoken languages such as English and Chinese. However, when faced with less common languages, these tools may falter in recognizing speech and converting it into text, resulting in numerous errors.
The primary issue lies in inadequate data for the specific language, Shum said.
In some instances, countries with limited technological infrastructure may find that their online information is predominantly available in English, rather than their native language, as seen in the Philippines and Mongolia.
Some languages have a variety of pronunciations without standardized characters, such as Minnan, a dialect spoken in southern parts of China.
Other languages are fragmented into numerous dialects. In Indonesia, for example, there are more than 300 dialects, which increase the complexity and diversity of the language.
These challenges can be overcome as long as clients have the financial resources to collect the necessary data, Shum said.
Asiabots accumulates data from extensive research and non-infringing open-source repositories, he said. Clients also provide data to the company or fund it to conduct on-site data collection.
After collecting the data, Asiabots collaborates with local universities and recruits native language speakers to refine and localize AI solutions, aligning them with regional cultures and legal frameworks to overcome cultural barriers.
Since its inception, Asiabots has expanded its AI's linguistic repertoire over the past seven years to 22 languages, including Indonesian, Filipino, Portuguese and Hindi, as well as less common dialects.
After establishing language capabilities, the company tailors AI software and hardware to meet specific customer requirements.
For instance, for the Okinawa tourist spot, Asiabots developed an AI translator capable of translating among five languages: Japanese, Chinese, English, Korean and Vietnamese. These languages can also be interchanged with any of the company's 22 language libraries when required, Shum said.
Endangered languages
While commercial demand ensures the survival of languages with a large offline population, those with few speakers, limited commercial interest, and insufficient technological research are at risk of becoming endangered both online and offline, Chan warned.
UNESCO has a classification system for endangered languages. Ones spoken across all age groups and contexts are considered safe, while languages that children no longer learn as their mother tongue are considered endangered. Those spoken solely by grandparents are in extreme peril, and those lacking speakers face extinction.
Based on this definition, even language dialects that are spoken by substantial populations, like Minnan and Hakka, which is primarily used in southern China, face a fight for survival as fewer young people are learning them.
Shum said not preserving an endangered language could lead to a deep sense of regret.
"There are various research directions in AI and we opted to delve into language study from the start, because behind each language lies a unique mode of thought and a profound reservoir of human wisdom," Shum said.
For instance, the Minnan term describing tears as "falling water" reflects a beautiful perspective. Losing such ways of thinking and expression is a loss of culture, and possibly even civilization, Shum said.
Chan said that language is a crucial vessel of intangible cultural heritage, showcasing the history, customs, habits and social relationships of a region, while forming a part of people's individual and collective identity.
"Protecting the cultural value of a language is much more urgent than its commercial worth, yet it often receives inadequate attention," he said.
By preserving the voice and text of a language through a language model, even if the original speakers disappear, people can access its nuances and written form and learn it whenever they want, Chan said.
Money talks
With hundreds of indigenous languages in Africa at risk of extinction, Votee has worked with clients on the continent to assist in language preservation efforts. However, significant challenges stem from Africa's political instability, limited technological proficiency and insufficient technology infrastructure.
In recent years, many clients have asked Asiabots to develop language models for the preservation of endangered languages.
However, all these projects faltered due to a lack of funding for data collection, such as sending researchers into remote mountainous regions to record voices, and process and digitize these recordings, which might cost millions of dollars.
Francis Fong Po-kiu, honorary president of the Hong Kong Information Technology Federation, said that the governments of smaller language communities should recognize the cultural value inherent in these languages.
Chan proposed that global tech firms, language-focused NGOs, linguists and language enthusiasts collaborate to form communities for mutual support and to encourage the contribution of open-source language data.
When developing its Cantonese LLM, Votee collaborated with Cantonese linguists and enthusiasts to establish a Cantonese-centered community. Subsequently, it open-sourced all the data and models within the LLM.
"Cantonese belongs to everyone, not just a select few — it already lacks resources, so why create additional boundaries?" Chan said.
In July this year, SenseTime, an AI software company in Hong Kong, launched a Thai-language LLM.
Lu Lewei, director of the SenseTime Research Institute, said that they paid attention to minor languages because equipping AI with multilingual capabilities is also good for its own improvement.
More importantly, AI was designed to assist humanity, and its future should prioritize broader accessibility and use, and not neglect some groups, Lu said.
"I believe this is the original intent, also the ultimate goal of humanity's pursuit of technological advancement," Lu said.
oasishu@chinadailyhk.com