China’s censorship regime requires Baidu and other internet companies to block access to certain websites and avoid politically sensitive subjects. The words or phrases that should be blocked can be updated rapidly in response to protests or during special events.
But Jeffrey Ding, an assistant professor at Georgetown University who studies China’s tech industry, says that concerns about censorship do not seem to have slowed the development of large language models in China. He notes that Baidu has made the Ernie language model that underpins its new bot available via an API for some time and that other companies have offered similar models.
Baidu has not given details of Ernie Bot’s training data, but it most likely was scraped from the Chinese internet. This will mean the bot’s feedstock has largely already been curated by China’s censorship rules, which, for example, aim to limit criticism of the government.
Censorship might also affect Chinese chatbots in more subtle ways. An academic research project from 2021 that trained algorithms on the Chinese-language version of Wikipedia, which is blocked in China, and Baidu’s Baike, a crowdsourced encyclopedia subject to government censorship, found that using censored training data significantly changed the meaning that AI software assigned to different words.
The algorithm trained on Chinese-language Wikipedia associated the words “democracy” closer to positive words such as “stability.” The algorithm trained on the censored Baike material represented “democracy” closer to “chaos,” more in line with the policy of China’s government. But because chatbots like ChatGPT can be extremely flexible and remix material in their training data, Baidu has likely had to introduce additional safeguards
Despite its mixed reception, Ernie Bot appears to be a capable competitor to ChatGPT. The bot is currently available only to a limited number of users, some of whom say they are impressed. ChatGPT is not available in China, although it is capable of conversing in Chinese.
Lei Li, a professor at UC Sant Barbara who specializes in AI and previously worked on the technology used to build some of the machine learning behind Ernie bot, points out that Baidu has been working on the underlying technology for around a decade. Microsoft, by contrast, licensed the core technology for Bing’s new chatbot and some forthcoming text-generation features for Office from OpenAI, in which it has invested billions of dollars in return for exclusive rights to its creations.
Li also says he is also impressed with some of what Ernie Bot can do, including its ability to generate stories and business reports. He adds that the hallucination problem is a challenge for all such language models. “This is where researchers still have work to do,” he says.
One WeChat poster compared the Chinese bot’s demoed capabilities to those of ChatGPT and found it better at handling Chinese idioms and more accurate in some instances. For example, ChatGPT incorrectly claimed that the ancestral home of science fiction author Liu Cixin, who wrote The Three Body Problem, is Hubei, while Ernie Bot correctly answered Henan. ChatGPT is blocked in China, but many people have found ways of accessing it.