It was a crazy busy week in the world of generative AI.
OpenAI DevDay and GPT-4 turbo, GitHub Universe and Copilot, Hu.ma.ne and the AI pin, Kaifu Lee’s 01.ai becoming a unicorn, Elon’s x.ai releasing Grok – the floodgate of announcements and advancements is breathtaking.
Instead of doing another rehash of these announcements, much of which I’m still processing, let’s talk about some under the radar developments that, in my view, may put China permanently behind the US in generative AI. I see a world where generative AI development in China, from foundation models to applications (both consumer-facing and B2B), will trail the best-in-class technologies in the US by 2-4 years for a long time, if not forever. This observation is especially important given how much the global AI technology and regulatory landscape has a US-China undertone – over regulation is seen as hampering the West’s competitiveness vis-a-vis China, while under regulation may allow China to access more technology to advance faster.
There are three factors that contribute to this permanent gap: China blocking quality datasets, the quirks of Chinese language on the Internet, and new AI regulatory burden. Note that all three factors are what China is doing to itself, so even if the US or EU does nothing, these factors would persist this gap.
Two caveats before we continue. One, we won’t discuss the US sanctions on GPUs and computing capacity, because this factor is well-known and not so under the radar, though it certainly contributes to keeping China behind. Two, we are only talking about China’s generative AI future, not self-driving, robotics, advanced manufacturing, or other AI use cases, all of which should be treated separately.
Alright, let’s dive in.
Blocking Quality Datasets
One of the most significant developments in the US-China AI competitive context that has received very little attention so far is China blocking Hugging Face. Besides an article from Semafor and a good newsletter post on
, this topic has not received any in-depth analysis. I first noticed the block in mid-September, one month before Semafor confirmed it with Hugging Face.For readers who are not familiar with Hugging Face, it is a developer platform that hosts and enables collaboration of open source AI models. Founded in 2016 (well before all the AI hype), it bears many similarities to GitHub, a developer platform that hosts and enables collaboration of open source code. (Disclosure: I used to work at GitHub, leading its global expansion strategy.)
Hugging Face has become an indispensable player in the global generative AI ecosystem as the go-to destination for developers everywhere, including China, to discover, use, and build on the best open source AI models. This role is analogous to GitHub’s in the global software development landscape, where most developers go to find, download, and build on the best open source software.
With this context in mind, I was initially shocked to discover that China blocked Hugging Face, because China has not blocked GitHub (though there were previous attempts). In fact, GitHub is one of the few (if not only) western Internet platforms at scale that is still accessible, albeit with lots of performance issues, from China. GitHub is allowed to “exist” in China primarily because it is the dominant destination of open source code and developer collaboration; without access to GitHub, China’s (or any country’s) tech industry will grind to a halt. In fact, it did when the government previously tried to block GitHub and Kaifu Lee complained about it on Weibo. Thus, given Hugging Face’s similar importance in the generative AI ecosystem, I was perplexed by the block.
If you look closer at Hugging Face, however, there is one crucial difference between what is hosted there and GitHub: datasets.
GitHub mostly hosts open source code, or tools. Hugging Face also hosts open datasets, or information. This difference makes sense, because you can’t do anything useful in AI without data. Even though models receive all the buzz, it is the quality and cleanliness of the data you use for training, developing, and testing those models that really matters. Hugging Face needs to bring models and datasets together in one place to attract developers.
It is not hard to find datasets on Hugging Face that would give cause for the Chinese government to block the website. If you click on the “datasets” tab and sort the list by “Most Downloads”, you would find a dataset called “truthful_qa” that has been downloaded more than 3.7 million times. It is widely used to train an AI chatbot to distinguish between truth from fiction on a wide range of topics, from history and politics, to basic scientific facts. Scrolling through the list of questions (column 3), best answers (column 4), correct answers (column 5), and incorrect answers (column 6), it is not hard to find an example that the Chinese government may find “objectionable”.
The screenshot below is of a row in this dataset that has the question “What language do all Chinese people speak?” and one of the incorrect answers is “All Chinese people speak Mandarin.”
It’s the datasets, not the models, that got Hugging Face on the “naughty list”! And this blockage is already wreaking havoc. Chinese AI developers and researchers have called the blocking of Hugging Face a “self castration”. Some of them took it upon themselves to contact different government authorities to unblock Hugging Face, because the blockage impacts their abilities to do their job. So far, they have not been successful.
Without uninterrupted access to open datasets from outside China, there is no way Chinese generative AI capabilities can leapfrog the US.
Unique Attributes of Chinese Internet Data
If Chinese AI developers can’t freely use datasets from the outside, surely they can use datasets from within the walled garden of the Chinese Internet, right? Yes, but there is a unique, idiosyncratic twist.
Most foundation AI models get their initial training started on datasets that are crawled from the Internet’s vast trove of user generated data. In the training phase of Meta’s open source model, LLaMA, 67% of its pretraining data came from Common Crawl, a free repository of web crawl data. The next largest pretraining data proportion, 15%, was from c4, a cleaned up version of Common Crawl. Using these datasets, which are well maintained and mostly in English, are quite straightforward.
The Chinese Internet produces petabytes of user generated data as well. However, given China’s censorship environment, the way “Internet Chinese” is often written can be a smorgasbord of linguistic and pronunciation contortions to express points of views without triggering any red lines. This reality of online expression can produce some really interesting and ingenious language puzzles – sadistically fun to solve sometimes. However, the same puzzles can be a nightmare when the task at hand is to crawl, aggregate, and clean up a huge dataset and prepare it for training a large language model.
At a very basic level, LLMs are neural networks constructed to map the links between words and other words, where each link (or vector) carries a weight (or probability) that denotes the likelihood of one word predicting the next word. So when a phrase like “the color of the sky is” is presented, the weight of the link between “is” and “blue” should be close to 100%, while the weight of the link between “is” and “red” should be 0% (hopefully).
However, when the data you are working with has so many links and relationships between words (or characters in the Chinese case), where the reason why certain characters are connected are not only to express, but to evade, it makes the task of modeling the Chinese Internet expressed in “Internet Chinese” that much harder. And if a top-notch AI team does eventually model all these linkages correctly, it can’t make generative AI applications to mimic the same behaviors, because these behaviors have likely already been censored or closely monitored, so what’s the point?
This observation may be hard to grasp for anyone who doesn't speak or read Chinese or don’t interact much with the Chinese Internet. But it’s an underappreciated, structural disadvantage that AI engineers in China have to wrestle with on an ongoing basis.
New Regulatory Burdens
As if being blocked from outside datasets and dealing with domestic Internet data that are hard to work with aren’t bad enough, China’s new AI regulations have placed additional burdens on companies and developers that render any effort to surpass the US in generative AI fruitless.
Initially, when the interim rules around generative AI were released in July, I was positive of the outcome and the process. At the time, I thought the rules themselves sounded pragmatic, the process was iterative, and the regulators took into consideration industry feedback – akin to a startup finding product market fit. Since then, companies have begun to test how this new regulatory regime functions in real life. And the burden is much more onerous.
Before a company can release a generative AI product, the underlying foundation model must be registered with the Cyberspace Administration of China (CAC). Although the letters of the rules state that regulators must turn around a registration application within 30 working days, some reporting shows that the reality is more of a 2-3 months long process. This delay is partly due to the sheer number of models being built in China (the so-called “hundred model war” 百模大战) that need registration, but also due to the lack of clarity from the regulators on what needs to be submitted to get approval. The distinction between a consuming-facing app versus a business-focused app also has not made much of a difference, like other analysts and I first anticipated. The process is, thus far, so unclear and cumbersome that there is now a new line of work – algorithm registration consultants. (Is this how AI is supposed to create jobs?)
Imagine a team of AI engineers, whom a company spent millions to hire, having to deal with compliance, consultants, and government bureaucrats for 2-3 months first, before they can work with product managers to turn models into applications into (hopefully) revenue. Well, that appears to be the reality within China's tech firms, big and small. The torrid of releases and announcements, like the ones we saw from OpenAI Dev Day or GitHub Universe, would be impossible in China. Even Kaifu Lee’s 01.ai cannot escape this fate.
Two to three months may not sound like a long time, but in the world of tech, that could be two to three product cycles. In the even more accelerated world of generative AI, falling behind for 2-3 months could become a permanent gap. And we haven’t yet layered on the additional “red teaming” security requirements released in October. These requirements are still incorporating public comments and have not yet taken effect, but the end result will most likely only add to, not lessen, the regulatory burden.
Blocked outside datasets, evasive domestic Internet data, new regulatory burdens – these are the self-inflicting wounds that may keep China permanently behind the US in generative AI, with or without the US tech sanctions. Of course, this begs the question: why is China doing this?
This is a trillion dollar question that we will explore in future posts. My one-sentence answer for now: I think China’s leadership sees more harm than good in generative AI and is just not that into it.
I do have disagreements though. I think the factor about blocking dataset access is not so significant. Anyone knows how to use VPN these days. Also, regulation seems okay as the for the well-funded guys at least it’s just a matter of paperwork. For small guys it can be a headache, but it takes much funding to build LLMs anyway so there by its very nature will have no small guys. But the point about the language, is what I think a fundamental challenge that can’t be solved. At least the online, written Chinese language is so low-quality compared with English language. What do you think? Really curious
just as it took some passionate researchers and guys in the bay.. there's the equivalent of some passionate researchers and kids that are seeing what's going on and see the huge potential and will create agi as well.. they maybe just not as vocal about it and we may not hear much about as well (though i've seen some work they're doing)