It’s estimated that around 63% of living humans now have access to the internet, which would put the number of digitally connected people at just over 5 billion in sum. Those people come from all across the planet, and potentially speak thousands of different languages. If you search, you can find hundreds of languages online, from Urdu to Catalan.

But some languages are harder to find than others. Rest of World turned to W3Techs, a web-scanning firm based in Austria, to count all of the publicly accessible web addresses on the internet to get hard numbers on the discrepancy. Our data shows that a little more than half the sites on the web use English as their primary language. That’s a lot more than one might expect, given that native English speakers only make up just under 5% of the global population. Meanwhile, Chinese and Hindi are the second and third most-spoken languages in the world, but the same scan found they account for just 1.4% and 0.07% of domains, respectively. 

A line chart showing the relation of languages on the internet to those spoken in real life.

Because the internet is so vast, the data comes with caveats and blind spots (detailed below), but the scan still reveals massive imbalances in language use. Languages like Bengali and Urdu, each spoken by hundreds of millions of people, are nearly impossible to find online. 

W3Techs primarily tracks programming languages used online. It regularly scans publicly available domains and categorizes them by language, providing real-time reports for interested clients. We compared the W3Techs data with spoken-language figures from a survey by Ethnologue, a nonprofit widely considered the world authority on language use. 

Combined, the two data sets suggest significant over- and under-representation. English, German, and Japanese command a much larger portion of the internet than they do among native speakers. By contrast, many non-European languages hardly exist on the internet at all.

For some international groups, these discrepancies are an ominous sign for the future. As early as 2003, UNESCO was urging the public and private sectors to maintain online content in the full range of human languages. But as the web has grown, the gap between spoken language and what’s used on the internet has only grown. 

Bhanu Neupane, a program manager at UNESCO who works with language inequity, told Rest of World we might be moving towards a world where only a handful of languages are meaningfully present online. “The world is converging,” Neupane said. “And after 15 years, there could be just five or 10 languages that are prominently spoken and used in business and online. So we’re very concerned about this.”

Surveys of the problem vary, but UNESCO’s own assessment is consistent with the W3Techs results, showing only 14 languages present on more than 1% of domains.

There are a few caveats you should keep in mind regarding this data set: The data comes from scans of publicly available websites, so anything that is behind a login is probably going uncounted, which includes apps and social networks. (This quirk suggests the scans may be undercounting the Chinese internet, in particular, although it’s hard to know by how much.) Even within web-accessible social networks like Reddit, the scans aren’t designed to go through every page of a domain, which means they may be undercounting non-English communities on English-language sites. There are more details here, but the data should be read as a broad survey of websites, not a precise measurement. 

Having said that, the bigger picture is hard to miss. Millions of non-native English speakers and non-English speakers are stuck using the web in a language other than the one they were born into. And since publicly available text on the internet is now often being used to train large language models like Bard and GPT-4, it suggests we’re already building the same imbalance into technology’s next frontier: artificial intelligence.