The following is a guest post by Catalin Rosu, who along with some colleagues, dug up a ton of data about the HTML content of web sites. This is the most recent study of its kind and wildly fascinating to see the results. I find it especially fun to compare the top results to what I would have guessed would have won.
We've all been there. We try to improve our HTML code making it clean, beautiful, and readable. We do this in pursuit of better semantics and better accessibility, so that everyone can use it. It's our top priority. And we always have questions:
- What is the best way to structure the markup?
- How are others doing it?
Questions like these were running through my mind. I wondered about how people write markup these days, as new web technologies emerge. So, I teamed up with a few of my colleagues at AWRCloud and we came up with a data set of over 8 million pages from Google top twenty results.
The studies that came before this one
Back in 2005, Ian Hickson, the editor of HTML5 specification, made an analysis of a sample of slightly over a billion documents, looking to see what the web is made of. A billion is an enormous number, but to Google, nothing is impossible. With this huge amount of documents, he extracted valuable information about popular class names, elements, attributes, and related metadata. The outstanding results were later published as Web Authoring Statistics, which is still the most powerful web authoring study ever made.
One of the analyses from Web Authoring Statistics that later proved vital for the work in progress HTML5 development, was a list of the most popular class names in those HTML documents. The Opera MAMA crawler also searched for the most common class names and in addition to Google's results and they've published relevant results on the popular ID attribute values given to elements as well.
What does this study add to the conversation?
The data for this study comes from 8,021,323 index pages gathered from the top twenty Google results for about 30 million keywords, chosen by keyword volume. Meaning: we had 30 million keywords. We ran a Google search for each of them and took the URLs for the top 20 results and added them to the list and removed the duplicates.
We can only assume that the relevance of these web pages to the general web population is very high. That is based on the likelihood these are popular and high-trafficked websites commensurate to their search result positions.
How fresh is this data?
The latest data set is from May 20th, 2016.
This new study will never surpass the former study Google made back in 2005. It’s not about overcoming Opera’s great study either. It’s about finding new and relevant insights on the actual markup used by the most popular and successful web pages on the internet.
So, how does the average HTML page look like nowadays? Take a look at the screenshots below and check out the study for the full statistics.
Following our study, we find that the average website index page uses twenty six different different element types.
The twenty six elements used on the most pages, ordered by frequency:
Among the document type declarations that specify which version of (X)HTML a page is using, the latest HTML5 doctype is clearly leading the way.
If we look at all the elements that are specifically about telling browser or search engines about the site and how to style it, we found about 175 million elements, and here's how they broke down:
The breakdown of the 105 million elements for content sectioning looks like this:
Of the billion text content elements:
What's the future of web?
Us web developers and web content creators are curious and interested in usage, statistics, and browser support. These are the things that led to the class names findings back in 2005, names known today as the most popular HTML5 tags.
The web is evolving fast. This isn't new, but it can feel overwhelming. The trends are changing from year to year and as a web content creator, it requires motivation and effort to stay up to date. Think about how the markup and the average web page looked like ten years ago and how a modern web page looks like today.
We also used the study to look at emerging technologies like Web Components. While Web Components allows authors to create arbitrarily named elements, we can look for standards elements used in the creation of Web Components.
Nobody can predict the future. We can only guess how the average web page will look like ten years from now on. Next time we run this study (we're considering quarterly), will we see things like Web Components rise?
And again, the complete data set is here.