{"id":247716,"date":"2016-11-25T07:28:35","date_gmt":"2016-11-25T14:28:35","guid":{"rendered":"http:\/\/css-tricks.com\/?p=247716"},"modified":"2016-11-25T07:28:35","modified_gmt":"2016-11-25T14:28:35","slug":"random-interesting-facts-htmlsvg-usage","status":"publish","type":"post","link":"https:\/\/css-tricks.com\/random-interesting-facts-htmlsvg-usage\/","title":{"rendered":"Random Interesting Facts on HTML\/SVG usage"},"content":{"rendered":"

Last time, we saw how the average web page<\/a> looks like using data from about 8 million websites<\/a>. That’s a lot of data, and we’ve been continuing to sift through it. We’re back again this time to showcase some random and hopefully interesting facts on markup usage.<\/p>\n

<\/p>\n

Hiding DOM elements<\/h3>\n

There are various ways of hiding DOM elements<\/a>: completely, semantically, or visually.<\/p>\n

Considering the current practices and recommendations, check out the findings on the most used methods to hide things via HTML or CSS:<\/p>\n\n\n\n\n\n\n\n\n\n\n\n\n
Selector<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
[aria-hidden]<\/code><\/td>\n2,609,973<\/strong><\/td>\n<\/tr>\n
.hidden<\/code><\/td>\n1,556,017<\/strong><\/td>\n<\/tr>\n
.hide<\/code><\/td>\n1,389,540<\/strong><\/td>\n<\/tr>\n
.sr-only<\/code><\/td>\n583,126<\/strong><\/td>\n<\/tr>\n
.visually-hidden<\/code><\/td>\n136,635<\/strong><\/td>\n<\/tr>\n
.visuallyhidden<\/code><\/td>\n116,269<\/strong><\/td>\n<\/tr>\n
.invisible<\/code><\/td>\n113,473<\/strong><\/td>\n<\/tr>\n
[hidden]<\/code><\/td>\n31,290<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

no-js HTML class<\/h3>\n

When JavaScript libraries like Modernizr run, the no-js<\/code> class is removed and it’s replaced with js<\/code>. This way you can apply different CSS rules depending on whether JavaScript is enabled or not in your browser.<\/p>\n

We found a total number of 844,242 elements whose HTML class list contains the no-js<\/code> string. More than 92% of them are html<\/code> elements.<\/p>\n

If you’re wondering about the remaining 8%, check out the top 10:<\/p>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Element<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
html<\/code><\/td>\n782,613<\/strong><\/td>\n<\/tr>\n
body<\/code><\/td>\n31,256<\/strong><\/td>\n<\/tr>\n
a<\/code><\/td>\n17,833<\/strong><\/td>\n<\/tr>\n
div<\/code><\/td>\n7,364<\/strong><\/td>\n<\/tr>\n
meta<\/code><\/td>\n1,104<\/strong><\/td>\n<\/tr>\n
ul<\/code><\/td>\n905<\/strong><\/td>\n<\/tr>\n
li<\/code><\/td>\n789<\/strong><\/td>\n<\/tr>\n
nav<\/code><\/td>\n768<\/strong><\/td>\n<\/tr>\n
span<\/code><\/td>\n431<\/strong><\/td>\n<\/tr>\n
article<\/code><\/td>\n263<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

noscript<\/h3>\n

The HTML noscript<\/code> element defines a section of markup that acts as an alternate content for users that have client-side scripting disabled, or whose browser lacks support. The client-side scripting language is usually JavaScript.<\/p>\n

We found 3,536,247 noscript<\/code> elements within the 8 million top twenty Google results.<\/p>\n

AMP<\/h3>\n

Accelerated Mobile Pages (AMP) is a Google initiative which aims to speed up the mobile web. Most publishers are making their content available in parallel in the AMP format.<\/p>\n

To let Google and other platforms know about it, you need to link the AMP and non-AMP pages<\/a> together.<\/p>\n

Within the 8 million pages we looked at, we found only 1,944 non-AMP pages referencing their AMP version using rel=amphtml<\/code>.<\/p>\n

Links attributes & values<\/h3>\n

href=”javascript:void(0)<\/h4>\n

We found 2,002,716 a<\/code> elements with href=\"javascript:void(0)\"<\/code>. Whether you’re coding a button or coding a link, you’re doing it wrong<\/a>.<\/p>\n

\nhref=”javascript:void(0)”
\n(a) You’re coding a button with the wrong element
\n(b) You’re coding a link with the wrong technology
\n–
Heydon Pickering<\/a><\/em>\n<\/p><\/blockquote>\n

target=_blank w\/ or w\/o rel=noopener<\/h4>\n

43,924,869 of the anchors we analyzed are using target=\"_blank\"<\/code> without a rel=\"noopener\"<\/code> conjunction. In this case, if rel=\"noopener\"<\/code> is missing, you are leaving your users open to a phishing attack and it’s considered a security vulnerability.<\/p>\n\n\n\n\n\n\n\n
Anchor\/Link<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
[target=_blank]<\/code><\/td>\n43,924,869<\/strong><\/td>\n<\/tr>\n
[rel=noopener]<\/code><\/td>\n40,756<\/strong><\/td>\n<\/tr>\n
[target=_blank][rel=noopener]<\/code><\/td>\n35,604<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

MDN<\/a>:<\/p>\n

When using target you should consider adding rel=”noopener noreferrer” to avoid exploitation of the window.opener API.<\/p><\/blockquote>\n

Ben Halpern<\/a> and Mathias Bynens<\/a> also wrote some good articles on this matter and the common advice is: don\u2019t use target=_blank<\/code>, unless you have good reasons.<\/p>\n

href=#top<\/h4>\n

It seems it is a common practice to use #top<\/code> as a href<\/code> value to redirect the user to the top section of the current page. There were found 377,486 a<\/code> elements with href=#top<\/code> values.<\/p>\n

lang<\/h4>\n

L\u00e9onie Watson<\/a>:<\/p>\n

The HTML lang attribute is used to identify the language of text content on the web. This information helps search engines return language specific results, and it is also used by screen readers that switch language profiles to provide the correct accent and pronunciation.<\/p><\/blockquote>\n

Of the 8,021,323 pages that we were able to look into, 5,368,133 use the lang<\/code> attribute on the html element. That\u2019s about 70%! <\/p>\n

div<\/h3>\n

The average web page has around 71 div<\/code>s. This number was computed after counting all the div<\/code> elements (576,067,185) encountered within 8,021,323 million pages.<\/p>\n

header vs footer<\/h3>\n

2,358,071 of pages use the header<\/code> element while the footer<\/code> is used by 2,363,665 pages. Also we found that only 2,117,448 of pages are using both header<\/code> and footer<\/code>.<\/p>\n\n\n\n\n\n\n
Element<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
footer<\/code><\/td>\n2,363,665<\/strong><\/td>\n<\/tr>\n
header<\/code><\/td>\n2,358,071<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

Links are not buttons<\/h3>\n

Neither are div<\/code>‘s and span<\/code>‘s.<\/p>\n\n\n\n\n\n\n\n\n\n\n
Element<\/th>\nAttribute & Value<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
a<\/code><\/td>\nclass=btn<\/td>\n3,251,114<\/td>\n<\/tr>\n
a<\/code><\/td>\nclass=button<\/td>\n2,776,660<\/td>\n<\/tr>\n
span<\/code><\/td>\nclass=button<\/td>\n292,168<\/td>\n<\/tr>\n
div<\/code><\/td>\nclass=button<\/td>\n278,996<\/td>\n<\/tr>\n
span<\/code><\/td>\nclass=btn<\/td>\n202,054<\/td>\n<\/tr>\n
div<\/code><\/td>\nclass=btn<\/td>\n131,950<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

In exchange, here are the native buttons statistics:<\/p>\n\n\n\n\n\n\n\n
Selector<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
button<\/code><\/td>\n4,237,743<\/td>\n<\/tr>\n
input[type=image]<\/code><\/td>\n1,030,802<\/td>\n<\/tr>\n
input[type=button]<\/code><\/td>\n916,268<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

Buttons without a specified type<\/h3>\n

Speaking of buttons, the button<\/code> element has a default type<\/a> of submit<\/code>. Make sure you always specify the button type, because we found around 1,336,990 button<\/code> elements with missing type<\/code> attribute. That’s around 31.5% from the total of button<\/code>s found in the wild.<\/p>\n

BEM syntax<\/h3>\n

If you’re a CSS addict, you may have heard about BEM, which is a popular naming convention for HTML classes.<\/p>\n

Knowing the BEM naming style that consists of strings containing double-underscores or\/and double-dashes, we were able to guess that only 20,463 elements use the BEM naming style.<\/p>\n

Bootstrap & Font Awesome<\/h3>\n

Apparently, we found only 1,711 pages that link to CSS or JavaScript resources that contain the bootstrap[.min.].js|.css<\/code>. Also, it looks like 379 pages link to CSS resources that contain the font-awesome[.min.].css<\/code>.<\/p>\n

I would have expected more.<\/p>\n

WordPress<\/h3>\n

1,866,241 pages, from the total that we analyzed, contain <meta name=\"generator\" content=\"*WordPress*\"><\/code>. We can only assume there are more that use WordPress, but some chose to remove this meta info from their sources.<\/p>\n

.clearfix VS .clear VS .cf<\/h3>\n

There are many naming styles for this well-known CSS utility that help clearing the floats. Here’s the breakdown:<\/p>\n\n\n\n\n\n\n\n
Selector<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
.clearfix<\/code><\/td>\n19,573,521<\/strong><\/td>\n<\/tr>\n
.clear<\/code><\/td>\n10,925,887<\/strong><\/td>\n<\/tr>\n
.cf<\/code><\/td>\n1,102,698<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

Favicon<\/h3>\n

Modern browsers fetch \/favicon.ico<\/code> automatically and asynchronously. So don’t manually specify its root location, just place it in there. Unless, for some reasons you prefer a different location for it.<\/p>\n

It looks like 354,024 publishers still link the \/favicon.ico<\/code> in the head<\/code>.<\/p>\n

Void elements<\/h3>\n

To close or not to close the void elements<\/a>, that is the question. Although fine with HTML either way, it is recommended to not close the void elements. At least for the sake of brevity.<\/p>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Element<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
<img\/><\/code><\/td>\n121,463,561<\/strong><\/td>\n<\/tr>\n
<br\/><\/code><\/td>\n67,595,285<\/strong><\/td>\n<\/tr>\n
<link\/><\/code><\/td>\n61,746,984<\/strong><\/td>\n<\/tr>\n
<meta\/><\/code><\/td>\n46,688,572<\/strong><\/td>\n<\/tr>\n
<br><\/code><\/td>\n34,492,680<\/strong><\/td>\n<\/tr>\n
<input\/><\/code><\/td>\n27,845,389<\/strong><\/td>\n<\/tr>\n
<img><\/code><\/td>\n17,844,667<\/strong><\/td>\n<\/tr>\n
<meta><\/code><\/td>\n15,133,457<\/strong><\/td>\n<\/tr>\n
<link><\/code><\/td>\n11,740,839<\/strong><\/td>\n<\/tr>\n
<input><\/code><\/td>\n7,231,827<\/strong><\/td>\n<\/tr>\n
<hr\/><\/code><\/td>\n2,610,890<\/strong><\/td>\n<\/tr>\n
<hr><\/code><\/td>\n1,690,891<\/strong><\/td>\n<\/tr>\n
<param\/><\/code><\/td>\n1,390,339<\/strong><\/td>\n<\/tr>\n
<area\/><\/code><\/td>\n1,336,974<\/strong><\/td>\n<\/tr>\n
<area><\/code><\/td>\n1,025,183<\/strong><\/td>\n<\/tr>\n
<param><\/code><\/td>\n698,611<\/strong><\/td>\n<\/tr>\n
<source\/><\/code><\/td>\n435,877<\/strong><\/td>\n<\/tr>\n
<base\/><\/code><\/td>\n389,717<\/strong><\/td>\n<\/tr>\n
<embed\/><\/code><\/td>\n304,954<\/strong><\/td>\n<\/tr>\n
<source><\/code><\/td>\n286,380<\/strong><\/td>\n<\/tr>\n
<wbr><\/code><\/td>\n237,606<\/strong><\/td>\n<\/tr>\n
<col\/><\/code><\/td>\n151,757<\/strong><\/td>\n<\/tr>\n
<col><\/code><\/td>\n145,434<\/strong><\/td>\n<\/tr>\n
<base><\/code><\/td>\n105,688<\/strong><\/td>\n<\/tr>\n
<wbr\/><\/code><\/td>\n77,922<\/strong><\/td>\n<\/tr>\n
<embed><\/code><\/td>\n56,610<\/strong><\/td>\n<\/tr>\n
<track\/><\/code><\/td>\n376<\/strong><\/td>\n<\/tr>\n
<track><\/code><\/td>\n310<\/strong><\/td>\n<\/tr>\n
<keygen\/><\/code><\/td>\n1<\/strong><\/td>\n<\/tr>\n
<keygen><\/code><\/td>\n–<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

tabindex<\/h3>\n

On hijacking the tab order, when using tabindex<\/code> to solve some disconnected UI elements, usually that only pushes the issue up to the document level<\/a>.<\/p>\n

The common advice is to use it with caution. We did notice that 552,109 HTML elements are using the tabindex<\/code> attribute to override the defaults when navigating with a keyboard.<\/p>\n

Missing alt<\/code> for images<\/h3>\n

This eternal SEO and accessibility issue still seems to be pretty common after analyzing this set of data. From the total of 139,308,228 images, almost half are missing the alt<\/code> attribute or use it with a blank value.<\/p>\n\n\n\n\n\n\n\n\n
Element<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
img<\/code><\/td>\n139,308,228<\/strong><\/td>\n<\/tr>\n
img alt=\"*\"<\/code><\/td>\n73,430,818<\/strong><\/td>\n<\/tr>\n
img alt=\"\"<\/code><\/td>\n32,603,650<\/strong><\/td>\n<\/tr>\n
img w\/ missing alt<\/code><\/td>\n33,273,760<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

Custom elements<\/h3>\n

Excluding the Web Component tags, here is a list of made up tags or custom elements, different to MDN’s HTML element reference<\/a>.<\/p>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Element<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
<o:p><\/code><\/td>\n808,253<\/strong><\/td>\n<\/tr>\n
<g:plusone><\/code><\/td>\n273,166<\/strong><\/td>\n<\/tr>\n
<fb:like><\/code><\/td>\n111,806<\/strong><\/td>\n<\/tr>\n
<asp:label><\/code><\/td>\n76,501<\/strong><\/td>\n<\/tr>\n
<inline><\/code><\/td>\n53,026<\/strong><\/td>\n<\/tr>\n
<noindex><\/code><\/td>\n51,604<\/strong><\/td>\n<\/tr>\n
<icon><\/code><\/td>\n42,703<\/strong><\/td>\n<\/tr>\n
<block><\/code><\/td>\n34,167<\/strong><\/td>\n<\/tr>\n
<red><\/code><\/td>\n33,424<\/strong><\/td>\n<\/tr>\n
<ss><\/code><\/td>\n27,451<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

We did find 21,403 h7<\/code> elements too.<\/p>\n

A11Y<\/h3>\n

First rule of ARIA use<\/a>:<\/p>\n

If you can use a native HTML element [HTML51] or attribute with the semantics and behaviour you require already built in, instead of re-purposing an element and adding an ARIA role, state or property to make it accessible, then do so.<\/p><\/blockquote>\n

Landmark roles<\/h4>\n

ARIA Landmark Roles<\/a> help users using assistive technology devices to navigate your site.<\/p>\n

You might have seen this warning message when validating<\/a> a document: “The banner role is unnecessary for element header”. This happens because browsers like iOS Safari do not currently support the above implicit mappings and for now it’s a good practice to keep adding these landmark roles and avoid the HTML validation warnings.<\/p>\n

Regarding the HTML5 implicit mappings, here’s the stats:<\/p>\n\n\n\n\n\n\n\n\n\n\n\n
Element<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
<nav role=navigation><\/code><\/td>\n1,144,750<\/strong><\/td>\n<\/tr>\n
<header role=banner><\/code><\/td>\n675,970<\/strong><\/td>\n<\/tr>\n
<footer role=contentinfo><\/code><\/td>\n613,454<\/strong><\/td>\n<\/tr>\n
<main role=main><\/code><\/td>\n236,484<\/strong><\/td>\n<\/tr>\n
<article role=article><\/code><\/td>\n129,845<\/strong><\/td>\n<\/tr>\n
<aside role=complementary><\/code><\/td>\n105,627<\/strong><\/td>\n<\/tr>\n
<section role=region><\/code><\/td>\n4,326<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

autoplay<\/h4>\n

Video and audio autoplay<\/code> is considered a bad practice, not only for accessibility, but also for usability.<\/p>\n

So, don\u2019t auto-play<\/a> and it will please all of your users.<\/p>\n

Check out below the findings from the total of 108,321 video<\/code> and 64,212 audio<\/code> elements.<\/p>\n\n\n\n\n\n\n\n\n\n\n
Element<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
<video autoplay><\/code><\/td>\n31,653<\/strong><\/td>\n<\/tr>\n
<video autoplay=true><\/code><\/td>\n5,601<\/strong><\/td>\n<\/tr>\n
<audio autoplay><\/code><\/td>\n2,595<\/strong><\/td>\n<\/tr>\n
<audio autoplay=true><\/code><\/td>\n339<\/strong><\/td>\n<\/tr>\n
<video autoplay=false><\/code><\/td>\n79<\/strong><\/td>\n<\/tr>\n
<audio autoplay=false><\/code><\/td>\n22<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

maximum-scale<\/h4>\n

Maximum-scale define maximum zoom and when set like maximum-scale=1<\/code> it won\u2019t allow the users to zoom your page. You shouldn’t do that, since zooming is an important accessibility feature that is used by a lot of people because it provides a better experience by meeting users\u2019 needs.<\/p>\n

Warning from HTML 5.2 Editor\u2019s Draft, 4 October 2016<\/a>:<\/p>\n

Authors should not suppress or limit the ability of users to resize a document, as this causes accessibility and usability issues.<\/p><\/blockquote>\n

However, we did find 1,047,294 websites using maximum-scale=1<\/code> and 87,169 websites with a user-scalable=no<\/code> value set. At the same time, 326,658 pages are using both maximum-scale=1<\/code> and user-scalable=no<\/code>.<\/p>\n

role=button<\/h4>\n

Setting role=button<\/code> for a button<\/code> is allowed but not recommended as the button<\/code> already has role=button<\/code> as default implicit ARIA semantic<\/a>. Still, we did find 26,360 button<\/code> elements having set a role=button<\/code>. <\/p>\n

Here’s a breakdown on other notable elements, whose behavior was overridden by role=button<\/code>:<\/p>\n\n\n\n\n\n\n\n\n
Element<\/th>\nCount<\/th>\n<\/tr>\n<\/thead>\n
<a role=button><\/code><\/td>\n577,905<\/strong><\/td>\n<\/tr>\n
<div role=button><\/code><\/td>\n85,565<\/strong><\/td>\n<\/tr>\n
<span role=button><\/code><\/td>\n21,466<\/strong><\/td>\n<\/tr>\n
<input role=button><\/code><\/td>\n8,286<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

On making clickable things correctly, MDN sums it up<\/a>:<\/p>\n

Be careful when marking up links with the button role. Buttons are expected to be triggered using the Space key, while links are expected to be triggered using the Enter key. In other words, when links are used to behave like buttons, adding role=”button” alone is not sufficient. It will also be necessary to add a key event handler that listens for the Space key in order to be consistent with native buttons.<\/p><\/blockquote>\n

SVG<\/h3>\n

There are several ways of including SVG in HTML, we sum them up and found a total of 5,610,764 SVG references<\/a>.<\/p>\n\n\n\n\n\n\n\n\n\n
How to use SVG<\/th>\n%<\/th>\n<\/tr>\n<\/thead>\n
Inline SVG code within HTML<\/td>\n97.05%<\/strong><\/td>\n<\/tr>\n
Using SVG as an <img><\/code><\/td>\n2.88%<\/strong><\/td>\n<\/tr>\n
Using SVG as an <object><\/code><\/td>\n0.05%<\/strong><\/td>\n<\/tr>\n
Using SVG as an <embed><\/code><\/td>\n0.02%<\/strong><\/td>\n<\/tr>\n
Using SVG as an <iframe><\/code><\/td>\n–<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n

The object, iframe and embed methods usage is under 1%.<\/p>\n

data-*=svg<\/h4>\n

There are 17,920 elements whose data-*<\/code> attribute value contains the string svg<\/code>. Most of the elements are <svg><\/code> or <img><\/code>.<\/p>\n

Top 5 data-* values:<\/p>\n

    \n
  1. http:\/\/www.w3.org\/2000\/svg<\/code> – 471<\/li>\n
  2. hg-svg<\/code> – 127<\/li>\n
  3. svg-siteline-facebook<\/code> – 114<\/li>\n
  4. icon-facebook.svg<\/code> – 95<\/li>\n
  5. twitter.svg<\/code> – 95<\/li>\n<\/ol>\n

    id*=svg<\/h4>\n

    There are 141,813 elements whose id<\/code> attribute value contains the string “svg”. Most of the elements are <svg><\/code> or its inner elements<\/a>. <\/p>\n

    Top 5 id values:<\/p>\n

      \n
    1. emotion-header-title-svg<\/code> – 16,281<\/li>\n
    2. cicLeftBtnSVG<\/code> – 5,793<\/li>\n
    3. cicPauseBtnSVG<\/code> – 5,793<\/li>\n
    4. cicPlayBtnSVG<\/code> – 5,793<\/li>\n
    5. cicRightBtnSVG<\/code> – 5,793<\/li>\n<\/ol>\n

      class*=svg<\/h4>\n

      There are 329,004 elements whose class attribute value contains the string “svg”. Most of the elements are <svg><\/code>, <i><\/code>, <img><\/code> or inner elements<\/a>. <\/p>\n

      Top 10 class values:<\/p>\n

        \n
      1. sqs-svg-icon--social<\/code> – 58,501<\/li>\n
      2. nav_svg<\/code> – 29,826<\/li>\n
      3. svg<\/code> – 28,604<\/li>\n
      4. mk-svg-icon<\/code> – 24,193 <\/li>\n
      5. svg-icon<\/code> – 12,255<\/li>\n
      6. icon_svg<\/code> – 7,956<\/li>\n
      7. ico ico-svg ico-40-svg<\/code> – 3,980<\/li>\n
      8. svg temp_no_id temp_no_img_src<\/code> – 3,794<\/li>\n
      9. svgIcon-use<\/code> – 3,127<\/li>\n
      10. svg temp_no_img_src<\/code> – 3,017<\/li>\n<\/ol>\n

        Regarding the above top, maybe it’s worth mentioning that sqs-svg-icon--social<\/code> is a (BEM-like) naming convention used by Squarespace website templates.<\/p>\n

        currentColor<\/h4>\n

        There are 868,194 SVG inner elements that contain the value currentColor<\/code>, mainly for the fill<\/code> or stroke<\/code> attributes. <\/p>\n

        Top 10 SVG elements<\/h4>\n
          \n
        1. <symbol><\/code> – 845,626<\/li>\n
        2. <path><\/code> – 12,834<\/li>\n
        3. <g><\/code> – 6,073<\/li>\n
        4. <path><\/code> – 3,207<\/li>\n
        5. <circle><\/code> – 1,885<\/li>\n
        6. <svg><\/code> – 1,061<\/li>\n
        7. <polygon><\/code> – 575<\/li>\n
        8. <rect><\/code> – 480<\/li>\n
        9. <line><\/code> – 412<\/li>\n
        10. <use><\/code> – 206<\/li>\n<\/ol>\n

          SVG as background-image (The journey!)<\/h3>\n

          To figure out if an element used SVG for a background-image, things were more complicated. Most of our data only used the HTML documents, but we worked out a solution to get the active stylesheets. <\/p>\n

          From the total of 6,359,031 domains we were able to gather data from, 84.5% (5,371,806) are using HTML elements with CSS background images, whilst only 1.2% (82,008) domains were using at least one SVG background image.<\/p>\n

          Also, from the total of 92,388,615 HTML elements with CSS background images, 0.5% (439,447) of them are using a SVG background image.<\/p>\n

          The process<\/h4>\n

          We went through all of the HTML files and transformed local\/relative CSS file references into absolute ones, e.g. <link rel=\"stylesheet\" href=\"style.css\"><\/code> became <link rel=\"stylesheet\" href=\"http:\/\/www.domain.com\/style.css\"><\/code>. <\/p>\n

          This took some time, since we sampled a couple of the results from our first runs, found inconsistencies with the results and had to restart the process. With a zipped file size of 65GB (and unzipped 323GB), it wasn’t a surprise why processing needed a couple of days to produce the above set of results. <\/p>\n

          Trying and aborting PhantomJS<\/h4>\n

          Since background images can be applied via CSS, we needed something to render the DOM and apply styles to it. We thought of a tool we were very familiar with: PhantomJS<\/a>. We ran a couple of tests with actual pages and saw that everything seemed to work properly. We then built our Java client to interface with the PhantomJS webserver: starting, opening pages, extracting output, handling responses, saving results and then cleaning up, but ran into disastrous performance results when trying to use and scale the rendering process on even one machine. <\/p>\n

          Rendering one HTML file would take anything from a couple of seconds to a couple of minutes and we had no way of knowing what PhantomJS was doing. This, coupled with the fact that the resources usage goes up exponentially the larger the DOM is, caused us to ditch it and look for alternatives.<\/p>\n

          Better luck with Selenium<\/h4>\n

          As luck would have it, a colleague was experimenting with Selenium<\/a> on top of headless Chrome<\/a>. Since he had encouraging results in all areas where PhantomJS was lacking, we thought about leaving the Java-do-it-all comfort zone and delegating stuff to other tools if needed. The test results were very promising – headless Chrome looked like it suited our needs marvelously: super fast startup time, great rendering time, and full control over stopping a process. <\/p>\n

          The Selenium web driver would actually close the binary, as opposed to us sending an exit<\/code> command to PhantomJS and hoping it wasn’t in 100% load so it would actually process it. This allowed us to control each process individually, without having to use killall<\/code> every couple of minutes and stopping all processes in case just one of them went rogue and throttled the CPU.<\/p>\n

          The only problem with this approach was that the JavaScript could no longer be contained in a single, standalone JS file we’d pass onto the PhantomJS executable, but had to be included inline in the actual HTML files. Here’s a simplified version of the script we used, relying on the Window.getComputedStyle()<\/a> method:<\/p>\n

          let backgroundImages = [],\r\n    allElem = document.querySelectorAll(\"*\"),\r\n    allElemLength = allElem.length;\r\n\r\nfor (let i = 0; i < allElemLength; i++) {\r\n  let style = window.getComputedStyle(allElem[i], false),\r\n  backgroundImage = style.backgroundImage.slice(4, -1);\r\n  backgroundImages.push(backgroundImage);\r\n}<\/code><\/pre>\n

          Saving data would be done by calling a simple PHP script. We ran a couple of larger-scale tests to validate our choice and everything performed flawlessly, so we went on with setting up a scalable environment.<\/p>\n

          We processed all HTML files (again) and injected the above JavaScript snippet. The next challenge was uploading everything to Amazon. S3Browser<\/a>, which we use for “casual” listing and downloading\/uploading, didn’t seem fast enough for this job (not the free version, at least). So, we looked for an alternative and came across s3-parallel-put<\/a>. <\/p>\n

          We set it up on a local Linux machine, moved over the SSD and had 65GB worth of zipped text data uploaded in no time. It crippled our machine and the local Jenkins server that was running on it – until we upgraded the old Q9550 CPU :). <\/p>\n

          The problems showed up when starting to scale up. We saw that our single web server would become overwhelmed and stop saving results, even though the Selenium driver was reporting the page had rendered successfully. This also meant many of our queue messages would be wasted (consumed and deleted from the queue), without producing any results.<\/p>\n

          We thus decided to have a more scrutinized system for keeping track of processed\/unprocessed files by using Redis<\/a>: each time we’d start processing a file, we’d insert the domain name into a Redis set. Each time we’d process a file (our PHP script would be called), we’d insert the domain name into another Redis set. The point was to keep the difference between the two to a minimum (anything over a certain value would mean something wasn’t working properly) and to make retrying easy if it was ever going to be needed. <\/p>\n

          Hardware setup<\/h5>\n

          For our hardware setup, we started by running 10 threads * 1 Chrome instance each on 10 Amazon c4.large<\/a> machines, served by one Apache webserver running on a m3.medium<\/a> initially doing a very lousy job. After toying with Apache’s settings, we scaled everything up gradually and got to 40 c4.large machines being served by Apache webservers running on 4 m3.medium machines behind a load balancer. Our Redis instance was serving all 10 threads * 40 machines * 3-4 requests per 5-20 seconds off a r3.large machine. So, that’s about 60-320 requests per second<\/strong>.<\/p>\n

          On costs, it’s pretty hard to give a total amount of money spent or CPU-time, since we ran into many issues before having a fully functional and stable ecosystem. Ideally, a single machine would need about 45 seconds for processing 100 files: downloading, unzipping, rendering and cleaning up.<\/p>\n

          Q&A \/ Follow up<\/h3>\n

          Why so many tbody elements?<\/h4>\n

          For the above new data, we did perform another full scan for the 8 million documents and also fixed a parsing sanitization issue where the jsoup<\/a> parser was adding the tbody<\/code> element automatically for all the tables. This is the answer to the question asked by some of you in the comments: “Why so many tbody elements?”.<\/p>\n

          As a consequence, the number of elements used on the most pages is now 25<\/a>, tbody<\/code> stats being now lessened.<\/p>\n

          body<\/code> at 99%?<\/h4>\n

          A little refresher: according to the specifications<\/a>, omitting the body<\/code> is fine: Start tag: optional, End tag: optional.<\/code><\/p>\n

          So, one of the most surprising results number, based on your comments, was the missing 1% of body<\/code> elements. I guess I owe you an answer, for that I went a bit further and ran the parser again to get some insights:<\/p>\n