Let’s Talk About Speech CSS

Avatar of Eric Bailey
Eric Bailey on

Boston, like many large cities, has a subway system. Commuters on it are accustomed to hearing regular public address announcements.

Riders simply tune out some announcements, such as the pre-recorded station stop names repeated over and over. Or public service announcements from local politicians and celebrities—again, kind of repetitive and not worth paying attention to after the first time. Most important are service alerts, which typically deliver direct and immediate information riders need to take action on.

An informal priority

A regular rider’s ear gets trained to listen for important announcements, passively, while fiddling around on a phone or zoning out after a hard day of work. It’s not a perfect system—occasionally I’ll find myself trapped on a train that’s been pressed into express service.

But we shouldn’t remove lower priority announcements. It’s unclear what kind of information will be important to whom: tourists, new residents, or visiting friends and family, to name a few.

A little thought experiment: Could this priority be more formalized via sound design? The idea would be to use different voices consistently or to prefix certain announcements with specific tones. I’ve noticed an emergent behavior from the train operators that kind of mimics this: Sometimes they’ll use a short blast of radio static to get riders’ attention before an announcement.

Opportunities

I’ve been wondering if this kind of thinking can be extended to craft better web experiences for everyone. After all, sound is enjoying a renaissance on the web: the Web Audio API has great support, and most major operating systems now ship with built-in narration tools. Digital assistants such as Siri are near-ubiquitous, and podcasts and audiobooks are a normal part of people’s media diet.

Deep in CSS’—ahem—labyrinthine documentation, are references to two Media Types that speak to the problem: aural and speech. The core idea is pretty simple: audio-oriented CSS tells a digitized voice how it should read content, the same as how regular CSS tells the browser how to visually display content. Of the two, aural has been deprecated. speech Media Type detection is also tricky, as a screen reader potentially may not communicate its presence to the browser.

The CSS 3 Speech Module, the evolved version of the aural Media Type, looks the most promising. Like display: none;, it is part of the small subset of CSS that has an impact on screen reader behavior. It uses traditional CSS property/value pairings alongside existing declarations to create an audio experience that has parity with the visual box model.

code {
  background-color: #292a2b;
  color: #e6e6e6;
  font-family: monospace;
  speak: literal-punctuation; /* Reads all punctuation out loud in iOS VoiceOver */
}

Just because you can, doesn’t mean you should

In his book Building Accessible Websites, published in 2003, author/journalist/accessibility researcher Joe Clark outlines some solid reasons for never altering the way spoken audio is generated. Of note:

Support

Many browsers don’t honor the technology, so writing the code would be a waste of effort. Simple and direct.

Proficiency

Clark argues that developers shouldn’t mess with the way spoken content is read because they lack training to “craft computer voices, position them in three-dimensional space, and specify background music and tones for special components.”

This may be the case for some, but ours is an industry of polymaths. I’ve known plenty of engineers who develop enterprise-scale technical architecture by day and compose music by night. There’s also the fact that we’ve kind of already done it.

The point he’s driving at—crafting an all-consuming audio experience is an impossible task—is true. But the situation has changed. An entire audio universe doesn’t need to be created from whole cloth any more. Digitized voices are now supplied by most operating systems, and the number of inexpensive/free stock sound and sound editing resources is near-overwhelming.

Appropriateness

For users of screen readers, the reader’s voice is their interface for the web. As a result, users can be very passionate about their screen reader’s voice. In this light, Clark argues for not changing how a screen reader sounds out content that is not in its dictionary.

Screen readers have highly considered defaults for handling digital interfaces, and probably tackle content types many developers would not even think to consider. For example, certain screen readers use specialized sound cues to signal behaviors. NVDA uses a series of tones to communicate an active progress bar:

Altering screen reader behavior effectively alters the user’s expected experience. Sudden, unannounced changes can be highly disorienting and can be met with fear, anger, and confusion.

A good parallel would be if developers were to change how a mouse scrolls and clicks on a per-website basis. This type of unpredictability is not a case of annoying someone, it’s a case of inadvertently rendering content more difficult to understand or changing default operation to something unfamiliar.

My voice is not your voice

A screen reader’s voice is typically tied to the region and language preference set in the operating system.

For example, iOS contains a setting for not just for English, but for variations that include United Kingdom, Ireland, Singapore, New Zealand and five others. A user picking UK English will, among other things, find their Invert Colors feature renamed to “Invert Colours.”

However, a user’s language preference setting may not be their primary language, the language of their country of origin, or the language of the country they’re currently living in. My favorite example is my American friend who set the voice on his iPhone to UK English to make Siri sound more like a butler.

UK English is also an excellent reminder that regional differences are a point of consideration, y’all.

Another consideration is biological and environmental hearing loss. It can manifest with a spectrum of severity, so the voice-balance property may have the potential to “move” the voice outside of someone’s audible range.

Also, the speed the voice reads out content may be too fast for some or too slow for others. Experienced screen reader operators may speed up the rate of speech, much as some users quickly scroll a page to locate information they need. A user new to screen readers, or a user reading about an unfamiliar topic may desire a slower speaking rate to keep from getting overwhelmed.

And yet

Clark admits that some of his objections exist only in the realm of the academic. He cites the case of a technologically savvy blind user who uses the power of CSS’ interoperability to make his reading experience pleasant.

According to my (passable) research skills, not much work has been done in asking screen reader users their preferences for this sort of technology in the fourteen years since the book was published. It’s also important to remember that screen reader users aren’t necessarily blind, nor are they necessarily technologically illiterate.

The idea here would be to treat CSS audio manipulation as something a user can opt into, either globally or on a per-site basis. Think web extensions like Greasemonkey/Tampermonkey, or when websites ask permission to send notifications. It could be as simple as the kinds of preference toggles users are already used to interacting with:

A fake NVDA screenshot
A fake screenshot simulating a preference in NVDA that would allow the user to enable or disable CSS audio manipulation.

There is already a precedent for this. Accessibility Engineer Léonie Watson notes that JAWS—another popular screen reader—“has a built in sound scheme that enables different voices for different parts of web pages. This suggests that perhaps there is some interest in enlivening the audio experience for screen reader users.”

Opt-in also supposes features such as whitelists to prevent potential abuses of CSS-manipulated speech. For example, a user could only allow certain sites with CSS-manipulated content to be read, or block things like unsavory ad networks who use less-than-scrupulous practices to get attention.

Opinions: I have a few

In certain situations a screen reader can’t know the context of content but can accept a human-authored suggestion on how to correctly parse it. For example, James Craig’s 2011 WWDC video outlines using speak-as values to make street names and code read accurately (starts at the 15:36 mark, requires Safari to view).

In programming, every symbol counts. Being able to confidently state the relationship between things in code is a foundational aspect of programming. The case of thisOne != thisOtherOne being read as “this one is equal to this other one” when the intent was “this one is not equal to this other one” is an especially compelling concern.

Off the top of my head, other examples where this kind of audio manipulation would be desirable are:

  • Ensuring names are pronounced properly.
  • Muting pronunciation of icons (especially icons made with web fonts) in situations where the developer can’t edit the HTML.
  • Using sound effect cues for interactive components that the screen reader lacks built-in behaviors for.
  • Creating a cloud-synced service that stores a user’s personal collection of voice preferences and pronunciation substitutions.
  • Ability to set a companion voice to read specialized content such as transcribed interviews or code examples.
  • Emoting. Until we get something approaching EmotionML support, this could be a good way to approximate the emotive intent of the author (No, emoji don’t count).
  • Spicing things up. If a user can’t view a website’s art direction, their experience relies on the skill of the writer or sound editor—on the internet this can sometimes leave a lot to be desired.

The reality of the situation

The CSS Speech Module document was last modified in March 2012. VoiceOver on iOS implements support using the following speak-as values for the speak property, as shown in this demo by accessibility consultant Paul J. Adam:

  • normal
  • digits
  • literal-punctuation
  • spell-out

Apparently, the iOS accessibility features Speak Selection and Speak Screen currently do not honor these properties.

Despite the fact that CSS 3 Speech Module has to be ratified (and therefore is still subject to change), VoiceOver support signals that a de facto standard has been established. The popularity of iOS—millions of devices, 76% of which run the latest version of iOS—makes implementation worth considering. For those who would benefit from the clarity provided by these declarations, it could potentially make a big difference.

Be inclusive, be explicit

Play to CSS’ strengths and make small, surgical tweaks to website content to enhance the overall user experience, regardless of device or usage context. Start with semantic markup and a Progressive Enhancement mindset. Don’t override pre-existing audio cues for the sake of vanity. Use iOS-supported speak-as values to provide clarity where VoiceOver’s defaults need an informed suggestion.

Writing small utility classes and applying them to semantically neutral span tags wrapped around troublesome content would be a good approach. Here’s a recording of VoiceOver reading this CodePen to demonstrate:

Take care to extensively test to make sure these tweaks don’t impair other screen reading software. If you’re not already testing with screen readers, there’s no time like the present to get started!

Unfortunately, current support for CSS speech is limited. But learning what it can and can’t do, and the situations in which it could be used is still vital for developers. Thoughtful and well-considered application of CSS is a key part of creating robust interfaces for all users, regardless of their ability or circumstance.