Many sites with lots of written content employ specially crafted print style sheets. That way, a user can print out the relevant content without wasting paper on navigation, ads, or anything else not germane.
Articulate.js, a jQuery plugin, is what I consider the narrative equivalent. With as little as one line of code, it enables developers to create links that allow users to click, sit back, and listen to the browser read aloud the important content of a web page. In some ways, it can turn a thoughtful essay or article into a mini podcast. And because it uses built-in JavaScript functionality, no browser extensions or other system software is needed.
I thought I would share how I went about creating Articulate.js with the hopes that it could provide the reader with ideas on how to apply this technology in different ways.
The Speech Synthesis Interface
Articulate.js uses the Speech Synthesis interface of the Web Speech API. It is currently supported in all the major browsers, including the latest versions of Edge, Safari, Chrome, Opera, Firefox, iOS Safari, and Chrome for Android.
There are two window
objects of the Speech Synthesis interface that are used to enable the browser to speak: SpeechSynthesis
and SpeechSynthesisUtterance
. The first step is to create an instance of the SpeechSynthesisUtterance
object and designate the text you would like spoken. If desired, you can set additional properties such as the rate, pitch, volume, and voice.
To begin speaking, this object is passed as a parameter to the SpeechSynthesis.speak()
method. Other playback functionality, such as pausing, resuming, and cancelling, are all methods of the SpeechSynthesis
object. A useful demo from Microsoft lets you play around with these features.
At the end of this article, there are many resources listed that walk you through the intricacies of this functionality and provide additional examples.
Getting Started
The lightweight Articulate.js plugin allows you to leverage the powerful selector options of jQuery to specify which parts of the website to speak. For example, depending on how the page is organized, a single line of code, like the following, can direct the browser to speak the entire contents of an article or blog post:
$('article').articulate('speak');
Here’s an example that targets only the primary headers and paragraphs:
$('h1, h2, p').articulate('speak');
Internally, Articulate.js clones the matched set of elements and all their descendant elements and text nodes. It then parses this clone using a default set of rules, deciding what should be spoken and ignored, then adding the appropriate pauses to make everything sound more like a narrative.
These are the basic methods along with a CodePen example:
Function | Description |
---|---|
$(selector).articulate('speak'); |
Speaks aloud the specified DOM element(s) and their descendants |
$().articulate('pause'); |
Pauses the speaking |
$().articulate('resume'); |
Resumes the speaking after it has been paused |
$().articulate('stop'); |
Stops the speaking permanently |
See the Pen Articulate: Basic Functions by Adam Coti (@adamcoti) on CodePen.
You can only have one SpeechSynthesisUtterance
instance active at a time, which is why a jQuery selector is not needed for pausing, resuming, or stopping. As mentioned before, these methods act upon the SpeechSynthesis
object.
Also, the browser will only stop speaking when there’s no more text to be read or when a “stop” call is executed. If the speaking is paused, it must be resumed or stopped before anything else can be spoken.
Adjusting the rate, pitch, and volume can be done. By using an input slider, adjustments can give the user some added control. While the system default rate is 1, after much testing, I bumped it up slightly to 1.1 as that seemed to provide a more natural speaking speed. That’s subjective, of course, and can be overridden.
Function | Description |
---|---|
$().articulate('rate',num); |
Sets the rate of the speaking voice Default = 1.1 Range = [0.1 – 10] |
$().articulate('pitch',num); |
Sets the pitch of the speaking voice Default = 1.0 Range = [0 – 2] |
$().articulate('volume',num); |
Sets the volume of the speaking voice Default = 1.0 Range = [0 – 1] |
See the Pen Articulate : Voice Parameters by Adam Coti (@adamcoti) on CodePen.
There are many more options available to the developer, but let’s talk for a moment about what happens under the hood when Articulate.js is asked to speak something on the page.
The Articulate.js Algorithm
The Speech Synthesis interface that Articulate.js leverages will read aloud, in a most literal fashion, any string of text it’s provided. Some symbols it will enunciate (e.g. it will say “percent” when it encounters “%”); others, it will ignore (e.g. the quote symbol is left unspoken). Its cadence is dictated primarily by commas, which elicit a small pause, and periods, whose pause is slightly longer.
With that in mind, quite a bit of manipulation is needed to prepare a web page for speaking. Unfortunately, one simply can’t concatenate all the selected text nodes in the DOM as that would result in a lot of run-on text (e.g. lists), include content that isn’t appropriate for reading aloud in a coherent fashion (e.g. tables), and ignore items that should be described (e.g. images).
Articulate.js handles this by applying, among others, the following rules to the DOM elements specified in the jQuery selector:
- Delete HTML tags that may contain text nodes, but shouldn’t be spoken, such as
<form>
and<s>
. A list of 21 tags are designated to be ignored as the default. - Find instances of
<h1>
through<h6>
,<li>
, and<br>
tags and append each with either a period or comma. This is to ensure that a pause occurs when spoken since these elements are often visually represented without punctuation. - Insert descriptive text gathered from the
alt
attributes of images,<caption>
tags from tables, and<figcaption>
tags from figures.
After this is completed, what’s left is converted to a long text string that now requires further manipulation, including:
- Find
<q>
tags and pairs of smart quotes and insert the text “quote” at the start and “unquote” at the end to distinguish them when spoken. - Add starting and ending text to designate lists and block quotes.
- Find em dashes and insert a comma in its place to elicit a short pause.
- Remove remaining HTML tags and comments.
- Remove remaining line breaks and carriage returns as well as lingering HTML special characters.
At this point, the string of text is ready to be sent to the Speech Synthesis interface to be spoken by the browser. If you were to look at this string, you would see instances of multiple periods, commas, and spaces — that’s OK — as it won’t affect how it sounds. That is, one or more commas or periods grouped together won’t create even longer pauses.
Customization
By using JavaScript and HTML data attributes, Articulate.js can be customized to optimize the user experience. As the following CodePens demonstrate, you can:
- Specify HTML tags to be spoken that would otherwise be ignored, and vice versa.
- Perform a search and replace within the text, which is helpful for abbreviations. For example, you can specify that all instances of “i.e.” to be spoken as “that is”.
- Specify blocks of text to be ignored. For example, a sentence that reads “click here for more information” does not need to be spoken.
- Specify words to be spelled out.
- Specify copy in specially crafted comment tags to be spoken that is otherwise hidden on the screen.
See the Pen Articulate: Text Manipulation by Adam Coti (@adamcoti) on CodePen.
See the Pen Articulate: HTML Data Attributes by Adam Coti (@adamcoti) on CodePen.
Browser Consistency
You will notice that the Speech Synthesis interface is subtly different across browsers and operating systems. For example, the default rate of speech will sound somewhat faster on an iPhone as opposed to its desktop implementations. Developers can provide input sliders or radio buttons for users to fine-tune their experience.
In addition, depending on the operating system and device, browsers expose different voices to the Speech Synthesis interface. As seen in the demo from Microsoft mentioned earlier, these voices can be selected to override the default “native” voice. But, for simplicity sake, Articulate.js only uses the default voice — later versions will allow that parameter to be modified as well.
Some Final Thoughts
The inspiration for Articulate.js came from the idea that with a simple click, I can enjoy having articles read to me when it’s not convenient or desirable to be staring at a screen — particularly when using my phone. Maybe when lying in the park with my eyes closed or while I’m preoccupied with preparing dinner. The goal was to allow developers to make the appropriate customizations so that it sounds less like a screen reader and more like a friend is reading the web page to you.
Articulate.js can be used as a voice option for anything on the web page, from enunciating a single word to conveying content not displayed on the screen. If you’re interested, download the source code and experiment. And, most importantly, have fun with it!
Download and Documentation
The commented source code and minified versions of Articulate.js can be downloaded at its Github home. Full documentation can be found there as well.
This is super rad; I’ve wanted something like it for a long time!
I’m curious about ignoring
<code>
, though; it would make this very article hard to understand, wouldn’t it?Glad you like it!
With
<code>
, it all depends on the context. Remember, you can override the defaults on a case-by-case basis by placingdata-articulate-recognize
in the<code>
tag.We have been looking into this API to us on out blog. We have found that the speech, though ‘OK’, can be a little hard to continue listening to due to the digital sound that the browsers / OS manufacturers haven’t been able to get around.
I was wondering if anyone else has had this experience?
I’ve noticed that the digital sound varies depending on the browser and/or OS. For example, on the Mac desktop with the latest OS, I can hear the digital voice “inhale” before a sentence. That certainly added to it sounding more human.
Podcasts can do a few things on my phone:
1) Jump back or forward 10 or 15 seconds
2) Pause when taking out the headphones
3) Keep playing after clicking the phone off
When I play a video (YouTube for example), the sound stops when I click the phone off. When I listen to music (not a “podcast”) I can’t “jump” around. These small touches add up and make it harder than it should be to easily listen to webpages. I’m not sur eit would be possible to solve these issues via JS, so I suppose it’s a mobile browser issue?
Wow, this is such a cool idea!
I can think of so many reasons why this could be useful on many sites. And to have it unobtrusively as an option on each page could potentially be a mechanism for increasing user engagement or even conversion.
Hi there,
excellent job, I certainly can use this, thank you so much for sharing.
A few questions:
1. Can the speech API automatically detect the language?
2. For a given language, can I select from different voices?
3. What happens when i.e. foreign location (or any other) names are present? For example, how would “Ludwig Wittgenstein” be pronounced in English?
Thanks again for this great work!
-Franco
Thanks for the kind words. To address your questions the best I can:
It doesn’t detect languages, but from my tests, the default voice will match the default language of the OS. When I changed the language on my MacBook to German, the SpeechSynthesizer default voice changed to the one that spoke German.
That will depend on the browser and OS. I would look at some of the resources listed at the end of the article for more info. Unfortunately, there’s little consistency among browsers, particularly with Windows.
From what I can tell, it would likely pronounce it using the English grammar “rules” that are default to that language. For example, the German word “die” would be pronounced “dye” with the English voice; it would be properly pronounced “dee” with the German voice.
Thank you, I’ll do some testing and let you know.
-Franco
It seems you are right, but that’s horrible. I wanted to try the first demo and my system started to read the text with a strong Italian accent.
After a couple of sentences you can’t stand it anymore. :D
I hope there will be a way to specify the language so that users get a better experience if the OS language doesn’t match the page language.
How difficult would it be to modify this to highlight each word on screen as it is being spoken by the system?
Thinking of using this for elearning with struggling readers.