We have all written essays during our school days. There was always a minimum word count – 250, 500 words – that we needed to meet before submitting. Before I had a computer at home I used to count words manually. Once my parents got me one, discovering Microsoft Word was like magic. I didn’t know programming then, so I was amazed at everything that MS Word did.
One of the features that I really liked was called “Word Count”. You could select any piece of text and click on Tools > Word Count to see some interesting statistics such as number of characters, words, and paragraphs about your selected piece of text. I loved it then and I’m even using it right now as I am drafting this article inside Google Sheets. I figured I’d take a crack at writing one myself.
The app that we are making today will calculate:
- Number of characters, words, and sentences in a piece of text
- Top keywords used
- Readability score – how difficult is it to comprehend the passage.
Before we begin, you can see what we are making here:
See the Pen Word Counter by Vikas Lalwani (@lalwanivikas) on CodePen.
Now, let’s get started!
All the counting in the app relies heavily on regular expressions (referred to as “regex” in rest of the article). If you are not familiar with regex, then check out this beginner article on regular expressions in JavaScript.
Page Setup
First and foremost we need something to take user input. What better way to handle this than textarea
HTML element?
<textarea placeholder="Enter your text here..."></textarea>
We can select the above textarea using this piece of JavaScript:
var input = document.querySelectorAll('textarea')[0];
We can access the input string through input.value
. Since we want to display the stats as the user types, we need to perform our logic on keyup
. This is what the skeleton of our core logic looks like:
input.addEventListener('keyup', function() {
// word counter logic
// sentence count logic
// reading time calculation
// keyword finding logic
});
The output will be stored in simple HTML div
elements, so nothing fancy here:
<div class="output row">
<div>Characters: <span id="characterCount">0</span></div>
<div>Words: <span id="wordCount">0</span></div>
</div>
<!-- more similar divs for other stats -->
Part 1: Counting Characters and Words
With the basic setup out of the way, let’s explore how to count words and sentences. One of the best ways to do it is to use regex.
I am going to walk you through regex patterns for word and sentence counting only. Once you are able to understand that, you can figure out rest of them on your own by looking at source code.
We need to look for two things to find words in our input string:
- word boundaries
- valid word characters
If we are able to locate these, then we will have our list of words. One more thing we can do to increase accuracy is to look for hyphens(-). That way words with hyphens (eg, CSS-Tricks) will be counted as one word instead of 2 or more.
var words = input.value.match(/\b[-?(\w+)?]+\b/gi);
In the above pattern:
\b
matches word boundaries i.e. starting or ending of a word\w+
will match word characters.+
makes it match one or more characters-?
will match hyphens,?
at the end makes it optional. This is a special case for counting words with hyphens as one word. For example, ‘long-term’ and ‘front-end’ will be counted as one word instead of two+
at the end of the pattern matches one or more occurrences of the whole pattern- Finally,
i
makes it case insensitive, andg
makes it do a global search instead of stopping at first match
Next, let’s explore how to count sentences.
Sentences are relatively easy to handle because we just need to detect sentence separators and split at those. This is what the code for doing it looks like:
var sentences = input.value.split(/[.|!|?]/g);
In the above pattern, we are looking for three characters – .
, !
, and ?
– in the input text since these three are used as sentence separators.
After above operation, sentences
will contain an array of all the sentences. But there is one interesting case that we need to take care of before we count: what if someone enters “come back soon…”?
Based on above logic, sentences will contain four entries – one correct sentence and three empty strings. But this is not what we want. One way to solve it to change the above regex pattern to this:
var sentences = input.value.split(/[.|!|?]+/g);
Notice the +
at the end of the pattern. It is there to take care of any instance where a user inputs consecutive sentence separators. With this modification, “come back soon…” will be counted as one sentence and not four.
If you followed the explanation above for counting words and sentences, you can understand rest of the logic yourself by looking at the source code. The important thing to keep in mind is not forgetting edge cases, like words with hyphens and empty sentences.
Part 2: Finding Top Keywords
Once you start typing, you will notice that a new container will appear at the bottom of the page that displays top keywords from the text. This tells you what keywords you used most often which can help you prevent overusing certain words in your writing. That’s cool, but how do we calculate it?
To make it easier to digest, I have divided the process of calculating top keywords into following 4 steps:
- Remove all the stop words
- Form an object with keywords and their count
- Sort the object by first converting it to a 2D array
- Display top 4 keywords and their count
Step 1) Remove All the Stop Words
Stop words refer to the most common words in a language and we need to filter them out before doing any kind of analysis on our piece of text. Unfortunately there is no universal list of stop words, but a simple search will give you many options. Just pick one and move on.
This is the code for filtering out stop words (explanation below):
// Step 1) removing all the stop words
var nonStopWords = [];
var stopWords = ["a", "able", "about", "above", ...];
for (var i = 0; i < words.length; i++) {
// filtering out stop words and numbers
if (stopWords.indexOf(words[i].toLowerCase()) === -1 && isNaN(words[i])) {
nonStopWords.push(words[i].toLowerCase());
}
}
stopWords
array contains all the the words that we need to check against. We are going over our words
array and checking if each item exists in the stopWords
array. If it does we are ignoring it. If not are are adding it to the nonStopWords
array. We are also ignoring all the numbers (hence the isNaN
condition).
Step 2) Form an Object with Keywords and Their Count
In this step, we will form an object where key will be words and value will their count in the array. The logic for this is pretty straightforward. We form an empty object keywords
and check if the word already exists in it. If it does, we increment the value by one, if not then we create a new key-value pair. Here is how it looks:
// Step 2) forming an object with keywords and their count
var keywords = {};
for (var i = 0; i < nonStopWords.length; i++) {
// checking if the word(property) already exists
// if it does increment the count otherwise set it to one
if (nonStopWords[i] in keywords) {
keywords[nonStopWords[i]] += 1;
} else {
keywords[nonStopWords[i]] = 1;
}
}
Step 3) Sort the Object by Converting It to a 2D Array
In this step first we will convert the above object to a 2D array so that we can use JavaScript’s native sort
method to sort it:
// Step 3) sorting the object by first converting it to a 2D array
var sortedKeywords = [];
for (var keyword in keywords) {
sortedKeywords.push([keyword, keywords[keyword]])
}
sortedKeywords.sort(function(a, b) {
return b[1] - a[1]
});
Step 4) Display Top 4 Keywords and Their Count
Output of the above step was a 2D array named sortedKeywords
. In this step we will display the first four (or few if total words are fewer than 4) elements of that array. For each item, word will be at position 0
and its count will be at position 1
.
We create a new list item for each entry and append it to ul
represented by topKeywords
:
// Step 4) displaying top 4 keywords and their count
for (var i = 0; i < sortedKeywords.length && i < 4; i++) {
var li = document.createElement('li');
li.innerHTML = "" + sortedKeywords[i][0] + ": " + sortedKeywords[i][1];
topKeywords.appendChild(li);
}
Now let’s move to finding readability score.
Part 3: Fetching Readability Score
To get your message across in a piece of writing, you want to use words which your audience can comprehend. Otherwise there is no point of writing. But how can you measure how difficult it is to understand a piece of text?
Enter readability score.
Folks, more intelligent than us, have developed many scales to measure difficulty level of a reading passage. For our app, we are going to use Flesch reading-ease test, which is one of the most commonly used tests for this purpose. This is the one that Microsoft Word relies on!
In the Flesch reading-ease test, scores range from zero to one hundred. Higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read. It is based on a little complex mathematical formula which we don’t need to worry about, because on the web there is an API for (almost) everything.
While there are many paid options for calculating readability scores, we are going to use Readability Metrics API from Mashape which is absolutely free to use. It gives some other readability scores as well if you are interested.
To use the API, we need to make a POST request to a designated URL with the text that we want to evaluate. The good thing about Mashape is that it allows you to consume its APIs directly from the browser.
I am going to use plain JavaScript to make this Ajax call. If you are planning to use jQuery, you will find this documentation page useful or if you want to use a server side language, please refer to API home page for code samples.
Here is what the code for making the request looks like:
// readability level using readability-metrics API from Mashape
readability.addEventListener('click', function() {
// placeholder until the API returns the score
readability.innerHTML = "Fetching score...";
var requestUrl = "https://ipeirotis-readability-metrics.p.mashape.com/getReadabilityMetrics?text=";
var data = input.value;
var request = new XMLHttpRequest();
request.open('POST', encodeURI(requestUrl + data), true);
request.setRequestHeader('Content-Type', 'application/x-www-form-urlencoded; charset=UTF-8');
request.setRequestHeader("X-Mashape-Authorization", "Your API key | Don't use mine!");
request.send();
request.onload = function() {
if (this.status >= 200 && this.status < 400) {
// Success!
readability.innerHTML = readingEase(JSON.parse(this.response).FLESCH_READING);
} else {
// We reached our target server, but it returned an error
readability.innerHTML = "Not available.";
}
};
request.onerror = function() {
// There was a connection error of some sort
readability.innerHTML = "Not available.";
};
});
Most of the above code consists of a standard Ajax call, but here are three important things to note:
- For using this API you need a free Mashape account. Once you have that, you will get a key which you can insert in authorization header.
- API returns many different scores and the one we need lies in
FLESCH_READING
property of response. - The response will contain a number between 0 and 100 which by itself does not convey anything to a user. So we need to convert it into a format that makes sense, and this is what the
readingEase
function does. It takes that number as input and outputs a string based on this Wikipedia table.
function readingEase(num) {
switch (true) {
case (num <= 30):
return "Readability: College graduate.";
break;
case (num > 30 && num <= 50):
return "Readability: College level.";
break;
// more cases
default:
return "Not available.";
break;
}
}
That’s it! That was all the logic involved in making a word counter app. I did not explain every small detail because you can figure that on your own by looking at the code. If you get stuck anywhere, feel free to check out the full source code on GitHub.
Further Development
Although the app looks nice and works great, there are few ways in which you can improve it, like:
- Handling full stops: right now if you enter an email like
[email protected]
, our app will count it as two different sentences. Same is the case if you enter a domain name or words like ‘vs.’, ‘eg.’ etc. One way to deal with it is to use a manual list and filter these out using regex. But since there is no such list readily available, we will have to curate it manually. Please let me know in comments if you know a better way. - Handling more languages: Currently our app only handles all the cases of English language very well. But what if you want to use it for German? Well, you will have include stop words of German. That’s it. Rest of the stuff will remain same. Best would be to have an option to choose a language from dropdown and the app should work accordingly.
I hope you had fun following along this tutorial. Feel free to leave a comment below if you have any questions. Or to just say hi! Always happy to help.
Out of curiosity, how would you prevent “Mr. Brooks is out of the office.” as counting as two sentences?
Sorry. I read the rest. You explain it at the end! Great post by the way!
Word counting failed. If the string is “foo ä bar”, there’s 3 words, but due to word boundary used, app shows only 2 words. This can be fixed by using a
u
flag in the regex or switching to different word counting logic.Oh. There’s a paragraph about that… I’m sorry.
Nice article!
One edge case:
Pasting large content using mouse right-click command is not handled here. Shouldn’t be hard to add though.
Performance enhancement:
Doing such large chunk of work on keyup event handling may not be a good idea. Maybe we can use debouncing here. https://css-tricks.com/debouncing-throttling-explained-examples/
I would use the
input
event and debouncing.To reach college graduate reading level in one sentence.
‘I find that being ambidextrous greatly facilitates negotiations with parliament.’
I think the examples could be improved by using array methods rather than loops, for example:
and
and
This sort of code is clearer and more self-documenting – it’s more obviously a series of transformations/operations on the text. It also avoids the visual noise of
for
loops.Agreed on this. Much more idiomatic JS.
Just like a dictionary of stop words, you should use a dictionary of abbreviations to distinguish them from dots used at the end of a sentence. Dots in domain names, usernames, etc can be easily caught, since there is no space after them.
There is a small mistake in the regex to find words
/\b[-?(\w+)?]+\b/gi
. It should be(
instead of[
.The current regex matches words containing any characters (
[
]
) among these characters-?(\w+)?
one or more times+
(\w
meaning any word characters, like you said). This means thathello+(-world?)
will be recognized as one word.Used RegExps in this article are a bit “weird”.
For instance, using
/[.|!|?]/g
to split means counting|
char too.When you use grouping squared brackets, you don’t need to use
|
because any char in them will be counted as such, including parentheses, dots or|
. Usingg
to split a string is also wrong because split already matches all occurrences of that RegExp in the string.Now, let’s talk about
value.match(/\b[-?(\w+)?]+\b/gi);
.Once again the author doesn’t seem to understand groups. ‘non-?sense’ will be considered, but it’s not the author intent.
/\b[-\w]+\b/g
is basically all the author needs to obtain meant results.The
i
(ignoreCase flag) is also useless because\w
includes all chars, not only upper or lower case.There are other gotchas that are not so great, specially when it comes to non latin alphabets, but it’s a decent starting point for an English counter so … I appreciated the effort but I also think you should ask for Technical Reviews before publishing.
This increases chances the content quality will be very high, like it is already for most of the content already present in here.
Best Regards
I remember there was a program in the early 1990s that count all words and how many times each word has occurred in a text. But now I don’t know anything about this program or app. It was called WORD Count or WORD those days. If you know of this program / app please let me know.
My problem is how to count the presence of different words in a document / article.