A Beginner’s Guide to Automatic Speech Recognition

Though Automatic Speech Recognition (ASR) has been around for a considerable amount of time – the first ASR device was created in 1961 – it is only recently that the technology has made its way into our homes. Devices like Apple’s personal assistant, Siri, have meant that most people have now had some kind of personal interaction with ASR.

The technology is also being utilised in a remarkably diverse array of commercial contexts. In the modern contact centre, its potential has been recognised and it’s now used in many customer service solutions, including IVR and some Chatbots.

But what is ASR and how does it work?

Here, we take an in-depth look at what the technology is, how it works and what the future holds for it.

  1. Getting a grip on the basics – what is Automatic Speech Recognition?

Automatic Speech Recognition is a form of modern technology that’s primary purpose is to transform spoken audio inputs into text. Essentially, it takes the human voice and attempts to translate into a written form that can be understood by both readers and machines.

Currently, one of the most widespread uses of this type of technology is in virtual assistants like Alexa and Siri. If you’ve ever activated your mobile device or home hub with a “Hey, Siri,” you’ve engaged with an ASR system. It’s the technology that allows your devices to respond to voice and audio inputs.

A Beginner’s Guide to Automatic Speech Recognition

While basic forms of ASR technology may provide a simple text transcript of an audio recording, more complex variations combine ASR with other powerful technologies, such as Natural Language Processing (NLP) and Sentiment Analysis. When used in conjunction with AI technologies like NLP, ASR becomes one of the key components in conversational AI – machines and systems that can communicate in a human-like manner.

While we may not yet be at the point at which ASR systems are so well-designed that we can’t tell whether or not we’re engaging with a machine or human, the rapid development of AI technologies suggests that we’re not far off.

  1. How is ASR utilised in modern technologies?

One of the key developments that makes ASR both possible and desirable is the mobile revolution. Today’s most powerful technology is handheld, able to travel, always within reach and integrated into almost everything. We have smart fridges, smart vehicles and smart lighting and heating.

All of these devices are capable of receiving data and transmitting it. As this powerful technology is now ubiquitous and doesn’t rely on a single control interface (it wasn’t that long ago that all of our digital activity was restricted to a single device – a large and cumbersome desktop computer hooked up to a phone line), it makes sense to develop a new kind of interface. An interface that doesn’t depend on you being within an arm’s length of the device – a voice-based control interface.

This interface is being deployed in many different ways. For instance:

  • Messaging apps – ASR transcribes voice recordings to text to be sent as messages
  • Search engines – ASR facilitates voice-based searches
  • In-car systems – ASR allows for hands-free control of navigation and entertainment systems, allowing drivers to focus on the road and improving safety
  • Virtual assistants – virtual assistants act as a voice-activated central control system that engages with a variety of apps to help you find information, schedule appointments and execute basic commands

The technology is also being deployed in a customer service context. Currently, it’s employed in three distinct manners:

  1. IVR – ASR has been used to provide callers with an alternative to the traditional keypad input method. Rather than responding to the IVR prompt by pushing the appropriate number of the pad, users can vocalise their response.
  2. Chatbots – While the majority of Chatbots currently engage in text-based conversations with users, many incorporate some aspects of speech. In the future, we can expect ASR to play a big role in enabling more voice-based interactions between customers and Chatbots.
  3. Speech analysis – Some organisations may record customer calls and analyse the speech to improve the performance of their AI technologies. We’ll go into this in greater detail later on.

  1. Breaking down ASR to better understand the process

When looking at ASR and how it works, it’s necessary to examine the key hurdles it must overcome if it’s to generate accurate transcripts.

These can be summarised with five distinct questions.

  1. Direct transcription – what was said?
  2. Speaker identification – which speaker spoke when?
  3. Speaker recognition – who spoke?
  4. Spoken language understanding – what does it mean?
  5. Sentiment analysis – what emotions is the speaker feeling or trying to convey?
A Beginner’s Guide to Automatic Speech Recognition

It’s important to note that not all of these questions need to be answered for an ASR system to work. For instance, the most basic ASR tools restrict themselves to answering the first, while advanced ASR systems can interpret emotion and intention is speech. Generally, the more of these questions the ASR can answer, the more complex and capable it is.

  1. Learning how machines process vocalisations

There are several different ways for a computer to interpret a word. Each differs according to what it uses as the building block of its language interpretation. For instance, a machine could interpret words as being made up of any of the following building blocks.

  • Phonemes – the basic units of sound in a language. English contains 44 different phonemes, each of which is a unique sound.
  • Morphemes – a part of a word that has meaning and that cannot be broken down into smaller parts without it becoming meaningless (eg. unhealthiness is “un” + “health” + “ness”).
  • As Part of Speech – speech can be interpreted in terms of grammatical grouping and flexional information. This means in terms of whether the word is a noun or verb, singular or plural etc.
  • Meaning – machines can interpret words in terms of their meaning. However, this is both difficult and impractical due to the way many words have multiple meanings and how meaning can change with context.

The majority of ASR systems use phonemes as the basic unit of language and attempt to break down and understand speech in terms of combinations of these units. The simplified process is as follows.

  1. The user speaks into a device and the audio is captured by recording software
  2. A wave file of the audio recording is created. This wave file is then cleaned by removing any unwanted or unnecessary background noise
  3. The wave file is cut up according to its constituent phonemes
  4. The ASR software analyses the chain of phonemes. Using statistical analysis, it runs through the phonemes in order, using the likelihood of certain phoneme combinations to determine whole words.
  5. The software continues to use statistical analysis to transcribe whole sentences, paragraphs and texts.

  1. The challenges involved in ASR

While that all may sound simple enough, countless complications can derail the process. By and large, these complications can be divided between two general ‘problem areas.’

The speaker

  • Language – there are many different languages spoken across the planet – what if a caller speaks a different language to your ASR system?
  • Articulation, slang and accent – even within the same language, two speakers can sound remarkably different due to variations in accent, dialect and vocabulary
  • Linguistic variation – it’s often the case that one language has multiple words for the same thing
  • Acoustics – if there’s excessive background noise, a poor audio recording or problems transferring the audio file, the ASR can struggle to transcribe accurately.
A Beginner’s Guide to Automatic Speech Recognition

The complexity of the machine processes

  • Messy data – unlike other inputs, audio data can be noisy and messy. This means that there’s often a lot of superfluous information within the data that needs to be filtered and removed. However, determining what is superfluous and what is necessary is often challenging
  • Complex analysis – human use of language is endlessly inventive. As well as new words, there are always fresh ways of saying things. On top of this, we often speak in grammatically ‘incorrect’ ways, making accurate analysis even more difficult.

  1. The importance of data – ensuring ASR can learn

Like most AI applications, the accuracy of ASR is almost entirely dependent on the availability of large amounts of high-quality data that can be used to train itself.

Keeping it relatively simple for the sake of this article, ASR systems require access to large amounts of audio data in order to improve performance. Data is labelled and fed through the ASR system. Each time this occurs, the system gets better at distinguishing voices from other noises, one speaker from another and individual phonemes. Over time, it fine-tunes its performance and it grows more and more accurate.

However, ASR systems need a remarkable amount of data if they’re to learn efficiently. For instance, Facebook’s new ASR software is built on more than 16,000 hours of voice recordings (Facebook AI Research). This is somewhat good news for customer service centres, who may find themselves playing an increasingly important role in future ASR data-gathering due to the high volumes of calls most centres receive.

  1. Measuring precision in ASR systems

Another significant aspect of ASR development is measuring precision and establishing how accurate the system is. Doing so allows us to compare ASR systems and to work out which is best able to recognise and transcribe speech.

While there are many ways to measure the precision of ASR systems, one of the most popular and widely used is Web Error Rate (WER).

A Beginner’s Guide to Automatic Speech Recognition

WER is a formula that takes into consideration a range of potential mistakes. These errors include:

  • Substitutions – when a word is replaced with an incorrect substitution
  • Insertions – when a word that was not vocalised is inserted into the text
  • Deletions – when a word that was vocalised is missed out of the text

WER is calculated in the following way.

WER = the number of errors divided by the total number of words.

The closer to zero the answer, the more accurate it is.

  1. Speech analysis – the power of ASR in a big data context

As well as being a supremely useful tool for customer service departments looking to automate responses in certain channels, ASR is key to exploiting new data streams.

Today, valuable information about customer behaviour and habits are largely gleaned from written sources. Our social media posts, emails, internet history, choice of restaurant or fast food order are all recorded and used to improve companies’ understanding of how consumers behave. In turn, this allows organisations to more effectively target potential customers and to influence the way they act.

In this respect, ASR is important because of the way it facilitates speech analysis and provides businesses with another source of data that can be mined for valuable insight. Even more excitingly, the nuance of human speech and the power of ASR systems to interpret intent and emotion (via technologies like NLP and Sentiment Analysis), means that speech analysis may provide far more valuable detail than existing data sources.

For contact centres, this is a remarkable opportunity. Though ASR is likely to be restricted to automation tasks for the immediate future, in the long-term it will play an increasingly important role in determining how to respond to specific customers and their complaints. It could also help human agents get to the heart of complex customer problems as quickly as possible.

  1. Looking forward – the future of ASR viewed through three challenges

Despite the rapid development of ASR technologies, there are still some obstacles to overcome if it’s to achieve its full potential. The future of the technology is best-viewed through the perspective of these challenges, as they highlight the ways individuals and organisations hope to utilise the technology over the coming years.

A Beginner’s Guide to Automatic Speech Recognition
  • Background noise – while ASR technologies have improved drastically in regards to filtering out background noise, there is still some way to go. Currently, background noise limits ASR deployment. However, should the technology develop to the point it can filter out background noise in busy public spaces, it will radically increase the extent to which ASR can be used, as well as the number of applications it can be used for.
  • Far-field speech – not all speakers are right next to the recording device. Currently, ASR systems find it difficult to deal with voices travelling different distances. If ASR is to prove itself suited to environments such as business meetings, it will need to improve in this respect.
  • Less resourced languages – as we’ve mentioned, ASR systems depend on a vast amount of audio data for training. This is a problem when it comes to developing systems for languages in which that data is difficult to acquire or simply not available. If ASR is to become a truly global technology, this will need to be tackled.

What Next?

Thanks to the ongoing AI revolution, ASR is developing at a remarkable pace. The ability of the technology to ‘teach itself’ with large quantities of data has been a gamechanger and allowed inventive entrepreneurs to dream up an endless array of ways the technology can be utilised commercially.

One of the areas which most stands to benefit from ASR is customer service. Technologies that automate customer service provision without negatively affecting the quality of that service are in enormous demand due to the way that they allow you to cut costs without frustrating customers. This makes ASR a valuable tool to any contact centre looking to provide improved customer service on a tighter budget.

To find out how ASR can help you provide a smoother customer experience, call our Demo Line on 07723 547670 or find out more via the link below.

Our expert team have been providing customer self-service solutions for over 25 years. Call us on 01344 595800 or drop us a line to find out more.