It was only six years ago that Apple introduced the beta version of Siri to the world. Although, it’s already easy to take the convenience that speech recognition brings to our daily lives for granted. Virtual personal assistants like Siri and Alexa have recently offered us ways to use speech recognition on a daily basis, but long before we met those chatty ladies, speech recognition had been growing its vocabulary for years.

So, how far have we come and we where are we headed? Let’s take a walk down speech recognition lane starting in 1952.

Speech Recognition Software Family Tree
  • 1952 — Yes, that’s right, speech recognition software has its roots in the days of the sock hop and soda fountain. Bell Laboratories designed the “Audrey” system, a computer capable of comprehending a series of digits read out loud to it by a single voice.
  • 1962 — The 1962 World Fair was the debut of IBM’s Shoebox machine, which understood a whopping sixteen words spoken in English plus the digits 0-9.
  • 1976 — Carnegie-Mellon’s “Harpy” was ominously named for an evil mythical creature to establish it as a “different beast entirely,” and had a vocabulary of 1011 words.
  • 1987 — Speech recognition grows to a commodity thanks to the “hidden Markov model,” even in the toy industry. In fact, check out this rather disturbing commercial for a terrifying doll named “Julie,” which utilized the predictive speech model in an inventive and ultimately unsettling way.
  • 1990s — The first consumer-facing speech recognition software, Dragon Dictate, launched with a jaw-dropping price tag of $9000. In 1997, a new $695 version released, but still required a 45-minute training program before it would work. Today, Dragon’s products still lead the industry for consumers.
  • 2000s — As the first decade of the 21st Century took wing, speech recognition technology achieved 80% accuracy in 2001.
  • 2011 — Apple released Siri as part of iOS 5, making speech recognition software something consumers could hold in their hands.
Judging the Accuracy of Speech Recognition Software

The accuracy of speech recognition software has always been measured by comparing the software’s final product with a manual transcription performed by an individual. Analysts compare both transcripts to calculate the Word Error Rate, or number of incorrect words in the automatic transcript. In recent time, more subjective measures have also been used, to take into account more complex issues, such as grammatical structure.

Speech Recognition Accuracy Today

Today’s industry-leading voice recognition software, like Siri and Alexa, are expected to be 95% accurate or more. There has been some debate as to whether that number should be higher or lower, but in March of 2017, IBM announced that they had reached an error rate of 5.5%, compared to Microsoft’s 2016 rate of 6.3%. Then, only a few months later, Google announced during their I/O conference in May that they had reached a once-unfathomable level of 4.9%. They attributed this advancement to the development of Deep learning methods, which had been previously used in Computer Vision for image recognition.

Automatic services like Temi can transcribe an audio recording into text after the recording is played only once, meaning transcripts are available within minutes. Even voicemails and video captions are supported by speech recognition and analysis software, making screening your calls and secretly watching cat videos at work easier than ever.

Speech Recognition Accuracy in the Future

The future of speech recognition technology is looking brighter by the day. Comprehension levels in AI are reaching near-perfect levels, and we as a society are in a fascinating position to watch and benefit.

It’s commonplace for smartphones to have a Siri or Ok Google search function, but the industry’s smartest minds are finding ways to bring the speech tech into new and emerging fields. Smart devices like Microsoft’s Cortana and Amazon’s Alexa are applying speech recognition to smart homes to make everything from grocery shopping to changing your music as easy as speaking out loud. On the more cutting-edge, Waverly Labs has just rolled out the Pilot Translation Kit, a set of small earbuds that will detect any language being spoken and automatically translate it into your native tongue.

Voice recognition has certainly come a long way from the World’s Fair and creepy talking dolls. Next time you need help and don’t know what to do next, you might just find yourself asking: Hey, computer? (Captain Kirk, eat your heart out.)

Get a Free Transcript