Siri showed us how much we want speech recognition to work – and how bad it can be
Ten years ago Apple launched Siri. It was described as a “humble virtual assistant”, using natural language processing to interpret and action voice commands. It was seen as a major step forward for speech and voice recognition, with one analyst describing it as “a powerful harbinger of the future use of mobile devices – not just the power of voice but, more importantly, the ability to contextualise a statement or request”. The announcement was a statement of intent from Apple: that they would be the first to market with this new technology (Amazon and Google were at least a year behind), but would also drive it forward in the years to come.
Fast forward a decade and the company’s lofty ambitions for Siri and its ilk have fallen short of the mark. While many think of Siri when they hear the phrase “speech recognition” they will also likely be aware of the problems it has encountered, whether it be the questions around user privacy or even a glitch that saw it using expletives. However, the most prevalent issue with Siri is depressingly simple: it fails to understand everyone.
As a result, consumer speech recognition has mostly failed to live up to the hype. Many are left wondering whether it will ever deliver for everyone. We laugh at a parrot ordering items on Amazon through a virtual assistant, but in reality, this is an egregious failure and can have devastating consequences – such as the Irish woman who failed an English exam and was denied an Australian visa when an automated oral test was unable to recognise her accent. In the years to come, speech recognition will play an increasingly important role in our personal and professional lives, but we need to ensure that this benefit is truly felt by all.
Obviously, understanding is everything – however, achieving it is no mean feat. In the last decade, major breakthroughs have been made. But there is an inevitable bias given who trains Siri: white English speaking men. Simply put, we need a wider representation of voices in our datasets.
Recent developments have aimed to address this. Most machine learning algorithms need a human to “label” data in order to train it to identify patterns – thus a machine learning algorithm knows that when it sees something with a similar label, it can assume those things are related. Self supervised learning is this same process, but without the need for human intervention. This means the sources of data available to self supervised engines are massive compared to their supervised counterparts, as they don’t need to be categorised beforehand – they can process far high quantities and more diverse datasets.
All of the big tech players are acutely aware of the value in speech recognition innovation. The UK’s challenger to Silicon Valley, Cambridge has proved a valuable source for them. Since the launch of Siri, Apple has snapped up VocalIQ, Google bought Phonetic Arts and Amazon acquired Evi Technologies. These moves are a clear indication of the huge commercial opportunity.
While “good enough” may have been acceptable when Siri was first introduced, we should now be light years ahead. It’s now likely others will win the race.