News:

And we're back!

Main Menu

Voice Recognition: Has it come of age?

Started by jimmy olsen, November 20, 2011, 11:27:10 PM

Previous topic - Next topic

jimmy olsen


The singularity is upon us!11

http://www.guardian.co.uk/technology/2011/nov/20/voice-recognition-apple-siri
QuoteVoice recognition: has it come of age?

Apple's voice-controlled 'personal assistant' Siri has got everyone talking (in more ways than one). Is this just the start of a computer revolution?

        Charles Arthur
        guardian.co.uk, Sunday 20 November 2011 15.00 EST
        Article history

The man sits down in front of the computer and says, affably: "Computer!"

Nothing happens. In a now-hear-this tone, the man repeats: "Computer?" Still nothing happens. Puzzled, he picks up the mouse and speaks into it: "Hello, computer?"

Beside him, the impatient owner says: "Just use the keyboard." The first man replies, "A keyboard?" Then, slightly annoyed: "How quaint."

The scene comes from the 1986 film Star Trek IV, where Scotty, the engineer, and the rest of the crew have flown back in time from the 23rd century; Scotty needs to get some work done on the computer, and, of course, in the 23rd century they all work by voice command, unlike those 1980s throwbacks. Ha, ha.

Yet if the crew were to land 35 years later, in the present day, Scotty would still be just as puzzled at the computer's lack of responsiveness – unless, that is, he picked up one of the latest breed of smartphones, where being able to respond to the human voice has become the new frontier in interaction.

Since October, people have been buying and using Apple's new iPhone 4S, which comes with a function called Siri – a "voice-driven assistant" which can take dictation, fix or cancel appointments, send emails, start phone calls, search the web and generally do all those things for which you might once have employed a secretary.

Siri isn't just a "voice recognition" tool, though it can do that (so you speak some words and it turns them into text, and sends them as an email or text message). You can also ask it things such as: "How's the weather looking tomorrow in London?" and it will come back with the forecast for London ("England"). It'll do currency conversions or give stock prices. Or try asking it: "Why is the sky blue?" and, after a little thinking, the screen will show an explanation: "The sky's blue colour is a result of the effect of Rayleigh scattering." (There is more, but we all know about Lord Rayleigh's work on molecular refraction in the troposphere, don't we?)

Lots of people think Siri isn't anything new. We've had voice-dialling on our phones for ages (where you say a name to the phone, link that to a phone number, and when you say it again it dials it). Google already offers a voice-driven web search app, which it backed with a big poster campaign around London earlier this year, with phonetic spellings of searches you might want to do - "pih-ka-di-lee sur-khus" or "tak-see num-buhz".

But experts say that Siri – and what it represents – might be as subtly revolutionary as the iPhone's multi-touch screen was when unveiled in January 2007. That's because Siri isn't just "voice dialling" or "voice recognition" (which tries to turn speech into its text equivalent); it's "natural language understanding" – NLU, in the lingo.

I've been testing an iPhone 4S, and found lots of uses for Siri: for doing currency and other conversions (sq ft in sq m, anyone?) or finding companies' stock values and market valuations (useful in stories) when a web search would distract from writing, and – in dodgier parts of town – to play songs or playlists (instructed via the headset microphone) without taking the phone out of my pocket. Better to be thought a bit strange than have the phone nicked.

Siri grew out of a huge project inside the Pentagon's Defense Advanced Research Projects Agency (Darpa), those people who previously gave you the internet and, more recently, a scheme to encourage people to develop driverless cars. Siri's parent project, called Calo (Cognitive Assistant that Learns and Organizes) had $200m of funding and was the US's largest-ever artificial intelligence project. In 2007 it was spun out into a separate business; Apple quietly acquired it in 2010, and incorporated it into its new phone.

When you ask or instruct Siri to do something, it first sends a little audio file of what you said over the air to some Apple servers, which use a voice recognition system from a company called Nuance to turn the speech – in a number of languages and dialects – into text. A huge set of Siri servers then processes that to try to work out what your words actually mean. That's the crucial NLU part, which nobody else yet does on a phone.

Then an instruction goes back to the phone, telling it to play a song, or do a search (using the data search engine Wolfram Alpha, rather than Google), or compose an email, or a text, or set a reminder (possibly linked to geography – the instruction, "Remind me to call mum when I get home," will work), or – boring! – call a number.

NLU has been one of the big unsolved computing problems (along with image recognition and "intelligent" machines) for years now, but we're finally reaching a point where machines are powerful enough to understand what we're telling them. The challenge about NLU is that, first, speech-to-text transcription can be tricky (did he just say, "This computer can wreck a nice beach," or "This computer can recognise speech"?); and second, acting on what has been said demands understanding both of the context and the wider meaning.

The demonstration that computers have cracked this – just as their relentless improvement cracked draughts and then chess – came in February this year, when IBM's Watson system competed in the American game show Jeopardy!, where the quizmaster provides a sort-of answer, and you have to come up with the question. (For example: "Originally called What's the Question, it got its name when a producer at testing said it needed more jeopardies.")

Except Watson didn't compete against just anyone. It was ranged against two humans who between them had scores of wins, and won millions of dollars in prize money from the game. They battled it out over "answers" such as "William Wilkinson's An Account of the Principalities of Wallachia and Moldavia inspired this author's most famous novel." Yes, of course – Bram Stoker. Watson, and the humans, answered correctly, but Watson had more points, and won. Watson wasn't competing on absolutely level terms; it was fed the questions as text at the same time as they were read to the other two players. Still, its victory led to plenty of tweeting from people welcoming our new robotic overlords. And it wasn't even connected to the internet; it just relied on a huge database of information stored on its system.

"While competing at Jeopardy! is not the end-goal," its IBM engineers noted drily, "it is a milestone in demonstrating a capability that no computer today exhibits – the ability to interact with humans in human terms over broad domains of knowledge." And, of course, in understanding what people mean when they say something.

Having gained that glory, Watson has been shunted off to work on healthcare problems – with the addition of Nuance's speech-to-text technology. You can imagine it popping up on some future episode of House, solving some gnarly medical conundrum (and perhaps banjaxing the series' reason for existing).

But much more interaction like Watson's is likely because of the growing availability of "cloud computing", where huge amounts of processing power is available ad hoc over the internet. Amazon, for instance, now adds almost as much computing power per day to its cloud computing systems as it used to need for all its site in 2000. Other companies are doing the same.

That means NLU is starting to seep into our daily lives: we've grown used to computers getting better at understanding us in limited contexts – where you phone an automated system and can pay bills just by reading out your payment card number, or use booking systems that don't have humans. Now, it's spreading wider, being used for "semantic analysis" of Facebook and Twitter postings by companies eager to figure out whether people are saying positive or negative things about them online. Feed in the text, and the computer figures out whether people are happy or annoyed with you. It could even work on your car: no more fiddling with the dashboard.

"The key thing is NLU – understanding what you mean and what you want," says Neil Grant of Nuance. "Historically, cars have voice elements but historically you had to learn a huge long list of commands. As NLU progresses, you can say what you want in a way that's natural to you."

Norman Winarsky, who coordinated the funding for the Siri company, says it is "real AI [artificial intelligence] with real market use", adding that "Apple will enable millions of people to interact with machines with natural language. [Siri] will get things done and this is only the tip of the iceberg. We're talking another technology revolution. A new computing paradigm shift."

Francisco Jeronimo, smartphones analyst for the research company IDC, is less sure: "We've had voice recognition on phones for a few years, but it never really took off. There are two reasons for that: there was no brand or company that really pushed it the way that Apple is doing. And there's no other company with a brand like Apple. It was only when Apple launched the new iPhone with Siri that I tried voice search on my Android phone – I'd never tried it before." Sounding a little surprised, he adds: "It worked really well."

In fact, what's happening now with NLU – which is spreading far beyond just the iPhone – could just be the beginning of a revolution in how we use computers (particularly smartphones) as big as that which came with the iPhone in 2007, when suddenly multi-touch screens (where you can do more than just prod the screen) were de rigueur. It's taken almost five years, but now multi-touch screens are everywhere: hundreds of millions of touch screen phones and tablets have been sold, and even the next version of Microsoft's Windows – due in about 12 months' time – will offer swooshy touch-screen operation.

Horace Dediu, who runs the consultancy Asymco, having formerly worked for the mobile phone maker Nokia, thinks the time is ripe for a new way of interacting with our computers. He points to how Apple drove changes in interaction before: the mouse and windows in 1985, the iPod's click wheel in 2001, multi-touch in 2007. And he also points out that Siri has some interesting similarities with what people thought about the original iPhone touch screen: "It's not good enough; there are many smart people who are disappointed by it; competitors are dismissive; it does not need a traditional, expensive smartphone to run, but it uses a combination of local and cloud computing to solve the user's problem."

Certainly there's been no shortage of people who say that Siri (which Apple meticulously points out is a "beta", or unfinished product) "isn't good enough" because it can't yet deal with every accent, or every possible query. They also say that the whole idea of asking a disembodied computer questions is ridiculous: "You won't catch me saying things into a phone" is typical of the sort of reaction. (An online poll at the website Ars Technica found 36% saying "you won't catch me talking at a machine in public".) Which is odd when you think about it, since phones are designed for talking into, and nobody seems to get embarrassed about announcing that they're on the train to a carriageful of people, or reading their credit card details out loudly and slowly for all to note. Perhaps there's an element of self-consciousness: we're more concerned with our own feelings about asking questions of a machine than anyone else's about hearing them.

The thing about NLU is people have been expecting it to happen for ages. In 1996 I watched Bill Gates announce that by 2011 we would have computers that could recognise their owners' faces and voices. That's this year! And he's right – if you count smartphones as computers which they are, really – about as powerful as the laptop computers of 2001. The newest Android phones can be unlocked by showing them your face. And the voice thing – well, we're working on it.

Not that things are perfect now either: Siri's servers have already had a number of outages. At Nuance, Grant thinks that will be solved: "Time will sort out the connection problems," he says.

Even so, it's not clear whether in our offices we'll talk to our computers, Scotty-style. Apart from anything, the noise would be maddening. Wouldn't it? Then again, that's probably what people thought in the 19th century about introducing telephones to offices. And we already have offices where people spend the whole time on the phone. We know them as call centres. Just hold on for a few years, Scotty. The computer will hear you soon.
It is far better for the truth to tear my flesh to pieces, then for my soul to wander through darkness in eternal damnation.

Jet: So what kind of woman is she? What's Julia like?
Faye: Ordinary. The kind of beautiful, dangerous ordinary that you just can't leave alone.
Jet: I see.
Faye: Like an angel from the underworld. Or a devil from Paradise.
--------------------------------------------
1 Karma Chameleon point

derspiess

Google voice search is surprisingly good at recognizing my voice.  Google Voice's visual voicemail tends to do a less stellar job at transcribing my voicemails, though.  If it's a voicemail from my wife, it usually just gives up :lol:
"If you can play a guitar and harmonica at the same time, like Bob Dylan or Neil Young, you're a genius. But make that extra bit of effort and strap some cymbals to your knees, suddenly people want to get the hell away from you."  --Rich Hall

HVC

My voicemail-to-text feature calls me larry. it annoys me.
Being lazy is bad; unless you still get what you want, then it's called "patience".
Hubris must be punished. Severely.

PDH

Quote from: HVC on November 21, 2011, 11:55:09 AM
My voicemail-to-text feature calls me larry. it annoys me.

:) It's trying to make you a 1st worlder.
I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth.
-Umberto Eco

-------
"I'm pretty sure my level of depression has nothing to do with how much of a fucking asshole you are."

-CdM

MadImmortalMan

A generation or two need to die off before the rest of us can have nice toys.
"Stability is destabilizing." --Hyman Minsky

"Complacency can be a self-denying prophecy."
"We have nothing to fear but lack of fear itself." --Larry Summers

crazy canuck

QuoteYet if the crew were to land 35 years later, in the present day...

If the crew were to land 35 years later they would not be in the present day.  Maybe voice recognition works like Star Trek in 2021 and the person writing the article time travelled to let us know about it?

Warspite

Ah, another Apple press release masquerading as a Guardian article.
" SIR – I must commend you on some of your recent obituaries. I was delighted to read of the deaths of Foday Sankoh (August 9th), and Uday and Qusay Hussein (July 26th). Do you take requests? "

OVO JE SRBIJA
BUDALO, OVO JE POSTA

DontSayBanana

No.  Just no.  Hell, Siri doesn't do anything Android's speech recognition or even WSR can't be easily coaxed into doing.

I also find it interesting how the author complains about image recognition being a fantasy, despite Google Goggles having been mature since 2009 (which works with computer webcams as well as with phone cameras).  Methinks a little more homework would not have gone amiss before publishing this piece.
Experience bij!

HVC

Quote from: PDH on November 21, 2011, 01:15:14 PM
Quote from: HVC on November 21, 2011, 11:55:09 AM
My voicemail-to-text feature calls me larry. it annoys me.

:) It's trying to make you a 1st worlder.
hey i am a first worlder! i was born here... it's just my name that's fucked up :lol:
Being lazy is bad; unless you still get what you want, then it's called "patience".
Hubris must be punished. Severely.