Category: Speech Recognition

Interesting CHRIS video

On May 29, 2012

An important European robotics project called CHRIS (Cooperative Human Robot Interaction Systems FP7 215805) has received its final review last april. They have also created a very nice video that summarizes their work:

As the video show, the work includes the recognition of speech, gesture (pointing), actions, and objects. All within a context of cooperation and safety. But, I will not try to summarize their work. Just watch the video.

Gesture Recognition in the Operating Room

By Jeroen

On February 7, 2011

In Gesture Recognition, Healthcare, Speech Recognition

In the news recently: Future Surgeons May Use Robotic Nurse, ‘Gesture Recognition’.

ScienceDaily (Feb. 3, 2011) — Surgeons of the future might use a system that recognizes hand gestures as commands to control a robotic scrub nurse or tell a computer to display medical images of the patient during an operation.

Purdue industrial engineering graduate student Mithun Jacob uses a prototype robotic scrub nurse with graduate student Yu-Ting Li. Researchers are developing a system that recognizes hand gestures to control the robot or tell a computer to display medical images of the patient during an operation. (Credit: Purdue University photo/Mark Simons)

I have noticed similar projects earlier, where surgeons in the OR were target users of gesture recognition. The basic idea behind this niche application area for gesture recognition is fairly simple: A surgeon wants to control an increasing battery of technological systems and he does not want to touch them, because that would increase the chance of infections. So, he can either gesture or talk to the machines (or let other people control them).

In this case the surgeon is supposed to control a robotic nurse with gestures (see more about the robotic nurse here). You can also view a nice video about this story here; it is a main story of the latest Communications of the ACM.

Well, I have to say I am in doubt if this is a viable niche for gesture recognition. So far, speech recognition has been used with some succes to dictate operating reports during the procedure. I don’t know if it has been used to control computers in the OR. Frankly, it sounds a bit scary and also a bit slow. Gesture and speech recognition are known for their lack of reliability and speed. Compared to pressing a button, for example, they give more errors and time delays. Anything that is mission-critical during the operation should therefore not depend on gesture or speech control would be my opinion.

However, the real question is what the alternatives for gesture or speech control are and how reliable and fast those alternatives are. For example, if the surgeon has to tell another human what to do with the computer, for example displaying a certain image, then this can also be unreliable (because of misinterpretations) and slow.

The article mentions several challenges: “… providing computers with the ability to understand the context in which gestures are made and to discriminate between intended gestures versus unintended gestures”. That sounds like they also run into problems with fidgeting or something similar that surgeons do.

In sum, it will be interesting to see if surgeons will be using gesture recognition in the future, but I wouldn’t bet on it.

Mama Appelsap, Perception and Phonetics

By Jeroen

On December 29, 2008

In Dutch Couleur Locale, Psycholinguistics, Speech Recognition

What you might hear in an English song if you are a Dutch native speaker.

In my own research ambiguity in signs is a recurrent issue. Context can change the meaning of signs. And if you are unfamiliar with a sign you may try to project anything that comes to mind on the incoming signal. These songs are great examples of such projections: Dutch listeners who have trouble decyphering the lyrics supplant them with their own ‘Dutch phonetic interpretations’. DJ Timur collects such cases as ‘mama appelsap’ liedjes.

In a way this is quite similar to this ‘silly’ translation of the song ‘Torn’ (here) into makeshift ‘sign language’. Or perhaps that is only a vague association in my mind and not an actual similarity…

No wait, it wasn’t a translation from song to sign, but the other way around: from a signed news item to a silly voice over…

And even this thing does not really show a lot of similarity to the ‘mama appelsap’ phenomenon, because the ‘translater’ does not supplant the correct phonology (BSL) with the phonology of another language (e.g. English), but he just interprets the signs in the only way he can: through iconic strategies. In a way you could call that the ‘universal language of gesture’ but that would be a bit lame, for there wouldn’t really be anything like a proper phonology at work, I think (not being a real linguist I am unsure). It does show the inherent ambiguity in gestural signs quite nicely, doesn’t it? And how it can be quite funny to choose to ignore contextual clues or even supplant an improper context. Ambiguity and context. A lovely pair.

My apologies to the Deaf readers who cannot bear to see these videos: I think my audience knows enough about signed languages to know that it is not really signed language nor a proper translation.

iCommunicator solves nothing at $6499?

By Jeroen

On March 8, 2007

In Sign Language Synthesis, Speech Recognition

,,Well Jim, good to see you and what have you got for us today?” ,,Same here, John, and I can tell you I have something really amazing, just watch this!”

It listens, it types, it signs, it speaks, “iCommunicator is the finest software ever developed for people who are deaf or hard of hearing”

Here is some ‘honest advice’ from EnableMart:

Training to Ensure Positive Outcomes. … Systematic professional training is strongly encouraged to maximize use of the unique features… The end user must be completely trained … to achieve positive outcomes. Managers of the system should … provide training for both end users and speakers … Additional time may be required to customize … Contact EnableMart for information about professional training opportunities.

This seems at first glance a fair bit of warning before you spend $6499 on an iCommunicator 5.0 kit. However, EnableMart sells the advised training for an additional $125 an hour, it is not free. I think this entire thing is a bit suspicious. I have worked with speech recognition, inlcuding the Dragon NaturallySpeaking, and it makes recognition errors (period). I have also fooled around with or seen most sign synthesis technology available today, and it is far from natural. And the same is true for speech recognition.

These technologies have yet to make good on their promises. If you ignore actual user experiences you can imagine it will solve many communication problems. But in practice, little errors cause big frustrations. Using speech recognition can be very tiring and irritating. It only works if the entire interaction is designed well and the benefits outweigh the cost.

Just imagine you are a deaf person using this iCommunicator with some teacher and a simple speech recognition error occurs: How is that error handled? Usually, when a speaker dictates to Dragon NaturallySpeaking he will spot the error and correct it. In this case your teacher will not spot the error (assuming he doesn’t monitor your screen) and the dialogue will continue with the error in place (unless there is enough context for you to spot the error and understand what the speaker actually said). A second problem is that you have to persuade people to wear your microphone to enter into a conversation with you. In a weird and cynical way you are asking them to suffer the same techno-torture as you. Not something you want to do more than twice a day, I imagine. And only with people whose affection you can afford to lose. The sign synthesis is fairly straightforward sign concatenation. A dictionary of 30.000 signs is accessed to get a video for every word. The videos are then played one by one, without any further sentence prosody. That means it looks terrible, like a gun firing signs at you. It also means it does not sign ASL, but signed English at best. Good enough, you might say, but I think the benefit of artificial signed English over typed text is not big. So, the signing is pretty much worthless. Jim the tell-sell guy further claims you can use it to improve your speaking. I do not believe speech recognition technology can give the proper feedback to improve articulation difficulties. It may be able to judge whether you pronounced something correctly (or at least similar to what it knows), but that’s about it. Although there is something in the specs about pronunciation keys, the video doesn’t show details. Well, I simply do not think a computer can reliably tell you what sort of error you made. So what does that leave? You can type text and your iCommunicator reads it out loud with text-to-speech. You can get that sort of software for the price of a cheap dinner from any of these sites.

Finally, the iCommunicator v5.0 lets you search for a word on Google with a single click. That’s pretty neat I admit. If you also think that that is worth a couple of thousand dollars, please contact me. I can supply an iBrowser v6.1 for only $2999, and will supply the necessary training for free. What the hell, I’ll even throw in a professional designer microphone v7.2 🙂 Unfortunately, the business case of the iCommunicator may actually rest on sales to hearing people who wish to reduce or entirely avoid the cost of interpreters:

HighBeam Encyclopedia: …The iCommunicator also enables government workers to provide equal access to information and services to the hearing impaired in compliance with the Americans with Disabilities Act and Section 508…

Sometimes, you can only hope the future will prove you wrong.

Gesture and Speech Recognition RSI

By Jeroen

On December 19, 2006

In Gaming, Gesture Recognition, Speech Recognition

Gesture and speech recognition often promise the world a better, more natural way of interacting with computers. Often speech recognition is sold as a solution for RSI stricken computer users. And, for example, prof. Duane Varana, of the Australasian CRC for Interaction Design (ACID) believes his “gesture recognition device [unlike a mouse] will accommodate natural gestures by the user without risking RSI or strain injury”.

Gesturing: A more natural interaction style? (source)

So, it is a fairly tragic side effect of these technologies that they create new risks of physical injury. Using speech recognition may give you voice strain, which some describe as a serious medical condition affecting the vocal chords and caused by improper or excessive use of the voice.

Software coders who suffer RSI and switch to speech recognition to code are mentioned as a risk group for voice strain. Using gesture recognition, or specifically the Nintendo Wii, may cause aching backs, sore shoulders and even a Wii elbow. It comes from prolonged playing of virtual tennis or bowling when gamers appear to actually use neglected muscles for exensive periods of time…

In comparison, gamers have previously been known to develop a Nintendo thumb from thumbing a controller’s buttons. I can only say: the Wii concept is working out. It is a workout for users, and it works out commercially as well. I even saw an add on Dutch national TV just the other day.

The Wii is going mainstream. As far as injuries are concerned: If you bowl or play tennis in reality for 8 hours in a row, do you think you will stay free of injury? Just warm up, play sensibly and not the whole night. Nonsense advice for gamers, I know, but do not complain afterward.

A collection of Wii injuries (some real, some imanginary): – www.wiihaveaproblem.com, devoted to Wii trouble. – What injuries do Wii risk? – Bloated Black Eye – Broken TVs, and a hand cut on broken lamp (YouTube, possibly faked).

For more background see also: The Boomer effect: accommodating both aging-related disabilities and computer-related injuries.