Some people are actively interested in the stuff I am doing in my PhD studies, or at least they ask me questions about it. I usually tell them about my first experiment. That experiment was entirely about the difference between meaningless movements I call fidgeting and meaningful gestures, in this case sign language signs.

“Press the spacebar as soon as you see a sign”

It struck me then, and it still strikes me, that a bunch of people talking respond to each other so appropriately. Many, many times I saw people reacting to gestures of all sorts. Maybe just a little headnod or a palm-up gesture, or a raising of the eyebrows. And how often do you see anyone accidentally responding to a movement that was not intended to communicate after all?

Imagine the following chitchat:
You: “Nice weather huh?”
Her: “Yeah” (and makes some sort of movement)
You: “What do you mean, you think I am crazy?” (misinterpreting the movement)
Her: “I didn’t do anything, what are you talking about?” (now starts thinking you are crazy)

Rather unlikely? It just doesn’t happen. No matter how much we talk and interact, it hardly ever goes wrong. I will take the exceptional examples as exemplifying the rule.

So, I set out to see if I could test this in a lab. How fast can people make judgements about the status of a movement. I used sign language signs and fidgeting, and told people to press a button as soon as they saw a sign. And I found people could do that very well and very fast. Even non-signers could do it. (In case you want to read more: the journal Gesture recently accepted my publication of these results, hooray!).

If you want you can repeat the experiment in real life whenever you (and a friend) watch a conversation. Just put up your finger as soon as you see the talking people make a gesture. I bet you will both skip the fidgeting and spot the gestures.

Now, imagine a gesture recognizing computer trying to do the same trick and ignore fidgeting. Currently computers that are programmed to recognize gestures, simply assume any movement is a gesture candidate, and will try to classify it against their vocabulary.

In speech recognition one might see a similar problem. People say things like “ehm” or “ehr..” during an utterance. They may also cough, sneeze or scrape their throat. But is that really comparable to fidgeting? I am tempted to think that they are quite different. Coughing or sneezing is a bodily function, whereas fidgeting is usually just a ritualized watered-down version of some bodily function, if any. The reason behind it is quite different. Saying “ehm” is mostly a way to fill the gap, or keep the floor, in a poorly planned utterance. It is in a way as much a deliberate part of the communication as the words used. Nevertheless the computers task is more or less the same: it must withstand the disruptions and continue recognizing the words (or gestures) as if nothing happened. Both “ehm” and fidgeting should be ignored without damaging other processes. And that is quite a challenge as it is.

In speech recognition several techniques have been invented to cope with “ehm” and out-of-vocabulary (OOV) words. Most importantly ‘word spotting’ and ‘filler and garbage models’. Perhaps gesture recognition would do well to have a closer look at those techniques to start safely ignoring fidgeting?