UX thinking for hand gestures in XR experiences

Timothy Leary, John Anderton, and Tony Stark.

In 2002 I was enrolled in the MS program in Human Computer Interaction at DePaul University in Chicago. I was a developer back then, but my eyes were opening to UX design and research; how we could map the things people did naturally and make their digital experiences better.

And then I saw the movie Minority Report. One scene, in particular, rocked my UX universe:

I’d been introduced to Extended Reality (XR) and manipulating digital objects in physical space as an undergrad waaaaay back in the early 90s by none other than Timothy Leary. He spoke about new input devices and moving well beyond the screen, and exploring new information spaces. My mouth fell open, my heartbeat lept. Later we got to don very clunky, counter-weighted head-mounted displays and slip into haptic gloves. We tossed a virtual baseball to one another, seeing it and feeling it, although it didn’t exist in the real world.

This is going to change the world, Undergrad Pete thought.

It didn’t, at least not in the near term. But the scene from Minority Report reactivated impressions I’d had a decade earlier with Dr. Leary, and after seeing Tom Cruise’s character John Anderton move through dense information spaces like a maestro conducting an orchestra (complete with the classical music in the background; kudos, Mr. Spielberg). I began dreaming about how the UX principles I was learning for the flat-rectangle world of monitors would evolve.

This stuff is going to change the world, Grad Student Pete thought.

Of course, the technology still wasn’t where it needed to be, at least not out in the world I lived in, but there were signs that it was coming of age. A decade later Oculus came to market and we saw Tony Stark in the Iron Man movies dramatically interact with AR objects as he solved problems with Jarvis. Now in 2021 it seems we’re not all that far from seeing these objects in our world, and using our hand gestures to manipulate them. Check out the hand gestures:

Right now, it’s still the Wild West out there, from a UX point of view. Different organizations — mostly in games — are designing their own gesture vocabularies to work with specific XR devices in isolated contexts. Head-mounted displays (HMDs), or just a phone or tablet you hold in front of your face as you point to objects and tap.

Today users actively interface with and control objects in an XR environment differently, depending on context or technical solution. Options include handheld controllers, possibly in combination with spoken audio cues and eye-tracking. Additional physical bodily movements such as jumping, moving, and posing can also be used to interface with an XR environment.

Sometimes, like Tony Stark, you can just use your hands.

These on-screen examples are not that far away from reality. For his role in 2002, Tom Cruise learned an existing gesture vocabulary developed at MIT called G-Speak for navigating complex information spaces. And as of today, we don’t need gloves; modern sensing and processing technology can track bare hands well enough for most gestures. Today you can use gestures to control the radio in your BMW 7-series vehicle.

As it turns out, the user’s own hands and a vocabulary of gestures can be the best choice for a primary XR interface. In contexts such as viewing AR through a device like a phone, tablet, or a pair of glasses, controllers aren’t present or practical. And to put it plainly- most people already come equipped with hands. They’re designed for interacting with an environment. They don’t need charging, and people for the most part natively understand how they operate, decreasing many potential points of complexity and failure.

While there are existing gesture vocabularies (like G-Speak) it still feels like uncharted territory where XR applications are being developed. As of this writing, nothing is approaching a standard gesture vocabulary and many creators of XR solutions that allow gestures just create their vocabulary from scratch. Today, one device (or one single game played with one device) can have a very different set of mapped gestures than the next.

This is not optimal, certainly. Tony Stark would disapprove.

But while we don’t have a standard vocabulary for hand gestures in an XR environment, practitioners and researchers are developing best practices that XR interaction designers can use to create easy, intuitive, memorable gestures. Even if the gestures themselves aren’t standardized.

In exploring different gesture guidelines and the UX of XR, I want to summarize what I’ve absorbed and suggest a few heuristics of my own that seem logically based on my experience and insights working as a UX designer and researcher for the past 20 years.

If you’re creating a gesture vocabulary, you could use the methodology that Piumsomboon et al. described in their research, “User-defined gestures for augmented reality.” This would involve working with representative users as participants, documenting your context’s needed interactions, and gathering consensus gestures for these interactions from your participants. Then you could isolate consensus gestures and develop a usable vocabulary, increasing usability by applying the following best practices.

The UX ante for any complex design effort

UX in the XR space (heh) is still pretty murky and largely uncharted waters and always a complex thing, so it’s even more important to make sure you’re proceeding from a strong foundation when you define the gestures in your vocabulary. These guidelines apply in all situations, regardless of context, devices, or location on the XR spectrum.

Do user research and testing. Context matters, and relying on your own experience to inform UX strategy is risky. Only using convention and similar UX in your own context — your own game, simulation, or the place your gestures will be active — is also risky. It’s easy to introduce your own biases and also to miss pain points. I’ve run hundreds of usability tests and I’m not sure I’ve had one where someone untrained and unbiased didn’t say something that made me think “That’s genius; why are they even paying me when I didn’t think of that?” Get your device(s) and scenarios in front of people. Even if you have to mock everything up with paper. Is your XR experience dependent on a mobile device as the lens? If you don’t have a working prototype, use a block of wood the size of a phone and run people through your common tasks.

Use affordances and signifiers. These are just as important in an XR context, if not more important. What can be done? What can be pressed? Providing a real-world analog (“It looks like a button, right?” might not be enough. Make sure people know they can push it.

System feedback is imperative. When they push that button, pull that lever, select that object, give the user feedback it’s happened. The best situation is that the user knows their gesture is being tracked, that it’s committing the action, and that the system has received the action. The cues can be visual, audio, haptic, or a combination. In even a partially-immersed environment these feedback cues are even more important, as users will be very attuned to any signal their action is correct or working.

All cues in your context should be consistent. Every time a user presses a button or does a thing, how to do it and what happens when they do it should be consistent across your experience. Pressing the button, selecting, undoing should be and feel the same throughout.

Have a sandbox. Gesture vocabularies will vary from one context/platform to another. Some gestures might carry over, some might be different, and others totally new. Give the user a space to learn and practice your context’s gestures to gain proficiency and comfort.

Onboarding is necessary. You can’t assume a user knows anything about your experience, your gesture vocabulary, and in an XR context disorientation is an order of magnitude worse than in a non-immersive experience. Present the onboarding experience by default; you can let the user choose not to see it again but if they do, make it easily accessible at any point. Of course, onboarding and your sandbox should be completely accurate and up to date, and you should notify in-system of any changes to these fundamental pieces.

Keep the user (physically) safe. You should do everything possible to keep the user safe, and not put them in a position to be injured or feel threatened in your context or by your gestures. Don’t ever enable striking the body as a gesture, don’t put an open palm in front of the eyes and obscure the view. Don’t spawn important elements behind them or behind hands or objects without great care, as they’re difficult to track. Don’t have an animated object sneak up on the user from behind. You may feel the intensity of such an experience will add to the enjoyment, but really a user will just mistrust all aspects of your experience. In specific isolated contexts it’s acceptable to create an environment of dread (as in a horror game), but at no time should this put a user in physical jeopardy.

Don’t be unduly exclusive. Don’t assume your users have 5 fingers, don’t wear glasses, don’t have arthritis, or have a body shaped “normally,” are white, or are male. If your context or gestures (or avatars) do assume these things, insofar as it’s possible, allow accommodation. This is at least as much art as it is science; no interface can be entirely inclusive, but take care and do the work to make sure yours isn’t unduly exclusive.

Good UX thinking for experiences using gestures

Keep these in mind when designing contexts where gestures are enabled.

Easy is better than complex. Make your gestures as simple as possible, avoid complexity or finger-knitting in all but the most necessary cases. User testing (as above) will help you determine what “easy” is in your context. Keep them short and simple; one hand motion instead of a series of connected motions. Long and complicated gestures will almost always lead to a frustrating experience, especially for new users.

Repetitive, common gestures should be the easiest. If a gesture is important or used often in your context, work to make it one of the most effortless gestures to execute and easiest to be correctly read by your tech stack.

Provide an easy “undo” gesture. The easier and more natural your undo gesture is, the better users will feel about your experience.

Today’s tech doesn’t do well with fine interactions. Compared to a few years ago, today’s (late 2021) technology that captures and decerns hand gestures is amazing, but it’s still not great at consistently determining fine gestures or little movement, so don’t include such gestures in your vocabulary. No threading needles unless you’re confident your technical stack can handle it.

Leverage the real world. If a gesture in your context has a real-world analog, use this analog if at all possible. If your context includes using a hammer to drive nails or swinging a sword, it’s much better to copy the gesture and motion for someone doing these things in real life (within “easy” tolerances, see below) such as gripping and swinging. These more intuitive choices are better than something dissonant like snapping fingers or tapping palm-down. Be semantically relevant; being close in this situation is better than being original.

Make your gestures discreet and unique. Suppose a gesture in your vocabulary is likely to be performed accidentally. In that case, it involves natural closing of fingers or hands coming to rest at the sides, for example — your experience will be jarring as people try to teach themselves to avoid natural gestures or poses. Don’t work against users’ natural tendencies. Also, gestures should be distinct from one another, not easily mistaken for each other.

Minimize your vocabulary. Less is more; the lower the number of gestures a user needs to master in order to enjoy your context, the better. This might not be true for “expert use” situations, but most users will find the effort of becoming fluent in a vast gesture vocabulary to make your experience go non-fun.

Don’t map the middle finger. Avoid using obscene gestures in your vocabulary, unless you’re very sure that’s what you’re going for. This sounds obvious and most vocabularies won’t contain a discreet action for giving a middle finger gesture, but the point here is to become aware of other cultural obscene gestures and stay away from them. The irony is that many obscene gestures are simple and easy to remember.

Allow gesture mapping. Just as it’s a common feature in games to allow keymapping to allow those used to other key vocabularies to quickly take up a game, allowing gesture mapping in your context can go a long way to helping veteran AR users feel at ease.

Gesture-specific UX guiding principles

These guidelines describe how to better define usable, intuitive gestures.

Be mindful of user comfort. Gestures that are closer to the body and flow easily with the design of wrists, elbows, and shoulders work best. The video from Iron Man 2 above with displays uses easy gestures very well — no large or strenuous movements for repetitive tasks, no holding for extended times necessary, no full extension of the arms necessary, and keep the raising above the shoulder to a minimum. The “victory” or touchdown gesture he makes towards the end of the clip violates most of these ideas, but certainly adds to the scene’s dramatic effect. It might be okay to sidestep this guideline in certain contexts like an exercise game, but you should always do it purposefully.

Enforce easy manipulation of objects in a natural way. When possible, use the easier gesture to manipulate an object along a natural axis or a plane it’s sitting on. If an object has a natural path of movement, that should be the default way it moves or the axis it travels. A ball should roll on the floor, not bounce when it’s just touched. It should also move in the direction it was nudged.

Define gestures where they can be “seen.” Make sure your gestures occur well within the tracking volume/interaction zone of your platform’s technical capabilities. This zone might include areas beyond the user’s peripheral vision. This one also seems obvious, but you can save a lot of re-work time making sure your zone is well-defined in your context and understood by users.

Conventional gestures are best. A gesture that’s already in your library, or even better one that’s in a bunch of other libraries and in common usage, is almost always better than an obscure or new gesture. The more convention you can leverage when defining your vocabulary, the easier it will be for users to transition to your context. When deviating from convention, your gestures should at least be simple and intuitive.

Single-handed gestures are best. Unless there is a very specific reason to use a two-handed gesture, one-handed is better. It’s simpler, easier to execute repetitively, and probably easier to map and discern. This being said, there are interactions where two-handed gestures are more intuitive, or all things considered the better choice. Examples can be enlarging/shrinking an object.

Leverage tactile feedback when you can. When we’re talking about hand gestures, those that provide some sort of (self) tactical feedback can be easier to remember and provide a built-in kind of mechanism that helps the user understand the gesture has been correctly performed. Examples include snapping fingers or tapping a wrist. Audio cues are also very helpful.

Intuitive gestures are best. Intuitive gestures leverage real-world interactions. Along with the hammer or sword discussed above, other easy examples are opening a door by turning a knob or holding a palm out to stop something approaching. This helps confer presence (how immersive your context feels) and embodiment (how much presence and control your user feels.)

Encourage practical, direct manipulation of objects. When possible, direct manipulation of objects is best; picking up a hammer or a baseball is natural and easy to remember. Picking up an airplane, or using a hand-waving gesture to grasp a hammer, is not. This principle pairs well with the idea that gestures should be intuitive, as above.

Place and size objects appropriately. Each kind of object — an item to use, a character to interact with, or text to read, has a distance/size that will feel correct to the user. Placing objects outside their “comfort zone” should be a very intentional design decision.

Palm-open or -away gestures are easy to track. So these gestures (single hand or double hand) would be solid choices to include in your vocabulary, or better still mapped to an existing convention.

Be mindful of real-world gestures. If your user will be moving around in the real, public world in your context, be mindful of not using gestures that have meaning or implore action in the real world. Examples would include waving someone over or shaking a fist; a suser in an AR or MR context might bring unintended attention or consequences when using such gestures… a little like referring to your friend “Alexa” instead of meaning to invoke Amazon’s voice device.

I’m a UX designer and researcher with over 20 years of total experience in the areas of e-commerce, healthcare, insurance, big data, retail, online identity, and community. I’m speaking for myself, not as a representative of any organization.

All of the clips I’ve used in this article are the property of their respective studios and are used for educational and descriptive purposes. Any imagery used in the public domain, is correctly attributed, or I own a license to display.

I have never met John Anderton or Tony Stark. And I still feel that XR experiences are going to change the world.

UX thinking for hand gestures in XR experiences

UX thinking for hand gestures in XR experiences

Timothy Leary, John Anderton, and Tony Stark.

The UX ante for any complex design effort

Good UX thinking for experiences using gestures

Gesture-specific UX guiding principles

Recommend

B2B行业，数据分析该怎么做？

Prototyping a HIIT timer in Origami Studio

The future of gaming personalization: Street Fighter lab

It’s beyond time to put ethics at the core of UX Research

车企“老炮”掏了20亿，开始抢投美元VC/PE

Unexplored Territory: A conversation with Kit Colbert, VMware’s new CTO

创业公司在人事上的五个常见错误（下）

面向出生缺陷预防市场，「贝瑞基因」发布胎儿临床基因检测本地化解决方案

亚里士多德笔记II

Implementing Swipe to Dismiss Feature in Flutter

About Joyk