Sunday, October 2, 2022

Coding computers with sign language

I am one of those people who searches slightly outside the parameters of the near term actual with an eye toward the long term feasible, for the purpose of innovation and curiosity. I'm not a futurist, but a probable-ist, looking for the ways we can leverage the technologies and tools we have at our fingertips today to achieve adjacent potential opportunities leveraging those tools. There are millions of people at any time thinking how to address new applications of any specific technology in novel ways to push the technological capabilities toward exciting new utilities. We often invent the same things using different techniques, the way that eyes and wings evolved via separate paths in nature, called convergent evolution. I remember going to a Google developer event in 2010 and heard the company announce a product that described my company's initiative down to every granular detail. At the time I wondered if someone in my company had jumped the fence. But I then realized that our problems and challenges are common. It's only the approaches to address them and the resources we have that are unique.

When I embarked into app development during the launch of the iPhone, I knew we were in a massive paradigm shift. I became captivated with the potential that we could use camera interfaces as inputs to control the actions of computers. We use web cameras to send messages person to person over the web. But we could also communicate commands directly into code as well if we leverage an interpretive layer to communicate as the computer interprets.

This fascination with the potential future started when I was working with the first release of the iPad. My developer friends were toying around with what we could do to extend the utility of the new device beyond the bundled apps. At the time, I used a Bluetooth keyboard to type, as speech APIs were crude and not yet interfacing well with the new device and because the in-screen keyboard was difficult to use. One pesky thing I realized was that there was no mouse to communicate with the device. Apple permitted keyboards to pair, but they didn't support the pairing of a Bluetooth mouse. Every time I had to place the cursor, I had to touch the iPad, and it would flop over unless I took it in my hands. 

I wanted to use it as an abstracted interface, and didn't like the idea that the screen I was meant to read through would get fingerprints on it unless I bought a pen to touch the screen with. I was acting in an old-school way wanting to port my past computer interaction model to a new device while Apple wanted iPad to be a tactile device at the time, seeking to shift user expectations. I wanted my device to adapt to me rather than having me adapt to it. "Why can't I just gesture to the camera instead of touching the screen?" I wondered.

People say necessity is the mother of invention. I often think that impatience has sired as many inventions as necessity. In 2010 I started going to developer events to scope out use cases of real-time camera input. This kind of thing is now referred to as "augmented reality" where the interaction of a computer overlays some aspect of our interaction with the world outside the computer itself. At one of these events, I met an inspirational computer vision engineer named Nicola Rohrseitz. I told him of my thoughts that we should have a touchless-mouse input for devices that had a camera. He was thinking along the same lines. His wife played stringed instruments. Viola and cello players have trouble turning pages of sheet music or touching the screen of an iPad because their hands are full as they play! So gesturing with a foot or a wave was easier. A gesture could be captured by tracking motion rendered through light color shifts on pixel locations of the camera chip. He was able to track the shift of pixel color locally on the device and render that as input to an action on the iPad. He wasn't tracking the hand/foot directly, he was post-process analyzing the images after they were written into random access memory (RAM). By doing this on device, without sending the camera data to a web server, you avoid any kind of privacy risk of a remote connection. So having the iPad think about what it was seeing, it could interpret the input as a command and thereafter turn the page of sheet music on his wife's iPad. He built an app to to achieve this for his wife's use. And she was happy. But it had much broader implications for other actions.

What else could be done with signals "interpreted" from the camera beyond hand waves we wondered? Sign language was the obvious one. We realized at the time that the challenge was too complex then because sign language isn't static shape capture. Though ASL alphabet are static hand shapes, most linguistic concept signs have a hand position shifting a certain direction over a period of time. We couldn't have achieved this without first achieving figure/ground isolation. The iPad camera at that time did not have a means for depth perception. Now a decade later, Apple has introduced HEIC image capture (a more advanced image compression format than JPEG) with LIDAR/depth information that can save layers of the image available, much like the idea of multiple filter layers in a Photoshop file.

Because we didn't have figure/ground isolation, Nicola created a generic gesture motion detection utility which we applied to allow users to play video games on a paired device by use of hand motions and tilting rather than pushing buttons on a screen. We decided it would be fun to adapt the tools for distribution with game developers. Alas, we were too early with this particular initiative. I pitched the concept to one of the game studios in the San Francisco Bay Area. While they said the game play concept looked fun, they said politely that there have to be a lot more mobile gamers before there would be demand among those gamers to play in a further augmented way with gesture capture. The iPad had only recently come out. There just wasn't any significant market for our companion app just yet.

Early attempts to infer machine models of human or vehicle motions would overlay an assumed shape of a body over a perceived entity in the camera's view. In a depiction of a video intake of a driving car, it might be inferred that every object in the field of view represented by a moving object is a car. (So being a pedestrian or biker in the proximity of self-driving cars became risky as object and behavior assumptions of the seeing entity predicted different behaviors than pedestrians and bicyclists exhibited.) In a conference demo on an expo floor, it is likely that most of what the camera sees are people, not cars. So the algorithm can be set to infer body position represented by the assumed skeletal overlays of legs related to bodies, and presumed eyes atop bodies. The purpose of this program pictured below was to be used in shop windows to notice when someone was captivated by the displayed items in the window. For humans near, eyes and position of arms were accurately projected. For humans far away, less so. (The Computer Electronics Show demo did not capture any photographs of the people moving in front of the camera. I captured that separately with my camera.)

Over the ensuing years, other exciting advancements brought the capture of hand gestures to the mainstream. With the emergence of VR developer platforms, the need for alternate input methods became even more critical than the early tablet days. With the conventional technique of wearing head-mounted-displays (HMDs) and glasses, it became quite obvious that conventional input methods like keyboard and mouse were going to be too cumbersome to render in the display view. So rather than trying to simulate a mouse and keyboard in this display, a team of developers at LeapMotion took the approach of utilizing an infrared camera which could detect hand position, then infer knuckle and joint positions of the hands which could in turn be rendered as input methods to any operating system to figure out what they hands were signaling for the OS to do at the same time as they were projected into the head-mounted-display. (Example gesture captures could be mapped to commands for grabbing objects, gesturing for menu options, etc.)

The views above are my hands detected by infrared from a camera sitting below the computer screen in front of me, then passed into the OS view on the screen, or into a VR HMD. The joint and knuckle positions are inferences based on a model inside the OS-hosted software. The disadvantage of LeapMotion was that it required an infrared camera to be set up and for some additional interfacing challenges through the OS to the program leveraging the input. But the good news was that OS and hardware developers noticed and could pick up where LeapMotion left off to bring on these app-specific benefits to all users of next generation devices. Another five years of progress and the application of the same technology in Quest removes the x-ray style view of the former approach with something you can almost infer as realistic presence of one's own hands.

 
Hololens and Quest thereafter merged the former external hardware camera into the HMD directly facing forward. This could then send gesture commands from the camera inputs to all native applications on the device, obviating the need for app developers to toil with the interpretive layer of joint detection inside their own programs. In the Quest platform, app developer adoption of those inputs is slow at present. But for those that do support it, you can use "Hands API" to navigate main menu options and high-level app selection. A few apps like Spatial.io (pictured above) take the input method of the Hands API and allow the use of the inferred hand position to replace the role formerly filled by hardware controllers for Spatial content and motility actions. Becacuse Spatial is a hosted virtual world platform, the Hands API offers the user a capability to navigate within the 3D space through more direct hand signals. This lets the user operate in the environment with their hands in a way resembling digital semaphore. Like Spider Man's web-casting wrist gesture, a certain motion will teleport the user to a different coordinate in the virtually-depicted 3D space. Pinching fingers allows command menus to come up. Hovering over an option and letting go of the pinch selects the desired input command. The entire menu of the Spatial app can be navigated with hand signals much like the Spielberg film Minority Report's futuristic computer interfaces. It takes a bit of confused experimentation before the user's neuro-plasticity rewires the understanding of the new input method. (The same way learning abstract motions of the mouse cursor or game-pad controls require a short acclimatization period.)
 
This is great advancement for the minority of people reported to be putting HMDs on their heads to use their computers. But what about the rest of us who don't want to have visors on our noggins? For those users also we can anticipate computer input from our motions in front of the machine of our choice too, very soon. Already, the backward-facing camera in iOS devices detects full facial structure of the user. The depth vision of that camera enables mirroring of the shape of our facial features such that it can be used in a similar way that old skeleton keys precisely matched the internal workings of bolt locks. Simulating the precise shape of your face, plus the pupil detection of your eyes looking at the screen, is trustworthy indication that you are awake and presently expecting your phone to awaken as well. Pointing my camera at a photo of me doesn't unlock the phone, nor would someone pointing my phone at me while I'm not looking at it. As a fun demonstration of this capability, new emoji packs called "memoji" allow you to enliven a cartoon image of your selection with the CGI animation by mirroring your facial gestures. Cinematographers have previously used body tracking to enable such animation for films including Lord of the Rings and Planet of the Apes. Now everybody can do the same thing with position mirroring models hosted in their phones.

The next great leap of utility for cross-computer communication as well as computer programming will be enabling the understanding of other human communication beyond what our faces and mouths express. People video-conferencing use body language and gesture through the digital pipelines of our web cameras. Might gestural interactions be brought to all computers allowing conveyance of intent and meaning to the OS for command inputs?

At a recent worldwide developer convention, Apple engineers demonstrated a concept of using machine pattern recognition to simulate gestural input commands to the operating system extending and expanding the approach from the infrared camera technique. Apple's approach uses a set of training images stored locally on the device to infer input meaning. The method of barcode and symbol recognition with the Vision API pairs a camera-matched input to a reference database. The matching database can of course be a web query to a large existing external database. But for a relatively small batch of linguistic pattern symbols such as American Sign Language, a collection of reference gestures can be hosted within the device memory and paired with the inferred meaning the user intends to convey for immediate local interpretation without a call to an external web server. (This is beneficial for security and privacy reasons.)

In Apple's demonstration below, Geppy Parziale has the embedded computer vision capability of the operating system to isolate the motion of two hands separate from the face and body. In this example he tracked the gesture of his right hand separately from the left hand making the gesture for "2." Now that mobile phones have figure/ground isolation and the ability to isolate portions of the input image into segments, enormously complex gestural sign language semiotics can be achieved in ways that Nicola and I envisioned a decade prior. The rudiments of interpretation via camera input can now represent the shift of meaning over time that forms the semiotics of complex human gestural expression.

 

I remember in high school going to my public library and plugging myself into a computer, via a QWERTY keyboard, to try to learn the language that computers expect us to comprehend. But with these fascinating new transitions in our technology, future generations may be able to "speak human" and "gesture human" to computers instead of having us spend years of our lives adapting to them! 

My gratitude, kudos and hats off to all the diligent engineers and investors who are contributing to this new capability in our technical platforms.

 

No comments:

Post a Comment