21

Teaching Computers to Recognize Human Actions in Videos

 3 years ago
source link: https://towardsdatascience.com/teaching-computers-to-recognize-human-actions-in-videos-81b2e2d62768?gi=aebd7514f3d8
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

PREDICT and CLUSTER: Unsupervised Skeleton Based Action Recognition

By Eli Shlizerman

iYraIrr.jpg!web

Photo by bruce mars on Unsplash

Identifying the various actions that people make with their bodies just from watching a video is a natural, simple task for humans. For example, most people would easily be able to identify a subject as, say, “ jumping back and forth ,” or “ hitting a ball with their foot ”. This is easy to recognize even if the subject shown in the video footage changes or was recorded from different views. What if we would like a computer system or a gaming console like an Xbox, PlayStation or similar, to be able to do the same? Would that be possible?

For an artificial system, this seemingly basic task is not as natural as for humans, requiring several layers of Artificial Intelligence capabilities such as (i) knowing which specific features ’ to track when making decisions, along with (ii) the ability to name, or label, a particular action .

With regards to (i), research in visual perception and computer vision has shown that, at least for the human body, 3D coordinates of the joints, i.e. skeleton features , are sufficient for identifying different actions. Additionally, current robust algorithms are able to track these features in real-time using nearly any video source footage, e.g. OpenPose [1].

RveuYrY.jpg!web

Photo by Sam Sabourin on Unsplash with skeleton features marked

Teaching a computer system to make predictive associations between collections of points and actions using these features turns out to be a much more challenging task than just selecting said features alone. This is because the system is expected to group sequences of features into “classes” and subsequently associate these with names of the corresponding actions.

2YrIRry.png!web

Skeleton-based action recognition: predictive association between collections of points (time series) and actions

Existing deep learning systems try to learn this type of association through a process called ‘ supervised learning ’, where the system learns from several given examples, each with an explanation of the action it represents. This technique also requires camera and depth inputs (RGB+D) at each step. While supervised action recognition has shown promising advancement, it relies on annotation of a large number of sequences and needs to be redone each time another subject, viewpoint, or new action is being considered.

ZFrYji2.jpg!web

Photo by Raymond Rasmusson on Unsplash

It’s of particular interest, then, to instead create systems that attempt to imitate the perceptual ability of humans, which learn to make these associations in an unsupervised way .

In our recent research entitled, “ Predict & Cluster: Unsupervised skeleton based action recognition ” [2] we developed such an unsupervised system. We have proposed that, rather than teaching the computer to catalog the sequences with their actions, the system will instead learn how to predict the sequences through ‘encoder-decoder’ learning. Such a system is fully unsupervised and operates with only inputs and not requiring labelling of actions at any stage.

NRRzQjM.png!web

Predict & Cluster: Unsupervised Skeleton Based Action Recognition

In particular, the encoder-decoder neural network system learns to encode each sequence into a code, which the decoder would use to generate exactly the same sequence. It turns out that in the process of learning to encode and then to decode , the Seq2Seq deep neural network self-organizes the sequences into distinct clusters . We developed a way to make sure that learning is optimal (by fixing the weights or states of the decoder) in order to create such an organization and developed tools to read this organization to associate each cluster with an action.

Bf26FnU.png!web

Schematics of Predict & Cluster

We are able to obtain action recognition results that outperform both previous unsupervised and supervised approaches. Our findings pave the way to a novel type of learning of any type of actions using any input of features. This might include anything from recognizing actions of flight patterns of flying insects to identification of malicious actions in internet activity.

For more info see of the overview video below and the paper in [2].

With Kun Su and Xiulong Liu.

References

[1] OpenPose: https://github.com/CMU-Perceptual-Computing-Lab/openpose

[2] Su, Kun, Xiulong Liu, and Eli Shlizerman. “Predict & cluster: Unsupervised skeleton based action recognition.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2020.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK