Teaching Computers to Recognize Human Actions in Videos

PREDICT and CLUSTER: Unsupervised Skeleton Based Action Recognition

Jul 18 ·4min read

iYraIrr.jpg!web

Photo by bruce mars on Unsplash

Identifying the various actions that people make with their bodies just from watching a video is a natural, simple task for humans. For example, most people would easily be able to identify a subject as, say, “ jumping back and forth ,” or “ hitting a ball with their foot ”. This is easy to recognize even if the subject shown in the video footage changes or was recorded from different views. What if we would like a computer system or a gaming console like an Xbox, PlayStation or similar, to be able to do the same? Would that be possible?

For an artificial system, this seemingly basic task is not as natural as for humans, requiring several layers of Artificial Intelligence capabilities such as (i) knowing which specific ‘ features ’ to track when making decisions, along with (ii) the ability to name, or label, a particular action .

With regards to (i), research in visual perception and computer vision has shown that, at least for the human body, 3D coordinates of the joints, i.e. skeleton features , are sufficient for identifying different actions. Additionally, current robust algorithms are able to track these features in real-time using nearly any video source footage, e.g. OpenPose [1].

RveuYrY.jpg!web

Photo by Sam Sabourin on Unsplash with skeleton features marked

Teaching a computer system to make predictive associations between collections of points and actions using these features turns out to be a much more challenging task than just selecting said features alone. This is because the system is expected to group sequences of features into “classes” and subsequently associate these with names of the corresponding actions.

2YrIRry.png!web

Skeleton-based action recognition: predictive association between collections of points (time series) and actions

Existing deep learning systems try to learn this type of association through a process called ‘ supervised learning ’, where the system learns from several given examples, each with an explanation of the action it represents. This technique also requires camera and depth inputs (RGB+D) at each step. While supervised action recognition has shown promising advancement, it relies on annotation of a large number of sequences and needs to be redone each time another subject, viewpoint, or new action is being considered.

ZFrYji2.jpg!web

Photo by Raymond Rasmusson on Unsplash

It’s of particular interest, then, to instead create systems that attempt to imitate the perceptual ability of humans, which learn to make these associations in an unsupervised way .

In our recent research entitled, “ Predict & Cluster: Unsupervised skeleton based action recognition ” [2] we developed such an unsupervised system. We have proposed that, rather than teaching the computer to catalog the sequences with their actions, the system will instead learn how to predict the sequences through ‘encoder-decoder’ learning. Such a system is fully unsupervised and operates with only inputs and not requiring labelling of actions at any stage.

NRRzQjM.png!web

Predict & Cluster: Unsupervised Skeleton Based Action Recognition

In particular, the encoder-decoder neural network system learns to encode each sequence into a code, which the decoder would use to generate exactly the same sequence. It turns out that in the process of learning to encode and then to decode , the Seq2Seq deep neural network self-organizes the sequences into distinct clusters . We developed a way to make sure that learning is optimal (by fixing the weights or states of the decoder) in order to create such an organization and developed tools to read this organization to associate each cluster with an action.

Bf26FnU.png!web

Schematics of Predict & Cluster

We are able to obtain action recognition results that outperform both previous unsupervised and supervised approaches. Our findings pave the way to a novel type of learning of any type of actions using any input of features. This might include anything from recognizing actions of flight patterns of flying insects to identification of malicious actions in internet activity.

For more info see of the overview video below and the paper in [2].

With Kun Su and Xiulong Liu.

References

[1] OpenPose: https://github.com/CMU-Perceptual-Computing-Lab/openpose

[2] Su, Kun, Xiulong Liu, and Eli Shlizerman. “Predict & cluster: Unsupervised skeleton based action recognition.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 2020.

PREDICT and CLUSTER: Unsupervised Skeleton Based Action Recognition

References

Recommend

内部黑客讲述：Twitter史上最大规模攻击事件始末！

马云旗下基金减持阿里系：刚刚退出35亿

用AI作弊！韩围棋手AI作弊被判监禁

2天市值逼近6000亿，华为新盟友中芯国际能替代台积电吗？

6月快递单价普降超20% 顺丰业务量连增但未能躲过价格战

多位明星代言产品遭遇“翻车” 出问题谁该埋单

曾市值仅次于腾讯、百度 1亿人的“人人网”消失了？_创事记_新浪科技_新浪网

I made an elegant SwiftUI timeline

微服务架构的演进和go的初步实践

Booting to 'Hello Rust' on x86_64

About Joyk