13

Using Hourglass Networks To Understand Human Poses

 3 years ago
source link: https://towardsdatascience.com/using-hourglass-networks-to-understand-human-poses-1e40e349fa15?gi=1a56fbfc5d9b
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Using Hourglass Networks To Understand Human Poses

A simple and digestible deep dive into the theory behind Hourglass Networks for human pose estimation

bEbQRfa.png!web

Human Pose Estimation ( source )

A man is running at you with a knife. What do you do? Well, most people will only have one thought in mind: RUN. Well, why would you run? Because after observing this man’s aggressive posture, you can conclude that he wants to harm you. And since you want to live to see tomorrow, you decide to run as fast as you possibly can.

Well, how are you able to do all this complex analysis in mere seconds? Well, you’re brain just did something called human pose estimation . Fortunately, since human pose estimation is done by a combination of the eyes with the brain, this is something that we can replicate in computer vision.

To perform human pose estimation, we use a special type of Fully Convolutional Network called Hourglass Networks. The network’s encoder-decoder structure makes it look like an hourglass, hence the name “ hourglass networks”.

6R3MRfj.png!web

Hourglass Network Diagram ( source ).

But before we dive deeper into the nitty-gritty components of the network, let’s take a look at some other deep neural nets which this network is based on.

Taking A Step Back

Here are some other network architectures which you should be familiar with before looking into hourglass networks:

Convolutional Neural Networks (CNN’s)

  • Significance: Automatically learns features which correspond best to a specific object leading to higher classification accuracy.

Resources for further learning: Video , Course , Article

Residual Networks (ResNets)

  • Significance : Allows for deeper networks by slowing down the convergence of the network’s gradient in backpropagation.

Resources for further learning: Article , Article , Article , Article

Fully Convolutional Networks (FCN’s)

  • Significance: Replaces dense layers with 1x1 convolutions allowing the network to accept inputs with various dimensions.

Resources for further learning:Article, Video

Encoder-Decoder Networks

  • Significance: Allows us to manipulate an input by extracting its features and attempting to recreate it (ex. image segmentation, text translation)

We’ll talk about encoder-decoders more since that’s basically what hourglass networks are, but if you want some other cool resources, here are some: video , quora thread , article , GitHub .

The Network At A High Level

So, hope you had some fun learning about all those network architectures, but now its time to combine them all.

2uueIbY.png!web

hourglass network architecture ( source )

Hourglass networks are a type of convolutional encoder-decoder network (meaning it uses convolutional layers to break down and reconstruct inputs). They take an input (in our case, an image), and they extract features from this input by deconstructing the image into a feature matrix.

It then takes this feature matrix and combines it with earlier layers which have a higher spatial understanding than the feature matrix (have a better sense of where objects are in the image than the feature matrix).

  • NOTE: The feature matrix has low spatial understanding , meaning i t doesn’t really know where objects are in the image . This is because, to be able to extract the object’s features, we have to discard all pixels which are not features of the object. This means discarding all the background pixels, and by doing this, it removes all knowledge of the object’s locations in the image.
  • By combining the feature matrix with early layers in the network who have a higher spatial understanding, this allows us to understand a lot about the input (what it is + where it is in the image).

yqIVfq7.png!web

Quick Diagram I made in Canva. Hope it helps :)

Doesn’t transporting early layers in the network into later layers ring a bell? ResNets. Residuals are used heavily throughout the network. They are used to combine the spatial info with the feature info, and not only that, each green block represents something we call a bottleneck block .

Bottlenecksare a new type of residual. Instead of having 2 3X3 convolutions, we have 1 1X1 convolution, 1 3X3 convolution and 1 1X1 convolution. This makes the calculations a lot easier on the computer (3X3 convolutions are much harder to do at scale than 1X1 convolutions), which means we get to save lots of memory.

QjeERvB.png!web

Left: Residual Layer RIght: Bottleneck Block ( source )

So in summary,

  • Input : Image of person
  • Encoding : Extract Features through breaking down input into a feature matrix
  • Decoding : Combine Feature Info + Spatial Info to Understand the Image in Depth
  • Output: Depends on the application, in our case, a heat map of where the joints are.

Understanding the Process Step-By-Step

If we actually want to be able to code this, we need to understand what is happening in every single layer, and why. So here, we’re going to break down the whole process and walk through it step-by-step so that we have a deep understanding of the network (we’re just going to be reviewing the hourglass network’s architecture, not the whole training process).

In this network, we’ll be using:

  • Convolutional Layers: Extract features from the image
  • MaxPooling Layers: Eliminate parts of the image which aren’t necessary for feature extraction
  • Residual Layers: Push Layers deeper into the network
  • Bottleneck Layers: Free up memory by including more less-intensive convolutions
  • Upsampling Layers : Increase the size of the input (in our case, using the nearest neighbour technique — watch the video to learn more)

Okay, so before diving in, let’s look at yet another diagram of the hourglass network.

VFZfIjJ.png!web

Hourglass Network Diagram ( source )

So here we can see a couple of things:

  • There are 2 sections: encoding and decoding
  • Each section has 4 cubes.
  • The cubes from the left get passed to the right side to form the cubes on the right

So if we expand each cube, it looks like this:

NNbYbuU.png!web

A bottleneck layer ( source )

So in the diagram of the network whole network, each cube is a bottleneck layer (like the one shown above) . After each pooling layer, we’d add one of these bottleneck layers.

However, the first layer is a bit different, since it has a 7X7 convolution (its the only convolution greater than 3X3 in the architecture). Here’s how the first layer would look:

E73M7bb.png!web

Visualization of First Layer

This is how the first cube looks. First of all, the input is passed into a 7X7 convolution combined with a BatchNormalization and ReLu layer. Next, its passed into a bottleneck layer, and the layer duplicates: one goes through the MaxPool and performs feature extraction and the other only attatches back to the network later on in the upsampling (decoding) part.

The coming cubes (cubes 2, 3 and 4) have a similar structure to each other, however different to the structure of cube 1. Here’s how the other cubes (in the encoding section) look like:

riqUfqr.png!web

Visualization of Second, Third and Fourth Layers

These layers are much more simple. The previous output layer gets passed into a bottleneck layer, then it duplicates into a residual layer and also a layer for feature extraction.

We’re going to repeat this process 3 times (in cubes 2, 3 and 4), and then we’re going to produce the feature maps.

Here are the layers involved in creating the feature maps (this section is the three really small cubes you’d see in the diagram of the whole network):

ma2ERbq.png!web

Visualization of the Bottom Layers

This is the deepest level in the network. It is also the part with the highest feature info and the part with the lowest spatial info. Here, our image is condensed into a matrix (actually, a tensor ) of values which represent the features of our image.

To get to this, it passed through all the 4 encoding cubes and the 3 bottleneck layers on the bottom. We’re now ready to upsample. Here’s how the upsampling layers look:

eaAZF33.png!web

Visualization of Upscaling Layers

So here, the incoming residual layer is going to pass through a bottleneck layer, and then perform element-wise addition between itself (the residual layer) and the upsampled feature layer (from the main network).

We’re going to repeat this process 4 times, and then pass the final layer (4th cube in the decoding part) into the final section where we determine how accurate each prediction is.

  • NOTE : This is called immediate supervision. It is when you calculate the loss at the end of each stage instead of at the end of the whole network. In our case, we calculate the loss at the end of each hourglass network instead of at the end of the all the networks combines (since for human pose estimation, we use multiple hourglass networks stacked together).

Here’s how the final layer will look like:

yiayimY.png!web

Visualization of the final network predictions

So here’s the end of the network. We pass the final network’s outputs through a convolutional layer, then duplicate the layer to produce a set of heatmaps. Finally, we perform an element-wise addition between the inputs of the network, the heatmaps and both of the network’s outputs (one is the predictions and the other is the output to go to the end of the next network).

And then, Repeat!

Yep, that’s it. You just walked through the whole hourglass network. In practice, we’re going to use many of these networks together, so that’s why the title was “and repeat” . Hopefully, this seemingly intimidating topic is now digestible. In my next article, we’ll code the network.

Like I mentioned before, we’re going to apply this to Human Pose Estimation. However, hourglass networks can be used for many things like semantic segmentation, 3d reconstruction and many more. I was reading some really cool papers in 3D reconstruction with hourglass nets, and I’ll link them below so you can read them too.

Overall, I hope you enjoyed reading this article, and if you’re having any trouble understanding this concept, feel free to reach out to me on email, linkedin or even Instagram (insta:@nushaine), and I will do my best to help you understand. Other than that, have a great day and happy coding :)

Resources

Really Cool Papers

Awesome GitHub Repos


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK