Geometric Deep Learning for Pose Estimation

Vaishak V.Kumar

One of the central problems in Computer Vision and Robotics is that of understanding how objects are positioned with respect to the robot or the environment. In this post, I will explain the theory behind and give a pytorch implementation tutorial of the paper “6-DoF Object Pose from Semantic Keypoints” by Pavlakos et al.

In this approach, there are two steps. The first step is to predict “semantic keypoints” on the 2D image. In the second step, we estimate the pose of the object by maximizing the geometric consistency between the predicted set of semantic keypoints and a 3D model of the object using a perspective camera model.

Overlay of the projected 3D model on the monocular image

Keypoint Localization

First, we put a bounding box on the object of interest using a standard off-the-shelf object detection algorithm such as Faster-RCNN. Then we consider only the region of the object for keypoint localization. The paper uses the “stacked hourglass” network architecture for this purpose. The network takes in an RGB image and outputs a set of heatmaps — one heatmap for each keypoint. The heatmaps allow the network to express its confidence over a region rather than regressing a single x,y position for a keypoint.

As can be seen in the above image, the network has two hourglasses. Further, each hourglass has a downsampling part and an upsampling part. The purpose of the second hourglass is to refine the ouput of the first hourglass. The downsampling parts consist of alternating convolution and max-pooling layers. When the output reaches a resolution of 4 X 4, the upsampling begins. The upsampling parts consist of convolution and upsampling (deconvolution) layers. The network is trained using L2 loss on the outputs of both the first and second hourglasses.

Pose Optimization

Now that we have the keypoints, we can use them to find the pose. All we need is a model of the object that we are interested in. The authors of the paper define a deformable model S that is composed of a mean shape B_0 added with a number of variations B_i that are computed using PCA. The shape is defined as 3xP matrix where P is the number of keypoints.

The main optimization problem is that of reducing the below residual. Here W is the set of normalized 2D keypoints in homogenous coordinates. Z is a diagonal matrix representing the depth of the keypoints. R and T are the rotation matrix and the translation vectors respectively. The unkowns that we optimize over are Z,R,T and c.

Below is the actual loss that we want to minimize. Here, D is the confidence of the keypoints — we do this to penalize the error on keypoints that the network is more certain about. The second term of the loss is a regularization term meant to penalize large deviations from the mean shape of the model of interest.

Processing train_148.jpg

That’s it! The resulting R and T define the pose of the object. Now let’s look at the pytorch implementation.

Pytorch Implementation

For the implementation, we will closely follow code provided in CIS 580 at the University of Pennsylvania. I have simplified the code by merging some files and removing some data augmentation steps.

Let,s dive right into the code. First, we need to clone a repository:

git clone <a href="https://github.com/vaishak2future/posefromkeypoints.git" rel="nofollow noopener noreferrer" target="_blank">https://github.com/vaishak2future/posefromkeypoints.git</a>

We need to unzip the data.zip so that the top level directory contains three folders: data, output, and utils. Now, let’s run the Jupyter Notebook. The first block of code that we will inspect is the Trainer class. This class loads up the train and test datasets and runs a few transformations on it to get it to be in the format that we want. The data is cropped and padded so that we only look at the bounding box around the object of interest. Then the locations of the ground truth keypoints are expressed as heatmaps. Then the data is converted into appropriately formatted tensors and normalized. Finally, the trainer also loads up the hourglass model. Once we call the train method, we are done with keypoint localization.

class Trainer(object):

def __init__(self):
 self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

 train_transform_list = [CropAndPad(out_size=(256, 256)),LocsToHeatmaps(out_size=(64, 64)),ToTensor(),Normalize()]
 test_transform_list = [CropAndPad(out_size=(256, 256)),LocsToHeatmaps(out_size=(64, 64)),ToTensor(),Normalize()]
 self.train_ds = Dataset(is_train=True, transform=transforms.Compose(train_transform_list))
 self.test_ds = Dataset(is_train=False, transform=transforms.Compose(test_transform_list))

 self.model = hg(num_stacks=1, num_blocks=1, num_classes=10).to(self.device)
# define loss function and optimizer
 self.heatmap_loss = torch.nn.MSELoss().to(self.device) # for Global loss
 self.optimizer = torch.optim.RMSprop(self.model.parameters(),
 lr = 2.5e-4)
 self.train_data_loader = DataLoader(self.train_ds, batch_size=8,
 num_workers=8,
 pin_memory=True,
 shuffle=True)
 self.test_data_loader = DataLoader(self.test_ds, batch_size=32,
 num_workers=8,
 pin_memory=True,
 shuffle=True)

 self.summary_iters = []
 self.losses = []
 self.pcks = []

def train(self):
 self.total_step_count = 0
 start_time = time()
for epoch in range(1,400+1):

 print("Epoch %d/%d"%
 (epoch,400))

for step, batch in enumerate(self.train_data_loader):
 self.model.train()
 batch = {k: v.to(self.device) if isinstance(v, torch.Tensor) else v for k,v in batch.items()}
 self.optimizer.zero_grad()
 pred_heatmap_list = self.model(batch['image'])
 loss = self.heatmap_loss(pred_heatmap_list[-1], batch['keypoint_heatmaps'])
 loss.backward()
 self.optimizer.step() 

 self.total_step_count += 1


 checkpoint = {'model': self.model.state_dict()}
 torch.save(checkpoint, './output/model_checkpoint.pt')

for step, batch in enumerate(self.train_data_loader):
 self.model.train()
 batch = {k: v.to(self.device) if isinstance(v, torch.Tensor) else v for k,v in batch.items()}
 self.optimizer.zero_grad()
 pred_heatmap_list = self.model(batch['image'])
 loss = self.heatmap_loss(pred_heatmap_list[-1], batch['keypoint_heatmaps'])
 loss.backward()
 self.optimizer.step() 

 self.total_step_count += 1

checkpoint = {'model': self.model.state_dict()}

Before we begin pose optimization, we need to define a custom function that we will use a lot. The Rodrigues function will help us convert axis-angle representation vectors into 3×3 rotation matrices. We define it this way so that we can use pytorch’s autograd functionality.

class Rodrigues(torch.autograd.Function):
 @staticmethod
def forward(self, inp):
 pose = inp.detach().cpu().numpy()
 rotm, part_jacob = cv2.Rodrigues(pose)
 self.jacob = torch.Tensor(np.transpose(part_jacob)).contiguous()
 rotation_matrix = torch.Tensor(rotm.ravel())
return rotation_matrix.view(3,3)

 @staticmethod
def backward(self, grad_output):
 grad_output = grad_output.view(1,-1)
 grad_input = torch.mm(grad_output, self.jacob)
 grad_input = grad_input.view(-1)
return grad_input

rodrigues = Rodrigues.apply

Finally, we write the pose optimization function where we convert our 2D keypoints into normalized homogenous coordinates using the camera model and then compare them with the rotated and translated ground truth 3D keypoints. We stop the optimization when the loss is lower than our threshold.

def pose_optimization(img, vertices, faces, keypoints_2d, conf, keypoints_3d, K):
# Send variables to GPU
 device = keypoints_2d.device
 keypoints_3d = keypoints_3d.to(device)
 K = K.to(device)
 r = torch.rand(3, requires_grad=True, device=device) # rotation in axis-angle representation
 t = torch.rand(3 ,requires_grad=True, device=device)
 d = conf.sqrt()[:, None]
# 2D keypoints in normalized coordinates
 norm_keypoints_2d = torch.matmul(K.inverse(), torch.cat((keypoints_2d, torch.ones(keypoints_2d.shape[0],1, device=device)), dim=-1).t()).t()[:,:-1]
# set up optimizer
 optimizer = torch.optim.Adam([r,t], lr=1e-2)
# converge check
 converged = False
 rel_tol = 1e-7
 loss_old = 100
while not converged:
 optimizer.zero_grad()
# convert axis-angle to rotation matrix
 R = rodrigues(r)
# 1) Compute projected keypoints based on current estimate of R and t
 k3d = torch.matmul(R, keypoints_3d.transpose(1, 0)) + t[:, None]
 proj_keypoints = (k3d / k3d[2])[0:2,:].transpose(1,0) 
# 2) Compute error (based on distance between projected keypoints and detected keypoints)
 err = torch.norm(((norm_keypoints_2d - proj_keypoints)*d)**2, 'fro')
# 3) Update based on error
 err.backward()
 optimizer.step()
# 4) Check for convergence
if abs(err.detach() - loss_old)/loss_old < rel_tol:
break
else:
 loss_old = err.detach() 

# print(err.detach().cpu().numpy())

 R = rodrigues(r)
 plt.figure()
 plot_mesh(img, vertices, faces, R.detach().cpu().numpy(), t.detach().cpu().numpy()[:,None], K.detach().cpu().numpy())
 plt.show()
return rodrigues(r)[0].detach(), t.detach()

When you run the above function you should get awesome results these:

That’s all for the important implementation details! Make sure you run it and play around with the different hyperparameters and transformations to understand how they affect the results. I hope this has been helpful. Please send me any comments or corrections!

References

[1] G.Pavlakos et al. 2017. 6-DoF Object Pose from Semantic Keypoints

Keypoint Localization

Pose Optimization

Pytorch Implementation

References

Recommend

波卡白皮书 Polkadot：畅想一种异构的多链架构

The Cyber Plumber's Handbook – SSH Tunnel Like a Boss

Host Your Own Blog with Gitlab and Netlify

Python 3 TSP solver without dependencies

Translating an ARM iOS App to Intel macOS Using Bitcode

初识Kotlin之集合

Workshop – R Bootcamp: For Newcomers to R

iOS/Android Google Places Widgets and API Services for React Native Apps

Learn React File Upload In 5 Minute

untracked: Universal way for ignoring unnecessary common files

About Joyk