22

Geometric Deep Learning for Pose Estimation

 4 years ago
source link: https://www.tuicool.com/articles/e2E3aaB
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

One of the central problems in Computer Vision and Robotics is that of understanding how objects are positioned with respect to the robot or the environment. In this post, I will explain the theory behind and give a pytorch implementation tutorial of the paper “6-DoF Object Pose from Semantic Keypoints” by Pavlakos et al.

In this approach, there are two steps. The first step is to predict “semantic keypoints” on the 2D image. In the second step, we estimate the pose of the object by maximizing the geometric consistency between the predicted set of semantic keypoints and a 3D model of the object using a perspective camera model.

zmee6rF.png!web
Overlay of the projected 3D model on the monocular image

Keypoint Localization

First, we put a bounding box on the object of interest using a standard off-the-shelf object detection algorithm such as Faster-RCNN. Then we consider only the region of the object for keypoint localization. The paper uses the “stacked hourglass” network architecture for this purpose. The network takes in an RGB image and outputs a set of heatmaps — one heatmap for each keypoint. The heatmaps allow the network to express its confidence over a region rather than regressing a single x,y position for a keypoint.

7JVBbiZ.png!web

As can be seen in the above image, the network has two hourglasses. Further, each hourglass has a downsampling part and an upsampling part. The purpose of the second hourglass is to refine the ouput of the first hourglass. The downsampling parts consist of alternating convolution and max-pooling layers. When the output reaches a resolution of 4 X 4, the upsampling begins. The upsampling parts consist of convolution and upsampling (deconvolution) layers. The network is trained using L2 loss on the outputs of both the first and second hourglasses.

nieeyuy.png!web

Pose Optimization

Now that we have the keypoints, we can use them to find the pose. All we need is a model of the object that we are interested in. The authors of the paper define a deformable model S that is composed of a mean shape B_0 added with a number of variations B_i that are computed using PCA. The shape is defined as 3xP matrix where P is the number of keypoints.

aauIFzQ.png!web

The main optimization problem is that of reducing the below residual. Here W is the set of normalized 2D keypoints in homogenous coordinates. Z is a diagonal matrix representing the depth of the keypoints. R and T are the rotation matrix and the translation vectors respectively. The unkowns that we optimize over are Z,R,T and c.

N3MvMfv.png!web

Below is the actual loss that we want to minimize. Here, D is the confidence of the keypoints — we do this to penalize the error on keypoints that the network is more certain about. The second term of the loss is a regularization term meant to penalize large deviations from the mean shape of the model of interest.

EZ7fQvE.png!web

Processing train_148.jpg

That’s it! The resulting R and T define the pose of the object. Now let’s look at the pytorch implementation.

Pytorch Implementation

For the implementation, we will closely follow code provided in CIS 580 at the University of Pennsylvania. I have simplified the code by merging some files and removing some data augmentation steps.

Let,s dive right into the code. First, we need to clone a repository:

git clone <a href="https://github.com/vaishak2future/posefromkeypoints.git" rel="nofollow noopener noreferrer" target="_blank">https://github.com/vaishak2future/posefromkeypoints.git</a>

We need to unzip the data.zip so that the top level directory contains three folders: data, output, and utils. Now, let’s run the Jupyter Notebook. The first block of code that we will inspect is the Trainer class. This class loads up the train and test datasets and runs a few transformations on it to get it to be in the format that we want. The data is cropped and padded so that we only look at the bounding box around the object of interest. Then the locations of the ground truth keypoints are expressed as heatmaps. Then the data is converted into appropriately formatted tensors and normalized. Finally, the trainer also loads up the hourglass model. Once we call the train method, we are done with keypoint localization.

class Trainer(object):

def __init__(self):
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_transform_list = [CropAndPad(out_size=(256, 256)),LocsToHeatmaps(out_size=(64, 64)),ToTensor(),Normalize()]
test_transform_list = [CropAndPad(out_size=(256, 256)),LocsToHeatmaps(out_size=(64, 64)),ToTensor(),Normalize()]
self.train_ds = Dataset(is_train=True, transform=transforms.Compose(train_transform_list))
self.test_ds = Dataset(is_train=False, transform=transforms.Compose(test_transform_list))

self.model = hg(num_stacks=1, num_blocks=1, num_classes=10).to(self.device)
# define loss function and optimizer
self.heatmap_loss = torch.nn.MSELoss().to(self.device) # for Global loss
self.optimizer = torch.optim.RMSprop(self.model.parameters(),
lr = 2.5e-4)
self.train_data_loader = DataLoader(self.train_ds, batch_size=8,
num_workers=8,
pin_memory=True,
shuffle=True)
self.test_data_loader = DataLoader(self.test_ds, batch_size=32,
num_workers=8,
pin_memory=True,
shuffle=True)

self.summary_iters = []
self.losses = []
self.pcks = []

def train(self):
self.total_step_count = 0
start_time = time()
for epoch in range(1,400+1):

print("Epoch %d/%d"%
(epoch,400))

for step, batch in enumerate(self.train_data_loader):
self.model.train()
batch = {k: v.to(self.device) if isinstance(v, torch.Tensor) else v for k,v in batch.items()}
self.optimizer.zero_grad()
pred_heatmap_list = self.model(batch['image'])
loss = self.heatmap_loss(pred_heatmap_list[-1], batch['keypoint_heatmaps'])
loss.backward()
self.optimizer.step()

self.total_step_count += 1


checkpoint = {'model': self.model.state_dict()}
torch.save(checkpoint, './output/model_checkpoint.pt')
for step, batch in enumerate(self.train_data_loader):
 self.model.train()
 batch = {k: v.to(self.device) if isinstance(v, torch.Tensor) else v for k,v in batch.items()}
 self.optimizer.zero_grad()
 pred_heatmap_list = self.model(batch['image'])
 loss = self.heatmap_loss(pred_heatmap_list[-1], batch['keypoint_heatmaps'])
 loss.backward()
 self.optimizer.step() 

 self.total_step_count += 1
checkpoint = {'model': self.model.state_dict()}

Before we begin pose optimization, we need to define a custom function that we will use a lot. The Rodrigues function will help us convert axis-angle representation vectors into 3×3 rotation matrices. We define it this way so that we can use pytorch’s autograd functionality.

class Rodrigues(torch.autograd.Function):
@staticmethod
def forward(self, inp):
pose = inp.detach().cpu().numpy()
rotm, part_jacob = cv2.Rodrigues(pose)
self.jacob = torch.Tensor(np.transpose(part_jacob)).contiguous()
rotation_matrix = torch.Tensor(rotm.ravel())
return rotation_matrix.view(3,3)

@staticmethod
def backward(self, grad_output):
grad_output = grad_output.view(1,-1)
grad_input = torch.mm(grad_output, self.jacob)
grad_input = grad_input.view(-1)
return grad_input

rodrigues = Rodrigues.apply

Finally, we write the pose optimization function where we convert our 2D keypoints into normalized homogenous coordinates using the camera model and then compare them with the rotated and translated ground truth 3D keypoints. We stop the optimization when the loss is lower than our threshold.

def pose_optimization(img, vertices, faces, keypoints_2d, conf, keypoints_3d, K):
# Send variables to GPU
device = keypoints_2d.device
keypoints_3d = keypoints_3d.to(device)
K = K.to(device)
r = torch.rand(3, requires_grad=True, device=device) # rotation in axis-angle representation
t = torch.rand(3 ,requires_grad=True, device=device)
d = conf.sqrt()[:, None]
# 2D keypoints in normalized coordinates
norm_keypoints_2d = torch.matmul(K.inverse(), torch.cat((keypoints_2d, torch.ones(keypoints_2d.shape[0],1, device=device)), dim=-1).t()).t()[:,:-1]
# set up optimizer
optimizer = torch.optim.Adam([r,t], lr=1e-2)
# converge check
converged = False
rel_tol = 1e-7
loss_old = 100
while not converged:
optimizer.zero_grad()
# convert axis-angle to rotation matrix
R = rodrigues(r)
# 1) Compute projected keypoints based on current estimate of R and t
k3d = torch.matmul(R, keypoints_3d.transpose(1, 0)) + t[:, None]
proj_keypoints = (k3d / k3d[2])[0:2,:].transpose(1,0)
# 2) Compute error (based on distance between projected keypoints and detected keypoints)
err = torch.norm(((norm_keypoints_2d - proj_keypoints)*d)**2, 'fro')
# 3) Update based on error
err.backward()
optimizer.step()
# 4) Check for convergence
if abs(err.detach() - loss_old)/loss_old < rel_tol:
break
else:
loss_old = err.detach()

# print(err.detach().cpu().numpy())

R = rodrigues(r)
plt.figure()
plot_mesh(img, vertices, faces, R.detach().cpu().numpy(), t.detach().cpu().numpy()[:,None], K.detach().cpu().numpy())
plt.show()
return rodrigues(r)[0].detach(), t.detach()

When you run the above function you should get awesome results these:

BJzuIva.png!web
bYzQ7bA.png!web
emiU3yz.png!web

That’s all for the important implementation details! Make sure you run it and play around with the different hyperparameters and transformations to understand how they affect the results. I hope this has been helpful. Please send me any comments or corrections!

References

[1] G.Pavlakos et al. 2017. 6-DoF Object Pose from Semantic Keypoints


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK