Create YOLOv3 using PyTorch from scratch (Part-6)
source link: https://numbersmithy.com/create-yolov3-using-pytorch-from-scratch-part-6/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
3. PyTorch implementation
3.1. Quick recap
Let’s do a quick recap on the format of model outputs and target labels.
We have the following snippet in the training iteration:
for ii, (imgii, labelii) in enumerate(dataloader): model.train() imgii = imgii.to(device) labelii = labelii.to(device) yhatii = model(imgii) lossii = compute_loss(yhatii, labelii, model) lossii.backward()
Where:
imgii
is a 4D tensor, with shape[batch_size, 3, 416, 416]
.labelii
is a 2D tensor, with shape[n_labels, 6]
. The 6 columns are:[batch_idx, x_center, y_center, w, h, cls]
yhatii
is the model output in training mode. It is alist
of 3 tensors, corresponding to predictions at 3 size scales. Each tensor has a shape of[batch_size, n_anchors, h, w, 5 + n_classes]
.
More details about these are given in Part-2, Build the model backbone, and Part-5, Training data preparation.
3.2. The compute_loss()
function
Code first:
def compute_loss(yhat, label, model, bbox_loss='iou', obj_label='1'): '''Compute multi-task losses Args: yhat (list of tensors): YOLO model output at 3 scales in a list. Each tensor has shape [B, na, h, w, 5 + n_classes]. Where: B: batch_size. na: number of anchors. h: number of rows. w: number of columns. Columns of last dimension: [x_center, y_center, w, h, obj, c1, ..., ck]. label (tensor): ground truth label, in shape (n, 6). n: number of labeled objects in the batch. Columns: [batch_idx, x_center, y_center, w, h, cls]. model (nn.Module): YOLO model. Keyword Args: bbox_loss (str): 'mse': use MSE loss for the x,y centers and w,h sizes. 'iou': use IoU with label bbox as loss. obj_label (str): '1': use 1 as the target objectness score in label locations. 'iou': use IoU between prediction and ground truth as target objectness score in label locations. Returns: loss_box (nn.Variable): loss term from bounding box prediction. loss_obj (nn.Variable): loss term from objectness score prediction. loss_cls (nn.Variable): loss term from classification prediction. ''' n_class = model.n_classes device = label.device # compute a factor to counter unbalanced object labels n_labels = len(label) # num of objects in label n_preds = 0 # total num of predictions for yhatii in yhat: b, na, h, w, _ = yhatii.shape n_preds += na * h * w obj_weights = torch.tensor([(n_preds - n_labels)/n_labels*0.5]).to(device) # prepare loss terms loss_box = torch.zeros(1, device=device) loss_obj = torch.zeros(1, device=device) loss_cls = torch.zeros(1, device=device) if bbox_loss == 'mse': loss_xy = torch.zeros(1, device=device) loss_wh = torch.zeros(1, device=device) # BCE loss func for objectness score and classification obj_bce = nn.BCEWithLogitsLoss(pos_weight=obj_weights) cls_bce = nn.BCEWithLogitsLoss() if bbox_loss == 'mse': # MSE loss func for x,y,w,h xy_mse = nn.MSELoss() wh_mse = nn.MSELoss() # loop through 3 scales for yhatii, yoloii in zip(yhat, model.yolo_layers): b, na, h, w, _ = yhatii.shape stride = float(yoloii.stride) anchors = (yoloii.anchors / stride).float().to(device) # [n_anchors, 2] grid_size = torch.tensor([w, h]).float().to(device) # w, h from labels, convert to feature map scale wh_lb = label[:, 3:5] * grid_size # [n_label, 2] # find matches between labels and anchors ratio = wh_lb[:, None, :] / anchors[None, :, :] # [n_label, n_anchors, 2] ratio = torch.abs(ratio - 1).sum(2) # [n_label, n_anchors] # select anchor boxes with closest ratios ratio = torch.min(ratio, dim=1) # labeled object index label_idx = ratio[0] < 2 # 2 is empirical # anchor index anchor_idx = ratio[1][label_idx] # get batch indices of labeled objects batch_idx = label[label_idx, 0].long() # get cell indices of labeled objects xy_lb = label[label_idx, 1:3] * grid_size xy_idx = torch.floor(xy_lb).long() x_idx = xy_idx[:,0].clamp(0, int(grid_size[0])-1) y_idx = xy_idx[:,1].clamp(0, int(grid_size[1])-1) # get target objectness scores obj_lb = torch.zeros(yhatii.shape[:-1]).float().to(device=device) if obj_label == '1': obj_lb[batch_idx, anchor_idx, y_idx, x_idx] = 1 # predicted objectness scores obj_pd = yhatii[..., 4] # if there are target objects in this scale: if len(batch_idx) > 0: # relative offsets wrt to cells of labels relxy_lb = xy_lb - xy_idx # x,y offests of predictions xy_pd = torch.sigmoid(yhatii[batch_idx, anchor_idx, y_idx, x_idx, 0:2]) # w,h sizes of labels wh_lb = label[label_idx, 3:5] * grid_size # w,h sizes of predictions, in feature map coordinate wh_pd = torch.exp(yhatii[batch_idx, anchor_idx, y_idx, x_idx, 2:4]) * anchors[anchor_idx, :] wh_pd = wh_pd.clamp(0, grid_size.max()) if bbox_loss == 'mse': # x,y mse loss loss_xy += xy_mse(xy_pd, relxy_lb) # w,h mse loss loss_wh += wh_mse(wh_pd, wh_lb) / 10 # scale size loss down loss_box += (loss_xy + loss_wh) if bbox_loss == 'iou' or obj_label == 'iou': # compute IoUs pbox = torch.cat([xy_pd, wh_pd], dim=1) box_lb = torch.cat([relxy_lb, wh_lb], dim=1) iou = compute_IOU(pbox, box_lb, x1y1x2y2=False, change_enclose=False) loss_box += (1.0 - iou).mean() if obj_label == 'iou': # Use cells with iou > 0 as object targets obj_lb[batch_idx, anchor_idx, y_idx, x_idx] = iou.detach().clamp(0).type(obj_lb.dtype) # classification predictions cls_pd = yhatii[batch_idx, anchor_idx, y_idx, x_idx, 5:] # one-hot encode classes cls_one_hot_lb = F.one_hot(label[label_idx, -1].long(), n_class).float().to(device) # classification loss loss_cls += cls_bce(cls_pd, cls_one_hot_lb) # objectness score loss loss_obj += obj_bce(obj_pd, obj_lb) loss = loss_box + loss_obj + loss_cls return loss, loss_box , loss_obj , loss_cls
More explanations:
The
bbox_loss
input argument has 2 choices:'mse'
or'iou'
. These are the 2 ways of formulating the localization loss mentioned above. I added this only for experiment purposes.If this is set to
'mse'
, anMSELoss()
loss function is created for the x, y coordinates, and another one for the w, h sizes:if bbox_loss == 'mse': # MSE loss func for x,y,w,h xy_mse = nn.MSELoss() wh_mse = nn.MSELoss()
Later, the
loss_box
term is computed as the sum of the 2:if bbox_loss == 'mse': # x,y mse loss loss_xy += xy_mse(xy_pd, relxy_lb) # w,h mse loss loss_wh += wh_mse(wh_pd, wh_lb) / 10 # scale size loss down loss_box += (loss_xy + loss_wh)
If
bbox_loss
is set to'iou'
,loss_box
is computed using:if bbox_loss == 'iou' or obj_label == 'iou': # compute IoUs pbox = torch.cat([xy_pd, wh_pd], dim=1) box_lb = torch.cat([relxy_lb, wh_lb], dim=1) iou = compute_IOU(pbox, box_lb, x1y1x2y2=False, change_enclose=False) loss_box += (1.0 - iou).mean()
The
obj_label
input argument has 2 choices'1'
or'iou'
. These are the 2 ways of setting the objectness target values mentioned above. Again, for test purposes only.If
obj_label == '1'
, set objectness target values to 1:# get target objectness scores obj_lb = torch.zeros(yhatii.shape[:-1]).float().to(device=device) if obj_label == '1': obj_lb[batch_idx, anchor_idx, y_idx, x_idx] = 1
where
batch_idx
,anchor_idx
,y_idx
,x_idx
correspond to the (b,a,i,j) coordinates talked about earlier. More on this later.If
obj_label == 'iou'
, set objectness target values to IoU:if bbox_loss == 'iou' or obj_label == 'iou': # compute IoUs pbox = torch.cat([xy_pd, wh_pd], dim=1) box_lb = torch.cat([relxy_lb, wh_lb], dim=1) iou = compute_IOU(pbox, box_lb, x1y1x2y2=False, change_enclose=False) loss_box += (1.0 - iou).mean() if obj_label == 'iou': # Use cells with iou > 0 as object targets obj_lb[batch_idx, anchor_idx, y_idx, x_idx] = iou.detach().clamp(0).type(obj_lb.dtype)
I counted the number of objects/labels in the batch:
n_labels = len(label)
. And the total number of predictions made by the model:n_preds = 0 # total num of predictions for yhatii in yhat: b, na, h, w, _ = yhatii.shape n_preds += na * h * w
Recall that using standard settings, this is
(52 * 52 + 26 * 26 + 13 * 13) * 3 = 10647
. So there would be a big imbalance between positive and negative samples. To counter this, YOLOv1 used the λnoobj scaling factor mentioned above. I think a more specific number could actually be worked out from data:obj_weights = torch.tensor([(n_preds - n_labels)/n_labels*0.5]).to(device)
The
*0.5
scaling is to tune down the ratio a bit, and it’s purely empirical.This
obj_weights
variable works with PyTorch’sBCEWithLogitsLoss
, to give weights to positive samples:# BCE loss func for objectness score and classification obj_bce = nn.BCEWithLogitsLoss(pos_weight=obj_weights)
We then loop through the 3 size scales of YOLO prediction, and get some size information first:
# loop through 3 scales for yhatii, yoloii in zip(yhat, model.yolo_layers): b, na, h, w, _ = yhatii.shape stride = float(yoloii.stride) anchors = (yoloii.anchors / stride).float().to(device) # [n_anchors, 2] grid_size = torch.tensor([w, h]).float().to(device)
Note that
anchors
are convert to feature map coordinate (see Part-1), andgrid_size
is by definition feature map sizes.The association between ground truth labels and “responsible” anchor boxes in this scale is achieved by comparing their width/height ratios:
# w, h from labels, convert to feature map scale wh_lb = label[:, 3:5] * grid_size # [n_label, 2] # find matches between labels and anchors ratio = wh_lb[:, None, :] / anchors[None, :, :] # [n_label, n_anchors, 2]
The best matches are those with smallest absolute width+height ratios from
1.0
:ratio = torch.abs(ratio - 1).sum(2) # [n_label, n_anchors] # select anchor boxes with closest ratios ratio = torch.min(ratio, dim=1) # labeled object index label_idx = ratio[0] < 2 # 2 is empirical # anchor index anchor_idx = ratio[1][label_idx]
label_idx
is a tensor of indices, denoting those ground truth labels that found associations with any anchor box in this scale. This is the n coordinate mentioned in The 1 term sub-section.anchor_idx
is a tensor of indices, denoting the anchor boxes in this scale that were associated with any ground truth labels. This is the a coordinate mentioned in The 1 term sub-section.These 2 index arrays will be used to select the relevant predictions.
To select the relevant predictions, we need these coordinates:
(s,n,b,a,i,j)
The iteration through scales implicitly gives s.
We just got n and a as shown above.
b is obtained by:
# get batch indices of labeled objects batch_idx = label[label_idx, 0].long()
And i,j:
# get cell indices of labeled objects xy_lb = label[label_idx, 1:3] * grid_size xy_idx = torch.floor(xy_lb).long() x_idx = xy_idx[:,0].clamp(0, int(grid_size[0])-1) y_idx = xy_idx[:,1].clamp(0, int(grid_size[1])-1)
These define the ground truth – anchor box associations.
Objectness prediction is not restricted to those “responsible” anchor boxes. Rather, all predictions are included (this is why we needed a
obj_weights
to balance the positive/negative samples):# predicted objectness scores obj_pd = yhatii[..., 4]
Labels may not be associated with anchors of a specific scale. To check this:
# if there are target objects in this scale: if len(batch_idx) > 0: ...
The x, y location predictions are obtained using:
# x,y offests of predictions xy_pd = torch.sigmoid(yhatii[batch_idx, anchor_idx, y_idx, x_idx, 0:2])
And w, h predictions:
# w,h sizes of predictions, in feature map coordinate wh_pd = torch.exp(yhatii[batch_idx, anchor_idx, y_idx, x_idx, 2:4]) * \ anchors[anchor_idx, :] wh_pd = wh_pd.clamp(0, grid_size.max())
Note that we use the (b,a,i,j) coordinates (
[batch_idx, anchor_idx, y_idx, x_idx]
) to take the correct values.Classification loss is computed by first selecting the correct values as before, and constructing one-hot encoded target values:
# classification predictions cls_pd = yhatii[batch_idx, anchor_idx, y_idx, x_idx, 5:] # one-hot encode classes cls_one_hot_lb = F.one_hot(label[label_idx, -1].long(), n_class).float().to(device) # classification loss loss_cls += cls_bce(cls_pd, cls_one_hot_lb)
Note that
torch.nn.BCEWithLogitsLoss
expects values without taking the sigmoid, so don’t passcls_pd
to the sigmoid function, similar forobj_pd
.Having gone through all 3 scales, we sum up all the loss terms and return them:
loss = loss_box + loss_obj + loss_cls return loss, loss_box , loss_obj , loss_cls
3.3. An alternative compute_loss2()
function
Spoiler alert: this function doesn’t work properly. So feel free to skip this.
The above compute_loss()
function selects the “responsible” anchor
boxes by comparing ground truth labels with anchor box priors.
Note that in this workflow, it is possible for more than 1 anchors in different scales to be associated with a same label. I’m not sure whether this would cause any practical harm.
The compute_loss2()
shown here is a closer-to-the-paper version: it
computes the IoU scores between labels and all 9 anchor boxes, and
selects the anchor with the highest IoU. This way, it is ensured that
only 1 anchor box is associated with any label.
3.3.1. select_anchor()
function
To do this association, first create a select_anchor()
function:
def select_anchor(yhat, label, model): n_labels = len(label) n_scales = len(yhat) n_anchors = yhat[0].shape[1] device = label.device batch_idx = label[:, 0].long() best_iou_scales = torch.zeros([n_scales, n_labels], device=device) best_iou_xy_pd = torch.zeros([n_scales, n_labels, n_anchors, 2], device=device) best_iou_wh_pd = torch.zeros([n_scales, n_labels, n_anchors, 2], device=device) best_iou_xy_lb = torch.zeros([n_scales, n_labels, 2], device=device) best_iou_wh_lb = torch.zeros([n_scales, n_labels, 2], device=device) best_iou_ancidx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device) label_x_idx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device) label_y_idx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device) # loop through 3 scales for ii, (yhatii, yoloii) in enumerate(zip(yhat, model.yolo_layers)): b, na, h, w, _ = yhatii.shape stride = float(yoloii.stride) anchors = (yoloii.anchors / stride).float().to(device) # [n_anchors, 2] grid_size = torch.tensor([w, h]).float().to(device) # get cell indices of labeled objects xy_lb = label[:, 1:3] * grid_size.unsqueeze(0) xyidx = torch.floor(xy_lb).long() xidx = xyidx[:,0].clamp(0, int(grid_size[0])-1) yidx = xyidx[:,1].clamp(0, int(grid_size[1])-1) # relative offsets wrt to cells of labels relxy_lb = xy_lb - xyidx # w, h from labels, convert to feature map scale wh_lb = label[:, 3:5] * grid_size # [n_label, 2] # x,y offests of predictions relxy_pd = torch.sigmoid(yhatii[batch_idx, :, yidx, xidx, 0:2]) # w,h sizes of predictions, in feature map coordinate wh_pd = torch.exp(yhatii[batch_idx, :, yidx, xidx, 2:4]) * anchors.unsqueeze(0) wh_pd = wh_pd.clamp(0, grid_size.max()) # compute IoUs pbox = torch.cat([relxy_pd, wh_pd], dim=2).view([-1, 4]) box_lb = torch.cat([relxy_lb, wh_lb], dim=1).view([-1, 1, 4]) box_lb = box_lb.repeat(1, na, 1).view(-1, 4) iou = compute_IOU(pbox, box_lb, x1y1x2y2=False, change_enclose=False) iou = iou.view([-1, na]) best_iou, best_ancidx = torch.max(iou, dim=1) best_iou_scales[ii] = best_iou best_iou_ancidx[ii] = best_ancidx best_iou_xy_pd[ii] = relxy_pd best_iou_wh_pd[ii] = wh_pd best_iou_xy_lb[ii] = relxy_lb best_iou_wh_lb[ii] = wh_lb label_x_idx[ii] = xidx label_y_idx[ii] = yidx # stack along the new scale dimension label_scale_idx = torch.max(best_iou_scales, dim=0)[1] label_anc_idx = best_iou_ancidx[label_scale_idx, torch.arange(n_labels)] label_y_idx = label_y_idx[label_scale_idx, torch.arange(n_labels)] label_x_idx = label_x_idx[label_scale_idx, torch.arange(n_labels)] return label_scale_idx, label_anc_idx, label_y_idx, label_x_idx, best_iou_scales,\ best_iou_xy_pd, best_iou_wh_pd, \ best_iou_xy_lb, best_iou_wh_lb
Some explanations:
We initialize some placeholder tensors, then go into the iteration through scales:
best_iou_scales = torch.zeros([n_scales, n_labels], device=device) best_iou_xy_pd = torch.zeros([n_scales, n_labels, n_anchors, 2], device=device) best_iou_wh_pd = torch.zeros([n_scales, n_labels, n_anchors, 2], device=device) best_iou_xy_lb = torch.zeros([n_scales, n_labels, 2], device=device) best_iou_wh_lb = torch.zeros([n_scales, n_labels, 2], device=device) best_iou_ancidx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device) label_x_idx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device) label_y_idx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device) # loop through 3 scales for ii, (yhatii, yoloii) in enumerate(zip(yhat, model.yolo_layers)): ...
We construct bounding boxes from labels and predictions in a similar manner as in
compute_loss()
, except that we broadcast the bbox shapes to[n_labels, n_anchors, 4]
, then reshape to[n_labels * n_anchors, 4]
. This way, we vectorize the IoU computations between all pairs of labels and anchors in this scale.The computed
iou
term has a shape of[n_labels, n_anchors]
.# compute IoUs pbox = torch.cat([relxy_pd, wh_pd], dim=2).view([-1, 4]) box_lb = torch.cat([relxy_lb, wh_lb], dim=1).view([-1, 1, 4]) box_lb = box_lb.repeat(1, na, 1).view(-1, 4) iou = compute_IOU(pbox, box_lb, x1y1x2y2=False, change_enclose=False) iou = iou.view([-1, na])
We then select the best IoU score across anchors in this scale, and store the winner anchor indices and the IoU scores into the placeholder tensors:
best_iou, best_ancidx = torch.max(iou, dim=1) best_iou_scales[ii] = best_iou best_iou_ancidx[ii] = best_ancidx best_iou_xy_pd[ii] = relxy_pd best_iou_wh_pd[ii] = wh_pd best_iou_xy_lb[ii] = relxy_lb best_iou_wh_lb[ii] = wh_lb label_x_idx[ii] = xidx label_y_idx[ii] = yidx
After going through all 3 scales,
best_iou_scales
is a tensor of[3, n_labels]
. Findng its maximum across scales gives us these indices, for each label in the batch:label_scale_idx = torch.max(best_iou_scales, dim=0)[1] label_anc_idx = best_iou_ancidx[label_scale_idx, torch.arange(n_labels)] label_y_idx = label_y_idx[label_scale_idx, torch.arange(n_labels)] label_x_idx = label_x_idx[label_scale_idx, torch.arange(n_labels)]
where:
label_scale_idx
: an array of 0–2 indices for each label in the batch. This is the s coordinate.label_anc_idx
: an array of 0–2 indices for each label in the batch. This is the a coordinate.label_y_idx
: an array of 0–I indices for each label in the batch. This is the i coordinate.label_x_idx
: an array of 0–J indices for each label in the batch. This is the j coordinate.
The b coordinate can be easy obtained from
batch_idx = label[:, 0].long()
. So we don’t have to worry about it now.
3.3.2. compute_loss2()
function
Now the compute_loss2()
function:
def compute_loss2(yhat, label, model, bbox_loss='iou', obj_label='1'): '''Compute multi-task losses Args: yhat (list of tensors): YOLO model output at 3 scales in a list. Each tensor has shape [B, na, h, w, 5 + n_classes]. Where: B: batch_size. na: number of anchors. h: number of rows. w: number of columns. Columns of last dimension: [x_center, y_center, w, h, obj, c1, ..., ck]. label (tensor): ground truth label, in shape (n, 6). n: number of labeled objects in the batch. Columns: [batch_idx, x_center, y_center, w, h, cls]. model (nn.Module): YOLO model. Keyword Args: bbox_loss (str): 'mse': use MSE loss for the x,y centers and w,h sizes. 'iou': use IoU with label bbox as loss. obj_label (str): '1': use 1 as the target objectness score in label locations. 'iou': use IoU between prediction and ground truth as target objectness score in label locations. Returns: loss_box (nn.Variable): loss term from bounding box prediction. loss_obj (nn.Variable): loss term from objectness score prediction. loss_cls (nn.Variable): loss term from classification prediction. ''' n_class = model.n_classes device = label.device # compute a factor to counter unbalanced object labels n_labels = len(label) # num of objects in label n_preds = 0 # total num of predictions for yhatii in yhat: b, na, h, w, _ = yhatii.shape n_preds += na * h * w obj_weights = torch.tensor([(n_preds - n_labels)/n_labels*0.5]).to(device) # prepare loss terms loss_box = torch.zeros(1, device=device) loss_obj = torch.zeros(1, device=device) loss_cls = torch.zeros(1, device=device) if bbox_loss == 'mse': loss_xy = torch.zeros(1, device=device) loss_wh = torch.zeros(1, device=device) # BCE loss func for objectness score and classification obj_bce = nn.BCEWithLogitsLoss(pos_weight=obj_weights) cls_bce = nn.BCEWithLogitsLoss() if bbox_loss == 'mse': # MSE loss func for x,y,w,h xy_mse = nn.MSELoss() wh_mse = nn.MSELoss() batch_idx = label[:,0].long() label_scale_idx, label_anc_idx, label_y_idx, label_x_idx, best_iou,\ best_iou_xy_pd, best_iou_wh_pd, \ best_iou_xy_lb, best_iou_wh_lb = select_anchor( yhat, label, model) # loop through scales for ii, (yhatii, yoloii) in enumerate(zip(yhat, model.yolo_layers)): s_idxii = torch.where(label_scale_idx==ii)[0] b_idxii = batch_idx[s_idxii] anc_idxii = label_anc_idx[s_idxii] y_idxii = label_y_idx[s_idxii] x_idxii = label_x_idx[s_idxii] iouii = best_iou[ii, s_idxii] obj_lb = torch.zeros(yhatii.shape[:-1]).float().to(device=device) # get target objectness scores if obj_label == '1': obj_lb[b_idxii, anc_idxii, y_idxii, x_idxii] = 1 else: obj_lb[b_idxii, anc_idxii, y_idxii, x_idxii] = iouii.detach().clamp(0).type(obj_lb.dtype) # predicted objectness scores obj_pd = yhatii[..., 4] # objectness score loss loss_obj += obj_bce(obj_pd, obj_lb) if len(s_idxii) == 0: continue if bbox_loss == 'mse': relxy_pd = best_iou_xy_pd[ii, s_idxii, anc_idxii] relxy_lb = best_iou_xy_lb[ii, s_idxii] wh_pd = best_iou_wh_pd[ii, s_idxii, anc_idxii] wh_lb = best_iou_wh_lb[ii, s_idxii] # x,y mse loss loss_xy += xy_mse(relxy_pd, relxy_lb) # w,h mse loss loss_wh += wh_mse(wh_pd, wh_lb) / 10 # scale size loss down loss_box += (loss_xy + loss_wh) elif bbox_loss == 'iou': loss_box += (1.0 - iouii).mean() # classification predictions cls_pd = yhatii[b_idxii, anc_idxii, y_idxii, x_idxii, 5:] # one-hot encode classes cls_one_hot_lb = F.one_hot(label[s_idxii, -1].long(), n_class).float().to(device) # classification loss loss_cls += cls_bce(cls_pd, cls_one_hot_lb) loss = loss_box + loss_obj + loss_cls return loss, loss_box , loss_obj , loss_cls
Some more explanations:
Having set up some preparations, we call the above
select_anchor()
function, and go into the scales loop as before:batch_idx = label[:,0].long() label_scale_idx, label_anc_idx, label_y_idx, label_x_idx, best_iou,\ best_iou_xy_pd, best_iou_wh_pd, \ best_iou_xy_lb, best_iou_wh_lb = select_anchor( yhat, label, model) # loop through scales for ii, (yhatii, yoloii) in enumerate(zip(yhat, model.yolo_layers)): ...
Now get the label coordinates in this scale:
s_idxii = torch.where(label_scale_idx==ii)[0] b_idxii = batch_idx[s_idxii] anc_idxii = label_anc_idx[s_idxii] y_idxii = label_y_idx[s_idxii] x_idxii = label_x_idx[s_idxii] iouii = best_iou[ii, s_idxii]
Here:
s_idxii
is the s coordinate, and denotes labels associated with some anchors in this scale.b_idxii
is the b coordinate, and denotes which images in the batch the selected labels are in.anc_idxii
is the a coordinate, and denotes the matched anchor boxes in this scale.y_idxii
andx_idxii
are the i,j coordinates, and denote the feature map cell locations of the selected labels.iouii
is the best IoU scores of the selected labels.
With these coordinates ready, we then compute the objectness loss. Again, depending on
obj_label
, I tried different target values:obj_lb = torch.zeros(yhatii.shape[:-1]).float().to(device=device) # get target objectness scores if obj_label == '1': obj_lb[b_idxii, anc_idxii, y_idxii, x_idxii] = 1 else: obj_lb[b_idxii, anc_idxii, y_idxii, x_idxii] = iouii.detach().clamp(0).type(obj_lb.dtype) # predicted objectness scores obj_pd = yhatii[..., 4] # objectness score loss loss_obj += obj_bce(obj_pd, obj_lb)
Depending on
bbox_loss
argument, theloss_box
term:if bbox_loss == 'mse': relxy_pd = best_iou_xy_pd[ii, s_idxii, anc_idxii] relxy_lb = best_iou_xy_lb[ii, s_idxii] wh_pd = best_iou_wh_pd[ii, s_idxii, anc_idxii] wh_lb = best_iou_wh_lb[ii, s_idxii] # x,y mse loss loss_xy += xy_mse(relxy_pd, relxy_lb) # w,h mse loss loss_wh += wh_mse(wh_pd, wh_lb) / 10 # scale size loss down loss_box += (loss_xy + loss_wh) elif bbox_loss == 'iou': loss_box += (1.0 - iouii).mean()
Note that we don’t need to re-compute everything from ground up again. We can query the returned values of
select_anchor()
to save some efforts.Classification loss is computed much as in
compute_loss()
:# classification predictions cls_pd = yhatii[b_idxii, anc_idxii, y_idxii, x_idxii, 5:] # one-hot encode classes cls_one_hot_lb = F.one_hot(label[s_idxii, -1].long(), n_class).float().to(device) # classification loss loss_cls += cls_bce(cls_pd, cls_one_hot_lb)
3.4. The train.py
script
Now create a train.py
script and put it into the YOLOv3_pytorch
project folder. Fill it with the following content:
from __future__ import print_function import os import numpy as np import torch import torch.nn as nn import torch.nn.functional as F from config import load_config from model import Darknet53 from utils import compute_IOU, batch_NMS, compute_mAP from loader import create_loader try: from torch.utils.tensorboard import SummaryWriter HAS_TENSORBOARD = True except: HAS_TENSORBOARD = False def lr_schedular(optimizer, iteration, warmup_iter, initial_lr, peak_lr, power=1): ... return lr def compute_loss(yhat, label, model, bbox_loss='iou', obj_label='1'): ... return loss, loss_box , loss_obj , loss_cls def compute_loss2(yhat, label, model, bbox_loss='iou', obj_label='1'): ... return loss, loss_box , loss_obj , loss_cls def select_anchor(yhat, label, model): ... return label_scale_idx, label_anc_idx, label_y_idx, label_x_idx, best_iou_scales,\ best_iou_xy_pd, best_iou_wh_pd, \ best_iou_xy_lb, best_iou_wh_lb #-------------Main--------------------------------- if __name__=='__main__': #--------------------Load model config-------------------- CONFIG_FILE = './config/yolov3.cfg' net_config, module_list = load_config.parse_config(CONFIG_FILE) config = {'net': net_config} config['module_list'] = module_list config['width'] = 416 config['height'] = 416 config['n_classes'] = 80 config['max_data_size'] = 100 config['batch_size'] = 4 config['is_train'] = True config['conf_thres'] = 0.3 config['nms_iou_thres'] = 0.5 config['map_iou_thres'] = 0.5 # experiment parameters BBOX_LOSS = 'iou' # 'iou' or 'mse' OBJ_LABEL = 'iou' # 'iou' or '1' EXP = '%s-%s' %(BBOX_LOSS, OBJ_LABEL) # training parameters LR0 = 1.0*1e-4 PEAK_LR = 8.*1e-4 WEIGHT_DECAY = 1e-4 EPOCHS = 150 WARMUP_ITER = 4e3 EVAL_INTEVAL = 10 # folders DATA_FOLDER = './data/coco' CKPT_FOLDER = './ckpt/' + EXP LOG_DIR = './runs/' + EXP #-------------------Create model------------------- model = Darknet53(config) #--------------Get dataset and dataloader-------------- dataset, dataloader = create_loader(DATA_FOLDER, config, shuffle=False) id2class = dataset.id2class #--------------------Load model-------------------- model = Darknet53(config) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print('######### Using device:', device, '############\n') model.train().to(device=device) #--------------------Optimizer-------------------- opt = torch.optim.Adam(model.parameters(), lr=LR0, weight_decay=WEIGHT_DECAY) #------------------Output folder------------------ os.makedirs(CKPT_FOLDER, exist_ok=True) ckpt_file = os.path.join(CKPT_FOLDER, 'ckpt.pt') # load check point if exists if os.path.exists(ckpt_file): print('####### Load ckpt #########') print('ckpt file:', ckpt_file) ckpt = torch.load(ckpt_file) model.load_state_dict(ckpt['model_state_dict']) opt.load_state_dict(ckpt['optimizer_state_dict']) epoch0 = ckpt['epoch'] else: epoch0 = 0 if HAS_TENSORBOARD: writer = SummaryWriter(log_dir = LOG_DIR) #------------------Start training------------------ total_iters = 0 # total number of iterations for ee in range(epoch0, epoch0+EPOCHS): print('\n#### Entering epoch: %d ########' %ee) # keep track of training loss train_loss_bbox = [] train_loss_obj = [] train_loss_cls = [] train_loss_total = [] # store for mAP computation pred_epoch = [] label_epoch = [] for ii, (imgii, labelii) in enumerate(dataloader): total_iters += ii total_seen = len(dataset) * ee + len(imgii) model.train() # run model imgii = imgii.to(device) labelii = labelii.to(device) yhatii = model(imgii) # compute loss, back-prop lossii, loss_bboxii, loss_objii, loss_clsii = compute_loss( yhatii, labelii, model, bbox_loss=BBOX_LOSS, obj_label=OBJ_LABEL) lossii.backward() opt.step() opt.zero_grad() # update learning rate lr = lr_schedular(opt, total_iters, WARMUP_ITER, LR0, PEAK_LR) train_loss_bbox.append(loss_bboxii.item()) train_loss_obj.append(loss_objii.item()) train_loss_cls.append(loss_clsii.item()) train_loss_total.append(lossii.item()) # evaluate if ii % EVAL_INTEVAL == 0: model.eval() with torch.no_grad(): yhatii = model(imgii) # compute NMS labelii = labelii.cpu().numpy() yhatii = yhatii.detach().cpu().numpy() yhatii = batch_NMS(yhatii, config['conf_thres'], config['nms_iou_thres']) if len(yhatii): # convert to fractional coordinates and sizes, to match with labels yhatii[:, [1,3]] /= config['width'] yhatii[:, [2,4]] /= config['height'] pred_epoch.append(yhatii) label_epoch.append(labelii) # print loss print('ii = %d, Total loss = %.1f, box loss = %.2f, Obj loss = %.2f, Cls loss = %.2f'\ %(ii, lossii.item(), loss_bboxii.item(), loss_objii.item(), loss_clsii.item())) if HAS_TENSORBOARD: # training loss every iteration writer.add_scalar('iLoss/train_bbox', train_loss_bbox[-1], total_iters) writer.add_scalar('iLoss/train_obj', train_loss_obj[-1], total_iters) writer.add_scalar('iLoss/train_cls', train_loss_cls[-1], total_iters) writer.add_scalar('iLoss/train_loss', train_loss_total[-1], total_iters) writer.add_scalar('iLearning_rate/lr', lr, total_iters) # compute mAP of epoch if len(pred_epoch): pred_epoch = np.vstack(pred_epoch) label_epoch = np.vstack(label_epoch) mAP_ee = compute_mAP(pred_epoch, label_epoch, 0, config['map_iou_thres']) else: mAP_ee = 0 if HAS_TENSORBOARD: # training loss every epoch writer.add_scalar('eLoss/train_bbox', np.mean(train_loss_bbox), ee+1) writer.add_scalar('eLoss/train_obj', np.mean(train_loss_obj), ee+1) writer.add_scalar('eLoss/train_cls', np.mean(train_loss_cls), ee+1) writer.add_scalar('eLoss/train_loss', np.mean(train_loss_total), ee+1) writer.add_scalar('eLearning_rate/lr', lr, ee+1) writer.add_scalar('emAP/map', mAP_ee, ee+1) print('\n############ Save model #############') print('Save to', ckpt_file) torch.save({'epoch': ee, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': opt.state_dict(), 'loss': lossii.item(),}, ckpt_file)
I’ve omitted the compute_loss()
, compute_loss2()
and
select_anchor()
definitions.
There is a lr_schedular()
function that updates the learning rate:
def lr_schedular(optimizer, iteration, warmup_iter, initial_lr, peak_lr, power=1): if iteration == 0: iteration += 1 lr = min(1 / iteration**power, iteration / warmup_iter**(power + 1)) *\ warmup_iter**power * (peak_lr - initial_lr) + initial_lr lr = max(lr, 1e-7) for param_group in optimizer.param_groups: param_group['lr'] = lr return lr
It increases the learning rate from the initial value of initial_lr
to peak_lr
during a warm-up phase of warmup_iter
iterations, and
slowly decreases exponentially afterwards.
The script uses tensorboard
to record the training losses across
iterations and epochs, and uses our previously developed
compute_mAP()
to measure the performance every epoch.
Note that for test purposes, I’m using only a tiny fraction of
the entire dataset (config['max_data_size'] = 100
), and computing
the mAP on the training data itself.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK