Create YOLOv3 using PyTorch from scratch (Part-6)

 1 year ago
source link: https://numbersmithy.com/create-yolov3-using-pytorch-from-scratch-part-6/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

3. PyTorch implementation

3.1. Quick recap

Let’s do a quick recap on the format of model outputs and target labels.

We have the following snippet in the training iteration:

for ii, (imgii, labelii) in enumerate(dataloader):
    imgii = imgii.to(device)
    labelii = labelii.to(device)
    yhatii = model(imgii)
    lossii = compute_loss(yhatii, labelii, model)


  • imgii is a 4D tensor, with shape [batch_size, 3, 416, 416].
  • labelii is a 2D tensor, with shape [n_labels, 6]. The 6 columns are:

      [batch_idx, x_center, y_center, w, h, cls]
  • yhatii is the model output in training mode. It is a list of 3 tensors, corresponding to predictions at 3 size scales. Each tensor has a shape of [batch_size, n_anchors, h, w, 5 + n_classes].

More details about these are given in Part-2, Build the model backbone, and Part-5, Training data preparation.

3.2. The compute_loss() function

Code first:

def compute_loss(yhat, label, model, bbox_loss='iou', obj_label='1'):
    '''Compute multi-task losses

        yhat (list of tensors): YOLO model output at 3 scales in a list. Each
            tensor has shape [B, na, h, w, 5 + n_classes]. Where:
            B: batch_size. na: number of anchors.
            h: number of rows. w: number of columns.
            Columns of last dimension: [x_center, y_center, w, h, obj, c1, ..., ck].
        label (tensor): ground truth label, in shape (n, 6). n: number of labeled
            objects in the batch. Columns: [batch_idx, x_center, y_center, w, h, cls].
        model (nn.Module): YOLO model.
    Keyword Args:
        bbox_loss (str): 'mse': use MSE loss for the x,y centers and w,h sizes.
            'iou': use IoU with label bbox as loss.
        obj_label (str): '1': use 1 as the target objectness score in label
            locations. 'iou': use IoU between prediction and ground truth as
            target objectness score in label locations.
        loss_box (nn.Variable): loss term from bounding box prediction.
        loss_obj (nn.Variable): loss term from objectness score prediction.
        loss_cls (nn.Variable): loss term from classification prediction.

    n_class = model.n_classes
    device = label.device

    # compute a factor to counter unbalanced object labels
    n_labels = len(label)   # num of objects in label
    n_preds = 0             # total num of predictions
    for yhatii in yhat:
        b, na, h, w, _ = yhatii.shape
        n_preds += na * h * w

    obj_weights = torch.tensor([(n_preds - n_labels)/n_labels*0.5]).to(device)

    # prepare loss terms
    loss_box = torch.zeros(1, device=device)
    loss_obj = torch.zeros(1, device=device)
    loss_cls = torch.zeros(1, device=device)
    if bbox_loss == 'mse':
        loss_xy = torch.zeros(1, device=device)
        loss_wh = torch.zeros(1, device=device)

    # BCE loss func for objectness score and classification
    obj_bce = nn.BCEWithLogitsLoss(pos_weight=obj_weights)
    cls_bce = nn.BCEWithLogitsLoss()

    if bbox_loss == 'mse':
        # MSE loss func for x,y,w,h
        xy_mse = nn.MSELoss()
        wh_mse = nn.MSELoss()

    # loop through 3 scales
    for yhatii, yoloii in zip(yhat, model.yolo_layers):

        b, na, h, w, _ = yhatii.shape
        stride = float(yoloii.stride)
        anchors = (yoloii.anchors / stride).float().to(device) # [n_anchors, 2]
        grid_size = torch.tensor([w, h]).float().to(device)

        # w, h from labels, convert to feature map scale
        wh_lb = label[:, 3:5] * grid_size  # [n_label, 2]

        # find matches between labels and anchors
        ratio = wh_lb[:, None, :] / anchors[None, :, :]  # [n_label, n_anchors, 2]
        ratio = torch.abs(ratio - 1).sum(2)  # [n_label, n_anchors]
        # select anchor boxes with closest ratios
        ratio = torch.min(ratio, dim=1)
        # labeled object index
        label_idx = ratio[0] < 2     # 2 is empirical
        # anchor index
        anchor_idx = ratio[1][label_idx]

        # get batch indices of labeled objects
        batch_idx = label[label_idx, 0].long()

        # get cell indices of labeled objects
        xy_lb = label[label_idx, 1:3] * grid_size
        xy_idx = torch.floor(xy_lb).long()
        x_idx = xy_idx[:,0].clamp(0, int(grid_size[0])-1)
        y_idx = xy_idx[:,1].clamp(0, int(grid_size[1])-1)

        # get target objectness scores
        obj_lb = torch.zeros(yhatii.shape[:-1]).float().to(device=device)
        if obj_label == '1':
            obj_lb[batch_idx, anchor_idx, y_idx, x_idx] = 1

        # predicted objectness scores
        obj_pd = yhatii[..., 4]

        # if there are target objects in this scale:
        if len(batch_idx) > 0:
            # relative offsets wrt to cells of labels
            relxy_lb = xy_lb - xy_idx

            # x,y offests of predictions
            xy_pd = torch.sigmoid(yhatii[batch_idx, anchor_idx, y_idx, x_idx, 0:2])

            # w,h sizes of labels
            wh_lb = label[label_idx, 3:5] * grid_size

            # w,h sizes of predictions, in feature map coordinate
            wh_pd = torch.exp(yhatii[batch_idx, anchor_idx, y_idx, x_idx, 2:4]) * anchors[anchor_idx, :]
            wh_pd = wh_pd.clamp(0, grid_size.max())

            if bbox_loss == 'mse':
                # x,y mse loss
                loss_xy += xy_mse(xy_pd, relxy_lb)
                # w,h mse loss
                loss_wh += wh_mse(wh_pd, wh_lb) / 10   # scale size loss down
                loss_box += (loss_xy + loss_wh)

            if bbox_loss == 'iou' or obj_label == 'iou':
                # compute IoUs
                pbox = torch.cat([xy_pd, wh_pd], dim=1)
                box_lb = torch.cat([relxy_lb, wh_lb], dim=1)
                iou = compute_IOU(pbox, box_lb, x1y1x2y2=False, change_enclose=False)
                loss_box += (1.0 - iou).mean()

                if obj_label == 'iou':
                    # Use cells with iou > 0 as object targets
                    obj_lb[batch_idx, anchor_idx, y_idx, x_idx] = iou.detach().clamp(0).type(obj_lb.dtype)
            # classification predictions
            cls_pd = yhatii[batch_idx, anchor_idx, y_idx, x_idx, 5:]

            # one-hot encode classes
            cls_one_hot_lb = F.one_hot(label[label_idx, -1].long(), n_class).float().to(device)
            # classification loss
            loss_cls += cls_bce(cls_pd, cls_one_hot_lb)

        # objectness score loss
        loss_obj += obj_bce(obj_pd, obj_lb)

    loss = loss_box + loss_obj + loss_cls

    return loss, loss_box , loss_obj , loss_cls

More explanations:

  • The bbox_loss input argument has 2 choices: 'mse' or 'iou'. These are the 2 ways of formulating the localization loss mentioned above. I added this only for experiment purposes.

    If this is set to 'mse', an MSELoss() loss function is created for the x, y coordinates, and another one for the w, h sizes:

      if bbox_loss == 'mse':
          # MSE loss func for x,y,w,h
          xy_mse = nn.MSELoss()
          wh_mse = nn.MSELoss()

    Later, the loss_box term is computed as the sum of the 2:

                if bbox_loss == 'mse':
                    # x,y mse loss
                    loss_xy += xy_mse(xy_pd, relxy_lb)
                    # w,h mse loss
                    loss_wh += wh_mse(wh_pd, wh_lb) / 10   # scale size loss down
                    loss_box += (loss_xy + loss_wh)

    If bbox_loss is set to 'iou', loss_box is computed using:

                if bbox_loss == 'iou' or obj_label == 'iou':
                    # compute IoUs
                    pbox = torch.cat([xy_pd, wh_pd], dim=1)
                    box_lb = torch.cat([relxy_lb, wh_lb], dim=1)
                    iou = compute_IOU(pbox, box_lb, x1y1x2y2=False, change_enclose=False)
                    loss_box += (1.0 - iou).mean()
  • The obj_label input argument has 2 choices '1' or 'iou'. These are the 2 ways of setting the objectness target values mentioned above. Again, for test purposes only.

    If obj_label == '1', set objectness target values to 1:

            # get target objectness scores
            obj_lb = torch.zeros(yhatii.shape[:-1]).float().to(device=device)
            if obj_label == '1':
                obj_lb[batch_idx, anchor_idx, y_idx, x_idx] = 1

    where batch_idx, anchor_idx, y_idx, x_idx correspond to the (b,a,i,j) coordinates talked about earlier. More on this later.

    If obj_label == 'iou', set objectness target values to IoU:

                if bbox_loss == 'iou' or obj_label == 'iou':
                    # compute IoUs
                    pbox = torch.cat([xy_pd, wh_pd], dim=1)
                    box_lb = torch.cat([relxy_lb, wh_lb], dim=1)
                    iou = compute_IOU(pbox, box_lb, x1y1x2y2=False, change_enclose=False)
                    loss_box += (1.0 - iou).mean()
                    if obj_label == 'iou':
                        # Use cells with iou > 0 as object targets
                        obj_lb[batch_idx, anchor_idx, y_idx, x_idx] = iou.detach().clamp(0).type(obj_lb.dtype)
  • I counted the number of objects/labels in the batch: n_labels = len(label). And the total number of predictions made by the model:

        n_preds = 0             # total num of predictions
        for yhatii in yhat:
            b, na, h, w, _ = yhatii.shape
            n_preds += na * h * w

    Recall that using standard settings, this is (52 * 52 + 26 * 26 + 13 * 13) * 3 = 10647. So there would be a big imbalance between positive and negative samples. To counter this, YOLOv1 used the λnoobj scaling factor mentioned above. I think a more specific number could actually be worked out from data:

      obj_weights = torch.tensor([(n_preds - n_labels)/n_labels*0.5]).to(device)

    The *0.5 scaling is to tune down the ratio a bit, and it’s purely empirical.

    This obj_weights variable works with PyTorch’s BCEWithLogitsLoss, to give weights to positive samples:

      # BCE loss func for objectness score and classification
      obj_bce = nn.BCEWithLogitsLoss(pos_weight=obj_weights)
  • We then loop through the 3 size scales of YOLO prediction, and get some size information first:

        # loop through 3 scales
        for yhatii, yoloii in zip(yhat, model.yolo_layers):
            b, na, h, w, _ = yhatii.shape
            stride = float(yoloii.stride)
            anchors = (yoloii.anchors / stride).float().to(device) # [n_anchors, 2]
            grid_size = torch.tensor([w, h]).float().to(device)

    Note that anchors are convert to feature map coordinate (see Part-1), and grid_size is by definition feature map sizes.

  • The association between ground truth labels and “responsible” anchor boxes in this scale is achieved by comparing their width/height ratios:

            # w, h from labels, convert to feature map scale
            wh_lb = label[:, 3:5] * grid_size  # [n_label, 2]
            # find matches between labels and anchors
            ratio = wh_lb[:, None, :] / anchors[None, :, :]  # [n_label, n_anchors, 2]

    The best matches are those with smallest absolute width+height ratios from 1.0:

            ratio = torch.abs(ratio - 1).sum(2)  # [n_label, n_anchors]
            # select anchor boxes with closest ratios
            ratio = torch.min(ratio, dim=1)
            # labeled object index
            label_idx = ratio[0] < 2     # 2 is empirical
            # anchor index
            anchor_idx = ratio[1][label_idx]

    label_idx is a tensor of indices, denoting those ground truth labels that found associations with any anchor box in this scale. This is the n coordinate mentioned in The 1 term sub-section.

    anchor_idx is a tensor of indices, denoting the anchor boxes in this scale that were associated with any ground truth labels. This is the a coordinate mentioned in The 1 term sub-section.

    These 2 index arrays will be used to select the relevant predictions.

  • To select the relevant predictions, we need these coordinates:


    The iteration through scales implicitly gives s.

    We just got n and a as shown above.

    b is obtained by:

            # get batch indices of labeled objects
            batch_idx = label[label_idx, 0].long()

    And i,j:

            # get cell indices of labeled objects
            xy_lb = label[label_idx, 1:3] * grid_size
            xy_idx = torch.floor(xy_lb).long()
            x_idx = xy_idx[:,0].clamp(0, int(grid_size[0])-1)
            y_idx = xy_idx[:,1].clamp(0, int(grid_size[1])-1)

    These define the ground truth – anchor box associations.

  • Objectness prediction is not restricted to those “responsible” anchor boxes. Rather, all predictions are included (this is why we needed a obj_weights to balance the positive/negative samples):

            # predicted objectness scores
            obj_pd = yhatii[..., 4]
  • Labels may not be associated with anchors of a specific scale. To check this:

            # if there are target objects in this scale:
            if len(batch_idx) > 0:
  • The x, y location predictions are obtained using:

                # x,y offests of predictions
                xy_pd = torch.sigmoid(yhatii[batch_idx, anchor_idx, y_idx, x_idx, 0:2])

    And w, h predictions:

                # w,h sizes of predictions, in feature map coordinate
                wh_pd = torch.exp(yhatii[batch_idx, anchor_idx, y_idx, x_idx, 2:4]) * \
                    anchors[anchor_idx, :]
                wh_pd = wh_pd.clamp(0, grid_size.max())

    Note that we use the (b,a,i,j) coordinates ([batch_idx, anchor_idx, y_idx, x_idx]) to take the correct values.

  • Classification loss is computed by first selecting the correct values as before, and constructing one-hot encoded target values:

                # classification predictions
                cls_pd = yhatii[batch_idx, anchor_idx, y_idx, x_idx, 5:]
                # one-hot encode classes
                cls_one_hot_lb = F.one_hot(label[label_idx, -1].long(), n_class).float().to(device)
                # classification loss
                loss_cls += cls_bce(cls_pd, cls_one_hot_lb)

    Note that torch.nn.BCEWithLogitsLoss expects values without taking the sigmoid, so don’t pass cls_pd to the sigmoid function, similar for obj_pd.

  • Having gone through all 3 scales, we sum up all the loss terms and return them:

        loss = loss_box + loss_obj + loss_cls
        return loss, loss_box , loss_obj , loss_cls

3.3. An alternative compute_loss2() function

Spoiler alert: this function doesn’t work properly. So feel free to skip this.

The above compute_loss() function selects the “responsible” anchor boxes by comparing ground truth labels with anchor box priors.

Note that in this workflow, it is possible for more than 1 anchors in different scales to be associated with a same label. I’m not sure whether this would cause any practical harm.

The compute_loss2() shown here is a closer-to-the-paper version: it computes the IoU scores between labels and all 9 anchor boxes, and selects the anchor with the highest IoU. This way, it is ensured that only 1 anchor box is associated with any label.

3.3.1. select_anchor() function

To do this association, first create a select_anchor() function:

def select_anchor(yhat, label, model):

    n_labels = len(label)
    n_scales = len(yhat)
    n_anchors = yhat[0].shape[1]
    device = label.device
    batch_idx = label[:, 0].long()

    best_iou_scales = torch.zeros([n_scales, n_labels], device=device)
    best_iou_xy_pd = torch.zeros([n_scales, n_labels, n_anchors, 2], device=device)
    best_iou_wh_pd = torch.zeros([n_scales, n_labels, n_anchors, 2], device=device)
    best_iou_xy_lb = torch.zeros([n_scales, n_labels, 2], device=device)
    best_iou_wh_lb = torch.zeros([n_scales, n_labels, 2], device=device)

    best_iou_ancidx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device)
    label_x_idx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device)
    label_y_idx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device)

    # loop through 3 scales
    for ii, (yhatii, yoloii) in enumerate(zip(yhat, model.yolo_layers)):

        b, na, h, w, _ = yhatii.shape
        stride = float(yoloii.stride)
        anchors = (yoloii.anchors / stride).float().to(device) # [n_anchors, 2]
        grid_size = torch.tensor([w, h]).float().to(device)

        # get cell indices of labeled objects
        xy_lb = label[:, 1:3] * grid_size.unsqueeze(0)
        xyidx = torch.floor(xy_lb).long()
        xidx = xyidx[:,0].clamp(0, int(grid_size[0])-1)
        yidx = xyidx[:,1].clamp(0, int(grid_size[1])-1)

        # relative offsets wrt to cells of labels
        relxy_lb = xy_lb - xyidx

        # w, h from labels, convert to feature map scale
        wh_lb = label[:, 3:5] * grid_size  # [n_label, 2]

        # x,y offests of predictions
        relxy_pd = torch.sigmoid(yhatii[batch_idx, :, yidx, xidx, 0:2])

        # w,h sizes of predictions, in feature map coordinate
        wh_pd = torch.exp(yhatii[batch_idx, :, yidx, xidx, 2:4]) * anchors.unsqueeze(0)
        wh_pd = wh_pd.clamp(0, grid_size.max())

        # compute IoUs
        pbox = torch.cat([relxy_pd, wh_pd], dim=2).view([-1, 4])
        box_lb = torch.cat([relxy_lb, wh_lb], dim=1).view([-1, 1, 4])
        box_lb = box_lb.repeat(1, na, 1).view(-1, 4)
        iou = compute_IOU(pbox, box_lb, x1y1x2y2=False, change_enclose=False)
        iou = iou.view([-1, na])

        best_iou, best_ancidx = torch.max(iou, dim=1)

        best_iou_scales[ii] = best_iou
        best_iou_ancidx[ii] = best_ancidx

        best_iou_xy_pd[ii] = relxy_pd
        best_iou_wh_pd[ii] = wh_pd

        best_iou_xy_lb[ii] = relxy_lb
        best_iou_wh_lb[ii] = wh_lb

        label_x_idx[ii] = xidx
        label_y_idx[ii] = yidx

    # stack along the new scale dimension
    label_scale_idx = torch.max(best_iou_scales, dim=0)[1]
    label_anc_idx = best_iou_ancidx[label_scale_idx, torch.arange(n_labels)]
    label_y_idx = label_y_idx[label_scale_idx, torch.arange(n_labels)]
    label_x_idx = label_x_idx[label_scale_idx, torch.arange(n_labels)]

    return label_scale_idx, label_anc_idx, label_y_idx, label_x_idx, best_iou_scales,\
            best_iou_xy_pd, best_iou_wh_pd, \
            best_iou_xy_lb, best_iou_wh_lb

Some explanations:

  • We initialize some placeholder tensors, then go into the iteration through scales:

        best_iou_scales = torch.zeros([n_scales, n_labels], device=device)
        best_iou_xy_pd = torch.zeros([n_scales, n_labels, n_anchors, 2], device=device)
        best_iou_wh_pd = torch.zeros([n_scales, n_labels, n_anchors, 2], device=device)
        best_iou_xy_lb = torch.zeros([n_scales, n_labels, 2], device=device)
        best_iou_wh_lb = torch.zeros([n_scales, n_labels, 2], device=device)
        best_iou_ancidx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device)
        label_x_idx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device)
        label_y_idx = torch.zeros([n_scales, n_labels], dtype=torch.long, device=device)
        # loop through 3 scales
        for ii, (yhatii, yoloii) in enumerate(zip(yhat, model.yolo_layers)):
  • We construct bounding boxes from labels and predictions in a similar manner as in compute_loss(), except that we broadcast the bbox shapes to [n_labels, n_anchors, 4], then reshape to [n_labels * n_anchors, 4]. This way, we vectorize the IoU computations between all pairs of labels and anchors in this scale.

    The computed iou term has a shape of [n_labels, n_anchors].

            # compute IoUs
            pbox = torch.cat([relxy_pd, wh_pd], dim=2).view([-1, 4])
            box_lb = torch.cat([relxy_lb, wh_lb], dim=1).view([-1, 1, 4])
            box_lb = box_lb.repeat(1, na, 1).view(-1, 4)
            iou = compute_IOU(pbox, box_lb, x1y1x2y2=False, change_enclose=False)
            iou = iou.view([-1, na])
  • We then select the best IoU score across anchors in this scale, and store the winner anchor indices and the IoU scores into the placeholder tensors:

            best_iou, best_ancidx = torch.max(iou, dim=1)
            best_iou_scales[ii] = best_iou
            best_iou_ancidx[ii] = best_ancidx
            best_iou_xy_pd[ii] = relxy_pd
            best_iou_wh_pd[ii] = wh_pd
            best_iou_xy_lb[ii] = relxy_lb
            best_iou_wh_lb[ii] = wh_lb
            label_x_idx[ii] = xidx
            label_y_idx[ii] = yidx
  • After going through all 3 scales, best_iou_scales is a tensor of [3, n_labels]. Findng its maximum across scales gives us these indices, for each label in the batch:

        label_scale_idx = torch.max(best_iou_scales, dim=0)[1]
        label_anc_idx = best_iou_ancidx[label_scale_idx, torch.arange(n_labels)]
        label_y_idx = label_y_idx[label_scale_idx, torch.arange(n_labels)]
        label_x_idx = label_x_idx[label_scale_idx, torch.arange(n_labels)]


    • label_scale_idx: an array of 0–2 indices for each label in the batch. This is the s coordinate.
    • label_anc_idx: an array of 0–2 indices for each label in the batch. This is the a coordinate.
    • label_y_idx: an array of 0–I indices for each label in the batch. This is the i coordinate.
    • label_x_idx: an array of 0–J indices for each label in the batch. This is the j coordinate.

    The b coordinate can be easy obtained from batch_idx = label[:, 0].long(). So we don’t have to worry about it now.

3.3.2. compute_loss2() function

Now the compute_loss2() function:

def compute_loss2(yhat, label, model, bbox_loss='iou', obj_label='1'):
    '''Compute multi-task losses

        yhat (list of tensors): YOLO model output at 3 scales in a list. Each
            tensor has shape [B, na, h, w, 5 + n_classes]. Where:
            B: batch_size. na: number of anchors.
            h: number of rows. w: number of columns.
            Columns of last dimension: [x_center, y_center, w, h, obj, c1, ..., ck].
        label (tensor): ground truth label, in shape (n, 6). n: number of labeled
            objects in the batch. Columns: [batch_idx, x_center, y_center, w, h, cls].
        model (nn.Module): YOLO model.
    Keyword Args:
        bbox_loss (str): 'mse': use MSE loss for the x,y centers and w,h sizes.
            'iou': use IoU with label bbox as loss.
        obj_label (str): '1': use 1 as the target objectness score in label
            locations. 'iou': use IoU between prediction and ground truth as
            target objectness score in label locations.
        loss_box (nn.Variable): loss term from bounding box prediction.
        loss_obj (nn.Variable): loss term from objectness score prediction.
        loss_cls (nn.Variable): loss term from classification prediction.

    n_class = model.n_classes
    device = label.device

    # compute a factor to counter unbalanced object labels
    n_labels = len(label)   # num of objects in label
    n_preds = 0             # total num of predictions
    for yhatii in yhat:
        b, na, h, w, _ = yhatii.shape
        n_preds += na * h * w

    obj_weights = torch.tensor([(n_preds - n_labels)/n_labels*0.5]).to(device)

    # prepare loss terms
    loss_box = torch.zeros(1, device=device)
    loss_obj = torch.zeros(1, device=device)
    loss_cls = torch.zeros(1, device=device)
    if bbox_loss == 'mse':
        loss_xy = torch.zeros(1, device=device)
        loss_wh = torch.zeros(1, device=device)

    # BCE loss func for objectness score and classification
    obj_bce = nn.BCEWithLogitsLoss(pos_weight=obj_weights)
    cls_bce = nn.BCEWithLogitsLoss()

    if bbox_loss == 'mse':
        # MSE loss func for x,y,w,h
        xy_mse = nn.MSELoss()
        wh_mse = nn.MSELoss()

    batch_idx = label[:,0].long()
    label_scale_idx, label_anc_idx, label_y_idx, label_x_idx, best_iou,\
            best_iou_xy_pd, best_iou_wh_pd, \
            best_iou_xy_lb, best_iou_wh_lb = select_anchor(
            yhat, label, model)

    # loop through scales
    for ii, (yhatii, yoloii) in enumerate(zip(yhat, model.yolo_layers)):

        s_idxii = torch.where(label_scale_idx==ii)[0]
        b_idxii = batch_idx[s_idxii]
        anc_idxii = label_anc_idx[s_idxii]
        y_idxii = label_y_idx[s_idxii]
        x_idxii = label_x_idx[s_idxii]
        iouii = best_iou[ii, s_idxii]

        obj_lb = torch.zeros(yhatii.shape[:-1]).float().to(device=device)

        # get target objectness scores
        if obj_label == '1':
            obj_lb[b_idxii, anc_idxii, y_idxii, x_idxii] = 1
            obj_lb[b_idxii, anc_idxii, y_idxii, x_idxii] = iouii.detach().clamp(0).type(obj_lb.dtype)

        # predicted objectness scores
        obj_pd = yhatii[..., 4]
        # objectness score loss
        loss_obj += obj_bce(obj_pd, obj_lb)

        if len(s_idxii) == 0:

        if bbox_loss == 'mse':
            relxy_pd = best_iou_xy_pd[ii, s_idxii, anc_idxii]
            relxy_lb = best_iou_xy_lb[ii, s_idxii]
            wh_pd = best_iou_wh_pd[ii, s_idxii, anc_idxii]
            wh_lb = best_iou_wh_lb[ii, s_idxii]
            # x,y mse loss
            loss_xy += xy_mse(relxy_pd, relxy_lb)
            # w,h mse loss
            loss_wh += wh_mse(wh_pd, wh_lb) / 10   # scale size loss down
            loss_box += (loss_xy + loss_wh)

        elif bbox_loss == 'iou':
            loss_box += (1.0 - iouii).mean()

        # classification predictions
        cls_pd = yhatii[b_idxii, anc_idxii, y_idxii, x_idxii, 5:]

        # one-hot encode classes
        cls_one_hot_lb = F.one_hot(label[s_idxii, -1].long(), n_class).float().to(device)
        # classification loss
        loss_cls += cls_bce(cls_pd, cls_one_hot_lb)

    loss = loss_box + loss_obj + loss_cls

    return loss, loss_box , loss_obj , loss_cls

Some more explanations:

  • Having set up some preparations, we call the above select_anchor() function, and go into the scales loop as before:

        batch_idx = label[:,0].long()
        label_scale_idx, label_anc_idx, label_y_idx, label_x_idx, best_iou,\
                best_iou_xy_pd, best_iou_wh_pd, \
                best_iou_xy_lb, best_iou_wh_lb = select_anchor(
                yhat, label, model)
        # loop through scales
        for ii, (yhatii, yoloii) in enumerate(zip(yhat, model.yolo_layers)):
  • Now get the label coordinates in this scale:

            s_idxii = torch.where(label_scale_idx==ii)[0]
            b_idxii = batch_idx[s_idxii]
            anc_idxii = label_anc_idx[s_idxii]
            y_idxii = label_y_idx[s_idxii]
            x_idxii = label_x_idx[s_idxii]
            iouii = best_iou[ii, s_idxii]


    • s_idxii is the s coordinate, and denotes labels associated with some anchors in this scale.
    • b_idxii is the b coordinate, and denotes which images in the batch the selected labels are in.
    • anc_idxii is the a coordinate, and denotes the matched anchor boxes in this scale.
    • y_idxii and x_idxii are the i,j coordinates, and denote the feature map cell locations of the selected labels.
    • iouii is the best IoU scores of the selected labels.
  • With these coordinates ready, we then compute the objectness loss. Again, depending on obj_label, I tried different target values:

            obj_lb = torch.zeros(yhatii.shape[:-1]).float().to(device=device)
            # get target objectness scores
            if obj_label == '1':
                obj_lb[b_idxii, anc_idxii, y_idxii, x_idxii] = 1
                obj_lb[b_idxii, anc_idxii, y_idxii, x_idxii] = iouii.detach().clamp(0).type(obj_lb.dtype)
            # predicted objectness scores
            obj_pd = yhatii[..., 4]
            # objectness score loss
            loss_obj += obj_bce(obj_pd, obj_lb)
  • Depending on bbox_loss argument, the loss_box term:

            if bbox_loss == 'mse':
                relxy_pd = best_iou_xy_pd[ii, s_idxii, anc_idxii]
                relxy_lb = best_iou_xy_lb[ii, s_idxii]
                wh_pd = best_iou_wh_pd[ii, s_idxii, anc_idxii]
                wh_lb = best_iou_wh_lb[ii, s_idxii]
                # x,y mse loss
                loss_xy += xy_mse(relxy_pd, relxy_lb)
                # w,h mse loss
                loss_wh += wh_mse(wh_pd, wh_lb) / 10   # scale size loss down
                loss_box += (loss_xy + loss_wh)
            elif bbox_loss == 'iou':
                loss_box += (1.0 - iouii).mean()

    Note that we don’t need to re-compute everything from ground up again. We can query the returned values of select_anchor() to save some efforts.

  • Classification loss is computed much as in compute_loss():

            # classification predictions
            cls_pd = yhatii[b_idxii, anc_idxii, y_idxii, x_idxii, 5:]
            # one-hot encode classes
            cls_one_hot_lb = F.one_hot(label[s_idxii, -1].long(), n_class).float().to(device)
            # classification loss
            loss_cls += cls_bce(cls_pd, cls_one_hot_lb)

3.4. The train.py script

Now create a train.py script and put it into the YOLOv3_pytorch project folder. Fill it with the following content:

from __future__ import print_function
import os
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

from config import load_config
from model import Darknet53
from utils import compute_IOU, batch_NMS, compute_mAP
from loader import create_loader

    from torch.utils.tensorboard import SummaryWriter

def lr_schedular(optimizer, iteration, warmup_iter, initial_lr, peak_lr, power=1):
    return lr

def compute_loss(yhat, label, model, bbox_loss='iou', obj_label='1'):
    return loss, loss_box , loss_obj , loss_cls

def compute_loss2(yhat, label, model, bbox_loss='iou', obj_label='1'):
    return loss, loss_box , loss_obj , loss_cls

def select_anchor(yhat, label, model):
    return label_scale_idx, label_anc_idx, label_y_idx, label_x_idx, best_iou_scales,\
            best_iou_xy_pd, best_iou_wh_pd, \
            best_iou_xy_lb, best_iou_wh_lb

if __name__=='__main__':

    #--------------------Load model config--------------------
    CONFIG_FILE = './config/yolov3.cfg'
    net_config, module_list = load_config.parse_config(CONFIG_FILE)

    config = {'net': net_config}
    config['module_list'] = module_list
    config['width'] = 416
    config['height'] = 416
    config['n_classes'] = 80
    config['max_data_size'] = 100
    config['batch_size'] = 4
    config['is_train'] = True
    config['conf_thres'] = 0.3
    config['nms_iou_thres'] = 0.5
    config['map_iou_thres'] = 0.5

    # experiment parameters
    BBOX_LOSS = 'iou'  # 'iou' or 'mse'
    OBJ_LABEL = 'iou'  # 'iou' or '1'
    EXP = '%s-%s' %(BBOX_LOSS, OBJ_LABEL)

    # training parameters
    LR0 = 1.0*1e-4
    PEAK_LR = 8.*1e-4
    WEIGHT_DECAY = 1e-4
    EPOCHS = 150
    WARMUP_ITER = 4e3

    # folders
    DATA_FOLDER = './data/coco'
    CKPT_FOLDER = './ckpt/' + EXP
    LOG_DIR = './runs/' + EXP

    #-------------------Create model-------------------
    model = Darknet53(config)

    #--------------Get dataset and dataloader--------------
    dataset, dataloader = create_loader(DATA_FOLDER, config, shuffle=False)
    id2class = dataset.id2class

    #--------------------Load model--------------------
    model = Darknet53(config)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print('######### Using device:', device, '############\n')

    opt = torch.optim.Adam(model.parameters(), lr=LR0, weight_decay=WEIGHT_DECAY)

    #------------------Output folder------------------
    os.makedirs(CKPT_FOLDER, exist_ok=True)
    ckpt_file = os.path.join(CKPT_FOLDER, 'ckpt.pt')

    # load check point if exists
    if os.path.exists(ckpt_file):
        print('####### Load ckpt #########')
        print('ckpt file:', ckpt_file)
        ckpt = torch.load(ckpt_file)
        epoch0 = ckpt['epoch']
        epoch0 = 0

        writer = SummaryWriter(log_dir = LOG_DIR)

    #------------------Start training------------------
    total_iters = 0   # total number of iterations

    for ee in range(epoch0, epoch0+EPOCHS):
        print('\n#### Entering epoch: %d ########' %ee)

        # keep track of training loss
        train_loss_bbox = []
        train_loss_obj = []
        train_loss_cls = []
        train_loss_total = []
        # store for mAP computation
        pred_epoch = []
        label_epoch = []

        for ii, (imgii, labelii) in enumerate(dataloader):

            total_iters += ii
            total_seen = len(dataset) * ee + len(imgii)

            # run model
            imgii = imgii.to(device)
            labelii = labelii.to(device)
            yhatii = model(imgii)

            # compute loss, back-prop
            lossii, loss_bboxii, loss_objii, loss_clsii = compute_loss(
                yhatii, labelii, model, bbox_loss=BBOX_LOSS, obj_label=OBJ_LABEL)

            # update learning rate
            lr = lr_schedular(opt, total_iters, WARMUP_ITER, LR0, PEAK_LR)


            # evaluate
            if ii % EVAL_INTEVAL == 0:

                with torch.no_grad():
                    yhatii = model(imgii)

                # compute NMS
                labelii = labelii.cpu().numpy()
                yhatii = yhatii.detach().cpu().numpy()
                yhatii = batch_NMS(yhatii, config['conf_thres'], config['nms_iou_thres'])
                if len(yhatii):
                    # convert to fractional coordinates and sizes, to match with labels
                    yhatii[:, [1,3]] /= config['width']
                    yhatii[:, [2,4]] /= config['height']

                # print loss
                print('ii = %d, Total loss = %.1f, box loss = %.2f, Obj loss = %.2f, Cls loss = %.2f'\
                        %(ii, lossii.item(), loss_bboxii.item(), loss_objii.item(), loss_clsii.item()))

                if HAS_TENSORBOARD:
                    # training loss every iteration
                    writer.add_scalar('iLoss/train_bbox', train_loss_bbox[-1], total_iters)
                    writer.add_scalar('iLoss/train_obj', train_loss_obj[-1], total_iters)
                    writer.add_scalar('iLoss/train_cls', train_loss_cls[-1], total_iters)
                    writer.add_scalar('iLoss/train_loss', train_loss_total[-1], total_iters)
                    writer.add_scalar('iLearning_rate/lr', lr, total_iters)

        # compute mAP of epoch
        if len(pred_epoch):
            pred_epoch = np.vstack(pred_epoch)
            label_epoch = np.vstack(label_epoch)
            mAP_ee = compute_mAP(pred_epoch, label_epoch, 0, config['map_iou_thres'])
            mAP_ee = 0

            # training loss every epoch
            writer.add_scalar('eLoss/train_bbox', np.mean(train_loss_bbox), ee+1)
            writer.add_scalar('eLoss/train_obj', np.mean(train_loss_obj), ee+1)
            writer.add_scalar('eLoss/train_cls', np.mean(train_loss_cls), ee+1)
            writer.add_scalar('eLoss/train_loss', np.mean(train_loss_total), ee+1)
            writer.add_scalar('eLearning_rate/lr', lr, ee+1)
            writer.add_scalar('emAP/map', mAP_ee, ee+1)

        print('\n############ Save model #############')
        print('Save to', ckpt_file)
        torch.save({'epoch': ee,
                    'model_state_dict': model.state_dict(),
                    'optimizer_state_dict': opt.state_dict(),
                    'loss': lossii.item(),},

I’ve omitted the compute_loss(), compute_loss2() and select_anchor() definitions.

There is a lr_schedular() function that updates the learning rate:

def lr_schedular(optimizer, iteration, warmup_iter, initial_lr, peak_lr, power=1):

    if iteration == 0:
        iteration += 1

    lr = min(1 / iteration**power, iteration / warmup_iter**(power + 1)) *\
            warmup_iter**power * (peak_lr - initial_lr) + initial_lr

    lr = max(lr, 1e-7)

    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    return lr

It increases the learning rate from the initial value of initial_lr to peak_lr during a warm-up phase of warmup_iter iterations, and slowly decreases exponentially afterwards.

The script uses tensorboard to record the training losses across iterations and epochs, and uses our previously developed compute_mAP() to measure the performance every epoch.

Note that for test purposes, I’m using only a tiny fraction of the entire dataset (config['max_data_size'] = 100), and computing the mAP on the training data itself.

About Joyk

Aggregate valuable and interesting links.
Joyk means Joy of geeK