Two Tasks, Two Datasets, One Network: Multi-task Learning with DnD

This post is motivated by problems that I have been trying to solve for my actual work at Shoprunner. For context I am responsible for modeling an ever growing number of categories and we will likely never have a complete list of the things we want to model. So it is not particularly feasible to build one dataset to rule them all that is fully labeled with every category we would ever need. While it could physically be done, there are many risks such as a new category being under represented in the unified dataset then we would have to mark a completely new dataset with all of the previous categories. Also it could just cost a bunch of extra money.

There are a number of solutions I would consider partial solutions like building a bunch of category specific models and deploying those or continuing to find ways to pursue the one dataset to rule them all method.

Building and deploying single models seems easy enough, until you have to maintain dozens of models a year from now when you could use a single model to do the same task. So It would be nicer to be able to solve all of these with a single multi-task network.

To keep pursuing the “one dataset to rule them all path” I have been building smaller models trained to specific categories and have been using those to label a larger pool of images to create large fully marked datasets. However this process requires manual inspection of categories to measure for cleanness and there are many issues related to maintaining the cleanness of the fully marked datasets. For me this is a stopgap solution at best that may be difficult to maintain.

For me what I really need is a way to train multi-task models with the flexibility to use multiple datasets for individual tasks.

Dragonborn (.96) Paladin/Warrior (.91). When I looked at this image in my jupyter notebook I was confused as to what it was… then I saw the race and class was written on it. Gogo model I guess? The only label this image would have in the dataset was “dragonborn” so the model gets paladin right which is cool.

Background: Roll a History Check

Being able to train my multi-task networks using multiple datasets rather than relying on a unified fully marked dataset would give me the flexibility to add and remove categories with relative ease. Over the past few months I have done a good bit of experimentation with network architectures and pipelines but I didn’t get close to cracking it until recently.

A week or so ago I was in a rut and had listened to Andrew Ng’s multi-task learning lecture again earlier in the day, but the details of the video did not really click with me until I was sparring with some folks at Wing Chun practice. Part way through the sparring session I realized Andrew Ng had briefly mentioned doing multi-task learning with non unified datasets (~5:30 in the video). That broke my concentration and I got hit a few times but that was super worth it.

multi-task learning lecture from Andrew Ng’s deep learning course

While being super vague the basic idea that Andrew Ng goes through is that you can do multi-task learning on partially labeled datasets by only calculating the relevant losses and using those for back propagation.

This high level idea seems intuitive enough and since Andrew Ng mentions it in a lecture I figured it was probably possible. so I went through a few different iterations of pipelines where I tried to do this. All failures. Great, back to square one… I was still missing something.

The second bit of information that helped me came from researching more recent developments with applications of Google’s Bert NLP model. While I don’t like the Sesame Street naming trend (ELMO, BERT, Big Bird) the models are quite interesting if you haven’t checked them out. I came across a post by folks at Stanford Dawn ( here ). The key line for me was the following.

Our training schedule is simple as well: break up all participating tasks into batches, shuffle them randomly, and feed them through the network one at a time such that each training example from each task is seen exactly once per epoch.

I have a fairly compulsive need to do data science things and am a self described “data science addict”. So as soon as I read that post I basically headed home from work and tinkered until 2 AM to work out the details that I am using for this post.

Datasets: Roll for Initiative

humanoid (.57) warrior/paladin (.57). swords and plate armor for warrior/paladin. Dataset label is just warrior/paladin.

As with all projects the first need was to find/build a dataset. However for this post I was able to use two small datasets I had lying around where I had mined images/artwork related two DnD races and classes.

Races include humanoid (elf/human), gnomes, dragonborn and classes include paladin/warrior, rogue, wizard/sorcerer.

Are these all inclusive categories and able to generalize to everything in DnD? absolutely not, but they serve my purposes here and for each dataset the random guess is around ~35–40%

Another detail is that I built these datasets quickly using a method mentioned in the fastai lectures which reference another blog post from pyimagesearch .

The general gist is that you can use google images searches to mine relatively coherent categories for a dataset and then manually prune out unrelated images to clean it fully. As someone who used to build image datasets by hand and label them… this was a huge quality of life improvement and I HIGHLY recommend doing what you can with it. At the moment I am still trying to figure out how to do multi-label problems with this method, but that can be a topic for another time.

Version 1 Pipeline: Roll a Tinkering Check?

My first pass at building out this pipeline mostly involved me modifying my existing training pipelines. Here are a number of the implementation details.

Made two dataset loaders instead of one. I separated the two so that I could iterate over them to generate batches separately and calculate the loss for only that task head of the network.
Right now I use the smaller of the two dataset to determine the number of batches to generate in one epoch. This is likely not necessary and may only be sensible in this use case where the two datasets have a similar number of samples.
I am only doing back propagation once. I run one dataset’s batch through and calculate the relevant loss for that task (completely ignoring loss calculations for the other task as suggested by Andrew Ng). Then run the second dataset’s batch through and calculate the loss for that second task. Once those two are done I sum their losses and perform back propagation

Feel free to check out the notebooks I did my tinkering in. The initial notebook where I was working out my training pipeline is here

I had issues on getting this model to converge, but am finding that slightly lower than normal learning rates are working well. This might have something to do with helping the network not over adjust to a particular task and avoid it just jumping back and forth between the two. Highly speculative and it bears further testing on my end.

The end result of this pipeline was that I was able to get both tasks to around 70% accuracy for predicting the race and class of an input image.

On one hand this was a win because I was getting above random guess… However it is still a bit cringe worthy since I FEEL like it should be able to do better… So now that the pipeline works, how should I aim to improve it?

Since I was using lower than normal learning rates, I figured that the backbone Resnet model was likely not really getting fine tuned to this DnD specific domain. To test that hypothesis all I needed to do is get a DnD fine tuned model.

Version 2 with a Fine Tuned Backbone: Roll an Insight Check

Since I had my datasets in place, I trained a Resnet50 model to predict the race category. I did this with minor modifications to my standard training pipelines (mostly just converting them back to just do a single task since most of the pipelines I have lying around are multi-task learners these days). The goal was to get more domain specialization into the backbone model which does most of the heavy lifting.

While I likely could have squeezed out higher performance from this model, it reached 80% accuracy on the task. So this beats the random guess threshold of 40% for the race task and the version 1 multi-task model accuracy of 70%.

Fine tuning the backbone Resnet50 model takes place here

I replaced the vanilla Resnet50 model with this new fine tuned Resnet50 and ran the training pipeline a few times. I found that cycling the learning rates and actually letting it use slightly higher learning rates (lower than normal, but higher than I was using before) let the model get to 83% accuracy for race and 75% for class.

Version 2 pipeline with a fine tuned backbone is here . I also make note of the different runs while I was essentially just cycling the learning rates.

Is this as high as I would like? still nope, but I think it demonstrates a fairly effective way to train these models and shows that a fine tuned backbone improves the performance of this setup.

humanoid (.998) and wizard/sorcerer (.834). Staff and robes seem to equal wizard to the model.

Results: Roll a Perception Check

So swapping in a fine tuned model showed significant improvements over the version 1 of this pipeline using a non fine tuned model to this domain. The version 1 model had a highest accuracy of 70% and 71.25% (race and class) on its lowest loss of 1.681. In comparison the version two model improved a good bit scoring 83.75% and 75% with a best loss of 1.5751.

The only difference was replacing the backbone of the network with a fine tuned model and this is helpful since it provides a good way to work through the problems I am facing at Shoprunner. Basically you can apply current well tuned models and add additional categories using this multi-dataset training method.

The other piece of this that is interesting to me is that the multi-task version of this network is able to outperform a single task network on the same problem.

For the DnD race category the tuned Resnet50 scored 80% but the multi-task network using that network as a backbone scored 85% on the same task. One of the reasons that multi-task learning is used is because training a network on multiple tasks acts as a form of regularization. Sebastian Ruder goes through multi-task learning and talks about some of that here .

Training a network on a variety of tasks can see improvements in generalization over training each task separately because the network is penalized for overfitting to a single task.

This performance increase is also interesting since as I add more categories into my models for work it may offer performance improvements over previous iterations as more tasks are learned and the network gains more subject matter expertise than it previously had.

gnome (.778) rogue (.9112). Because of the way I constructed these classes paladin/warrior is mostly people wearing plate armor, rogue is for characters dressed fairly normally, and wizard/sorcerer is people with long robes who are using magic/have staffs. My fault, not really the model’s

Conclusions: Roll a Wisdom Check

humanoid (.995) wizard/sorcerer (.93). likely says wizard due to the robe-y feel?

Since this is mostly me just experimenting there are a lot of areas for improvement with this and I hope I will look back at this work and get a laugh out of the way I attacked this problem. However for the time being this method seems relatively promising and was not terrible to implement.

I think that randomly selecting batches rather than cycling through the batches for different tasks in order like I currently do might improve performance and add more variation into the model’s training cycle.

There is also experimentation to be done on what the best way to calculate the losses and back propagate them back into the network. I tried doing two optimizer steps and was getting some errors which is why I defaulted back to adding the losses together like other multi-task models.

Also like with other multi-task networks I need to figure out good ways to improve specific tasks over others in the case where a task is under performing.

Even with the things I think that could be improved, this pipeline is a big step forward for me with my current work and is a problem I have been wrestling with for a few months.

Building large unified datasets is time consuming and potentially expensive. Maybe some previous work was done so there are labels for a specific category in one place, another is brand new so you have to crowdsource the labeling of that category, and a third is from some dataset that has been purchased. In an environment where you need to unify the datasets you would need to do something like what I have done for work where I trained smaller models on each problem and use that to label a unified dataset or pay to crowdsource label them. For both of these cases you need to spend a lot of effort to make sure both the initial datasets and the resulting labels are clean and usable or risk cross contamination and under performing final models.

Rather than finding some way to unify the labels from these three datasets to be able to train a model… you can use each separately and train a multi-task network on each task using each dataset simultaneously. This cuts down the number of models that need to be trained and the amount of manual inspection that needs to be done.

Being able to train a multi-task network using datasets marked for a specific task means you can use whatever methods make the most sense at any given point to generate those labels.

Here is the general link to the repo . It also includes a notebook where I just run the model on a bunch of different images that I use for this blog.

humanoid (.995) wizard/sorcerer(.964)

Background: Roll a History Check

Datasets: Roll for Initiative

Version 1 Pipeline: Roll a Tinkering Check?

Version 2 with a Fine Tuned Backbone: Roll an Insight Check

Results: Roll a Perception Check

Conclusions: Roll a Wisdom Check

Recommend

When you want to quit

9 Must-Have Tools For Developers

Unreal Engine: Custom EQS Generators

RustCon Asia 实录 | Rust 在国内某视频网站的应用 - 知乎

Using Python to Get Free Chicken Sandwiches

基于Kafka的实时计算引擎如何选择？Flink or Spark？ - 哥不是小萝莉 - 博客园

How the Linux Kernel Detects PCI Devices and Pairs Them with Their Drivers

Understanding Angular Property Binding and Interpolation

Learning Laravel - Observations, part 1: The service container

微服务布道师：详解微服务架构

About Joyk