AdapterHub: A Framework for Adapting Transformers

No more slow fine-tuning: efficient transfer learning with HuggingFace transformers in 2 extra lines of code

Jul 18 ·6min read

This blog post is an introduction to AdapterHub , a new framework released by Pfeiffer et al (2020b) , that enables you to perform transfer learning of generalized pre-trained transformers such as BERT, RoBERTa, and XLM-R to downstream tasks such as question-answering, classification, etc. using adapters instead of fine-tuning .

AdapterHub was built on top of the popular transformer package supplied by HuggingFace. You can find the HuggingFace transformers package here , and AdapterHub’s modification here .

To properly understand this article, I recommend that you first read up on transformers and how they are usually fine-tuned !

Why use adapters instead of fine-tuning?

I go into the details of this in the section ‘The Benefits of Adapters’ but as a sneak peek:

Houlsby et al (2019) introduced something called an adapter. Adapters serve the same purpose as fine-tuning but do it by stitching in layers to the main pre-trained model, and updating the weights Φ of these new layers , whilst freezing the weights θ of the pre-trained model .

In contrast, you will recall that in fine-tuning, we are required to update the pre-trained weights as well.

As you might imagine, this makes adapters much more efficient , both in terms of time and storage , compared to fine-tuning. Adapters have also been shown to be able to match the performance of state-of-the-art fine-tuning methods !

faauyay.png!web

Source: Pfeiffer et al (2020b)

What you will learn from this post

I will simplify the contents of the AdapterHub paper which accompanied the release of the framework, to make it easier for you to start developing.

The AdapterHub framework is important because prior to this framework, stitching in adapters or sharing adapters that one has trained was difficult and involved manually modifying the transformer architecture. This framework enables a

dynamic “stitching-in” of pre-trained adapters for different tasks and languages

In short, it makes using adapters for transfer learning much, much easier.

In this post, we will review the various benefits of adapters as discussed in the AdapterHub paper and explain the various features of the new AdapterHub framework .

The features will be accompanied by some sample code from the paper to get you started!

If you are already familiar with adapters & their various benefits, you can skip straight to the section ‘Key Features of AdapterHub.’

The benefits of adapters

The benefits listed here correspond to a simplified version of those listed in Section 2.2 of the AdapterHub paper .

Task-specific Layer-wise Representation Learning

As alluded to earlier, when comparing the performance of fine-tuning vs adapters on the GLUE benchmark (popular in natural language processing circles), there was no large difference in performance.

This means that adapters can achieve the state-of-the-art on par with fine-tuning , whilst preserving the time and space efficiency listed as the next feature!

Small, Scalable, Shareable

To fully fine-tune a model, we need to store a copy of the model for each task. This also “impedes itemizing and parallelizing training.”

In contrast, adapters require MUCH LESS storage. To illustrate, Pfeiffer et al(2020b) provide the following example:

for the popular Bert-Base model with a size of 440Mb, storing 2 fully fine-tuned models amounts to the same storage space required by 125 models with adapters, when using a bottleneck size of 48 and adapters of Pfeiffer et al. (2020a)

A sub-benefit of this is that we can add many more tasks to an application by simply adding small adapters instead of hefty fine-tuning.

Reproducibilityacross researchers is another wonderful result of the reduced storage requirements.

Modularity of Representations

When we stitch-in adapters, we fix the representations of the rest of the transformer, which means these adapters are encapsulated and can be stacked or moved or combined with other adapters.

This modularity allows us to combine adapters from various tasks — something super important as NLP tasks get more complex.

Non-Interfering Composition of Information.

Natural language processing often involves sharing information across tasks. We often use something called Multi-Task Learning (MTL), but MTL suffers from two issues:

catastrophic forgetting: where ‘information learned during earlier stages of training is “overwritten”’ (Pfeiffer et al, 2020b).
catastrophic inference: where ‘the performance of a set of tasks deteriorates when adding new tasks’ (Pfeiffer et al, 2020b).

With adapters, we train the adapter for each task separately, meaning that we overcome both issues above.

Key features of AdapterHub

Great, now let’s look at the key features of this framework!

Adapters in Transformer Layers + How to Train an Adapter

To add the adapters, the authors used something called ‘Mix-Ins’ which are inherited by the HuggingFace transformer, so as to keep the codebases reasonably separate.

In practice, here is how you add an adapter layer:

from adapter_transformers import AutoModelForSequenceClassification, AdapterTypemodel = AutoModelForSequenceClassification.from_pretrained("roberta-base")model.add_adapter("sst-2", AdapterType.text_task, config="pfeiffer") model.train_adapters(["sst-2"]) # Train model ... 
model.save_adapter("adapters/text-task/sst-2/", "sst")
# Push link to zip file to AdapterHub ...

You’ll notice that the code mostly corresponds to regular HuggingFace transformers, and we just add to two lines to add & train the adapter.

Something special about this AdapterHub framework is that you can dynamically configure the adapters, and change the architectures. Whilst you can use adapters directly from the literature — for example from Pfeiffer et al(2020a) or Houlsby et al (2020), you can also modify these architectures quite easily using a configuration file . In the code above, we use the default Pfeiffer (2020a) configuration.

nyaMfui.png!web

Source: Pfeiffer et al (2020b) . The dotted line indicate configurable components.

Extracting and open-sourcing adapters

You can push the adapters that you train to AdapterHub.ml , as well as benefit from adapters other people have pre-trained. Unlike fine-tuning, where the entire large model must be shared, these adapters are lightweight and easily shareable!

Finding pre-trained adapters

The search function of AdapterHub.ml works hierarchically:

1st level: view by task/language
2nd level: separate into datasets of higher-level NLP tasks//separate into the language of training data (if we are adapting to a new language)
3rd level: separate into individual datasets or domains (eg Wikipedia)

The website also helps you to identify compatible adapters depending on the pre-trained transformer that you specify.

Stitching in pre-trained adapters

Use the following code to stitch in a pre-trained adapter, as opposed to training on your own (as detailed earlier):

from adapter_transformers import AutoModelForSequenceClassification, AdapterType model = AutoModelForSequenceClassification.from_pretrained("roberta-base")model.load_adapter("sst", config="pfeiffer")

To be perfectly clear, the pre-trained adapter is loaded in the third line of code, whilst the pre-trained transformer is loaded in the second line of code.

Inference with Adapters

Just use regular HuggingFace code for inference! You may optionally load prediction heads when loading adapter weights .

Recall that combining adapters for combined tasks etc is very much possible with this awesome framework!

Conclusion

I’d encourage you to try out the AdapterHub framework for yourself here .

This framework is so exciting and I hope that this post helped you to start your journey of adapting transformers!

Do let me know if you detect any errors in the post, or if you have any comments/critiques!

References

Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., and Gurevych, I. (2020a). AdapterFusion: Non-destructive task composition for transfer learning. arXiv preprint.

Pfeiffer, J., Rücklé, A., Poth, C., Kamath, A., Vulić, I., Ruder, S., Cho, K., Gurevych, I. (2020b). AdapterHub: A Framework for Adapting Transformers. arXiv preprint.

Houlsby, N., Giurgiu, A., Jastrzkebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. (2019). Parameter-efficient transfer learning for NLP. In Proceedings of ICML 2019.

AdapterHub: A Framework for Adapting Transformers