This is nvidia-oriented research on using less memory with more accuracy and speed when making new equally powerful transformer models for new tasks off of existing large pretrained ones, on lower end systems and with less data.
We propose Ladder Side-Tuning (LST), a new parameter-efficient transfer learning (PETL) technique that reduces training memory requirements by more substantial amounts. Unlike existing PETL methods that insert additional parameters inside backbone networks, we train a ladder side network, a small and separate network that takes intermediate activations as input via shortcut connections (ladders) from backbone networks and makes predictions.
On both GLUE and VL tasks, LST saves 2.7x more memory than other PETL methods. To further show the advantage of this better memory efficiency, we also apply LST to larger T5 models (T5-large, T5-3B), attaining better GLUE performance than full fine-tuning and other PETL methods.