This is nvidia-oriented research on using less memory with more
   accuracy and speed when making new equally powerful transformer models
   for new tasks off of existing large pretrained ones, on lower end
   systems and with less data.
   [1]https://arxiv.org/abs/2206.06522
   [2]https://github.com/ylsung/Ladder-Side-Tuning
We propose Ladder Side-Tuning (LST), a new parameter-efficient transfer learning
 (PETL) technique that reduces training memory requirements by more substantial
amounts. Unlike existing PETL methods that insert additional parameters inside b
ackbone networks, we train a ladder side network, a small and separate network t
hat takes intermediate activations as input via shortcut connections (ladders) f
rom backbone networks and makes predictions.

On both GLUE and VL tasks, LST saves 2.7x more memory than other PETL methods. T
o further show the advantage of this better memory efficiency, we also apply LST
 to larger T5 models (T5-large, T5-3B), attaining better GLUE performance than f
ull fine-tuning and other PETL methods.

References

   1. https://arxiv.org/abs/2206.06522
   2. https://github.com/ylsung/Ladder-Side-Tuning