14 Nov
2022
14 Nov
'22
5:40 a.m.
In their paper ( https://arxiv.org/pdf/2208.00635.pdf ) they say the highest scores on CommonSenseQA were acquired via what they call "DictRoBERTa + LWA(K+V)". LWA means "Layer-wise Extra-hop Attention" .... well i misplaced that. i think i'll try to adapt bloom-560m to do this. my plan is to give it a small dataset that i add to by hand and have it break the dataset into train/test and train an adapter so long as the loss on the test drops i infer there is something wrong with that plan, but it's a start