The BERT is described in the paper 《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》.
The RoBERTa is described in the paper 《RoBERTa: A Robustly Optimized BERT Pretraining Approach》.
Now 3 years past. Are there any pretrained-language-model that surpass them in most of the task? (Under the same or nearby resources)
Speedup without accuracy decreasing is also considered as a better one.