I used BERT in a few papers and I was going to use it in a future project. I needed to understand its drawbacks in a real ongoing project.
I read somewhere that its biggest disadvantage is that it is very compute-intensive at inference time, meaning that if you want to use it in production at scale, it can become costly.