I have trained the vision transformer several times and noticed that it takes so much time to train the network while working with images, especially when we increase the number of transformer blocks. I am wondering if there is a way to optimize the computation in the transformer while obtaining the most efficient information from the image? In other words is there any approach or idea that helps to optimize the transformer?