I'm trying to guess what are L1 caches typical pipeline stages. The attached file describes a 3-cycle one, like those found in Silvermont, Jaguar, and Cortex-A9. The notation conventions are:
However, high-end CPUs such as Haswell, Bulldozer, and Cortex-A15 have a 4-cycle L1 cache access latency. Where does the fourth cycle come from? Could someone explain in detail what do the four stages do?