This is a question without a single answer, since there are many variables and dependencies here which are impossible to pin down, and will change according to design parameters, technology generation/features, logic/circuit design methodologies and capabilities, and (very importantly) the design of the memory hierarchy, and the workload(s) under consideration. Even the basic metrics (power, and performance) are subject to much discussion: how should one trade off power vs performance to make a "fair" comparison? Nonetheless, it would be interesting to hear folks' opinions and analysis on this topic. To start things off, let's assume that you need to meet a certain ("relatively high") performance level, measured as performance per thread in a multi-core/multi-threaded microprocessor, but you are also subject to a per-thread power constraint (typical of today's high-performance processors). Obviously, the deeper the pipeline, the higher the operating frequency. Or, frequency can be traded off for power, by lowering the operating voltage. However, lowering the design FO4 and deepening the pipeline will impact the power-performance of the design in a variety of ways:
1) Number of flops in the design increases, driving up power, especially power used for clocking. Also, the delay overhead of the flop and any clock uncertainty takes a relatively larger bite out of the cycle time.
2) "Design difficulty" increases since logic has to be divided more finely to achieve the higher cycle time. Also increased timing/device modeling accuracy is called for. Tighter signal slews will be required, and number of repeaters/buffers will increase.
3) CPI will increase, since miss penalties (measured in numbers of cycles) will increase. Also, in light of #2, the design may be pushed towards a simpler microarchitecture.
4) There may be many other costs, depending on the design, ranging from SER impacts, power/current density issues, cache design issues, etc.
I'll start by suggesting that, given where most commercial designs seem to be landing, that 10FO4 is probably well below the ideal cycle time (despite considerable literature which might suggest otherwise). Also, something like 50FO4 is probably too high. Comments anyone?