I've come across some sort of optimization techniques which takes care about performance, energy and leakage, but those look inefficient when we move to 32 nm or so?
You might want to investigate the IBM Blue Gene family of super computers. Massively Parallel Processing model for their computing. Each node/core/chip might not be clocked at the fastest possible level, but by throwing *lots* of nodes/cores at the problem, they could simultaneously be on the top500.org list AND the green 500 list. http://en.wikipedia.org/wiki/Blue_Gene
There are 2 basic approaches to get high performance and low energy consumption simultaneously for a processor:
1. to find architectural solution, that brings best suiting performance/equipment volume tradeoff. This normally applies some combination of paralleling, pipelining, iterationing, with the general goal to get processor's structure close to the features of the algorithm it is intended to execute, and to utilize its temporal and spatial characteristics, while equipment volume remains acceptable. As a result, it also allows to get as high as possible value of resource usage coefficient.
2. to find a low energy consumption platform for processor's implementation. As of today, it could be FPGA. Of course, you can make a full-custom VLSI and get even less energy consumption, but this would require much more costs.