I am looking into some (2D) flow-simulation algorithms, which jump between real and wave-number space using FFT. How efficient are the different implementations available? Also, how well do they scale with the number of cores in a multicore system? (The FFTW benchmark overview at http://www.fftw.org/speed/ doesn't include any runs beyond dual-core systems.)