I'm asking those who have experiences with MPI working on various environments.
It can be used on simple Beowulf clusters, consisting of four PCs connected to a single switch (or even hub), where all info can be broadcasted (or multicasted) to all hosts. It can also be used on large supercomputers of complicated topology, with large diameters.
It seems various cases need different program optimization don't they?
In particular - are the collective operations efficient on supercomputers, like BlueGene? AFAIK, modern algorithms (like SUMMA) often use MPI_Reduce, at least...
What are your experiences? How to design programs for MPI?