If so, the memory space is shared, so you can just directly access the results of the different (worker) threads and copy them into their final location.
You need to implement a locking mechanism (mutex), to make sure you don't get partial results, or memory garbage.
2) Are you running separate processes on the same node?
You need to implement a shared memory communication framework, or (local) sockets. Other than that, it's similar to 1 (use signals/semaphores instead of mutexes).
3) Are you distributing over the network, between different nodes?
You need a network protocol, in addition to the above 2.
Also, what is the structure of the resulting data?
Is it just an array that you split up into chunks between the different CPUs?
2D/3D etc. arrays?
Something more complex?
Are you doing homogeneous multi-processing (all the CPUs run the same code, on different chunks of the data)?
Or, is it heterogeneous (different CPUs run different code and solve different parts of a larger problem)?
Without looking at your code, if you're just joining 1-d arrays of data into one long array on the root processor, possibly with variable length of source array per worker processor, you would use MPI_Gatherv.
I would say that regardless to the name of a method, the basic idea is the same: you have to gradually fold the results produced by different CPUs into a single one halving/reducing the number of CPUs in each step.
Let's say you have N CPUs each gives you a value, however all these values are just fractions of the final result. The algorithm in this case will be:
Although you mention Gean4, you haven't really described which results you're trying to merge. Such merging obviously may depend on physics, rather than the trivial advice to use gather/reduce. Mostly, it's not really clear what you have done, and want to do. My impression is that Gean4 has reasonable MPI support built in: