For software architecture reasons I can't currently control, we are acquiring data through a singleton library which is not thread-safe, nor can you run multiple instances of it for the same experiment. There is only one single valid instance of the source data structure at any one time. Therefore, our current design has a master thread copying that data to newly allocated internal data structures in our post-processing system, and sending them off to worker threads.

What I would like is to have a parallel version of the default memcpy call, that would hand off the copying task to an internally managed threadpool, ideally with both synchronous and asynchronous versions. Since one serious bottleneck right now is all tasks that bog the master thread, distributing the copying work would make sense. Naturally, we would need to synchronize before calling the "next event" method on our library API again. The task is not trivial to parallelize, though, since different architectures have different memory controller layouts, where you could even run into cache starvation issues etc if multiple threads access ranges nearby. Some kind of auto-tuning or at least architecture-aware library would be ideal.

Some suggestions for where to look would be appreciated. Currently, this code is using Boost, but e.g. no BLAS or LAPACK implementation, so even adding linking to those would increase the maintenance and portability burden.

More Carl Nettelblad's questions See All
Similar questions and discussions