When solving PDE'S on unstructured grids, we access in an element loop the nodal values based on the element connections table, which results in an arbitrary scattered memory access pattern. The grid can be sorted of course using various algorithms, such as Cuthill-McGee, Hilbert curves and friends but this does not help really since there will be always a mis-prediction on the instruction level of the CPU due to the fact that the ordering is not known at compilation time. In this way vectorization is of course an additional problem and the pipelines of the CPU will be always underutilized due to the micro grained processing, element by element.
I have in mind somehow to group them, to use derived data types in order to be able to vectorize on the subset ... but, honestly, i am pessimistic person and I do not really think that this will solve the issue. I know many codes, where nothing is specially is done. I checked the literature in a various ways but I do not really find something that gives me new ideas, maybe I was looking at the wrong keywords.
Anyway, maybe somebody is willing to share his thoughts.
Thanks in advance ...
Aron