I'm studying Locality optimizations on ccNUMA architecture.I've started reading the book Introduction to high performance computing for scientists and engineers but there is a something that i dont understant.Using as example the vector triad (A[i]=B[i]+C[i]*D[i]) it says if we initialize them serial and the working set fits into the aggregate cache the scalability is good.What does it mean by that and how is this work?(Imagine 4 locality domaines with 2 cores each and with 2 threads per processor).
Im talking about the area for N