May be this isn't helpful at all ... I think for getting maximum performance of threaded environment (parallel processing) you are given the flexibility to group your data , follow your own hierarchy (based on the api's provided) !!
well, yes strictly speaking none of the answers are any helpful. Except that it just confirms that it is hard to find any tutorials on how to build from the npp. The primitives themselves are ok to well documented, but I find it strange this void of how to tie them together into something sensible. It would be fun to make something myself in C because I know that language instead of utilising packages such as torch and caffee that uses the NPPs.
I guess , they are pretty much in C (me too, I am a fan of C ) as you said for low level flexibility and (un)fortunately that is a way in which hardware dependency is ascertained....
Flip side of using the packages is that low level data handling is less complicated....