I'm trying to code a kernel profiling on my source code, and it needs to profile multiple metrics on CUDA kernels. I've tried PAPI with the CUDA component, but it's doesn't worked out. I also noticed the NVidia CUPTI API, but the sample only profiles one metric per kernel and I'm a bit confused on how to change the sample source code to perform under multiple events.
PS: the nvprof tool doesn't help, because the profiling must happen on my source code.