In my own experience, all applications show some runtime improvement as more cores are added. Eventually, the improvement reaches an asymptote and then a decline is seen as even more cores are brought into the picture.
Beyond numbers of cores, one must also consider the volume of memory available, the speed of the memory itself, and the communication time between cores and the cores and memory, and what other applications are running at the same time. The communication time will also increase as the number of cores is expanded via networked nodes.
Try experimenting with simple trial runs and then use that result as a baseline for your main experiment.
Long answer: This process is called benchmarking and is very important in the field of high-performance computing. Depending on your hardware, utilizing the two different parallelization layers of distributed memory (processes, MPI) and shared memory (OpenMP threads) in certain ratios can work around given hardware limitations (bottlenecks). However, you need to test what is fastest: E.g. 1 process 32 threads or 2 processes 16 threads, and so on. Also, for this kind of calculation, I highly recommend omitting any logical cores (Hyperthreading and similar) since normally memory bandwidth is the limiting factor here. Hence, the best performance is often achieved by using core binding to only the physical cores, using e.g. taskset.