The optimal number of threads per block in CUDA programming?

More Assma Azeroual's questions See All

How to perform a simulation of cyber attacks in a power system using a multi-agent system?

I would like to know how to detect a cyber attack in power grid using a multi-agent system ? Any help would be much appreciated. Thanks

31 August 2022 8,290 4 View

How synchronization is done in CUDA?

Hello Dear All, Could you please tell me how the synchronization is done between Host and Device and between Device Kernels? I mean if I did not specify the streams parameters in the call of...

02 December 2017 1,423 5 View

Filter Mask depending on the image?

Hello Dear all, I am working on edge detection, I must firstly apply a smoothing filter to the image before applying other processing. The problem is the mask chosen gives good results in some...

02 April 2017 8,101 6 View

Perform CUDA for HD image processing?

Hello dear all, I am working on the HD image processing using CUDA. I have a 3750*3750 image, and I have troubles to initialize an array of this...

24 January 2017 2,405 4 View

How can I synchronize a grid threads in CUDA?

Hello dear all, I want to synchronize all threads of a child kernel before executing other operations in a parent kernel in CUDA. How can I do this? I have many threads in many blocks. I used 1D...

22 January 2017 2,679 8 View

HD images dataset?

Hello Dear All, Could you please give me a link to a dataset of HD images for image processing purpose? Thank you very much

16 January 2017 6,370 4 View

Best filter for edge detection?

Hello, To detect image edges, three steps are done:1 - Filer the image 2 - Applicate a proposed method to detect edge pixels 3 - Link the pixels detected What is the best filter to applicate as a...

02 January 2017 8,226 24 View

Is there any specific media for the isolation of dried Amanita samples in pure culture?

I have tried growing the dried sample of amanita in PDA media but haven't succeeded. Can anyone suggest any media composition for reviving the dry sample in any media.

02 August 2016 1,582 1 View

Are there any uncompressed video fragile watermarking articles?

Hello, Could you like please to gave me some articles for uncompressed video fragile watermarking? I need them for comparison with my method. Thank you very much

27 January 2016 8,170 2 View

Any articles in Fragile watermarking for video?

Hello Please do you have some articles in fragile watermarking for videos? Thank you very much

22 January 2015 3,583 9 View

Cuáles fueron las tendencias en investigaciones en arquitectura, urbanismo y patrimonio edificado en decadas del 2000 al 2020?

Cuáles fueron las tendencias en investigaciones en arquitectura, urbanismo y patrimonio edificado en decadas del 2000 al 2020? Porque requiero conocer tesis de posgrado nivel maestría...

24 July 2024 5,494 1 View

Does Nature Scientific Reports waive open access fee for industry authors?

I came across the Green Building and Sustainable Architecture collection under Nature Scientific Reports some weeks ago. https://www.nature.com/collections/gajghaebce The special issue/collection...

10 July 2024 5,533 1 View

LSTM on Time Series: Has LSTM architectures ever been applied to Time-Series Forecasting ?

Have we ever used LSTM architectures on Time-Series Forecasting and Analysis, and gotten a decent result ?

30 June 2024 6,924 3 View

Has fine-tuning techniques like LORA ever been applied to pre-trained Computer Vision CNN architectures ?

Has fine-tuning techniques like LORA and QLORA ever been applied to pre-trained CNN architectures for any application ?

25 June 2024 7,332 2 View

Images: Between CNN architectures and Vision Transformers, which requires more data to train and why ?

Which architecture requires more data to train between CNN and Vision Transformer based models ?

25 June 2024 7,599 0 View

You are kindly requested to investigate the stealing my name from one of my researches?

Unfortunately, I found my name as a senior author of research entitled "Multifunctional prosthetic polyester based hybrid mesh for repairing of abdominal wall hernias and defects" published in...

23 June 2024 7,798 0 View

¿Cuáles son los entornos estrategicos mas importantes frente al tema de inteligencia artificial?

Según el Ministerio de Tecnología e Innovación Colombiano, los entonos estrategicos en los que deben trabajar los gobiernos para adoptar una posición eficiente frente a la Inteligencia artificial,...

23 June 2024 9,844 1 View

Hello In your opinion, which is better: Study Microprocessing first, then Computer Architecture, or vice versa, and why?

Computer Science Department

19 June 2024 8,292 2 View

Object Detection: Which Object Detection Model can identify small objects ?

Which Object Detection architecture (be it CNN-based or Visual Transformer-based) can be used to detect small objects ?

18 June 2024 9,589 2 View

How can we train multi-modal CLIP architecture to generate images using Prompt ?

Can we even make changes to CLIP Model architecture such that it can be used as an image generator from prompts ?

16 June 2024 320 0 View

Vajira Thambawita

It is better to use 128 threads/256 threads per block. There is a some calculation to find the most suitable number of threads per block. The following points are more important to calculate number of threads per block;

Maximum number of active threads (Depend on the GPU)

Number of warp schedulers of the GPU

Number of active blocks per Streaming Multiprocessor etc.

However, according to the CUDA manuals, it is better to use 128/256 thread per blocks if you are not worry about deep details about GPGPUs.

Bo Joel Svensson

In our paper:

Conference Paper Meta-programming and auto-tuning in the search for high perf...

We performed experiments varying these numbers and generated heat maps for a couple of simple kernels. One surprise was how large the hot areas were. Another surprise was that the occupancy calculator from NVIDIA seemed to be off target for some of these kernels.

Assma Azeroual

Thank you very much for your replies. Is there any relation between the number of GPU processors and the number of threads in block?

My GPU has a compute capability of 5.0, max threads per block = 1024 and threads in warp = 32

I think that may be hard to answer. But maybe you approaching this from the wrong direction. Think about the algorithm you are implementing instead. What is the optimal decomposition of that computation onto the resources offered by the GPU? I really dont think there is one single answer to these questions that is applicable in all situations. Some experimentation is probably always needed.

Let me know how it goes.

Thank you again for your reply. Actually I found by experiments that 256 threads per block is the optimal option for my program, but a reviewer asked me to give an explanation or justification for this not just proofing it by experiments.

I am fond of the experimental + tuning approach. If you want to show this by pointing at architectural specifics of the individual GPU you use, it will be very particular to that one setup. If someone else tries your code on a different GPU, they in turn may need to tweak it for their specifics. Experiments I did at one point showed that the "Good" performing setting can be different between GPUs (of course). I would like to run more such experiments over a larger selection of GPUs.

Did you run your parameters through the nvidia occupancy calculator as a starting point? I think, though, that even when using the occupancy calculator you cannot be sure to get the best performing setting. But if the occupancy calculator points at 256 being great, then you have the answer from the horse's mouth (the GPU manufacturer).

Thank you very much Dear Bo Joel Svensson. I understand now the point. I will use occupancy calculator to be sure of what I get. I agree with you, the performance depends on each machine and I can not always generalize what I got in my machine to others

Chad Frederick

You want to use a multiple of your warpsize and your number of cores per MP: eg 32, 64 etc. Then you want to use a large a number of threads as the architecture is optimized for calculating and swapping out threads waiting on memory reads.

In my experience, though I have not tested on the latest architectures, I would start with the largest number of allowed threads eg 512. I would get good speed up with this, but I found that the optimal was less than the max ie 256 threads.

The last architecture I tested for this was Pascal and the best speedup was using 256 threads. I imagine that this might increase in the future as the architecture scales up. This also could change with an application that is doing greater calculations/communication at the kernel level, however if your calculations are that intensive per core, you probably are not utilizing your memory to it's full potential.

So, I would go with 256 threads for most applications.

Dinei Rockenbach

We recently published an article testing different configurations of the number of threads per block on the NPB benchmarks. You can find the article here: Article NAS Parallel Benchmarks with CUDA and beyond