Is this CUDA implementation of separable convolution optimal?

More Hugh Lachlan Kennedy's questions See All

Does bed rest really help when you have influenza?

Or is it just something you do because you don't feel like being active and so you don't infect others?

05 June 2016 5,271 1 View

Are Bode and Nyquist diagrams useful tools for digital controller analysis and design?

I have been using both (Bode and Nyquist diagrams) to design simple digital feedback controllers for low-order SISO systems P(z), by analysing the loop transfer function L(z) = P(z)*C(z). But I am...

31 December 2015 6,792 5 View

Why are equiripple (digital) filter designs generally preferred over error-minimized designs?

1) Does this have something to do with aesthetics? Because engineers like the relative uniformity of an equiripple response, as opposed to the more 'organic' response of other methods. 2) Is it...

09 October 2015 7,585 4 View

Could you please help me with equiripple FIR filter design?

I would like to design a low-pass equiripple FIR filter, but I would also like to specify the (desired) group delay in the pass band; I would like to make this as low as feasibly possible, while...

09 October 2015 8,877 18 View

Does anyone have experience using Least-Mean-Squares (LMS), or some other recursive technique, for the estimation of a sinusoid's parameters?

I would like to estimate the frequency of a discrete-time (real) sinusoidal tone - with unknown amplitude, frequency and phase - in noise An LMS-type technique looks promising, for...

08 September 2015 4,042 6 View

What are the main classes of techniques to estimate the time delay between (or cross correlate) two digital signals/channels?

I would say there are essentially two main groups or categories of techniques that are commonly used to do this ... 1) Time domain e.g. a) normalized or 'regular' cross-correlation b) sum of...

07 August 2015 9,375 3 View

Why do people find the idea of interpreting quantum mechanical wavefunctions using classical trajectories so objectionable?

20 years ago I came to grips with quantum mechanics and Schrodinger’s wave equation through what I thought were fairly unconventional means. Whenever I talked or wrote about this I was treated...

03 April 2015 4,218 1 View

Are the discrete Laguerre polynomials unique?

I have been using the Gram–Schmidt procedure to generate the discrete Laguerre polynomials, over the domain m = 0 ... +Inf, using an exponentially decaying weighting function w(m) = exp(sig*m),...

02 March 2015 7,166 3 View

Is there an automatic tool to determine the arithmetic complexity of a Simulink circuit?

I have several large discrete-time Simulink models, built from fairy simple components, with vector/matrix ops, and I would like to quantify and compare their arithmetic complexity (e.g. the...

01 February 2015 2,042 2 View

What is your preferred/favourite method for designing digital low-pass filters (1-D, FIR or IIR, causal, low order, configurable delay)?

Low-Pass Filter Design (Digital Signal Processing).

09 October 2014 3,995 16 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

Separation of organic acids-HPLC?

Hello What should be done to separate and identify organic acids in HPC when their RetTime is the same?Like oxalic acid with Propanoic Acid.or acids that have a very close RetTime.

07 August 2024 8,782 3 View

Which test should be used to study association among demographic profile and awarness level?

i have to study the awareness and adoption level of cloud computing in a district of India. i also want to use association among demographic variables like gender, age, education, income etc and...

02 August 2024 2,420 3 View

La animación digital en plataformas digitales?

Hoy la animación se utiliza como una tecnología multimedia con gran potencial educativo, que va mucho más allá de sólo crear figuras, ya que puede promover una mejor comprensión en...

01 August 2024 7,186 0 View

Where to download DEM from 2018?

I am currently working on validating land subsidence results obtained using SBAS InSAR. Could anyone suggest reliable sources or repositories where I can download a historical DEM from 2018?...

31 July 2024 9,757 3 View

Simulation of metal drawing by Abaqus with UMAT?

Hello, colleagues. Recently, I have been working on a metal processing simulation with my UMAT in Abaqus. I have outlined the corresponding simulation, but I keep encountering issues that cause...

30 July 2024 7,062 1 View

Recent topic in digital banking?

i need a recent digital banking topic for doing my Ph.d research

29 July 2024 5,210 2 View

How to use Desmond in HPC ?

Our department has recently acquired an HPC (High-Performance Computing) system, and I'm thrilled to take my molecular dynamics calculations to the next level using Desmond. I used to run my...

28 July 2024 6,553 1 View

Patronage margin difference between offline womens'apparel store and its equivalent online.Is this a good research topic?

I intend to explore and assess how offline womens' modern apparel store are losing out online and the need to get them on board based on this assessment. I am on a student on MPHIL Media and...

26 July 2024 1,440 1 View

All math can be explained by iterator of code?

all math can be traversed by code? all math can be translate to code?

26 July 2024 9,530 0 View

Hugh Lachlan Kennedy

OK. This is what I have found so far ...

The attached design doc all makes sense (but I would disagree that "convolution measures the amount of overlap between two functions"). However, I found the sample code difficult to follow. So I wrote my own ...

I used the sample code as a performance benchmark and eventually I was able to match it. I found the profiler to be more reliable than timer calls for quantifying execution speed.

Yes, using shared memory improves performance, despite the extra code and (wasted) threads required to populate it.

My code was much more compact than the sample code. Why are loops used to populate the shared memory tile in the sample code? I just used a single thread to populate each element, before proceeding to the convolution (after syncing threads in the block).

My code runs at around 100% GPU occupancy compared with 50% for the sample code, yet the execution time is roughly the same (?).

My row and col calls have approximately equal run times, whereas the sample code run times are very different in each dimension so the total run time will depend on the image aspect ratio!

In my code, I zero out the edge effects around the perimeter of the image.

I'll post my code later, if there is any interest.

Munesh Chauhan

Hi Hugh,

It is great to see that you have developed a much better optimized version of the convolution code (CUDA samples).

Can you please share the code along with a small description of how it works? It will benefit us all who are looking into this particular CUDA sample.

I am working on optimizing this code and wish to look into the already optimized versions.

Sinay Ury

Hello Hugh,

could you please share the optimized convolution code? It is important for me to study it.

Thank you in advance!