I would like to know whether it is possible to use the concepts of queuing theory for performance analysis of FPGA. By FPGA implementation I mean stand alone fpga implementation, not as co-processor.
Thank you Mr. Hussein. I would like to bring your notice that both SDx and SDSOC is based on the principle of hardware software codesign. It offloads CPUs work load to FPGA, in other words FPGA performs as a coprocessor there. But what is my requirement is to design a perfomance model for FPGA only implementation, for example a complete CNN application realized only in FPGA.
I would say that it's difficult to answer "yes" or "no" directly, without knowing a bit on the architecture you want to implement. Basically, a FPGA is very large scale "raw" matrix of binary operators, and any operator can be connected to any other. It's massively parallel, it's even the core concept of FPGA. A sequential operation, while possible, is not the best way to use a FPGA.
An algorithm implemented in a FPGA is "synchronized by the data". At each clock cycle, the operators produce results. So most of the time, there is even no queues in a FPGA, but pipelines. I don't really know queuing theory, maybe it can be applied to pipeline as well.
I would say that if you want to use queuing theory for a CNN in a FPGA, you can. But you have to implement the algorithm in a way that you can use it !
FPGA are kind of "raw" digital eletronic material, you can do nearly everything you want in a FPGA, from softcore processor to MPSOC to massively parallel signal processing algo, or both ! So it's up to you to implement your CNN with mechanisms that allow performance analysis using queuing theory.
If you can draw the data flow graph and the control flow graph, you can analyze the performance of the implementation. I recommend the following book, in which you can find how to associate the data flow and control flow models with implementations made only in hardware or as a coprocessor. The book covers from finite-state machines with data-path to system-on-chip architectures.
Schaumont, P. R. (2012). A practical introduction to hardware/software codesign. Springer Science & Business Media.ISO 690
On the other hand, a high-level synthesis tools such as Vivado HLS can be used to describe the algorithm in C/C++ ang generates the RTL as an IP core. You can use parallelization directives to optimize the implementation. The tool provides information on latencies and the use of hardware resources. It also has other tools as "Schedule Viewer" to analyze the execution of the algorithm.