I suppose that for architecture's interfaces efficiency you can use (T/G/M/k) bits per second of throughput and, what looks to be especially usefull there, for arithmetical efficiency you can meausure it with FLOPS (floating point operations per second) [ https://en.wikipedia.org/wiki/FLOPS ]
Basically, idea of such analysis is to determine how many bits of data you can send (interface) or how many basic operations you can perform (arithemtic unit) in a single clock cycle.
The most important analytical assessments in FPGA design are: the required hardware area for the design, the power conception for the implemented model, time latency, the processing speed and throughput (number of bits per second). All the above you can find them directly if you use Xilinx software such as ISE- design Suit and Vivado by using the existing synthesis tool in the above software.
You will easily see what resources needed in the part "Design Summary".
You have to look at the Post-PAR (Place and Route) Report, this is where ISE explains whether the timing constraints of your design are met, what the critical path of your design is and up to what clock frequency your design can run! It will also let you know of how many slices, FF, IOBs have been used etc.
I have understood it in that way, which is why I have not proposed you engineering reports like resource utilization, power or propagation timings.
Theoretical analysis of floating-point architecture performance is based on FLOPS. Of course it is not made in pure mathematical domain - as perfomance itself is not a mathematical abstract.
To perform analitical comparative analysis aganist PC or GPU or other standardized architecture performance you should take your processing path apart into elementary floating-point operations (+,-,*,/ or floating data assignment ) and determine, how many of such operations your architecture is able to perform in a single clock cycle. Or you can take into account a whole algorithm processed by your architecture, let say N-point FFT, and determine: this algorithm needs M floating point operations and my architecture is able to process it in x miliseconds which yields Q = M/x FLOPS of performance whereas a standard CPU has P FLOPS and/or compute the algorithm in y miliseconds.