I am trying to provide some answers to a long disputed question on: "which programming language is the fastest when handling large matrices and arrays. Fortran or C++". Personally I use Fortran but after reading some very interesting suggestions and experiences by users that can and did program with both fortran and C++ languages (without providing any evidence... just discussing... but interesting stuff nonetheless) I thought of trying C++ myself and performing a small experiment in an attempt to find some answers.
I developed a very simple algorithm which is given below in C++. This code performs some calculations with a square 10000x10000 matrix named jimmy looping the procedure 100 times, while a second part of the algorithm handles an array named jimmy2 (100 million elements) and loops also 100 times the performed calculations. By the way, I must note here that the code for the loops in C++ was found in the net due to my poor capabilities in C++ (I had to start from somewhere).
Given that I do not have that much of experience with C++ compilers, I would like someone who does have this experience, to take this algorithm and first of all optimize it, if he/she thinks that I programmed any of its parts in a non optimal manner (most probably I did), and then compile it in a 32-bit machine by using the latest C++ compiler and post the exe file in this conversation. To maintain a level of clarity, please repost the code in C++ if any reprogramming was done.
Now, what I did, was to compile the C++ code with the default compiler incorporated in VS2013 and by using Full Optimization (/Ox) in the release configuration, I got the following average times in ms (I run an Intel(R) Core(TM) i5 CPU M520 @2.4GHz on a personal laptop):
ms1: 11900 (matrix 10000x10000)
ms2: 11400 (array 100,000,000)
Those that might be interested in participating here must know that the corresponding average times, when using Visual Intel Fortran Compiler XE14 , were:
ms1: 5300 (matrix 10000x10000)
ms2: 4900 (array 100,000,000)
The algorithm developed in F90 is also given and can be found at the end of this post.
Waiting to hear from you C++ experts, with anticipation.
P.S. If any C or C# or Python or other language users would like to give it a go the same things apply. Post your code and the .exe file so as to derive the corresponding times. No GPU allowed given that this has as a general goal to compare CPU compilers.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
C++ Code
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
#include
#include
#define WIDTH 10000
#define HEIGHT 10000
int jimmy[HEIGHT][WIDTH];
int jimmy2[HEIGHT * WIDTH];
int n, m, ii, DumDum1, DumDum2;
using namespace std;
int MS2(int Dum)
{
int Dum2;
// Check for array use
auto begin2 = chrono::high_resolution_clock::now();
for (ii = 0; ii < 100; ii++)
for (n = 0; n < HEIGHT; n++)
for (m = 0; m < WIDTH; m++)
{
jimmy2[n*WIDTH + m] = (n + 1)*(m + 1);
}
auto end2 = chrono::high_resolution_clock::now();
auto dur2 = end2 - begin2;
auto ms2 = std::chrono::duration_cast(dur2).count();
Dum2 = ms2;
return Dum2;
}
int MS1(int Dum)
{
int Dum1;
//Check for matrix use
auto begin = chrono::high_resolution_clock::now();
for (ii = 0; ii < 100; ii++)
for (n = 0; n < HEIGHT; n++)
for (m = 0; m < WIDTH; m++)
{
jimmy[n][m] = (n + 1)*(m + 1);
}
auto end = chrono::high_resolution_clock::now();
auto dur = end - begin;
auto ms = std::chrono::duration_cast(dur).count();
Dum1 = ms;
return Dum1;
}
int main() {
DumDum1 = MS1 (1);
DumDum2 = MS2 (2);
cout
I did a first set of experiments on my own. I used the Intel compiler (both Fortran an C++) on a Linux machine. I ran the test on a cluster node that I had reserved exclusively (which means almost no interference from other running programs). I did tests with compiler versions 11.1 (08/27/2009), 12.1SP1-7(10/11/2011), 13.1 (03/13/2013), and 14.0 (07/28/2013) for Fortran and version 13.1 and 14.0 for C++ (since the older compilers don't support the C++11 features used in the example code). No significant difference could be found in speed for versions 12-14. So, I'll just provide the timings for version 11.1 and 14.1:
flags | Fortran v11 | Fortran v14 | C++ v14
======================================================
-O0 | 181030/21582 | 179117/30328 | 27521/27664
-O1 | 31964/6676 | 31998/6501 | 6371/6391
-O2 | 6177/6349 | 3624/3914 | 4245/4182
-O3 | 604/3218 | 926/3944 | 4246/4188
-O3 -msse3 | 863/4047 | 920/3922 | 4254/4181
-O3 -xhost | 554/4156 | 555/1567 | 4162/4169
-fast | -/- | 1644/1643 | 4175/4163
Interestingly, for C++ with the '-fast' option I had to include some read access to the variables (just simple output of the first and last element). Otherwise I would get timings of 0/0 for C++ whereas Fortran did not optimize away the loops.
Acting on a little hunch I rewrote the C++ for a less Fortran-style. Making the loop variables ii, n, and m local variables, e.g.
for(int ii = 0; i < 100; ++i) ...
I would get timings of 1653/1645 for C++ with the '-fast' option (I didn't test the other flags again). With this flag C++ and Fortran are equally fast. ('-fast' equals '-xHOST -O3 -ipo -no-prec-div -static' according to the command line help.) Making all variables global looks too much like Fortran and no C++ programmer would have this idea. Making the array/matrix local did not work because they are too large to be allocated on the stack (using new/delete for the array worked, though => same timing results).
[EDIT]
I suspect that most people upvoted this post because of the numbers I provided. Unfortunately, this does not give the full picture of the discussion. On March 21, 2014 I posted new numbers with the modification described above. Here is the table from that post:
flags | Fortran v14 | C++ v14
======================================================
-O0 | 179117/30328 | 27233/27560
-O1 | 31998/6501 | 6460/6443
-O2 | 3624/3914 | 3920/3939
-O3 | 926/3944 | 3921/3939
-O3 -msse3 | 920/3922 | 3920/3938
-O3 -xhost | 555/1567 | 1640/1639
-fast | 1644/1643 | 1644/1640
Once the other post gets promoted to "popular answers" I would remove this edit. (I hope nobody is offended by my edit.)
[/EDIT]
I did a few quick additional tests (using C++ with '-fast'). Using std::vector instead of plain arrays I got 7241/1591. Using QVector (I have seen it to be faster than std::vector many times with g++ compilers) I got 17795/12749. Using EigenLib with MatrixXd and VectorXd I got 39660/9811. These are results I cannot fully understand, yet. Though this left me with the idea that instead of using integers we should do a simple DAXPY test on double precision floating point (DAXPY is matrix times vector plus vector). This should be closer to scientific computations than integer arrays. I'm gonna get back at this another time.
Dear George
Just two side remarks regarding all your codes.
1) In the inner loop (or m-loop)of both your FORTRAN and C++ codes, the variable n never changes. Thus, operations such as n*WIDTH or (n-1)*iWIDTH should be done outside this m-loop and the result should be stored inside a new variable. Otherwise, you will be wasting CPU by computing m times exactly the same thing.
2) Before comparing FORTRAN and C++, you should be aware that the former is column major while the later is row major. Thus, when handling a matrix, the order of the n-loop and the m-loop should not be the same. Otherwise, the comparison will not be fair.
Hope this helps.
Dear Lehtihet, thank you for the response and the constructive comments!
1) This was done intentionally so as to increase the computational cost without caring about the actual outcome. What you are suggesting is 100% correct but given that we do not care about the result it makes no difference. The increase of cpu time is achieved to a few seconds thus it is measurable and comparable.
2) Excellent observation. I interchanged the n and m in both cases thus the new average cpu times for F90 Code are:
ms1: 4800 (matrix 10000x10000)
ms2: 4600 (array 100,000,000)
Regards,
George
P.S. I updated the question's F90 code so as to make sure that everything complies.
Note that the Cray Compiler and the PathScale compiler would
optimize out the loop:
do i = 1, iRepeat
since nothing changes after the first loop.
Also a good optimizing compiler would remove the n+1 from the inner calculation
so that it would look like:
n1=n+1
do m=1,Iwidth
iaA (m, n) = n1 * (m + 1)
enddo
I don;t know if any compiler will change this to the equivalent
n1=n+1
ivalue=n1*1
do m=1,Iwidth
iaA(m,n)=value
ivalue=ivaule+n1 ! Register to register arithmetic
enddo
but some might be that sophisticated.
Dear George,
In your new FORTRAN code, you should perhaps interchange the names iWidth and iHeight to be consistent with the C++ code and to avoid any confusion.
Thank you James for your input.
I am just wondering if the C++ code can be optimized further and if there is a compiler that can give better timings than the 11.9 and 11.4 seconds?
Regarding Fortran, I used a programming strategy of someone that just implements Fortran for the very first time. Even though the results do not require any further explanations or justifications, if C++ cannot do better than that, definitely there is no question about the capabilities of the two languages when it comes to handling matrices and arrays.
Nevertheless, I do not assume that the issue is closed given that we should give some time to those that have a better knowhow on C++ to let us know if there is something that can be done in terms of programming or compiling.
P.S. I used the counter technique for the case of the array and surprisingly it was 20% slower given that you add an additional calculation thus does not optimize the performance in this case. Finally, I tried to program the two codes in such a way that both will have to perform the same number of calculations (actually Fortran code performs more).
iaA2 ((m-1)*iWidth + n) = (n + 1) * (m + 1)
jimmy2[n*WIDTH + m] = (n + 1)*(m + 1)
The calculations being here very simple, you are comparing in fact the access time to the structures. I have a question regarding your measure of time. How reliable is it ?
The codes are there.Compile them and run them.You can get the fortran compiler XE14 for free from the Intel Web site.
Reliable measurements? I do not get exactly what is the problem in getting reliable measurements but usually what I do is to deactivate all running software and then run the exe files 5 to 10 times each and check the time.
By the way to be on the safe side you have to run the exe files in different times so as to avoid any interference due to windows os and probable procedures that run at the background, which might interfere with the performance of the under study exe.
Well, I was just asking that because of multi-tasking. If the system is not fully dedicated to the task, the running time will not be correct so repeated checking (as you have already done) is necessary. Please see my previous comment regarding the names of your constants in your FORTRAN code. The variable m currently runs through the 'height' of your matrix while in fact it should be running through its "width".
Regards
Already done that.Read the previous postings.
Thank you for your input.
Regards.
Vasili I'll give it a try tomorrow. I am done for tonight . I believe though it will be easier for you to compile and run them both. I can send you my f90 exe if you like.
Have a good night.
Vasili good morning. I downloaded the PI software and executed for 1M. I attached the jpeg file with the results bellow. I hope this is what you wanted.
Fortran is typically a bit faster in basic comptations but tends to lose when more advanced data structures are useful. See link for a large collection of benchmarks. (Fortran performs well but is certainly not always at the top.)
http://benchmarksgame.alioth.debian.org/
Also, as a side note, be very careful with simple calculations such as these. The compiler might optimize out some calculations if the results are not used.
Thank you Toni for your input.
To be honest with you it amazes me the fact that for such a simple problem and calculations, the difference is more than 2 times faster in favor of fortran. Initially I thought that there was something wrong with my C++ code or there are faster compilers that I was not aware off, but as it derives this is not case.
Regarding the advanced data structures, I have an extensive personal experience by using them in Fortran like very complicated and long Types, double precision Arrays within integer Arrays etc., thus never had speed issues with Fortran. I agree with you on what you are saying about comparing performances and definitely it is not a black or white case. Nevertheless, when you are dealing with large scale FEM problems, handling this type of matrices/arrays and simple calculations is what you basically have to deal with when programing.
Regards
P.S. Two times faster is not a bit ... ;-)
Thank you Vassili for the effort. No it does not make any difference especially given that m = n.
Regards
Thank you Alejandro for the interesting post.
I will disappoint you a little bit here, but personally I never used libraries when it comes to programming matrix multiplications and in general matrix or array operations. This way you can optimize every and each part of your code. Check my paper https://www.researchgate.net/publication/258354289_Numerical_Investigation_of_a_3D_Detailed_Limit-State_Simulation_of_a_Full-Scale_RC_Bridge
and the corresponding bridge model that I solve through the use of a standard CPU system (check the times given there).
As soon as you start exchanging data with libraries then performance issues begin and delays related to the data exchange. Even in the case where you go parallel memory handling is very significant. The question here is which language performs faster (if one manages to program his/her code in such a way that it will develop memory hierarchy problems then this is another issue). Do not forget that programming techniques and strategies affect significantly the performance of a code thus then we should have talked about comparison between programming skills and not compilers' performance.
As for your paper, if you are kind enough, upload a copy here in the ResearchGate platform so as for me to be able to down load it or send it to me at [email protected].
Regards
Article Numerical Investigation of a 3D Detailed Limit-State Simulat...
No objections with your statement. So as for me to understand what you are saying here, is that C++ is not structured by default to perform fast matrix operations thus requires additional "handling", which interprets into additional code thus additional complexity.
I did not know about this so thank you for the additional information.
Again the goal here (from the FEM application point of view-->mainly matrix operations) is to "run the 100m" without adding any programmer's induced obstacles, which as it results C++ has as a built-in desideratum.
Someone update the Title of my question/discussion. Thank you for making the Title more precise...
@Vahid. Hello my dear. The "I thing" will not do. I've read in this platform many "I things". Just post the code and I'll give it a try.
On a second glimpse, if you can compile it and post the exe file it will be more convenient. Other wise I will have to install the C compiler, which is something that I would like not to do.
Regards
Each language has its own specific advantages; as it also turns out from all comments posted so far, when implementing a code that makes matrix operations in FORTRAN, it runs faster than C, and no special library is required in order to achieve this fast performance; this points out the simplicity and efficiency of FORTRAN in matrix operations. And another indication towards this the fact that full interoperability with C is supported in the FORTRAN 2003 standard, which gives the programmer the ability to take full advantage of the benefits of both FORTRAN and C/C++.
Very interesting, thanks for sharing this. You might also find this thread of some interest: http://scicomp.stackexchange.com/questions/1194/how-much-better-are-fortran-compilers-really
Thank you Robert for your comments. By the way Fortran Compiler 11 and 12 have a 30% difference in performance in favor of 12. In the above I used XE14.
Regards
I agree with Alejandro. C++ is a "general purpose" language and is not specialised in mathematics, as is Fortran. So in order to get fast matrix operations in C++, you'd need a dedicated, highly optimised library to do this. While I'm sure there are such libraries out there somewhere, there will be a certain level of overhead when calling their functions.
BTW: You can see a similar effect when looking at fixed-point operations, where COBOL is very strong and C++ is not.
So there is no "best language for everything" - it really depends what you want to do. Fortran remains an excellent choice for "number crunching".
FWIW, and this is not a formal benchmark by any means, recently I re-wrote a moderately complex piece of simulation code (Bose-Einstein condensates, ~1000 lines of code) in C++, and achieved an immediate roughly 1.5-2-fold increase in performance on the same hardware and operating system. The code structure and organization remained unchanged. In both cases, the GCC 4.5.2 compiler suite was used, running on Linux.
@Robert. Yes indeed. After the XE series release the performance was increased significantly. They managed to take into advantage the multi threating technology of the core's structure.
@Stefan. Totally agree with you. I always say the best programming language is the one you know.
Regarding the nature of the discussion and the overall goal here, is to address if C++ is a better language when it comes to Finite Element Analysis implementations. So again I agree with what you said.
Regards
@Toth. Thank you for the comments and update on your work. Can you play with the provided code and repost an update version of it that will incorporate the features that you just described related to an increased performance? That was my initial request but it seems that nobody was willing to do so.
Regards
Here is my work done.
System: Windows 7, 32 bit, Intel Core i5 CPU with 4 GB RAM memory
Compiler: GNU gfortran
I do not work in C++. If somebody does, then he/she can perform the code in GNU C++ compiler, for comparison reasons.
Results:
=================
(secs)
Run = 0.293500E+03
user = 0.293313E+03
sys=0.187201E+00
==================
ie about 5 minutes for all program.
Code:
===========================
Program LoopTest
Implicit None
Integer, Parameter :: iWidth = 10000, iHeight = 10000, iRepeat = 100
Integer :: i, m, n, iaA (iWidth,iHeight), iaA2 (iWidth * iHeight)
Integer :: Count, iTime0, iTime1, iTimeT
real, dimension(2) :: elapsed
real :: result
OPEN(6,file='out.txt')
do i = 1, iRepeat
do n = 1, iHeight
do m = 1, iWidth
iaA (n,m) = (n + 1) * (m + 1)
enddo
enddo
enddo
do i = 1, iRepeat
do m = 1, iHeight
do n = 1, iWidth
iaA2 ((m-1)*iWidth + n) = (n + 1) * (m + 1)
enddo
enddo
enddo
call ETIME(elapsed, result)
WRITE(6,95) result,elapsed(1),elapsed(2)
95 FORMAT(1X,"(secs)Run = ",E12.6," ,user = ",E12.6,"sys=",E12.6)
CLOSE(6)
End Program LoopTest
=========================
I have not convinced to learn any other formal language since now...
No problem in advertising your paper but do not repost the whole thing.By the way posting the full paper will help people read it, including my self and learn about what you did.
Appreciate your honesty.
Regards
A proper C++ wrapper for BLAS is a neat thing. I wonder how this compares to Fortran in George's example.
George: I did not do anything fancy to make the C++ code faster. Indeed, if I run your code examples through GCC 4.5.2, the C++ version is again noticeably faster. But different compilers can produce vastly different results, so it is quite possible that another FORTRAN compiler on the same hardware and operating system produces code that is significantly more efficient than the GCC C++ code.
That said, generally it is my experience that C++ (especially if it is more C than C++) is the best language (other than machine language) for high-performance code but a lot depends on the compiler and on the programmer's skill and experience. And sometimes it is just more efficient to, say, use a high-performance FORTRAN compiler than to find the expertise to produce hand-optimized C/C++ code, even if the latter might offer somewhat improved performance.
I did a first set of experiments on my own. I used the Intel compiler (both Fortran an C++) on a Linux machine. I ran the test on a cluster node that I had reserved exclusively (which means almost no interference from other running programs). I did tests with compiler versions 11.1 (08/27/2009), 12.1SP1-7(10/11/2011), 13.1 (03/13/2013), and 14.0 (07/28/2013) for Fortran and version 13.1 and 14.0 for C++ (since the older compilers don't support the C++11 features used in the example code). No significant difference could be found in speed for versions 12-14. So, I'll just provide the timings for version 11.1 and 14.1:
flags | Fortran v11 | Fortran v14 | C++ v14
======================================================
-O0 | 181030/21582 | 179117/30328 | 27521/27664
-O1 | 31964/6676 | 31998/6501 | 6371/6391
-O2 | 6177/6349 | 3624/3914 | 4245/4182
-O3 | 604/3218 | 926/3944 | 4246/4188
-O3 -msse3 | 863/4047 | 920/3922 | 4254/4181
-O3 -xhost | 554/4156 | 555/1567 | 4162/4169
-fast | -/- | 1644/1643 | 4175/4163
Interestingly, for C++ with the '-fast' option I had to include some read access to the variables (just simple output of the first and last element). Otherwise I would get timings of 0/0 for C++ whereas Fortran did not optimize away the loops.
Acting on a little hunch I rewrote the C++ for a less Fortran-style. Making the loop variables ii, n, and m local variables, e.g.
for(int ii = 0; i < 100; ++i) ...
I would get timings of 1653/1645 for C++ with the '-fast' option (I didn't test the other flags again). With this flag C++ and Fortran are equally fast. ('-fast' equals '-xHOST -O3 -ipo -no-prec-div -static' according to the command line help.) Making all variables global looks too much like Fortran and no C++ programmer would have this idea. Making the array/matrix local did not work because they are too large to be allocated on the stack (using new/delete for the array worked, though => same timing results).
[EDIT]
I suspect that most people upvoted this post because of the numbers I provided. Unfortunately, this does not give the full picture of the discussion. On March 21, 2014 I posted new numbers with the modification described above. Here is the table from that post:
flags | Fortran v14 | C++ v14
======================================================
-O0 | 179117/30328 | 27233/27560
-O1 | 31998/6501 | 6460/6443
-O2 | 3624/3914 | 3920/3939
-O3 | 926/3944 | 3921/3939
-O3 -msse3 | 920/3922 | 3920/3938
-O3 -xhost | 555/1567 | 1640/1639
-fast | 1644/1643 | 1644/1640
Once the other post gets promoted to "popular answers" I would remove this edit. (I hope nobody is offended by my edit.)
[/EDIT]
I did a few quick additional tests (using C++ with '-fast'). Using std::vector instead of plain arrays I got 7241/1591. Using QVector (I have seen it to be faster than std::vector many times with g++ compilers) I got 17795/12749. Using EigenLib with MatrixXd and VectorXd I got 39660/9811. These are results I cannot fully understand, yet. Though this left me with the idea that instead of using integers we should do a simple DAXPY test on double precision floating point (DAXPY is matrix times vector plus vector). This should be closer to scientific computations than integer arrays. I'm gonna get back at this another time.
I looked up the configuration of the cluster node I used. It has two octacore Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz.
@Vahid
As long as you try to write readable code C is not necessarily faster than C++. I have seen comparisons for this (e.g. using plain arrays vs std::vector). In any case you can write code that is as fast as what the compiler produces, but it will be very ugly and unreadable. The trick is that the more abstract your description of the problem is the more the compiler can optimize. Otherwise there is only the optimization by the human programmer and this is usually not very good. Let the compiler do most of the work since it is a lot smarter than you (especially for huge amounts of code).
When looking at C/C++ performance, you need to keep in mind that C (and to a certain degree C++) lack some features in the core language which are useful in many scenarios, e.g., hashes (OK, C++ has these in a the standard library, but that's not the same thing, especially from an optimiser viewpoint) or a proper garbage collection.
So while it's easy to write highly performant small code chunks, writing expressive, high-level code is tricky.
C++ is a good step forward, but one permanently struggles with C legacy elements (preprocessor, pointers, implementation-dependent variable sizes, etc.) and the horrible templates syntax. In addition, neither C nor C++ have a proper NaN handling in the core language, leading to all sorts of side effects.
D is an interesting approach to keep a C-style language, while addressing C++'s main weaknesses. Have a look at it at http://dlang.org/index.html .
Hi every one. I would like to thank each and every one of you for contributing to this discussion. Definitely I will not address all the new comments given that most of you say again that C++ is a very fast general programming language (I also agree with this statement) but when it comes to the handling of large matrices and arrays it has some performance issues due to its structure (for this I am now 100% convinced from the evidences provided and your constructive comments).
On the other hand Fortran is not a general programming language (I totally agree) but when it comes to perform large computations it is performing optimally managing to increase the computational performance significantly. I was expecting someone to build the code in C but as I've read from the above comments, once more I got some "I thinks" and what might happen in the future, without any actual numerical evidence (this goes for you Vahid). If you post the exe file then also post your comments and then be sure about them. Follow Vassilis, Demitris and Simon examples. They provided numbers. Nothing personal, put when it comes to comparing numbers we have to use numbers. I am still going to wait for your executable.
Regarding the excellent work done by Simon Schröder , which I would like to thank for his contribution, I would like to add that as it seems you confirmed the significant increase in terms of performance between the 11 and 12. For the corresponding increase of performance between 12 and 14, it is there but definitely it is not the milestone set by 12.
@Simon. Finally, from your posting, in which you did use a lot of technical description related to C++ (I could not understand them given that with fortran things are simpler when it comes to matrices and arrays), it derives that first of all getting 0/0 timings (given that you measure your time in milliseconds) your compilation resulted a code that could not be executed for various "bad configuration" build reasons. In simple words your code did not even run. Let as take for instance the array case. We have 100 million elements and for each one we perform 2 additions and 1 multiplication. Therefore 300 million calcs multiplied by 100 repeat loops will result into 30billion calcs or 30Giga-calcs. What you are saying is that it took you 0 time to perform 30billion calcs? I don't think this is the case. What is the case, you will wonder...? What actually happened here is when you tried to build the exe file in the release configuration, the build procedure generated an executable that incorporates a bug, sending the flow of your calculations to the very end of your code. To verify what I am saying here is actually what happened to your exe file (your code exits without performing a single calculation) add a write command at the end of each loop group to display the value of the last element of your array. It should give you the value of 100020001 (given that the code executes in a correct manner and flow). You will get two possible outcomes: 1. The bug will be alleviated but your time will not be equal to zero or 2. You will get a 0 time but also the value of your last array's element will be 0 (or equal to the initialization value when allocating the array).
To conclude on the overall experience gained after reading Simon's postings, is that C++ requires additional handling (confirming previous posts) when it comes to the implementation of the language to scientific problems that deal with large matrices and arrays. And regarding to the simplicity and functionality of F90: If I wanted to tell you what I did to get the 4-5 seconds of performance I would say the following:
A. Wrote the algorithm (as can be seen above).
B. Compiled and checked for errors (simple problem thus a few runs did the job).
C. Build and run for the release configuration.
No rewrites or restructuring or any use of other programming tricks so as to increase the performance. I did not even try to do that. Meaning a non experienced programmer will be able to take into advantage the efficiency provided from the compiler and not wonder why his/her code is slow.
Just a final comment on the performances of the two compilers. For this problem, according to what my results derived, the difference is significant but does not reflect reality given that the results cannot be considered as a general rule (we dealt only with integers and not double precision variables, if statements calling subs and functions and other calculations). Furthermore, if you consider the Fortran simplicity and efficiency when it comes to matrix and array handling, the ability to develop oop apps and the robustness of the derived exe file then when it comes to answer a question: "Which is the best tool to use when dealing with the development of FEM solvers?", then the answer, as derived from above, is Fortran and not C++. When the question is "What language should I use to develop a FEM solver?" then the answer should be "The one you know the best". This is what I did when I was a Ph.D. student and it worked fantastically. If you are about to develop a complicated solver you need to do it optimally. If you do not have deep knowledge in your programming language, then there is almost no chance to construct and develop a fast application, even if you use the fastest Fortran or C++ compiler.
Related to what might happen in the long run or even in the near future, definitely few people know and of course there is always a chance that the C++ compiler team even develop a compiler that will outperform Fortran or any other compiler out there. No matter the case, when dealing with FEM solvers many parameters affect the overall performance, especially the programming technique implemented and derives from the programming skills of each and every one.
My Regards
You really believe that you can solve a 30billion calc problem serially in 0 time? Check how many calcs/sec a core can perform and then get back to me. I got a 0/0 time in one of my initial attempts with Fortran and looked into it and this is what a deeper research on why, derived.
The comparison of FORTRAN and C++ given by Simon above as a barplot.
I think that FORTRAN is faster.
For image processing where matrix operations of large scale is the focus , c, c++, fortran were tried. Apart from efficient programming, using multi threads, etc use of dedicated libraries would give a better performance.
Especially you may try numpy ( python based) using simple exes and configuring options over command line. Scipy is equally good and you may use C++ interface calls for the same. Multi dimensional matrix operations are handled most efficiently in these.
Dear Sarraju, when it comes to image processing the best way to go is GPU. The current problem with GPU is that they are hardware oriented. No one objects on GPU's superiority but definitely not everyone can afford an expensive graphics card and definitely you can not expect all laptops and desktops to have the NVIDIA graphics card so as for the executable to run. This is the reason why I included the remark "no GPU" in the initial statement of this discussion.
George, the 0/0 timing is a correct result. The reason for this is that the observable behavior (other than the time spent for the computation) is the same. As long as results are not used anymore the compiler is allowed to optimize these away. And this is what happened with the C++ compiler.
This is why I included some output as you suggested. Then, the compiler is not allowed to optimize away the loops anymore and you get the timings from the table.
In order to get better timings for C++ (1653/1645 vs. Fortran's 1644/1643) I rewrote the program in a way that any C++ programmer would write it in the first place (in C++ global variables are considered bad style). For completeness here is the code I used:
#include
#include
#define WIDTH 10000
#define HEIGHT 10000
int jimmy2[HEIGHT*WIDTH];
int jimmy[HEIGHT][WIDTH];
int DumDum1, DumDum2;
using namespace std;
int MS2(int Dum)
{
int Dum2;
// Check for array use
auto begin2 = chrono::high_resolution_clock::now();
for (int ii = 0; ii < 100; ii++)
for (int n = 0; n < HEIGHT; n++)
for (int m = 0; m < WIDTH; m++)
{
jimmy2[n*WIDTH + m] = (n + 1)*(m + 1);
}
auto end2 = chrono::high_resolution_clock::now();
auto dur2 = end2 - begin2;
auto ms2 = std::chrono::duration_cast(dur2).count();
Dum2 = ms2;
std::cout
It's amazing that a compiler optimizes code by removing parts of it that it deems extraneous. Who's writing the program, anyway? Me or the computer?
Alejandro, I know that ++i is better than i++. Usually I use this. And in this case it doesn't make a difference (I tested it). The reason why there is i++ in the code is because I just copied the code from this question and only changed the most significant parts to get good results.
@Demetris
Nice work with the diagram. But, since I am the advocate for C++ I cannot leave this uncorrected ;)
The timing results I provided in the table were taken from the first set of runs with the loop variables being global. Here's a new table of different runs where the loop variables are local:
flags | Fortran v14 | C++ v14
======================================================
-O0 | 179117/30328 | 27233/27560
-O1 | 31998/6501 | 6460/6443
-O2 | 3624/3914 | 3920/3939
-O3 | 926/3944 | 3921/3939
-O3 -msse3 | 920/3922 | 3920/3938
-O3 -xhost | 555/1567 | 1640/1639
-fast | 1644/1643 | 1644/1640
In most cases there is no significant difference in the numbers. I am still lacking an explanation why '-fast' makes things slower for Fortran compared to '-O3'. But, everyone who knows about -O3 will also know about -fast. I always expected -fast to be faster since it includes -O3 -xhost. So, a lot of people will fall into this trap. Looking at the numbers for -fast I still will say that Fortran and C++ are equally fast. For a final judgment I am still waiting for numbers for DAXPY. I don't have the time right now to do it myself, but maybe later.
@Simon, thank you for your excellent job!
The new barplot is here and I think that the winner is ... FORTRAN!
(First Problem and -O3 -xhost < -O3 -msse3 < ... all)
Five point summaries:
=================
f14_1 f14_2 cpp14_1 cpp14_2
Min. : 555 Min. : 1567 Min. : 1640 Min. : 1639
1st Qu.: 923 1st Qu.: 2778 1st Qu.: 2782 1st Qu.: 2789
Median : 1644 Median : 3922 Median : 3920 Median : 3939
Mean : 31255 Mean : 7403 Mean : 6963 Mean : 7014
3rd Qu.: 17811 3rd Qu.: 5222 3rd Qu.: 5190 3rd Qu.: 5191
Max. :179117 Max. :30328 Max. :27233 Max. :27560
@Demetris
I don't really understand your statistics. I think that it does not make sense to combine different optimization settings. And I could come up with even more settings until I get the results I like. If you pick the minimum I pick the maximum. If you pick the median I pick the mean. One time Fortran wins, one time C++ wins. But, including -O0 or -O1 in the discussion does not prove anybody's point. They are just an interesting fact how the optimization setting influences performance for a single language. The default setting is -O2 anyway.
@all
Since I was curious about double precision performance and more complex operations I created a test program. Actually, I looked up DAXPY and it is scalar times vector plus vector. Anyway, I implemented scalar times matrix times vector plus vector, storing the result in the last vector again (=> no need for any temporaries). Here are the timings for the Intel 14 compilers:
flags | Fortran | C++
====================================
-O0 | 566624 | 369896
-O1 | 125047 | 91942
-O2 | 100375 | 31035
-O3 | 9380 | 31167
-O3 -msse3 | 9388 | 31033
-O3 -host | 9390 | 7744
-fast | 8108 | 7714
I'd like to mention that for these statistics (including earlier ones) I did only a single test run.
So, now I have the problem that -fast is faster than -O3 -host for Fortran when dealing with reals, but slower when dealing with integers. In scientific applications I would expect more real than integer arrays. Hence, I would compile with -fast. In that case C++ would have a slight edge.
I am going to post the source code used for these timings in separate posts. Note, that I have chosen much smaller arrays. I challenge the experts to improve the Fortran code and become faster than C++. Only using BLAS does not count because I can also call the BLAS routines from C++.
program daxpy
implicit none
integer, parameter:: width = 1000, height = 1000, nRuns = 100
real*8 :: A(height, width), x(width), y(height), alpha
integer :: i, j, k, l;
integer :: start, end, duration
! initialize variables
do j = 1,width
do i = 1,height
A(i,j) = i*j
end do
end do
do i = 1,height
x(i) = i
y(i) = i
end do
alpha = 2.0
call system_clock(start)
do k = 1,nRuns
do j = 1,width
do i = 1,height
do l = 1, width
y(i) = alpha*A(i,l)*x(l) + y(i)
end do
end do
end do
end do
call system_clock(end)
duration = floatj(end-start)/10
write(*,*) duration, "ms"
stop
end program
#include
#include
const unsigned int width=1000;
const unsigned int height=1000;
const unsigned int nRuns = 100;
int main()
{
double A[height][width], x[height], y[height], alpha;
// initialize variables
for(unsigned int i = 0; i < height; ++i)
for(unsigned int j = 0; j < width; ++j)
A[i][j] = i*j;
for(unsigned int i = 0; i < height; ++i)
x[i] = y[i] = i;
alpha = 2.0;
auto start = std::chrono::high_resolution_clock::now();
for(unsigned int count = 0; count < nRuns; ++count)
for(unsigned int i = 0; i < height; ++i)
for(unsigned int j = 0; j < width; ++j)
for(unsigned int k = 0; k < width; ++k)
y[i] += alpha*A[i][k]*x[k];
auto end = std::chrono::high_resolution_clock::now();
auto duration = end - start;
std::cout
I have some questions about the examples that are not answered in the dialog so far. I apologize if these are naive... I don't have some of the benchmark experience in this area that you guys have.
Observation#1
Isn't Fortran column major order unlike C++? If so then that affects when pages are loaded (and potentially replaced/reloaded) in a very different way since each row (C++) or column (Fortran) is occupied by several pages. If so that would effect timing. And if so, a related question I have is whether the different time gathering methods used in your programs are identical with respect to inclusion, or not, of system overhead due to paging. I realize that if there is enough memory to hold the entire arrays then it may make no difference since the same number of pages will have been loaded after the first of the 100 passes.
Observation #2
I wondered if there might be side effects in the timings by the different optimizers in the handling of the innermost evaluation of the expression in the assignment statement and the amount of loop unrolling. Perhaps a simpler assignment can lessen such side effects (eg. assign a constant or call to random function).
Dear Simon, the statistics was just for something extra. The crucial point is always one: Which language, which paltform, which option can give the smallest runtime. I think that you agree that, based on your choices, working with FORTRAN can give you the smallest time:
f14 cpp14
-O2 7538 7859
-O3 4870 7860
-O3 -msse3 4842 7858
-O3 -xhost 2122 3279
-fast 3287 3284
See also my last barplot.
Cheers everybody!
@Simon I tried compiling and running your code in C++ but the exe does not work. Maybe a Windows error or something else. When trying to debug I got a break message and the figure attached is what I get.
Used O2 and Ox but the same result. Crushing.
Fortran gave me 31secs (31000ms).
@Simon. Still waiting for your input related to the 0/0. More interesting than the rest.
@Simon, I am sure that you smile when you see what I did here and got 26secs (26000ms).
;-)
program daxpy
implicit none
integer, parameter:: width = 1000, height = 1000, nRuns = 100
real*8 :: A(height, width), x(width), y(height), alpha
integer :: i, j, k, l;
integer :: start, endt, duration
call system_clock (start)
! initialize variables
do j = 1,width
do i = 1,height
A(i,j) = 2*i*j
end do
end do
do i = 1, height
x(i) = i
y(i) = i
end do
call system_clock (endt)
duration = floatj(endt-start)/10
write(*,*) duration, "initialize process (ms)"
call system_clock (start)
do k = 1,nRuns
do j = 1,width
do i = 1,height
do l = 1, width
y(i) = A(i,l) * x(l) + y(i)
end do
end do
end do
end do
call system_clock (endt)
duration = floatj(endt-start)/10
write(*,*) duration, "ms"
write(*,*) y(width)
pause
end program
And this is for an extra smile (190 ms):
program daxpy
implicit none
integer, parameter:: width = 1000, height = 1000, nRuns = 100
real*8 :: A(height, width), x(width), y(height), alpha
integer :: i, j, k, l;
integer :: start, endt, duration
call system_clock (start)
! initialize variables
do j = 1,width
do i = 1,height
A(i,j) = 2*i*j
end do
end do
do i = 1, height
x(i) = i
y(i) = i
end do
call system_clock (endt)
duration = floatj(endt-start)/10
write(*,*) duration, "initialize process (ms)"
call system_clock (start)
do j = 1,width
do i = 1,height
do l = 1, width
y(i) = (A(i,l) * x(l) + y(i))
end do
end do
end do
y =nRuns* y
call system_clock (endt)
duration = floatj(endt-start)/10
write(*,*) duration, "ms"
write(*,*) y(width)
pause
end program
@James. BLAS takes care of that issue (rearrangement of elements).
By the way C++ is row major and the code is written accordingly.
Try the following and let us know about the results if you like...
program test
real :: a(1024,1024), b(1024,1024)
integer :: i,j,r
do r=1,100
do j=1,1024
do i=1,1024
! a(j,i) = b(j,i) ! version of row major
a(i,j) = b(i,j) ! version of column major
end do
end do
end do
end program test
Try the same in C++ and if there is a significant difference post the results here.
Regards
If i may speak , From a long time past comparison i conducted, i can say Fortran (even earliest versions of Fortran) is fastest language for large computations and matrices, because thats what it was essentially made for. Grand FE commercial program computation cores are made using FORTRAN. e.g. ANSYS, ABAQUS, COMSOL...etc. C languages or JAVA are simply used for interfaces and visual effects, generally because of their superior capabilities in data manipulation.
i think there is not much difference between them, and you need to use cleverly programming tricks like avoiding interior loops, avoiding jumping from a line to the other and... to make the algorithm to be faster. just use the program you are good at and go ahead :P
@James
I have written both the Fortran and C++ code with storage order in mind. If you have a close look you will see that the loops with i and j are interchanged for the two languages. The problem is not with paging in this case. For this problem all the pages will be in memory. The problem is with the CPU's caches. When reading a variable from memory a few neighboring addresses will also be transferred to the cache. These neighboring addresses are readily available when accessed. Accessing in the wrong storage order would mean that you will always have a cache miss and you'll have to load another cache line. Once you could reuse a previous cache entry it will already be removed from the cache since the cache is not very large.
Concerning your second objection, I have chosen a more complicated example because this best describes scientific applications. Simple benchmarks will give you nice results, but they will never occur in reality. In that case the benchmark results don't tell you anything about which language to use.
@Demetris
Where did you get your numbers? What kind of setup did you use?
@George
Nice responses to my challenge!!! But I guess that you know I can apply the same optimizations to C++;) Anyways, in a pure scientific setup these are probably not applicable optimizations. But, it is good to know that we can take this whole discussion with some humor.
Concerning the 0/0 timing, from a computer scientist's point of view removing the loops is a valid compiler optimization. Whenever results of a computation are not used, i.e. they are thrown away anyway, it is okay to throw away the computation and not just the results. As soon as I added the output of some variables the results and thus also the computation could not be thrown away anymore. Higher level programming languages tell the compiler only which results you want to have and a guide (algorithm) how to get there. But, they are not explicit instructions to the computer anymore. So, you cannot say "do this amount of computation", but only "use the following algorithmic structure to give me the results". I think that it is a good thing that the compiler optimizes away all unnecessary source code in order to speed things up.
As for your problem with compiling the C++ code I cannot exactly see the error from your screenshots. You could try, though, to move the matrix A and the vectors x and y to global scope like before. Previous experiments have shown that this does not impact performance. It might be that for your compiler and platform the stack is not large enough to hold the entire data. In that case the application will just crash without a warning.
Overall I agree with you that it is best to use the language you know best. In both languages you can make performance mistakes. And so, in the end for you the language will be faster where you make fewer mistakes, i.e. the language you know best. The language timings are too close and too dependent on the optimization settings and platform to pick a clear winner.
@George
I have reread one of your posts where you have mentioned that if I had the knowledge about Fortran that I have about C++ I could find better ways to manage the data structures. First of all the problem with my code is that it is many years old and i am only working on this code for one year. Our code consists of a couple of thousands files, some with several thousand lines of code. It is impossible to change data structures in there.
The problem I have with acquiring the knowhow about Fortran I have about C++ is that there are too few good resources. It seems that nobody runs into the same problems that we have. Most people seem to be using Fortran for some comparatively tiny simulation codes. We have a really big beast here. During the last year we turned on warnings and interface checking for our software. It took us a couple of months to get the code to compile with these settings. But, it helped us find some serious bugs. In order to get interface checking to work correctly (and efficiently) we needed a couple of really ugly hacks. With C++ interface checking is always turned on and no ugly hacks are needed. In my opinion interface checking is a good thing because it helps you to catch some bugs at compile time instead of at runtime.
Looking at the learning resources for modern Fortran and C++ I would always advice to learn C++ over Fortran. Once you get to the really hard programming problems you will find more material for C++ than for Fortran. And you can transfer your programming skills to other domains since C++ is a general purpose programming language.
If you know already one of these languages, stick to it!
@Simon. I should have rephrased it then to: "If the code is written by using the old formating then you should re-written it by using the new standards of Fortran and alleviate the issues". From our discussion I understood that you invested a lot in C++ therefore your knowledge in this language is more advanced than in Fortran. Anyhow, as I can see, we both agree on this issue thus if you take any code and write it in a proper manner then either C++ or Fortran are capable in handling memory issues, variable types and other oop issues. The fact that many apps are written in old Fortran standards is true and maybe one of the main reasons that people conclude that Fortran is an old programming language, which as we know it is not true.
Once more thank you for the constructive discussion through which I personally managed to answer many of the questions that I had.
Regards
PS People with no humor are people that will never be able to enjoy life. Seeing everything in a competitive manner is something that I do not endorse. That is the reason when I saw your comments related to C++ I got more interested in learning about what you had to say and did not try to fight you. Learning more is the objective here. Cheers.
@Mahmud
It is true that Fortran has been developed with matrices in mind. That by itself does not make it the fastest language for this purpose. Because it was targeted at this it was easy to write optimizing compilers for this kind of operations. But, as some of the results have shown, compilers for other languages have at least caught up. Some of the same optimization techniques are now also applied to other languages, letting them take advantage of years of experience. The reason why many commercial FE applications are written in Fortran is because it once was the fastest language for this purpose. And nobody wants to throw away years of hard development and experience. This post is about re-evaluating the hypothesis that Fortran is the fastest language for matrix operations still today. My conclusion to this is that it mostly depends on the compiler (and which optimization techniques it has built in) and not so much on the language.
@Simon, they are the numbers presented by you and plotted by me at my 2nd barplot, except that I added the two columns of run times, those for problem 1 and 2, so f14=f14_1+f14_2 and cpp14=cpp14_1+cpp14_2.
@George, I am working mainly with nonparametric methods for solving numerical analysis and statistical problems. An example is the identification of inflection point: You can either adopt a model and do a linear (if it can be linearized) or nonlinear curve fitting, or you can focus on the geometrical properties of such a point and work with Extremum Surface Estimator (ESE) or Extremum Distance Estimator (EDE) to find it. This work has been implementated at the R Package inflection:
http://cran.r-project.org/web/packages/inflection/index.html
(This work is an example.)
I have been working with FORTRAN since 1985 and I am using it for all 'hard computations'. R is an excellent suite for working, but it is too slow for big data computations and you have to combine it with a fast language like FORTRAN. Besides those works I am working also on applied regression techniques, not curve fitting, but a different work.
You can see samples of my work here:
https://www.researchgate.net/profile/Demetris_Christopoulos/contributions/?ev=prf_act
[The climate work is not my main field, it was rather a big data exercise]
Anyway, you can send me also a pm for more details.
I hope to be everything ok for you.
PS Here in Greece we live inside a sea of 'jokes'...
@Demetris. Your last comment unfortunately is true and this is the reason that I myself had to leave. Hopefully things will change.
@Demetris
Thank you for your explanation which data you used. In your post with the last plot you wrote: "I think that you agree that, based on your choices, working with FORTRAN can give you the smallest time." But , I don't agree. Based on the timings I provided the answer is "It depends." The timings for my simple double precision test speak clearly for C++. So, if I have source code that has lots of double precision arrays but no integer arrays I would have to prefer C++ based on the results (although I don't like to make the case for C++ based on these slight differences in timings). The problem I see with your analysis is that you have to choose which compiler setting to take. I would always go for the -O3 or -fast option. In this case, if we sum up my numbers for the integer calculations and the double precision calculations we get the following numbers:
flags | Fortran | C++
====================================
-O2 | 107913 | 38894
-O3 | 14250 | 39027
-O3 -msse3 | 14230 | 38891
-O3 -xhost | 11512 | 11023
-fast | 11395 | 10998
Please, don't add up the numbers again. Because you have to choose one of the options. If you go for -O3 Fortran wins. If you include -xhost or use -fast C++ wins. If you have a dumb user who uses the default setting -O2 C++ wins. But, this is all bullshit. Nobody actually wins: It depends on your application which of the three things is more prominent in your code: integer arrays, integer matrices, or double precision matrix and arrays. The problem is that you don't have the option for your full application to test C++ vs. Fortran. It might even depend on the input data which of the two is faster. I hope that nobody is ever saying again "Fortran is generally faster than C++" or "C++ is generally faster than Fortran" because "It depends".
At least let's agree that I don't agree with you ;)
I think it depends of the compiler or if you command it to keep tha main data and operations at fast ram (it depends also on fast ram size)
Also I recommend using the right version (32 or 64 bits)
and also it depends of ALU bits.
One time I made a tests for computers that made separate tests for integers, short, long, double... values and operations
@Javier.What you are saying is correct and this is what Simon noted also in this discussion.At the end though the algorithm plays one of the most important roles.If the algorithm is written in an optimised manner then it can outperform others even though it is written in a "slower" language.
George, I am a computer scientist with a PhD in computer graphics, more specifically in scientific visualization. Currently, I am working at a research institute for applied mathematics. The software that I help develop is an FPM (Finite Point-Set Method) code. The FPM is a meshless solver for fluid and continuums mechanics which can be thought of as a generalized finite difference scheme. This is where I got into contact with Fortran about a year ago. My task, though, is not so much the physical or mathematical modeling, but helping to optimize the code. First, I wrote a new algorithm (in C++) for a really slow part of the code. Currently, I am optimizing the application for parallelization in general and hybrid parallelism (OpenMP and MPI combined).
I have to admit that the first thing I learned was that it is hard to write slow code in Fortran. Rewriting a small part of the Fortran code in C++ I learned an important lesson: Although the Fortran code seemed completely inefficient it was not. But, I overengineered the C++ solution at first (with things you cannot do in Fortran) so that at first it was slower. It needed some tweaking and a good understanding of the language to finally outperform the Fortran code with the help of a better algorithm.
@Simon. Very interesting. Solid proof that is not the car in most cases but the driver that makes the difference. ;-)
@Simon Schröder: Can you please give us some examples of those things you could not do in FORTRAN? Also, which standard of FORTRAN did you use?
@Simon, my apologies on my late observation on row/col major ordering. I'm new to this site and overlooked the need to click a link to see the entire discussion so I missed Lehtihet's early observation and changes in caused. I was still looking at the code in the original post. Also, thanks for the bit about the side effect with miss/hits at the cache level versus the virtual memory layer.
For those that are interested these are the comments related to this issue from the Intel Forum:
Comment 1 (by Jim Dempsey):
If your application is manipulating large matrices using relatively standard functions, then the best route is to find a well written library such as MKL or BLAS. And in which case it will not matter as to what language you use. Assuming you can organize the data the same.
However, many simulation programs, which fall into a class of Finite Element Solvers, have requirements that fall outside the realm of functionality best served by a library such as MKL or BLAS. IOW you have a different set of requirements.
For both languages (Fortran and C++), the performance will come from four areas in order of importance:
1) Your ability to perform data layout that is favorable to vectorization
2) The compiler ability to optimize the code to maximally incorporate vectorization
3) Language features to permit you to express the problem efficiently
4) Language features to permit effective parallelization
Both languages can produce the same efficiency as well as inefficiencies depending on the abilities of the programmer. The programming emphasis for the C++ programmer is to make collections of objects. This tends to organize data as Array of Structures (AOS). AOS format is not favorable to vectorization. Whereas Fortran, at least for the older programmers, tends to organize data dimensions (all X's, all Y's, ...) which is Structure of Arrays (SOA). SOA format is favorable to vectorization.
I'd vote for Fortran for computational side, but you may wish to use something else for the presentation side. It is relatively easy to write a mixed language application.
Comment 2 (by Tim Prince):
For work on large matrices, you would likely use a library (e.g. Intel MKL), so that performance doesn't depend on which programming language you are calling from. Then, it's largely a matter of convenience, with Fortran offering more relevant built-in language features, while calling a library from C++ with Fortran-compatible interface is not pretty. For example, the most widely used Fortran compilers offer options to implement the MATMUL intrinsic as a call to the optimized library.
Array assignment notation is a likely convenience feature built into Fortran. Intel has added the similar Cilk(tm) Plus extended array notation feature to their compiler, and an effort is continuing to implement it in gcc, but that, and libraries such as Boost, aren't strictly part of C++.
The most popular C++ compilers (except Microsoft) have added the __restrict qualifier extension (same functionality as C99 restrict) to facilitate optimization, and are adding the OpenMP 4 parallel simd pragmas. C++ compilers disagree on the syntax required to optimize in certain situations, such as std::max() which is excellent in Intel C++ but has to be changed for optimization in g++ or MSVC++. It's usually possible to achieve full performance in either Fortran or C++ (by testing C vs. C++ alternatives for the compiler of choice), but there's more effort involved in C++.
Microsoft C++ is the only C++ compiler considered by many Windows developers; it has a much more limited selection of auto-vectorizations, supports an old version of OpenMP standard, and supports very little of C99. So you will need to choose another C++ (Intel or latest g++) if you want one which competes with Fortran.
@Theodoros: I am talking about things that weren't necessary when just writing the entire code in Fortran. Not that you misunderstand me: I did not choose C++ because it couldn't have been done in Fortran. I chose C++ because it was easier for me to write in that language. The overengineering part came from the fact that I tried to write a general class that either could be used stand-alone or directly access the data provided by Fortran. This involved a solution with several pointers and references and a boolean flag to distinguish the two cases. Unfortunately, making this distinction prevented many optimizations, so I finally dumped this solution. The part that I was talking about, which couldn't be done in Fortran, is the low-level programming stuff with pointers and references that I used to adapt my C++ structures to Fortran. I guess you agree that Fortran is not made for low-level operations. It is a lot easier to teach C++ Fortran's calling conventions than the other way around (if that is even possible at all). So, the real problem could be solved in Fortran alone, but interfacing with C++ I introduced too many unnecessary problems that at first slowed things down. Since there is no real low-level programming Fortran these are mistakes you can never do in Fortran, but only in C++.
@George
Thank you for starting this discussion. It was a good thing to dig deeper and actually test my hypothesis that there is no noticeable speed difference between Fortran and C++ in general. Thank you very much for your last post as well. There is only one thing from the forums that I cannot agree with: It is not entirely easy to interface Fortran with C++ when you are concerned with performance. I had to use array descriptors (you get those when you are using pointers to arrays or assumed-shape arrays in Fortran) to achieve good performance. This is a part that is not portable between different compilers and also not well documented. In my case I am using the Intel Fortran compiler and actually the documentation is wrong. Just using array descriptors to previously allocated arrays in C++ was easy to implement, but it got more complicated once I tried to re-allocate arrays within C++. So, general mixed language programming is easy, but beware of array descriptors.
Hi everybody,
C++ is thought as being an efficient programming language to deal with big data sets.
It is also more recent than fortran.
However, for big data issues, many firms (and many people from the statistical and/or econometrics area) favor rather SAS Software and SAS IML for programming.
The computational time is quite impressive...
Regards.
@Simons. In general Jim's worry was that the programmer can end up without knowing it, having decrease in terms of performance due to the fact that both languages expect to be in control of the multi-threading capabilities. As I told him, there is a solution that was also implemented by Femap with NXNastran, where the solver and graphical interface are two independent applications that intercommunicate through input and output files. Therefore one can use C++ for the graphical part and Fortran for the solver, avoiding all problematic issues.
Perhaps it woud be better using Intel C++ with parallel optimizer, but the algorithm would have to be modified to use parallel threads.
I would recommend to use Task Manager to see how many CPU cores are working
http://software.intel.com/en-us/c-compilers
@Hayette. Thank you for the additional information. If you are familiar with SAS then it would be very interesting if you could spend 10minutes and program the test algorithm with the 2x30billion calculation given above. Then we can verify if SAS is faster or slower than the two languages (Fortran/C++).
@Javier. If we go parallel then things will be unclear in terms of language capabilities (or better compiler/language capabilities). Programming multi-thread algorithms is not that simple (for our simple code, the truth is that it is not complicated to do so, but it was decided to keep things simple) and the scalability of the code will not allow you to have a clear image of how the language handles the calculations and the matrices/arrays. This is why I said in the initial statement we'll use serial solution approach (and no GPUs or multi-thread should be used).
@Simon Schröder: The FORTRAN 2003 and 2008 standards fully support object oriented programming (as well as pointers and interoperability with C). I think it would have been better if you had tried to create your objects in FORTRAN and then to bind these objects/variables with C, in a mixed language project. This would have given you the ability to write your solver(s) in FORTRAN and then present your results/data using a visualizer/library written e.g. in C++ (since we all know that C++ has much better graphic capabilities than FORTRAN). Therefore, I will insist that it is a better choice to always do your number crunching with FORTRAN, because it will always work faster (FORTRAN supports parallel processing too).
@Theodoros: The choice of the Fortran standard is not up to me. I am glad that the source code has already been converted from FORTRAN 77 to Fortran 90. There is also some object oriented programming included, but it is completely impossible to rewrite the code to modern Fortran (in that case we would rewrite it in C++ anyway as this option has been discussed internally a couple of times). As for interfacing Fortran with C++ the only problem is handling array descriptors. As far as I know these are not standardized (and there is also no library to be used by other programming languages). In my case there is no visualizer or anything else attached to the solver. It is just a plain solver to be run on an HPC cluster originally written in Fortran. I implemented advanced algorithms for geometry handling and chose to use C++ for this task. My preferred language is just C++. And as this discussion has shown so far your claim that FORTRAN "will always work faster" has no solid ground.
@Simon.As an overall conclusion though Fortran will outperform C++ in most cases thus it is a better choice when it comes to handle large matrices and arrays.If we take into consideration the time you will save into looking why the algorithm is not performing optimally and try fixing it then definitely Fortran gets the thumps up.
@Simon Schröder: I am not expressing an opinion based on personal assessments only. I recommend you also have a look at the comments from the Intel Forum by Jim Dempsey and Tim Prince that have been posted earlier by George. Additionally, please note that full interoperability with C and FORTRAN is supported from the F2003 standard and onwards; it was not supported in F90/95, which is the standard you used. I also found this article on handling of array descriptors in FORTRAN 2003 (including interoperability with C/C++):
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-mac/GUID-5474C7B4-78E5-4A7E-ACBD-E8A8501605A0.htm
I hope you find it somewhat useful for you. Cheers.
Sorry for the comment, but probably working with R project and using FORTRAN dll's to do the 'hard computations' is the best solution: Fast computations and great interpretation, statistical and visual, of the results.
It is better to use FORTRAN. I have been using both Fortran and C since 2003 (windows). From my experience, i feel fortran is better than C/C++.