I would recommend that you develop the algorithm first, then worry about what language to implement it in. Once you select a language, you will unknowingly limit how the algorithm can be developed as you won't consider doing operations/actions that are *not* supported in the language.
Having said that - some of the other things to consider are: 1) the environment that the algorithm will execute in; 2) external performance requirements - each will be a consideration for any language selection.
Are you using a huge SMP type machine (lots of cores)? Are you looking to do pieces (or all) of the algorithm in parallel by partitioning the data? Does it have to solve the problem in a certain amount of time? Compute intensive? I/O intensive?
I would recommend that you develop the algorithm first, then worry about what language to implement it in. Once you select a language, you will unknowingly limit how the algorithm can be developed as you won't consider doing operations/actions that are *not* supported in the language.
Having said that - some of the other things to consider are: 1) the environment that the algorithm will execute in; 2) external performance requirements - each will be a consideration for any language selection.
Are you using a huge SMP type machine (lots of cores)? Are you looking to do pieces (or all) of the algorithm in parallel by partitioning the data? Does it have to solve the problem in a certain amount of time? Compute intensive? I/O intensive?
Hi Hayford. It seems to me that there are different perspectives to take into account to address your question.
Implementing algorithms to handle large data sets, some one may consider the resources a particular language offers in terms of pre built components (or building blocks) that would facilitate your work.
Or you may consider resources for representing and processing data structures, and also resources for multithread, and distributed programming.
I believe that R, Python and Java would come in your list of choices, possibly among others, and not necessarily in that order.
As a personal choice, I would look at Java very carefully, for some reasons. New relase of JDK 8 brings powerfull data structure and processing capabilities in the Java Collection API that, together with the Generics, gives you powerfull resourcers for data processing.
Still in the Java world, there are new languages that run on top of the JVM, such as SCALA, which brings you some powerful resources for multithread and functional programmimg.Java also allows you smooth integration with WEKA and Hadoop frameworks - what can be a big advantage (which includes HDFS, Cassandra, HBase and Mahout).
R, as a language for statistic programming also offers powerful capabilities and certainly must be considered. The same can be said to Python.
I would like to follow Sérgio's initial lines on prebuilt components to facilitate your work. This consideration takes into account that you are measuring efficiency in terms of development but does not necessarily take into account execution speed (as he mentioned it depends on your perspective).
If you are talking instead on execution speed you have to take into account several factors on your selection of programming languages:
a)Interpreter execution speed- how fast is the bytecode(or equivalent intermediate language) processed by the interpreter
b) Library dependencies are being called: by adding the capability of being generic procedures and functions, they loose their execution speed.
c) The programming language goal: while most programming languages are presented as being generic each has its own goal and optimization perspective. For example, the initial goal of Java was portability, Perl focused on string processing, R for statistics, etc. This will show some of its strengths.
Some of my work focuses on image processing and speech processing, which are both intensive in processing and require interfaces with input devices. I generally code these in C ( along with some CUDA routines )which can give me a low level interface and I have managed to build a set of common processing routines optimized for the kinds of processing that I need to handle "big data" in the fastest time possible. This sometimes imply breaking modularity for execution time and memory usage (which should be another area of concern). You can alternatively use aspect oriented language for this last point. Depending on the particulars of your problem.
What architecture will your implementation be working on? The design of the algorithm will need to conform to the architecture in mind. I'd suspect this to be a multiprocessor environment if you are handling large data sets ("big data"). If this is the case, the programming language will follow depending on what your algorithm needs to do on the architecture.
Dear Shrikant- You are correct. But the algorithms must meet the structure requirement of Map-Reduce/BSP parallel programming paradigm if you want to implement the algorithm in these paradigms.
For example. If you implemented an algorithm in Java. You can not simply implement same algorithm in the above parallel paradigms But you need to fix your algorithm according to the structure.
Also, It depends on the type of large data set. It is not the same if the data is only numeric or they are text. For large numeric data set, I recommend hdf5 with Python or C++
Develop the algorithm first. The choice of programming language depends on the type of large data set. For large numeric data set, hdf5 with Python or C++ is a good choice.