Which programming language or languages may be considered superior when implementing algorithms for handling large data sets?

Jerrold (Jerry) Heyman Popular answer

I would recommend that you develop the algorithm first, then worry about what language to implement it in. Once you select a language, you will unknowingly limit how the algorithm can be developed as you won't consider doing operations/actions that are *not* supported in the language.

Having said that - some of the other things to consider are: 1) the environment that the algorithm will execute in; 2) external performance requirements - each will be a consideration for any language selection.

Are you using a huge SMP type machine (lots of cores)? Are you looking to do pieces (or all) of the algorithm in parallel by partitioning the data? Does it have to solve the problem in a certain amount of time? Compute intensive? I/O intensive?

Jerrold (Jerry) Heyman

Hayford Tsikata

Thanks Jerry, it's "implementing" rather.

Sérgio Viademonte

Hi Hayford. It seems to me that there are different perspectives to take into account to address your question.

Implementing algorithms to handle large data sets, some one may consider the resources a particular language offers in terms of pre built components (or building blocks) that would facilitate your work.

Or you may consider resources for representing and processing data structures, and also resources for multithread, and distributed programming.

I believe that R, Python and Java would come in your list of choices, possibly among others, and not necessarily in that order.

As a personal choice, I would look at Java very carefully, for some reasons. New relase of JDK 8 brings powerfull data structure and processing capabilities in the Java Collection API that, together with the Generics, gives you powerfull resourcers for data processing.

Still in the Java world, there are new languages that run on top of the JVM, such as SCALA, which brings you some powerful resources for multithread and functional programmimg.Java also allows you smooth integration with WEKA and Hadoop frameworks - what can be a big advantage (which includes HDFS, Cassandra, HBase and Mahout).

R, as a language for statistic programming also offers powerful capabilities and certainly must be considered. The same can be said to Python.

hope this helps you somehow!

regards

Sergio V.

Arturo Geigel

I would like to follow Sérgio's initial lines on prebuilt components to facilitate your work. This consideration takes into account that you are measuring efficiency in terms of development but does not necessarily take into account execution speed (as he mentioned it depends on your perspective).

If you are talking instead on execution speed you have to take into account several factors on your selection of programming languages:

a)Interpreter execution speed- how fast is the bytecode(or equivalent intermediate language) processed by the interpreter

b) Library dependencies are being called: by adding the capability of being generic procedures and functions, they loose their execution speed.

c) The programming language goal: while most programming languages are presented as being generic each has its own goal and optimization perspective. For example, the initial goal of Java was portability, Perl focused on string processing, R for statistics, etc. This will show some of its strengths.

Some of my work focuses on image processing and speech processing, which are both intensive in processing and require interfaces with input devices. I generally code these in C ( along with some CUDA routines )which can give me a low level interface and I have managed to build a set of common processing routines optimized for the kinds of processing that I need to handle "big data" in the fastest time possible. This sometimes imply breaking modularity for execution time and memory usage (which should be another area of concern). You can alternatively use aspect oriented language for this last point. Depending on the particulars of your problem.

Muhammad Imran Babar

The option of Cuda C is also considerable

Daniel Page

What architecture will your implementation be working on? The design of the algorithm will need to conform to the architecture in mind. I'd suspect this to be a multiprocessor environment if you are handling large data sets ("big data"). If this is the case, the programming language will follow depending on what your algorithm needs to do on the architecture.

Aftab A. Chandio

These are practical approaches to deal the BigData. (1) MapReduce (HADOOP), (2) BSP (HAMA) (3) STORM (Real time processing)

You can fix your algorithm according to the structure.

Marc Jansen

Hi Hayford,

we have very good experiences in using NodeJS, especially if the data already is in a MongoDB and you want to use Map-And-Reduce.

Best,

Marc

Shrikant Badiger

Algorithm will not matter with language which we are going to use. Its my level

of understanding, If any one can please inform me that How language is going to effect ?

Aftab A. Chandio

Dear Shrikant- You are correct. But the algorithms must meet the structure requirement of Map-Reduce/BSP parallel programming paradigm if you want to implement the algorithm in these paradigms.

For example. If you implemented an algorithm in Java. You can not simply implement same algorithm in the above parallel paradigms But you need to fix your algorithm according to the structure.

Given the skull and horns of a big horn sheep (Ovis canadensis) and a domestic sheep (such as a Rambouillet), how does one tell the difference?

How to learn more about SPSS and its Application?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Separation of organic acids-HPLC?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

How can I interpret the data without the need of solving it manually?

Which test should be used to study association among demographic profile and awarness level?

Why can't academics earn the money they deserve?