Which frameworks are available for BigData processing? Is any framework available involving workflow processing? What additional configuration of distributed environment is required to get real power out of Big Data processing?
There are alot of hadoop based frameworks you can use: hive, pig for data preprocessing and aggregation and mahout offers a range of large scale machine learning algorithmns which were implemented in the underlying map-reduce paradigm. If you are interested in graph algorithmns have a look at giraph.
If you are familiar to Java or Python, you can extend the frameworks functionality.
Of course you can implement your own algorithmn in Hadoop too.
Default is Hadoop ( and lot of languages on top of Hadoop like Hive, Pig, Cascading etc). Spark is believed to be the Hadoop replacement, and Apache Drill is not very far behind.
If you need results fast, (Hadoop takes minutes to hours), then you need Stream Processing (e.g. Storm) or complex Event Processing (Esper, WSO2 CEP).
I do not know a workflow processing language doing big data, where things like Cascading are step on that direction. There was a project called Dryad by Microsoft which planned to do that, but it is now dead in favour of MapReduce.
How others replied, i would also like to suggest you that it depends on which data is to be analyzed or processed. From the Data Mining algorithms to Hadoop and Storm anything can be used.
If you have access parallel hardware and like to use R for analytics, consider pbdR (see r-pbd.org). This is the most scalable (tested to 50,000 cores and terabytes of data) framework for R.
I agree with the others that there is huge momentum growing behind Hadoop. However, could you share more about your goals and data set? "Big data" is increasingly used to mean many different things. Hadoop lends itself to problems that can be solved through distributed strategies. Other problems just need a hardware solution (MPP) with current mainstream analytics/database.
Arguably Hadoop is almost at the center of Big Data processing. Cloudera Impala can do faster SQL based analytics (MPP) on Hadoop data (at HDFS) than the alternative Hive. For machine learning, Mahout has implemented a large number of them. You can look into Spark for iterative algorithms, Giraf for graph processing, Apache Hama for Bulk Synchronous Processing and graph processing. These are all based on Hadoop. Cascading and Oozie can be used for Hadoop workflow.
GraphLab is another alternative machine learning framework to be used for Big Data.
First of all there are many parallel processing framework but each is designed for specific applications and if they are general, then most likely they require a lot of work from the programmer side. So based on your applications and skills you can choose what is suitable.
From my experience, the MapReduce framework introduced in 2004 by Google is the state-of-art for highly parallel repetitive large data processing. Especially for log analysis on large data sets that does not fit into the memory of a cluster of machines.
The open source version of it is Hadoop provided from Apache. You can learn about it in: http://hadoop.apache.org/docs/r0.19.0/quickstart.html. The good part is that it can be installed in a single node then with a configuration setting change it to multiple nodes depending on your cluster. The package also contain some common functions such as wordCount (ie. the helloWorld of Hadoop) that counts the number of words in a large dataset.
There are other frameworks such as HPCC Systems (High Performance Computing Cluster) , Dryad from Microsoft. You can find more on this link: http://stackoverflow.com/questions/19310293/is-hadoop-the-only-framework-in-big-data-space .
If Hadoop is an architecture to perform distributed computing on large data, then how Grid computing differs from it and how Hadoop differs from Grid computing? Why there was requirement of one more architecture (Hadoop), if Grid computing architecture was already available at that time?
I don't have exposure to Grid Computing. But I think http://developer.yahoo.com/hadoop/tutorial/module1.html will put more light how Hadoop addresses concerns related to distributed computing. Key difference is in Hadoop, computing code is moved to the node where data resides (ideal case).
In Grid computing, code in form of executable is moved on computing machine. Code gets executed on computing machine output in form of file is transferred using GridFTP protocol.
Hadoop is the best choice, but its batch processing is very slow (usually on daily basis). The most worthful advantage of Hadoop is its unlimited scalability.
If you accept commercial products, Netezza is better. Revolution R Enterprise for Netezza is a good framework with R.
Map Reduce framework is very important here which is distributed in nature and tasks are executed in parallel manner, which manages execution and starts or stops tasks. Hadoop is used for storage as well as processing can be done with the help of Map Reduce. In Hadoop, large clusters can be made of commodity machines. Hence for big data processing , Hadoop can be right choice. Readers can correct me if some other points are also there.
You can check the different big data processing paradimgs and technologies in http://www.slideshare.net/Datadopter/the-three-generations-of-big-data-processing
For MS SQL Server 2012, Microsoft is building connectors for Hadoop, which is an extremely popular NoSQL platform. We need to explore implementation details for the same.
Also have a look at http://stratosphere.eu. Its a general purpose data processing system developed in Europe. It has some unique features such as an optimizer, native support for iterative algorithms and some more.
Besides from that, it generalizes from MapReduce, allowing much more operators and advanced data flows.
The nice thing about it being a general purpose system is that it allows to cover a broad range of use cases, from traditional relational workloads to more advanced machine learning or graph analysis algorithms.
Lambdoop (www.lambdoop.com) is a really cool framework for developing Big Data applications. You can check out its features at http://www.slideshare.net/Datadopter/lambdoop-a-framework-for-easy-development-of-big-data-applications