We are working in bioinformatics and chemoinformatics. We need your help, I will highly appreciate if you suggest best software for mining big data. I will appreciate if you suggest free or open source software.
Dear Mr Gajendra, in my humble opinion... there is no best software (advantages and disadvantages). Sometimes the combination of independent functions from various softwares provides a better result. I suggest to you to take a look on the following softwares:
Weka (http://www.cs.waikato.ac.nz/ml/weka/)
R (http://cran.r-project.org/)
Orange (http://orange.biolab.si/)
Knime (https://www.knime.org/)
RapidMiner Community Edition (https://rapidminer.com)
I think the choice solely depend upon the type of data. For example, if you have data from next generation sequencing machines then for instance python may help. if data is of metabolomics or networks, graph databases such as Neo4j. R is definitely good option for medium size dataset but can only used after preprocessing data coming from any throughput technology. If you want to model data and use machine learning approaches then WEKA is the most preferred choice. You may also have a look at HDF5 (http://www.hdfgroup.org/HDF5/).
Thanks for suggestions, our group is already using R, Weka, , SVMlight, SNNS, RAPID miner for developing prediction methods. One of the major problem is speed, if I build model on large number of patterns than techniques like SVM take huge time. This BIG data is new terminology, we heard recently. I am interested whether their are software tools that can mine large data in reasonable time. I means is their any tool specifically designed to develop model on huge data. For example Hadoop is specially developed for managing BIG data.
Currently, the best statistical analysis and mining package is R and the best infrastructure that can handle big data is Hadoop. So combining both together can make sense of big data! I suggest use "RHadoop" package..
You may like to read about how to install it in this link:
"RHadoop" uses the steaming feature recently embedded in Hadoop (MR2) and it turns the R code implicitly into efficient MapReduce code that run easily on HDFS.
If you are looking for batch-level processing of the data, then Hadoop stack is your best choice. However, you are looking for more real-time data mining, then Spark or datastax (http://www.datastax.com/) is a better route.