I partly disagree with Juan Luis Herrera Cortijo's answer. Statistics and Data Mining are two different things, except that in certain Data Mining approaches methods of Statistics are used. Statistics is a centuries old and well established methodology of science. Data Mining is a relative neologism, grossly misused. Methods of Statistics are generally pretty well founded and mathematically sound (and these Data Mining approaches which use those methods are probably the most successful ones). However, more generally, Data Mining methodologies also use elements of the so-called 'Artificial Intelligence', which is a much more murky terrain.
I also do not quite agree with the statement that Data Mining does not try to model the 'world'. Data Mining operates without an 'a priori' model. However, getting closer to establishing even simplest characteristics of relationships within the data, is the essence and the goal of Data Mining techniques.
In my view, Data Mining has seen its heyday. If you have data which you understand, apply Statistics and interpret the results. If you do not understand your data, you are unlikely to gain much knowledge by throwing it into a Data Mining algorithm unless you understand what it is exactly doing. And if you understand what it is doing, you will be likely referring to this algorithm by its name, such as clustering, regression, etc.
Both Statistics and Data Mining try to do the same: find a mapping between some input and some output in some world.
Statistics approach this problem trying to model the world using stochastic processes. Once you have a model you can extract more samples from the model and reason about the whole population.
On the other hand, in Data Mining you don't depend on a model of the world. You just try to find a function that maps inputs into outputs without any assumption of how the world works (or at least with the less possible number of assumptions).
Edited: My statement "in Data Mining you don't try to model the world" was not exactly what I had in mind. Replaced with "in Data Mining you don't depend on a model of the world".
I partly disagree with Juan Luis Herrera Cortijo's answer. Statistics and Data Mining are two different things, except that in certain Data Mining approaches methods of Statistics are used. Statistics is a centuries old and well established methodology of science. Data Mining is a relative neologism, grossly misused. Methods of Statistics are generally pretty well founded and mathematically sound (and these Data Mining approaches which use those methods are probably the most successful ones). However, more generally, Data Mining methodologies also use elements of the so-called 'Artificial Intelligence', which is a much more murky terrain.
I also do not quite agree with the statement that Data Mining does not try to model the 'world'. Data Mining operates without an 'a priori' model. However, getting closer to establishing even simplest characteristics of relationships within the data, is the essence and the goal of Data Mining techniques.
In my view, Data Mining has seen its heyday. If you have data which you understand, apply Statistics and interpret the results. If you do not understand your data, you are unlikely to gain much knowledge by throwing it into a Data Mining algorithm unless you understand what it is exactly doing. And if you understand what it is doing, you will be likely referring to this algorithm by its name, such as clustering, regression, etc.
I fully agree with the interpretation of Zbigniew Struzik.
For me, the name of Data Mining or its many synonyms: Archaeology of Data, Information Harvesting, Information Discovery, Knowledge Extraction, "Knowledge Discovery in Databases", "Fishing Data" or "Data Dredging", refers to a set of statistical and database techniques used to discover relevant information in large databases. Rather than considering Data Mining as a Science per se, it seems a virtuous blend of statistical methods and of the development of efficient algorithms for databases. Data Mining may be also considered as a specialization of statistics, just as Biometrics, Econometrics, Demography, etc. but strictly speaking, Data Mining seems to lack of any exclusive method, because in first place it cares little about the assumptions and foundation behind the methods used and in second place, those methods used were borrowed from other sciences. Its first concern is behind discovering some useful meaning in the data.
Data Mining is a really broad term that includes data management and analysis. The question posted seems to be directed to discuss about the analysis aspects and my answer is more related to the machine learning techniques often involved.
In Data Mining you throw your data to a random algorithm and see what happens as much as you would apply a random statistical technique under the assumption of a random probability distribution and see what's the output. You need to understand the data and see if it's well suited for the algorithm that you intend to apply, that in turn must be appropriate for your objective (classification for example). It's just that you don't care about why the input data is related to the output data than how you can relate both.
Statistics is not "just representation of data". It is the study of the collection, organization, analysis, interpretation and presentation of data. It deals with all aspects of data, including the planning of data collection in terms of the design of surveys and experiments (see for example http://en.wikipedia.org/wiki/Statistics). There Data Mining is presented as a Specialized discipline: "Applying statistics and pattern recognition to discover knowledge from data".
You are confusing the name too, Statistics with Static.
Statics is the branch of mechanics that is concerned with the analysis of loads (force and torque, or "moment") on physical systems in static equilibrium, that is, in a state where the relative positions of subsystems do not vary over time, or where components and structures are at a constant velocity.
I would add that an important difference between statistics and data mining is inference.
At least in the social sciences , most of the time statistics are used on a sample and one tries to generalize the conclusions to the population (hypothesis testing, p-values,...).
Where data mining procedures are more often used on population (or very large data sets). The idea is then to find patterns in the data, with no more concern to the confidence in generalisation.
This is why data mining is less concerned in data collection, sampling techniques, hypothesis testing,...
In statistics, the knowledge is not hidden rather you directly are able to observe the knowledge on your own. Statistics only lets you prove your observation (hypothesis) scientifically so that the community accept your hypothesis. On the contrary, data mining is an exploratory tool. You have no idea about the hidden knowledge inside the data. Data mining lets you to "discover" that invisible knowledge.
Data mining deals with large-scale data sets with usually complex interactions between the data items. The larger scale and more complex the more difficult will be for statistics to uncover the knowledge inside the data. Because, more data means more complex interactions between data items and more difficult for direct observation.
In statistics you provide a theory first and test it using statistical tools. In data mining you dig out the data and find some patterns, and then, make up some theories :) This is the main difference between them.
You are right about they both are used in data analysis. however they are different tools in other words statistics is a tool used by Data Mining for getting output results
For example: we use statistics to calc. Support & Confidence so we can get result of data mining and see the assoication for example between items
According to my understanding you can't study data-mining without knowing statistics but you can study statistic without knowing Data-mining.
Bottom-line: Statistic is a general tool used in data analysis (in absolute vision) doesn't put inconsideration any context but Data Mining you need to use the statistics plus a context knowledge to get the intended results
I disagree with your commentary about what is the fundation of Statistics. As stated before, It is the study of the collection, organization, analysis, interpretation and presentation of data. It deals with all aspects of data, including the planning of data collection in terms of the design of surveys and experiments (see for example http://en.wikipedia.org/wiki/Statistics). There Data Mining is presented as a Specialized discipline: "Applying statistics and pattern recognition to discover knowledge from data". Then when you use data mining in base of sound mathematical methods by mean of heuristic software and you study the facts you are doing Statistics.
One more difference is in the use of heuristics. Data mining algorithms make generous use of heuristics, while there is no scope of heuristics in statistics.
Generally speaking data mining is considered more valid as it employs automatically the computational and algorithm techniques and can play a major role in exploring the features and getting the results without much more human interfering.
On the other hand statistics could be seen a pure (black box) field to manipulate the datasets or the period of study themselves. Although DM and Statistics both complement each other in more cases where someone can use data mining to preprocess the data and make it ready to use later in statistical analysis. Many research problems can use them both in different order and different ways.
The second difference is that data mining is much more related to machine learning while the statistics is not.
Finally, both can contradict each other in many cases. I advise to read the book (learning from data) for Yaser Abu-Mustafa and try to learn Matlab.
1. Statistical analysis is designed to deal with structured data in order to
solve structured problem: Results are software and researcher independent, Inference reflects statistical hypothesis testing
2. Data mining is designed to deal with structured data in order to solve
unstructured business problems . Results are software and researcher dependent (absence of implementation standards), Inference reflects computational properties of data mining algorithm at hand .
Note: Accurate prediction is more important than the explanation
Both resemble in exploratory data analysis, but statistics focuses on data sets far smaller than used by data mining researchers.
• Statistics is useful for verifying relationships among few parameters when the relationships are linear. It includes everything from planning for the collection of data and subsequent data management to end of the line activities such as drawing inferences from numerical facts called data and presentation of results. Statistics is concerned with one of the most basic of human needs.
• Data mining builds many complex, predictive, nonlinear models which are used for predicting behavior impacted by many factors. Data mining is used to discover patterns and relationships in your data in order to help you make better business decisions. Data mining cannot be ignored the data are there, the methods are numerous.
Data mining is the same as statistics ,a discipline that deals with data analysis. But the main difference is that data mining is science that born and grew up with computer and information technology revolution in mind.