I'd like to add to this discussion the perspective of a professional statistician with 40 years experience in the analysis of medical and stock market data.
In the early part of the twentieth century, statistics developed techniques to assist the analysis of experimental data. The key idea was that proper studies of realms where "noise" often masked the "signal", properly designed experiments using randomization could by use of the law of large numbers and central limit theorem,allow an high quality hypothesis test that cleverly removed the distorting effects of the noise.
There was a big wall that existed between experimental data (that which had relatively little noise, e.g. Physics, or that explicitly using randomization) and observational data. In the latter case, one usually observes the time course of a complicated system that does not allow experimentation or of sentient beings engaging in decision making. Moreover, the world of human activity is highly structured, and as many have written, often hierarchical – and because it is so nonrandom, statistical hypothesis testing, per se, does not apply. Thus while one should expect patterns in observational data of human activity, e.g. consumer choice data, stock market data and macroeconomic data, the use of such time-honored statistics as t-tests, analyses of variance and yes, even regression, are questionable at best.
Now mathematics is a study of how narrowly-formulated problems can be solved. Mathematics creates its own patterns, but these do not necessarily map faithfully onto the real world. And statistics, which was originally developed to assist scientists design experiments, has in recent times become more and more a study of how to make sense of large data sets. But note, these are studies of observational data – and statistics has competitors in this arena – computer science, machine learning, pattern recognition and quite recently, “data mining.”
None of this should mask what is going on, though. Finding patterns in data involving human activity is really more like social psychology and sociology than math, statistics or computer science. And finding patterns in big data has another problem, the “needle in the haystack problem.” If a dataset is large enough, there will according to most probability models be many spurious patterns. In statistics, this is called “the problem of multiplicity.”
Statisticians today have adapted somewhat to the changed conditions. For one thing, they are much more knowledgeable of the subject matter they're dealing with. It is inappropriate to use the same methods to analyze studies of rats in a laboratory and the patterns of purchases from a retail enterprise with 50,000 sku's. Thus for big (observational) data, one needs to first identify the goals – what is it you want to known from this data ... how do you intend to use what you learn? Once that question is answered, statisticians realize there are two ways to reduce the problem of multiplicity. The first is to become familiar enough with what is known about the subject matter, then to look only for the relevant types of patterns. The second is to invoke omnibus methods for analysis that will likely produce many spurious patterns. Those patterns are then filtered to eliminate those that “make no sense” in the light of circumstances.
At first, i agree with Prof. Arina Tsam's view over the Big data analysis. Beyond this i want to add some comments which may or may not serve your purpose, it depends upon the area of study the subject itself :
My suggestions are : 1. Compute Correlation Matrix for large number of variables. 2. Compute factor analysis using correlation matrix. 3. Find communality and specificity of the variables. 4. Reduce the number of variables by extracting the effectiveness of the variables. 5.Make a conclusion . Otherwise , in statistical point of view finding least square approximation or LU decomposition technique the problem can be solved. All the best.
@Hemanta, I have never said that Big Data Analysis is not mathematical! Just one sentence from the blog attached "• Learn enough about the underlying math and statistics “to have a general idea of how the model works and when it might be invalid.” speaks itself, is not it? Regards!
It is interesting to look for relationships within the data via computer aided analysis. My normal use of this approach has been to understand a scene for which incomplete data is available (either non-commensurate or fading). Where mathematics becomes useful is in building the underlying scene analysis process and then subsequently to understand the limitations of the resulting estimated relationships. With that caveat, it is possible to begin to understand intent from the resulting relationships either in time, space or population.
How does this apply to business oriented decisions for Big Data Analysis? In each case, there is a set of underlying assumptions about discovering either time varying or time invariant relationships. I guess that one of the strong issues is to see how the transient relationships form a scene and then evolve to subsequent scenes and to discover the likely independent variables that drive the scene forward.
Had a minute or two and am just throwing in a couple of cents worth.
Hi Prof. Hemanta and others. My opinion is that big data analysis today is as dependent in computers and as supported by mathematical and logics analysis as small data analysis. Mathematics is a nice language tool with its own logical power and its own limits; that knowledge helped to develop better computers and better numerical methods and data analysis practices. Data size is not the main point, analytical quality of fundamentals is central. Thus mathematical and logics education have contributed to those tasks in general, but not always neither in all realms. Thanks, emilio
I'd like to add to this discussion the perspective of a professional statistician with 40 years experience in the analysis of medical and stock market data.
In the early part of the twentieth century, statistics developed techniques to assist the analysis of experimental data. The key idea was that proper studies of realms where "noise" often masked the "signal", properly designed experiments using randomization could by use of the law of large numbers and central limit theorem,allow an high quality hypothesis test that cleverly removed the distorting effects of the noise.
There was a big wall that existed between experimental data (that which had relatively little noise, e.g. Physics, or that explicitly using randomization) and observational data. In the latter case, one usually observes the time course of a complicated system that does not allow experimentation or of sentient beings engaging in decision making. Moreover, the world of human activity is highly structured, and as many have written, often hierarchical – and because it is so nonrandom, statistical hypothesis testing, per se, does not apply. Thus while one should expect patterns in observational data of human activity, e.g. consumer choice data, stock market data and macroeconomic data, the use of such time-honored statistics as t-tests, analyses of variance and yes, even regression, are questionable at best.
Now mathematics is a study of how narrowly-formulated problems can be solved. Mathematics creates its own patterns, but these do not necessarily map faithfully onto the real world. And statistics, which was originally developed to assist scientists design experiments, has in recent times become more and more a study of how to make sense of large data sets. But note, these are studies of observational data – and statistics has competitors in this arena – computer science, machine learning, pattern recognition and quite recently, “data mining.”
None of this should mask what is going on, though. Finding patterns in data involving human activity is really more like social psychology and sociology than math, statistics or computer science. And finding patterns in big data has another problem, the “needle in the haystack problem.” If a dataset is large enough, there will according to most probability models be many spurious patterns. In statistics, this is called “the problem of multiplicity.”
Statisticians today have adapted somewhat to the changed conditions. For one thing, they are much more knowledgeable of the subject matter they're dealing with. It is inappropriate to use the same methods to analyze studies of rats in a laboratory and the patterns of purchases from a retail enterprise with 50,000 sku's. Thus for big (observational) data, one needs to first identify the goals – what is it you want to known from this data ... how do you intend to use what you learn? Once that question is answered, statisticians realize there are two ways to reduce the problem of multiplicity. The first is to become familiar enough with what is known about the subject matter, then to look only for the relevant types of patterns. The second is to invoke omnibus methods for analysis that will likely produce many spurious patterns. Those patterns are then filtered to eliminate those that “make no sense” in the light of circumstances.
1. Whatever the metod used to probe big data one needs some concept, meaning some model to represent the patterns that we are looking for. It happens that in a large class of models these patterns exist and take definite mathematical form (Hilbert spaces). From there it is possible to proceed in a more comfortable mathematical background, and to set up the base for the definition of observables, laws of evolution, phases ,...And this kind of analysis holds whatever the field upon which the data are related.
2. One very interesting aspect is the study of big data not by one observer / system / stattistician but by a network of interelated such operators. Interacting systems are not equivalent with the same set of isolated systems. So they can get some very different insight from the same set of big data.
The forum question is "Does mathematics fully support big data?" The literal answer to this question is: Yes, mathematics does support big data, in the sense that any mathematics or mathematical algorithms applied to it will produce results. Most standare out of the box big data algorithms are supported by software packages, e.g. SAS, Matlab, R, etc. But that does not mean that indiscriminate mathematics applied to big data is relevant or a good idea. Therefore, I do not believe that the question interpreted literally is really being asked here, because mathematics can be applied to any numerical (and also lots of nonnumerical) data.
Instead, I think the heart of the question is whether big data algorithms are justifiable. To this question I do not believe there is any one answer for all cases, but in general, I'd say no. To me it depends on the nature of the data, and especially whether it was generated by randomized experiments or is just observed. In a previous post, I discussed the relationship between experimental data and observational data. Most big data is observational - it does not arise from experiments and often involves human activity and decision making. This type of data is decidedly nonrandom, and the use of statistical methods is generally inappropriate.
The use of non-statistical methods, especially unsupervised fishing trips using for example, principal components, cluster analysis, self-organizing maps, syntactical methods or rule based discovery, are quite useful but should be applied with caution. Reason: lots of spurious patterns will be found by such algorithms. Thus to use them requires a strategy for detecting spurious patterns.
In the previous post, I suggested that knowledge of the domain was essential, be it retail, medical or psychological, because such knowledge will help identify “patterns that don't make any sense.” The alternative is to code the data in a way that limits the search to patterns that are believed to be meaningful – a very difficult undertaking.
In our daily life, we awake with mathematics and go to bed with mathematics; so the question whether the big data is a mathematics or not is meaningless. So according to specific application, we should check the reliability coefficient as well as validity of the data set ( however it is Big or Small). Then we may go for necessary statistical methods. Thank you all.
Why 'big data' buzzword was invented in the first place? I think the problem is largely of practical nature, not mathematical. More precisely, the data we would like to analyze are indeed big, in other words they don't fit into a single computer memory, sometimes even into distributed cluster memory. Worse yet, they continuously change and therefore are not synchronized most of the time - some are outdated, no longer valid. Some other records are not available at the time of analysis just because they are being currently updated. All this may come as a surprise to many researchers, but not for physicists. We, the physicists, know very well that our experimental knowledge is always uncertain and the trick is to reduce those uncertainties as much as we can. Statistics is very helpful, but using (inventing) new experimental protocol may appear much more successful. Similarly with big data: perhaps we need new statistical ideas? Anyway, I think those new tools will be of purely mathematical nature, even (or especially?) when the data of interest are neither of numerical nature nor plain texts.
Practically speaking, the dimensionality of the data matrices is key where the rows represent variables and columns observations. Classical methods were designed for short and wide data matrices with a few variables measured in many samples. "Big data" matrices are tall and narrow where many (1000's to tens of 1000's) of variables are measured in a relatively small number of samples. New mathematical and statistical methods are needed to handle big data matrices. For example, novel methods for the integrated analysis of multiple data matrices (data fusion) should help to reduce spurious, random signals.
More to the point of Dr. Baruah's question, big data analysis is still very much a craft as well as a science, especially in the "softer" domains, e.g., political and commercial arenas. Each data set is different and the trick is to know what always works and what varies from problem to problem. This requires a good bit of experience as well as talent and technical ability.
You mean, big data analysis is data dependent, and one has to see which particular 'trick' works well for a given set of data. Have I understood correctly?
Big data analysis is a multi-step process and these steps are quite constant no matter what the data. The steps include normalization, transformation, data reduction, data modeling and validation. There are many computational methods available at each step and what to do at each step is data dependent to a large extent. Moreover, the goals of the analysis (e.g., prediction, classification, signal detection, etc.) will also impact how the data is analyzed. Lastly, leveraging prior knowledge will also suggest a specific method to use. For example, if you know that the signal of interest is sparse, then L1 regularization should work better than standard methods based on the L2 norm (e.g., least-squares). Basically, the better you know your data, the better your results. So, in that sense, big data analysis is most definitely data dependent.
If by heuristic you mean data (and context) dependent, then yes it is heuristic. But the tools you decide to use for each step of the analysis must be mathematically grounded. How the tools are chosen and strung together is probably the most heuristic part of the process.
For example, clustering analysis is data dependent in the sense that a certain algorithm may return better results than another for some particular set of data. Why it is better is not mathematically defined. This is what I meant by heuristic.
Various applications, such as recommendation system, stock prediction, are all built on big data analysis. It is generally believed that high quality data can be benefit for model learning, especially with large amount labeled samples in machine learning. Actually, most current big data analys techniques have their roots in statistical learning theory(SLT).
This is precisely what I am asking! What is the mathematical foundation behind Big Data Analysis?
I believe the mathematical foundation behind big data analysis is nonlinear mathematics including for example fractal geometry, power law statistics, chaos theory, or complexity science; see related papers as follows:
Jiang B. (2015), Wholeness as a hierarchical graph to capture the nature of space, International Journal of Geographical Information Science, xx(x), xx-xx, Preprint: http://arxiv.org/abs/1502.03554
Jiang B., Yin J. and Liu Q. (2014), Zipf’s Law for all the natural cities around the world, International Journal of Geographical Information Science, xx(x), xx-xx, Preprint: http://arxiv.org/abs/1402.2965
Jiang B. and Miao Y. (2014), The evolution of natural cities from the perspective of location-based social media, The Professional Geographer, xx(xx), xx-xx, DOI: 10.1080/00330124.2014.968886, Preprint: http://arxiv.org/abs/1401.6756
Jiang B. (2015b), Geospatial analysis requires a different way of thinking: The problem of spatial heterogeneity, GeoJournal, 80(1), 1-13.
Jiang B. (2015a), Head/tail breaks for visualization of city structure and dynamics, Cities, 43, 69-77.