Just where do we draw the distinctive line between traditional data analysis and present day (Big) data analytics?

Bin Jiang Popular answer

In these papers, I have argued that small data analysis based on Gaussian statistics, while BIG data based on power law statistics, and small data analysis based on Euclidean geometry, while BIG data based on fractal geometry.

Jiang B. (2015b), Geospatial analysis requires a different way of thinking: The problem of spatial heterogeneity, GeoJournal, 80(1), 1-13, Preprint: http://arxiv.org/ftp/arxiv/papers/1401/1401.5889.pdf

Jiang B. (2015a), Head/tail breaks for visualization of city structure and dynamics, Cities, 43, 69-77, Preprint: http://arxiv.org/ftp/arxiv/papers/1501/1501.03046.pdf

Jiang B. and Miao Y. (2014), The evolution of natural cities from the perspective of location-based social media, The Professional Geographer, xx(xx), xx-xx, DOI: 10.1080/00330124.2014.968886, Preprint: http://arxiv.org/abs/1401.6756

Bin Jiang

Jiang B. (2015a), Head/tail breaks for visualization of city structure and dynamics, Cities, 43, 69-77, Preprint: http://arxiv.org/ftp/arxiv/papers/1501/1501.03046.pdf

James R Knaub

Dele -

It seems that Bin's definition/description is based on estimation methodology, and the Davenport definition/description is based on application. As far as I can see here, it would seem that Bin's is more inclusive. It would be difficult to consider all applications in one defn/description, but I would think Davenport may be considering an area too narrow to really cover the official statistics in which I was involved for a number of years. Looking at energy statistics, usually from establishment surveys, the finite populations were numerous and not practical to census all of them without increased nonsampling error and impractical burden, but these populations were generally far from providing "big data."

My concern is that "big data" is just a term now to cover cases where we are now able to collect so much data that we are overwhelmed by it.

With the "traditional" data, we find that if we try to collect too much , nonsampling error is a huge problem. The idea of the sample is to reduce burden and expense and nonsampling error, by including sampling error, which may still leave us with less overall error (total survey error). There was a good story by Ken Brewer about his mentor, Ken Forster in the 1940's in official statistics in Australia, I think in an ISR journal article, about how the knighted head of the primary statistical agency complained that not doing a census was adding errors! But we know now that a sample can do better.

Now that we can collect huge amounts of data, the problem is somewhat different, but similar: "How appropriate are the data?" Nonsampling error may sometimes be less about measurement error and more about scope, or some other bias, but one still needs to be wary of the data. Now, however, because there is so much, there are many more interactions between data and various such complications may muddle the understanding of what the data may be expressing. The reaction to let the observed data relationships take the lead, and explain what they mean later, may be OK for some exploratory purposes, but we have to remember that correlation is not causation.

With traditional data, we may have more quality control over its collection (maybe), and somewhat more straightforward interpretations perhaps, but if we have too little data, sampling error is more of a problem. Also, the finite population correction (fpc) factor may be important. For big data, we may have an abundance of data, but its appropriateness for a given use may be a challenge, as well as increased complexity of relationships and perhaps too much tendency to 'let the data speak for itself,' without proper subject matter theory to support it, thus possibly increasing the likelihood of confusing correlation for causality.

So Davenport's explanation may be a good example of breaking data into size (n and/or N) with 'small' being traditional and new huge data sets being big data, but to generalize I think you really need to say the "line" is drawn more by a distinction between smaller sets of data that may be more under a study control, versus huge sets of data that may now be available, but we may need to be more cautious as to their interpretation. Naturally, these are overlapping, non-exclusive problems. Many challenges for traditional as well as big data may be the same, but weighted differently.

Semantics can provide a substantial problem. People will likely often miscommunicate as to what they even mean by 'big data.'

Subject application is key to when we would even obtain a traditional, versus 'big' data scenario, and Davenport seems to have that in mind, but "product" in many energy and other applications may not be big data at all.

It would seem the distinction is more related to how single-mindedly one must obtain data for a specific use (traditional), versus how one must sift through overwhelming data that may just be available ('big' data). The analytical needs are skewed accordingly.

Often one might obtain administrative data that might fit the 'big' data mode better, that can be used as auxiliary (regressor) data for a traditional data collection, which demonstrates that statistical methodologies often overlap, and one must examine the practical implications of every application.

Cheers - Jim

Bin Jiang

James, interesting observations!

I often use the following diagram to show the differences between small and big data. I have another view to add: small data is like an elephant seen by a blind man, while big data is an elephant itself.

Dele Joshua Osahogulu

Thank you, Bing and James. I derived great benefits by your contributions. I even think I am now better prepared to follow my webinar discussions about two hours hence.

Henning Barthel

High Dele,

you might also draw the line by taking over the perspective of available big data solutions and how they operate.

With really large data sets you soon get into problems if your system architecture is based on vertical scaling only, the approach taken the last decades. The idea behind big data solutions (big data =big data sets + data analysis) is to split the data set into pieces and distribute it across a cluster of commodity hardware (see e.g. HDFS or NoSQL-Solutions like Cassandra, MongoDB, etc.). The analysis part then requires distributed computing algorithms (e.g. MapReduce, Mahout, Storm, Spark) which can deal with the partitioned and distributed data set in order to get a high analysis performance even for big data sets.

But not every "traditional" analysis/mining algorithm can be parallelized, or it might be hard work to do so. So, for big data, there is a great need to have analysis algorithms working in a massive parallel fashion.

From a technical perspective, that's where you can draw the line too.

Cheers,

Henning.

Daniel Wright

The size of what people mean when they say big data changes (and varies by discipline), so 20 years ago it would have been a lot smaller than now. It might be what you can't store (or do simple analysis of) on a new desktop. Here is a quote from a White House report:

There are many definitions of “big data” which may differ depend ing on whether you are a computer scientist, a financial analyst, or an entrepreneur pitch

ing an idea to a venture capitalist. Most definitions reflect the growing technological ability to capture, aggregate, and process an ever-greater volume, velocity, and variety of data.

https://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf

So, imo, there is no line (and if someone drew one today, its wrong by tomorrow).

Is operations research/ management science (i.e. OR/MS) not a specialty in the present context?

How to learn more about SPSS and its Application?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Hello Everyone ! I'm looking for a good journal to publish my manuscript with low publication cost?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Any idea about 'International Research Journal of commerce , arts and science? Is it a UGC listed journal?

How can I interpret the data without the need of solving it manually?

Are authors sowing for scientific Journals to be reaping the benefits? We are charged for publication but we offer free peer review services, why?