“Big Data” problem or just a “Lots of Data” problem?

04 April 2014 1 2K Report

I guess there is some confusion around the very meaning of Big Data. Just because a data set is large, that doesn´t automatically make it “Big Data”. So how does one know if he/she actually has a Big Data problem?

Andrew Messing

Short answer: dimensionality, nonlinearity, and "confidence" that the measured/observed used actually produced the data they did and that these data are of some phenomenon or phenomena that are "real" (see #3 below for more detail).

Long answer:

1) Dimensionality:

Let's say I'm interested in whether temperatures show greater variability at particular ages. If I take the temperature of 6 million people, I have a lot of data points, but in 2-dimensional space. I can plot them and learn quite a bit without running a single statistical test. In fact, standard tests can be LESS accurate than my informal, visual scan because I can "see" how much my plot approximates a linear function, if there are potential outliers, etc. The fact that I took the temperatures from a million people helps a whole lot as well (providing I did a decent job getting balancing the ages of my sample).

What happens, though, if I instead of taking peoples' temperatures I am interested in global surface temperatures over a single year? Let's assume that I have a thermometer ever square meter across the globe. I still know that surface effects ranging from anthropogenic changes in land morphology due to farming to the UHI effect come into play. Also, while I can use a thermometer to measure the temperature of a person living in Canada and another in Egypt without worrying about location, thermometers measuring surface temperatures can't be treated like this (as if location didn't matter). Even with a simple model that only takes into account a few constants to correct for things like the UHI effect and a relativity few variable (humidity, surrounding urban density, altitude, proximity to the poles the poles or the tropics, distance above the ground, etc.), I am comparing thermometers along multiple different axes of varying scales. Things get subtle in higher dimensions. In R2 or R3, it's pretty simple in most applications to judge when points are close to one another, but when I am comparing one thermometer to another in a point in maybe 15th dimensional space in which one variable (temperature) can be -1 degree Celsius and another 36, while another variable (altitude) can be -100 meters to a few thousand, etc., small changes along enough axes can be easily missed (can't be easily plotted) and can easily render useless a number of statistical measures/methods useless.

2 ) Linear/nonlinear

Lorenz practically founded what is commonly known as chaos theory by imagining the simplest atmosphere possible. However, even a single molecule in such an idealized model turned out to exhibit be very complex. Its dynamics are nonlinear, they are governed by sudden & qualitative changes like bifurcations, unpredictable changes large and small, are far from equilibrium, etc.

Dimensionality crops up here too. Image a pile of rocks that a wave carries and deposits on the shore vs. a sand that it carries creating a sandpile onshore. Knowing information about a many dimensions can give you a pretty good approximation of the final configuration of the rocks (initial conditions, from the position of the rocks to the geometry of the ocean floor). But is the difference between knowing the final configuration of the sandpile vs. the rock pile just a matter of more sand grains than rocks? No. For the rocks, the problem is already hard enough because there are so many dimensions one needs to account for and their trajectories will not be linear. But you can make it simpler and reduce the dimensionality with some safe assumptions: grouping the rocks into size & weight ranges instead of having a value of each for each (which combines two variables into one and reduces the set cardinality), using more approximate units of position, averaging the forces of currents and reducing their scale or effect they will have, etc. Sand, however, makes this all impossible. Small changes in currents result in a completely different position. Sand will do little to impede the trajectory of a rock propelled by a water wave, but the reverse isn't true so debris has to be taken into account. And so on.

Finally, there's what your data consists of and how confident you are that you are using valid measurements of something that is a valid phenomenon? It is very difficult to model living systems. Single cells have to0 many strongly interacting parts. But at least you can look at the cell and your model and see how close they are (most of the time and to a certain extent, anyway). What if I'm interested in the effect of religiousity & political orientation on intelligence and mental health? With the exception of neurological disorders, all mental disorders are based on classifications of symptoms and intelligence on certain constructs that are supposed to be measures of whatever intelligence is. I can have a small population and a good but small sample, but how do I know if I am measuring what I think I am? I can't x-ray religiousity. I can't use an fMRI scan to determine intelligence. I can't even use neuroimaging to diagnose mental disorders. So as complicated as modelling e.g., the weather can be, if the model predicts rain tomorrow and it doesn't rain, then my model was wrong. Most importantly, I know that.

Badges
Science topic

More Profcharles .V's questions See All

Is there a misperception/misuse of the term Big Data?

Although the term "Big Data" is well defined in terms of Vs and Cs, I often find research papers that use it without a clear indication of what exactly they refer to: the size of the data being...

03 April 2018 7,057 22 View

Should the UK exit the EU?

Much is being talked about nowadays regarding the UK exiting the EU. What might be the main economic/political/social implications of such event? What could be the impact on the EU?

03 April 2016 9,284 3 View

Did you notice an unethical practice to increase the RG Score?

It is a shock to see that some of the researchers are including others' articles in their profiles, by taking advantage of the similarity of names to increase the RG Scores.

01 February 2016 4,815 4 View

Can scientific knowledge be considered the same as public knowledge?

Science, as most people understand it, is about theories of the real world… in this context, can we consider the scientific knowledge generated as being equal into public knowledge?

09 October 2014 6,207 3 View

Formation of desired values in students: Can the University do something about it?

It is widely accepted nowadays that the quality of educational training depends not only on the professional knowledge and skills developed through the university curriculum but also on the values...

06 July 2014 1,879 1 View

How do you define quality of education?

I have recently engaged in few talks with some of my graduate students and it was interesting to note how many of them believed that empathy was one of the crucial determinants of the quality in...

05 June 2014 10,137 10 View

What do you think of grading the class participation?

In recent days, grading the class participation (CP) is becoming a kind of cumbersome, that, too, in spite of many models for assessment, including clear rubrics for evaluation. We often hear...

04 May 2014 4,241 14 View

What is the difference between Hadoop and Data Warehouse?

I have read a couple of articles which are trying to sell the idea that the organization should basically choose between either implementing Hadoop (which is a powerful tool when it comes to...

04 May 2014 380 16 View

Is the Cloud a safe environment to store confidential data?

It is pretty much common nowadays to store, process, or transmit data by means of a Cloud. Whatever the Cloud of your choice might be (public or private), you expect the Cloud provider to provide...

03 April 2014 7,987 7 View

Adopting Big Data just for the sake of it?

Recently, I have come across couple of pieces of news in which many are complaining about the fact that they have trouble in convincing their clients to adopt Big Data as a new approach in their...

03 April 2014 4,566 5 View

Could you recommend some articles on Urban Transportation System optimization and Innovation?

13 August 2024 2,595 3 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View