Should we be collecting data more intelligently rather than collecting all sorts of garbage and then apply "statistical hammer" to break it into smaller pieces? Why not use more intelligent statistical design to collect the data in first place?
This topic is exceedingly important in my opinion as I have observed great divergence from rigorous statistical experimental (study) design as data become "cheap" and more easily analyzed with "calculation hammer". I use "calculation" instead of "statistical" as without a probabilistic basis through sampling and design, statistical inference is in doubt.
Generally speaking the huge advances in computer science allowing nearly boundless analysis methods have to a degree rendered academic statistics departments irrelevant with new emerging Jargon such as "Data Scientist" and "Data Engineer". Often these new "calculators" are expert in algorithms, but poorly trained in experimental design and sampling theory and lack appreciation of the intimate link between experimental design and it's foundation for subsequent statistical analysis and inference.
Large sample sizes and more critically, large numbers of variables with small sample sizes do not eliminate these issues.
I agree with your viewpoint in that statistical methods based on probability theory are in doubt when they are used to analyze data that were probably collected without a well thought out statistical design. However, there are useful information hidden within observational studies and the problem becomes how to extract such information using rigorous probabilistic methods based on such haphazardly collected data.
Large sample sizes are generally good for statistical analyses and given that now a days collecting data is getting cheaper and easier, it is believed that typical large sample theory would apply. But...the assumptions under which such large sample methods work (e.g., some form of repeatability or ergodic set-ups) may not always apply to such cheaply collected data. May be one, should use the observational studies to create a statistical design to collect similar future data...
You seem to have read my mind...Much of my consulting work involves "reverse engineering" the approximate study design that would have generated available observational data......with an eye toward differentiating the rigorous from the more tenuous inferences....particularly for inferring causation.
Well - the main problem about "collecting data intelligently" is - you do not have control over the data-sources. For example, take one of the biggest sources of data today - Social media.
Hence the "statistical hammer".
Same goes with many other sources.
As for structured data-sources, there are already intelligent systems that collect data intelligently. That's what traditional RDBMS has been doing for years. No need for big-data for that.
There is no doubt that there are major differences between small and big data. In the following papers, I have discussed some of them, e.g., Guassian statistics for small data, and power law statistics for big data.
Jiang B. (2015), Geospatial analysis requires a different way of thinking: The problem of spatial heterogeneity, GeoJournal, 80(1), 1-13.
Jiang B. and Miao Y. (2014), The evolution of natural cities from the perspective of location-based social media, The Professional Geographer, xx(xx), xx-xx, DOI: 10.1080/00330124.2014.968886, Preprint: http://arxiv.org/abs/1401.6756