Does anyone know of scholarly resources that argue for the benefits of extra data?

01 January 1970 5 10K Report

I am in the process of writing an introduction to an article that will change the way in which a particular type of data is stored, allowing it to have more flexibility for particular tasks.

One argument that I have seen from reviewers in the past is that they do not see the need to keep additional data when no one is using it right now (citing that humans only make discernments to a particular level, so making models that can identify thousands of differences while maintaining the information that handles what is currently established is not necessary). I would like to shed light on why it is important to be ahead in the data curve, being able to tackle the unforeseen problems that may arise in an area. I feel like an article addressing this topic exists somewhere, but I cannot find the right search terms to produce it.

For context, I am not changing the amount of data that is stored, but rather the way in which it is stored. Rather than to maintain just binary set intersections between an object and partition tiles of a space, I am maintain a topological relationship between the object and the partition tiles, providing information about boundary contact and exterior interactions. In so doing, the boundary information in particular drastically expands the granularity of what can be discerned from other things.

If anyone knows of relevant literature that suggest the benefits of keeping extra data beyond what is currently used, I would be much obliged.

Ian Kennedy

Dear Matthew,

I agree with you that storing a miniscule amount of extra data (e.g., content phrase index and typed inward and outward link data), can provide much more powerful, future-proof searches for the data. See for example, Google or Google Scholar.

Your failure to find material for your introduction is, in itself, a demonstration of how imperfect our searches are, even today. I think that your search problem is partly caused by the commonness of the word “data”, which word probably appears in every file in Scholar. So you need to take a different approach to looking for references. Possible ways are to find a keypaper, keycitation, keyauthor, keyjournal and use that as an entry point. Your other reason for your search failure is the other side of the coin. You are not yet using the vocabulary of the target field, because you do not know it! I find for example that your phrase “object and the partition tiles” is idiosyncratic. So you need to inform yourself (catch22!) of the current phraseology. Another reason may be that you are depending too heavily on Google's services, when the vocabulary you want is inside ACM and IEEE databases.

I think for example that it might be a good thing to keep track of "How did people get here?" and use this to improve the learning curve.

https://scholar.google.com/scholar?hl=en&q=using+metadata+for+searching

Max Maurer

I cannot speak to the Literature but I can speak to practical experience. The fundamental question is: 1) Raw Data? or 2) Computed Data? For the latter, it is only a matter of convenience and time as computed data can be reproduced. So, it's really a query performance question -- what is the cost of data storage vs the cost of compute time? Given how storage costs have plummeted recently, storage frequently wins over recomputation. For the former, obviously, the loss of raw data implies the loss of knowledge. In this case, unless the data really has no value or has only limited-time value, losing data is a catastrophe. (And, who really knows what might be considered valuable in the future?)

In the work that I do, I have raw data going back to 1985. We have, frequently, found new relationships in our data that we wish to explore and exploit. Having this vast data store has made the exploitation of new analysis approaches possible for us where others have not had this ability at all. We gain a competitive advantage from this discontinuity. Also, we work on quantities that move very slowly -- over years. In this case, it is necessary to have vast amounts of data to be able to distinguish these slow movements.

For example, we are currently studying the effect that release cycle time (i.e. 2 years vs 1 year vs 6 months vs continuous delivery.) has on SW Quality. There are marketing pressures that move the choice one way or the other in this continuum. Given that we have this breadth of data available, we can make comparisons that we simply could not even consider in the past. (In fact, we can see that, without several "womb to tomb" sets of data for some number of releases, it is still not possible. So, for the 2 year release case where the release life time is two cycles, having five years of data is not sufficient -- you effectively only have one complete data point and one partial. In addition, if the customer adoption rate for the one complete release is unusual (too high or too low), you don't even have that one. So, simple volume of data is not enough -- you have to be aware of the quality of the data as well.

This illustrates the power of data analysis. In the example above, a business decision will be made -- whether there is data to back it up or not. In the "not" case, there is no hope of knowing the consequences and surprises are frequent and frequently bad. We have saved ourselves so many times by having even a little bit of data to prove or disprove what had been -- up to the time of analysis -- just speculation. (i.e. Just because something seems to "make sense" doesn't mean it is true.)

So, if I were to be asked the "keep or not" question, I always come down on the "keep" side unless there is clear evidence that we will never need the data again (and since we have so many counter-examples I am unaware of such a situation) or if it were easily recomputable.

Jeremy Benjamin Singer

Please take a look at this article:

http://www.theatlantic.com/technology/archive/2012/04/everything-you-wanted-to-know-about-data-mining-but-were-afraid-to-ask/255388/

Techniques such as cluster analysis, regression analysis and anomaly detection work with large amounts of data discovering new features that are not immediately visible with small amounts of data.

Jeremy Benjamin Singer

Another less arcane argument from the Nobel laureate, Richard Feynman:

http://fed.wiki.org/journal.hapgood.net/concept-triangulation/alex.au.fedwikihappening.net/concept-triangulation

Jeremy

Swapnajit Chakraborti

You can look into the article "Unreasonable Effectiveness of Data" by Peter Norvig. This argues for more data.

For a relation algebra, do disjunctive relations need to be considered in Theorem k?

Compile error in METIS

How to learn more about SPSS and its Application?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Hello Everyone ! I'm looking for a good journal to publish my manuscript with low publication cost?

Is Galaxy.org good to use for research for analyzing data and for publication?

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Any idea about 'International Research Journal of commerce , arts and science? Is it a UGC listed journal?

How can I interpret the data without the need of solving it manually?

Are authors sowing for scientific Journals to be reaping the benefits? We are charged for publication but we offer free peer review services, why?