This is a weird question on a taboo topic.
I have been discussing some issues about data manipulation with colleagues. Some of them believe that considerable manipulation is done in statistical description of experimental data when unethical researchers want to "prove" their point with statistical analysis. This is made easier by the traditional practice of not publishing raw data behind statistical tests and data descriptions.
However asking to see the raw data is often prized as the ultimate test for veracity. My friends insist that there must be a simple tool, even in Excel, of generating random numbers that would be fit any given (plausible) description of mean +/- SD within an interval, which could be used to bypass such proof tests by giving a false superficial impression of data veracity. I particularly think that such a generated random numbers would not fit statistical tests perfectly, especially if the index values given were also manipulated. This comparison sounds like be an interesting way of using the same tool to double-check statistical data, and seems like it could be automated and even applied to random published literature as a (very controversial yet interesting) scan test.
This is an awkward idea that crossed my mind, and I got curious. I could not find any discussion on this and I find this relevant.
Maybe others here would know more about this?
Let me rephrase what I think you are asking for. Given a parametrized probability distribution parameters, you seek to generate a data set, for which the empirical estimates of the parameters are identical to specified population parameters. That should be relatively easy for most standard distributions.
For example, if you specify a Gaussian distribution with mean and SD and seek to generate a data set with N observations, you have (N-2) degrees of freedom in your data. That is, you could generate N-2 observations by randomly drawing from the distribution and analytically determine the remaining two observations, such that the mean and SD come out as you have specified them.
Thus, it is easy to see that publishing the raw data is not good enough to ensure veracity. I think the strength in having access to raw data rather is the increased chance to understand and reproduce results and their coming about after data was collected. Generally, it is easier to judge how much voluntary or involuntary nudging went into the choices of preprocessing and methods.
The gold standard for ensuring veracity remains independent replicability.
Let me rephrase what I think you are asking for. Given a parametrized probability distribution parameters, you seek to generate a data set, for which the empirical estimates of the parameters are identical to specified population parameters. That should be relatively easy for most standard distributions.
For example, if you specify a Gaussian distribution with mean and SD and seek to generate a data set with N observations, you have (N-2) degrees of freedom in your data. That is, you could generate N-2 observations by randomly drawing from the distribution and analytically determine the remaining two observations, such that the mean and SD come out as you have specified them.
Thus, it is easy to see that publishing the raw data is not good enough to ensure veracity. I think the strength in having access to raw data rather is the increased chance to understand and reproduce results and their coming about after data was collected. Generally, it is easier to judge how much voluntary or involuntary nudging went into the choices of preprocessing and methods.
The gold standard for ensuring veracity remains independent replicability.
I agree with Andreas. I could get, as anyone with some knowledge of statistical theory, a random sample of any distribution that I want. I only need that the distribution was completely specified.
In most cases I could do this job in Excel because it has the inverse distribution function of several statistic distribution like, Normal, Chi square, t, F of Snedecor, lognormal and Uniform, and many others like the discrete distributions where I could get the dstribution function for most of the relevant values.
If you have the distribution function, you need generate a pseudo random sample of uniform(0,1), then for each value, look in the function of distribution which is the corresponding quantil. After doing it n times you have a sample of size n from the required distribution.
I do this usually when preparing examples of data for my pupils of Statistic.
A sample of moderate size, so obtained, can pass any test proposed to detect its lack of adjust. Excel has some problems with the generation of pseudo random samples that may produce detectables defects when the sample is large but there are other software that may surpass any test of randomness as the SAS because of the good cuality of its pseudo-random generating routines.
Off-topic: "I do this usually when preparing examples of data for my pupils of Statistic. " --- Guillermo, why don't you take real data?
Dear Jochen
Thank you for asking.
I use real data when I could get it.
I am an Agronomist Engineer and Master in Biometry and I have got a large experience in Statistics of Agriculture Research because I worked as Statistician for INTA (Instituto Nacional de Tecnología Agropecuaria) from 1984 to 1994, but I work in others matters from 1994.
I work usually on data of Education, Opinion Pools, Audit, Statistics of Public Employ, Health, but I teach statistics for careers in Food Engineering and Agricultural Engineering of my University.
Then I work usually in other areas of knowledge different of that where I prepare my classes. In some ocasions I have real data, but I am not the owner, but when I could get the authorization of the owner I prefer to use real data.
So there are groups on your university generating data - and publishing the results. You should urge these investigators to make their data availabe after publication for your courses. I mean... it is a University, so it's a public matter and it's paid by public (hopefully, mostly), and after the results are published - what should be a reason to hide the data? Also make clear that the labs will profit because they get students trained on *their* specific kind of data with their typical problems and properties.... it should be a win-win(-win) situation.
I don't know whats going on here :D. But I recommend using Fleishman method to generate any type of data with different distributions.
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCoQFjAA&url=http%3A%2F%2Fwww.diva-portal.org%2Fsmash%2Fget%2Fdiva2%3A407995%2FFULLTEXT01.pdf&ei=1DBEU-6wOubf2QWs6IC4Dg&usg=AFQjCNEbW4v538jHmFNhl9kvvMDK-PKtGg&sig2=6Pxb5xg7zue8a2V2BHE3sQ&bvm=bv.64367178,d.b2I
Dear Jochen
I agree that what you said is a win-win situation.
But in my examples, for an exam or a practice, sometimes I need data that shows particular characteristics that may be difficult of obtaining in the available time, then the use of simulated data is a good option too. Other times I found an article in the bibliografy that is very interesting to show some theoric point but the raw data is not available, then I may simulate this data compatible with the available information to study with the class the statistical problem.
OK fellas, since it is so easy to reconstruct any kind of given raw data, I am presenting here a data set already presented in another topic and I am waiting to reproduce all its properties by any suitable and efficient method on your hands.
Thanks everybody who will deal with it.
PS Format: decimal symbol is the dot.
Demetris, I repeat myself: looking data without any "scientific background" (the purpose with what the data was obtained under what conditions to answer what kind of question in what context...) in mind is a waste of time. I'd like to refer to information theory: data can be the medium to transport information, but the kind and amount of information depends on the context of sender and reciever (i.e. the experiment/study design and the "state of knowledge" of the scientific community). My 2 pence. I would be happy to get convinced that this point-of-view were faulty.
Dear Demetris,
Many thanks for stepping forward and adding a practical example. I am not sure if this is what you wanted to test, but I have inputted the values into an online Descriptive Statistics Calculator. The results are below:
Minimum: -160.4551832
Maximum: 9390.108698
Range: 9550.5638812
Count: 2000
Sum: 21508.859269371
Mean: 10.75
Median: 5.008
Standard Deviation: 211.3
Variance: 44650
Mid Range: 4614.8267574
Quartiles: Quartiles:
Q1 --> 4.0161
Q2 --> 5.0075
Q3 --> 6.0397
Interquartile Range (IQR): 2.0236
Sum of Squares: 89260000
Mean Absolute Deviation: 13.25
Root Mean Square (RMS): 211.5
Std Error of Mean: 4.725
Skewness: 43.77
Kurtosis: 1942
Coefficient of Variation: 19.65
Relative Standard Deviation: 1965%
Frequency of every value = 1
It would be indeed interesting to test if one would be able to generate another random set of numbers that would be fitting most of these sample descriptive parameters.
I surely agree that scientific knowledge should be based on critical and technical thinking and numbers be merely illustrative of reached conclusions, and that the gold standard will always remain on independently repeating the results for confirmation.
Yet in my field this is far from reality. In biological and biomedical sciences conclusions on papers are drawn on statistical analysis illustrated by very simple illustrations of the obtained results (even in the most "trusted" periodicals) , and repeating experiments under exactly same conditions is very rare, quite often impossible in practical terms. Nowadays, on PubPeer, papers are questioned post-publication and exposed based on obvious signs of manipulation of images, particularly in blots from cancer research papers. This has been revealing much about the true evidence behind these papers, and making it evident that traditional idea of Published-Impact-Factor-Peer-Reviewed standards are not indicative of reliability in scientific literature. Basic statistical descriptions and tests over obtained data are seldom challenged, and when so any big dataset of numbers produced by the authors will generally silence questioners. Most believe descriptive statistics will be a reliable fingerprint of the original data, and I think this naive notion must be challenged as well, thus I am appreciating here any ideas and tools for demonstration here.
Dear Eduardo, so can you re-generate the data in order to have values close to the above measures? Or, have you any idea about the originally used distribution?
Thanks.
Dear Demetris
What I said previously requires that the distribution was completely specified. This is lacking in your example.
Let see an example of what I said: An article presents the sample means and standard deviation estimates of k treatments after transforming the variable taking natural logarithm, and the Anova table for a Complete Random Block Design. It do not presents the raw data but it may be asummed that the data follows a lognormal distribution because the variable was transformed with ln() before the analysis and the assumptions of the Anova were met. Then we have all the information to build a plausible set of data that reproduces exactly the Anova and the estimates of the parameters of the k treatments populations.
In a sample like yours we may compute the sample moments (as a minimum the mean, variance, kurtosis and skewness) and then we would have to study which of the families of statistical distributions is more compatible with moments like that found, and then to estimate the parameters of the selected distribution via maximum likelihood, and then verify that the distribution is well adjusted, then if we are satisfied, we may compute with these estimates the distribution function and then generate a simulated sample like that of the beginning. This last procedure may become a large one.
Dear Guillermo, I agree with you and that was the reason for posting it: To show how difficult could be such a "reproduction form nothing" task. It is interesting I think to find a minimum time technique for doing it...
Dear Demetris,
"can you re-generate the data in order to have values close to the above measures?"
Sorry, I cannot and I do not know if that can be done, and this is the core of my original question. Can you please enlighten me on this?
My colleagues feel that it could be done with simple procedures, like the suggested regressions on Excel, and that could be used to double check statistical data, as not all phoney statistical description would fit an original set of results.
Concerning the original distribution that you and Guillermo mentioned, I have inputted that data as copy/paste from your file. I hope this is what was meant? I think for descriptive statistics this would not affect the end result, would it?
Guillermo I am sorry but I can only superficially understand the general idea of your last comment. I guess it would indeed be interesting to add another dataset or divide the present one into two populations for performing some ANOVA test, and re-generate a set of random data fitting the same sample moments (just learnt this term from you) and see how different the ANOVA test results would come out.
@Eduardo: you write "and repeating experiments under exactly same conditions is very rare, quite often impossible in practical terms.". I find this an important point to discuss. I agree that replication is rarely done, and the whole scientific system discurages thorough replication of things considered "already known" or at lest already published. It's not new, it's not sexy, so who will fund it? That is one of the major structural problems of this system. However, I have to admit that I cannot offer any promising solution :(
But what strikes me more is the part "under exactly same conditions". This is a major point. When a results will depent so strongly on the precise conditions, leading to different conclusions if the conditions were (slightly?) different, how valuable is such a result? For instance, if I measure the influence of CO2 on plant growth under lab conditions where the soil pH is exactly controlled at 6.5 and get results that will tell me a different story if the pH was 7.0... what should I conclude from such results? (just a silly example to clarify my point).
The more sensitive the results and conclusions are from the specific (controlled) conditions, the more specific should be the conclustions. This is valueable to get ideas about how things can be related, but one needs to clearly point out that a generalization, even to slighly different conditions, may render much of the observed effects irrelevant.
Results should be robust, i.e. replication should be simple, when a generalization is seeked.
@Demetris: here is some random data with the same distribution as your data. I generated 20000 values. Is this what you wanted?
Dear Jocken,
Yes, I understand your view, however I will not comment much as this should be added as a separate topic.
Just a quick notice, in biology I deal with living beings which are complex, unpredictable experimental subjects, and complex biochemistry is just one thing behind that. Unfortunately the scientific method relies on reproducibility of results yet in biology one can never be sure if they will always come out the same (actually never will, but there should be some similarity) every time and everywhere even under same conditions. This is dramatically illustrated by the extreme variability among cloned mice with well-known genes to the same clinical trial done at the same time and conditions, and in low number of repetitions (limited by price, time, manpower, bioethics). This last example is the most common justification of most cancer researchers for data manipulation (removal of "weird results") and irreproducibility. If one ponders on this too much, one gets more afraid of medicines that diseases, and loses all faith in biomedical research, yet this is crude reality and cannot be hidden by data manipulation and beautiful oratory as it is done today. I would like people to question data, and start accepting that more than often controversial data all there ever was. OK, not so quick note.
Thanks for your response, Eduardo. You are right that this should be another topic, but since we are here ... I think a big source of the problem is our understanding of the prduced results. Especially in biomedical research (where I am working, btw!) scientists look too often for simple yes/no answers and rely on p-values, painting black-and-white pictures of reality (significant or not). They irgnore too often that generalization is trustworthy only within the range of conditions under which the data was obtained, and if the variation of conditions are kept in tight bounds, generalization beyond these boundaries is risky. It would be a big step forward when the limits of generalization are better communicated, and when results are not presented as "yes or no" but with a sceptic estimate of uncertainties about the findings (and, again, p-values are often mistaken to do this job!).
(and, yes, one has to be in a very good state of health to take the risk going to a doctor ;) )
Dear Jochen,
On the dataset you sent, please note that original N=2,000 and yours N=20,000. I could not compute all those for lack of memory now, could you make another one with the same N size of 2000? And please, could you explain in details how you did it? I am eager to compare the results.
On reproducibility. I will not prolong it, sorry. I really do think there is a lot of masking on the true nature of results to make data fit the expected for scientific method in biology, and that even the finds in tight boundaries are quite often not quite as presented. On the present topic I think numbers can be also easily more often used to illustrate how scientific literature is distorting vision of natural phenomena to please funding agencies and publishers.
@Jochen, the number of sample data is crucial in statistics, please give a N=2000 sample. Thanks.
Demetris, this is a random sample of a distribution as given by your data. Simply randomly select N=2000 of these values and you have what you want...
These are random numbers, following the empirical distribution of the 2000 values you provided. Anyone of them is a random number from this distribution. It really doesn't matter which or how many you like to consider. If 20000 were not enough, I could generate more, but if you wish to have fewer values, just take fewer values. I do not see the problem.
@Eduardo: I did write that I generated 20000 values. Similar to Demetris, you are free to make a random subselection of the size you like.
Sorry Jochen,
Maybe I did not express myself well. My original question is ultimately if descriptive statistics and statistical test results can be used as a reliable fingerprint to the original data, or it would be fitting an infinitude of random numbers that could be generated.
I have done the same procedures now of descriptive statistics for a subsamples with the first 2,000 numbers and naturally the end results are much different from the previous sample, even if the original will have the same results (cannot compute those in this computer). I would be searching for another sample with same parameters exactly.
Dear Demetris,
Do you actually know if this is possible to be done in every case, and why?
Thanks all, this is an interesting discussion.
Sorry, Eduardo, so I misunderstood you. I can give you the statsitics of the entire 20000 values:
min -160.400
Q1 4.001
Q2 5.000
Q3 8.139
max 9212.000
mean 6.005
sd 130.6406
MAD to the median 6.902573
MAD to the mean 8.644351
skewness 52.98082
kurtosis 3029.28
I think that all these measures are well within the margin or error. To explicitely check this, one would need to bootstrap the confidence intervals for these statistics. I did some quick checks and found that the sampling distributions for N=2000 are quite pathological here. So even confidence intervals for the statistics may not be very instructive and "highest density sampling distribution intervals" should be used instead.But I have no means to get such intervals in a short time.
Thanks Jochen for posting the stats on the whole dataset. It shows actually different parameters. In a typical situation in my field, one would present results from it based on mean +/- SD or SE followed by min-max interval, and the result from ANOVA test against another sample. When challenged for raw data, if the researcher produced a fake dataset generated by your method (which still I do not aware which was) these would produce actually quite different parameters than the published ones, and one would have to come up with a good explanation.
Would this mean that Descriptive Statistics is indeed a good fingerprint of an original set of raw results? If it is, isn't there any software capable of reconstructing these original data based on Descriptive Statitics? Actually I would reason that there is a limit of parameters that would specify the original dataset, and thus a minimum amount of stats that can be used to investigate datasets. Would anyone know about this?
Eduardo, you must consider the uncertainties. Surely any sample will give different results, so it is no big deal to find that different samples give different results. The important question is: is the difference really unexpectedly large?
Also please note that statistics like the mean, sd and se have only a particular meaning in certain circumstances (just because something can be calculated does not mean that the result is meaningful or sensible). Demetris data is one example where all these statistics are particularily nonsensical.
A statistic is always a *summary* of (some properties of) data. Summarizing neccessarily means a loss of information (what is intended, btw. The aim is to reduce the confusing mass of information so that the remaining part might tell us something we can grasp and understand). After the information is lost, there is no way to reconstruct it. This would be like reconstructing the burned piece of oak woods from the ash, water, and CO2 that were left from the fire.
To answer your last question: Each statistic/summary will only contain a part of the information of the data. So one can build a collection of statistics that, taken together, will contain all information. The simplest way to construct such a selection - which is btw. the smallest possible collection! - is to take the original values (i.e. the quantiles from 1/n, 2/n, ... to n/n). You see, this is simply the data itself. There is no more efficient way to report all the information without reporting all the data.
I do not see the point in "data reconstruction". Today there should not be a problem to submit all original data that was used in a publication. The journals would just to demand it and provide some server space. Bigger datasets (where server space may be a limiting factor) from microarrays and NGS data is usually published in specialized public repositories.
I wish that journals would demand to submit the complete data. Having these data, one could analyse if there was some manipulation (at least in principle). But I see the main advantage that the data might be re-used, re-analyzed, meta-analyzed and so on. If we have reasons to distrust scientists, science is actually dead (and became just a business, at best). Science requires faith, faith in the understanding and expertise of the scientists and the correct application of methods (for gererating and for analysing data). We have to create an environment to make such faith possible (e.g. by independent funding, also should careers not depend on "good results" but rather on "good science" [but who will judge this?]. A rather harsh option would be to decouple career and science - why should a scientist make a career anyway? There is much to think about and I do not have good solutions! I am just mentioning the problems! As soon as science and mony get entangled, science will sooner or later degrade to a business and work like a buisiness - effective in creating anticipated results that are not very trustworthy, but ineffective to be creative and thorough).
I appreciate these points of view yet they sway from the original question, unless the assertion "Surely any sample will give different results" is essentially true.
I will not discuss directly on the current dilemmas of scientific publication and investigation as these would belong in another topic of discussion, and I think there are whole forums dedicated to that. That is the drawback of discussing one specific aspect inside a polemic context.
I am curious to know if it possible to exist, and then generate, any or many datasets with exactly the same same stats description (not necessarily all of them) as the one provided by Demetrius, as an example. This is very specific.
Depending on the kind and number of given statistics, this can be a tough job, but I would say: in principle yes, this will be possible.
Dear Eduardo
I will try to clarify my last comment via an example. I looked for a set of data in the web and I found an example of a randomized block design in a manual of FAO http://www.fao.org/docrep/003/x6831e/x6831e07.htm.
Suppose that a paper presents the results highlighted in red in the sheet1 of the attached Excel (ejemplo1.xls). I put the real data in sheet2.
Can we simulate data compatible with this results?
Yes we can.
The original data were in a matrix of 8 rows and 3 columns because there are 8 treatments and 3 blocks. I will simulate data compatible with the Anova table and the means of the 8 treatments.
1) I generate a pseudo random sample of 32 values (in cells B22:D29) using an Excel function for the inverse of the standard normal distribution.
2) I compute the mean of each row and column of this random table. I adjust each random data substracting the means of the row and the column and adding the general mean (in cells B33:D40) .
3) I guess values for the mean of each block with the same SSBlock that was reported in the Anova table. I decided to give to de 2nd block the general mean of the treatment means, and to put a mean of d units down in the first block and d units up in the 3rd. Then using the SSBlock, I computed d and obtained the 3 block means.
4) I compute the variance of all the random data of 2)
5) I divide each value of 2) by the square root of its variance [4)] and multiply them by the square root of the adjusted (by the degree of freedom) MSE of the ANOVA table (in cells B47:D54).
6) I simulate the new data adding to the data generated at random the mean of the rows (treatments) and the columns (blocks) and subtracting the general mean.
7) I rounded the new data to two decimals (blue data).
Voila!
You may compare this data with the real one of sheet2.
A little below I put the new anova table in blue.
I hope this help.
Dear Guillemo,
Thanks for a detailed explanation of a complex procedure, I will check on it on a couple of days when I get time.
Still I am very curious about the "fingerprint" of Demetrius data and if it could reproduced with another dataset generated by regression procedures.
For any distribution , we can generate a serie of numbers which, as a whole, approximate that distribution. See Metropolis algorithm
http://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm
But, if you have only some "parameters" (as you said, quartiles, mean, etc.) but not the complete description (formula) of the distribution, who knows!
I think you could choose the missing parameters and get the formula, or apply Metropolis many times and then by approximation (trial and error) get a serie of values close enough to your desciption
@Jochen, I didn't mean doing botstrapping [for example >sample(x,20000,replace=TRUE)], from the data given. I meant what Guillermo Ramos correctly understood: Finding the underlying distribution that generated data and then taking a sample from it. It is a difficult task and I respect the lack of time that everybody of us is facing, but that was the motivation.
@Eduardo, you have to read a little about bootstrapping (ie resampling with replacement: you take randomly from the data given every time one value and then you "put it back to the sample", so next time it is possible to re-choose it). In general we can take every data set and resample it, but I didn't mean it for my data uploaded.
The chance is to find a minimal time procedure for identifying the hidden distribution below any given data set.
Demetris, I used the "statistics" (i.e. your series of 2000 quantiles) you provided to generate the sample. It is not a "bootstrap sample" of your data (as you can easily verify). If one should take few specific statistics, then you should provide these statistics instead of the data. I think there is still some disaggreement what a statistic is. Quantiles (extremes, median, quartiles, percentiles, ...) are statistics, and the more of them are given the better one can simulate a distribution with similar quantiles. So I still think that I did exactly what you requested, just in a different way then you expected.
So Jochen what is the underlying distribution of the data, i.e. which distribution did generate that data set?
Hello all. Eduardo asks a way to generate random/”fake” data from a predesigned statistical-mathematical model. This is possible and this kind of simulation is valid and common as Guillermo commented. I have experience in its handling-
My opinion is that each representative dataset contains not only a media, but also an inner structural distribution that may be found in many cases if there is a good method at hand. With the method we may simulate a function of random cummulative population fractions P for values (0;1] and applying them to the variable distribution (which is a function of P), we may build, compare and get proxy fits of the empirical dataset (this one is the fitting measured referent).
The key point is the method and the premise of Laplace criterion applied to empirical datasets: each value is a mean of a sector with frequence 1/N. We have no idea of real frequences, so we better asume it.
I have shown here some examples of this kind of simulation in other Q@A of RG and discussed it with Demetris and Jochen. But keep in mind that what you call a random data set can only be obtained theoretically from a function applied to random values generated between two limits. That function has P as dependent variable, and that P is random chosen between zero and one in fractions.
I chose a 100 sample of Jochen 20000 data and will show my results in another moment. Thanks, emilio
Eduardo and other readers. The results are included here. Sample was done of 100 values taken from different sectors of 20K dataset of Jochen. Distribution has a very low dispersion, so the Gini index obtained was 0.12. It would be nice to compare and this results with other ones from different methods and packages. Regards, emilio
Sorry, I used a file that referenced another file, so the graphs did not appeared when downloading. I repeat the uploading here. emilio
So, dear Emilio, according to your analysis and if you remember that the data you just analysed are the "data2k.xlsx" which I had uploaded in a thread about MLS & LSE (Jochen took the same data and built a 10 times larger sample), how do the results of the current 100 sub-sample taken from a 20000 super-sample of the original 2000 sample differ? Do you find significant changes? Thank you.
Demetris. I used only 100 values from Jochen 20000, not your 2000. Perhaps the media is different, but the found structural distribution is almost the same. Do you refer to the W equation that I used this time? If so, it is only another one because there are many possible equations that may fit dataset rather well; LSM did not function, and I tried this more complex shape which has its advantages. Another reason may be that the minimum value of my 100 values can be different from your dataset. I only interpreted this 100 data set, which also contained negative values. I do not know if this answer helps to explain your question. OK, in any case it is a good question of my interest. emilio
Emilio, the fact that the "structural distribution is almost the same" promotes the validity of your analysis, so don't worry! BTW I have to read deeper your paper, when I found time. Can you estimate some central tendency measures with your method? It would be interesting to see the results...
Fisher argued that made-up data tends not to have sufficient variation; that is it tends to fit the theory too well.
On this basis he pointed to Mendel's data on peas ; the agreement between his genetic theory and the outcome of his experiments was much too good.
The debate is covered in depth in
Carlson, Elof Axel (2004). "Doubts about Mendel's integrity are exaggerated". Mendel's Legacy. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. pp. 48–49. ISBN 978-0-87969-675-7.
More recent work has cast some doubt on Fisher’s view:
Hartl, Daniel L.; Fairbanks, Daniel J. (1 March 2007). "Mud Sticks: On the Alleged Falsification of Mendel's Data". Genetics 175 (3): 975–979. ‘deliberate falsification can finally be put to rest, because on closer analysis it has proved to be unsupported by convincing evidence’
The software MLwiN (see http://www.bristol.ac.uk/cmm/software/mlwin/ ) which is used for estimating very complex multilevel models allows you to specify the coefficients of the general relation for a given structure and the stochastic variation at multiple levels and then simulate easily from this with just one command (SIMUL) – so you could create whatever you wished to find easily at a few clicks of a mouse. Reproducibility studies need more than just the data.
Demetris. Thanks for your interesting remarks. My comments are:
1) Let´s avoid using samples with negative values. It is simpler to work only positive values of measured variable.
2) The hardest point to deal with is the top tail of distribution. The bottom tail is frequently simpler because researchers have more field information about it.
3) The best distribution to start non-linear analysis is Pareto´s because: a) structural function has just one constant parameter, so W(p)=c, and 0
I could not understand the use of the word 'fake' in the question. Why 'fake'?
@ Hemanta Baruah
"Fake" in this contest is taken somehow as an equivalent to "artificial". I personally create artificial random data to test against something else which I suspect is NOT random, i.e., as a means of comparison of a given "natural" data, against the artificially generated one, which still "obeys" some of the rules (constitutive/natural laws) that seem to act upon the phenomena which rendered the natural data in the first place. Actually the very word "random" means nothing to a mathematician, because you need to specify the assumptions entailed behind this word. For example, does random means "equal probability within a whole interval"? in this case, given the classic (normalized and closed -or even possibly open) interval 0-1, it is declared that any real number in that interval has an equal chance to appear as any other.
Having said that, the discussion here is whether someone can actually produce artificially data which would resemble the result of an experiment. Real science is about results being reproducible, meaning that given the same experimental conditions (say, pressure, volume, temperature, purity of materials and accuracy of measurement), you ought to be able to get similar results. A random number generator, as I said, entails equal probabilities. Natural phenomena is not supposed to have those equal probabilities for two different conditions, otherwise we wouldn't have natural laws, isn't it? if someone attempts to generate artificial data with something that looks like the distribution of raw data coming from natural phenomena, for me it has at least a flavor of naiveness (because one is assuming a probability distribution which real data may not have), or it is a moot point, because you have fallen into a tautology: you already have arrived at THE law governing the generation of your natural data, so of course you can get artificial data: such is called Numerical Simulation, and if successfully reproduces some raw data, that is a paper. Can we spot the "good" from the "bad" data. For experiments, the only way I can see (and which was suggested all along the answers) is the reproducibility. However I think that if you know enough about the subject, you may suspect the data is either fake or incorrect, if you read the conditions the experiment was carried on, because you have the experience that such data cannot be obtained in such conditions.
Minimum: -838.1291705
Maximum: 971.0693132
Range: 1809.1984837
Count: 100
Sum: 40233.613725026
Mean: 402.3
Median: 464.4
Standard Deviation: 300.6
Variance: 90330
Mid Range: 66.47007135
Quartiles: Quartiles:
Q1 --> 294.1471
Q2 --> 464.3842
Q3 --> 550.0032
Interquartile Range (IQR): 255.8561
Sum of Squares: 8943000
Mean Absolute Deviation: 212.6
Root Mean Square (RMS): 501.3
Std Error of Mean: 30.06
Skewness: -1.376
Kurtosis: 5.893
Coefficient of Variation: 0.747
Relative Standard Deviation: 74.7%
Dear Emilio,
Many thanks for your time and technical insights. Regarding your phrase "we may create any distribution and data set we want, so we may lie with statistics if we want. I made this question in R&G months ago, but only received 2 or 3 answers." -- please can you indicate the link to the original question you made?
The descriptive statistics for the dataset you sent is posted above, as retrieved from the leftmost column, named "K orig data". All parameters seem different. If this is right, up to now I am actually getting convinced that descriptive statistics actually is a good fingerprint for the original data, and that generating another dataset fitting given parameters seems to be at the very least quite complicated (I cannot properly follow most of the discussion) if even possible at all as no fit alternative dataset was presented.
I am actually quite impressed, I thought faking random data to fit any challenged min-max intervals and means +/- SD or SE would be easier.
Thanks Kelvyn for recommending the paper on Mendel's data, this is truly relevant application of this discussion. I will try the software you indicate when I have time, however it sounds also like it demands specific technical knowledge and like it will not directly address min-max, mean +/- SE parameters? Will look into it later.
Dear Arturo and Kelvyn, I totally agree with you as mentioned before that reproducibility is the best available means to test for challenged phenomena description, however note that this is in real life both rare and actually discouraged by the current system of scientific publication. Currently peers have been questioning papers presenting "fishy data", usually based on exposing obvious image manipulation. Such practice has been invaluable to show the reality of modern science and the flaws of closed peer review and impact factors as filters for quality.
One of my main ideas with the present question is that maybe there is much overlooked "fishy numbers" that could also be employed in exposing questionable data, possibly on the same batch of challenged papers.
I have found the following compilation of methods below just for generating numbers fitting a given mean +/- SD. I am guessing doing this for further parameters becomes too complicated.
http://rosettacode.org/wiki/Random_numbers
Yes Eduardo
In the example that I did before, I simulated the data behind an ANOVA table where I wanted to rebuilt the published results. Then I have to take care of a lot of means and sum of squares simultaneously for obtaining my goal.
In your SAS quoted code for example, you are building a pseudo random sample of 1000 observations for a Normal distribution with mean=1 and SD=0.5. But the sample mean and sd are:
Mean Std Dev
----------------------------
0.9907408 0.4844051
If you want to simulate published data with mean=1 and sd=0.5 you would have to transform a little this sample.
Dear Eduardo, the link is
https://www.researchgate.net/post/How_to_control_the_ethical_practices_of_statistics_and_sampling
The method is easy and short to understand, but my files in excel are not always so clear. OK, emilio
My contribution might be late.
Yes, the way to generate random/fake data is through simulation. It is a common practice to illustrate the application of an estimation method. A very simple example is a simple linear regression. A bivariate normal distribution with given covariance matrix (variance and correlation) is generated . After the random variates are obtained, simple plots such as boxplots, scatterplots show and confirm the given values.
I consider the explicit problem given by Fox: Forge 100 data points yielding
the prescribed 23 values of descriptive stats.
In essence this is a solvable problem. However, it depends on the
descriptive stats at hand if a solution is feasible. Here it should be
feasible. Let's analyze it ("data forgery for experts"):
(1) eliminate all redundant descriptive stats:
range=max-min
sum=count*mean
standarddev=sqrt(variance)
midrange=(max+min)/2
Q2=median
IQR=Q3-Q1
SumSquares=(count-1)*variance
errorOfMean=standarddev/sqrt(count)
CoeffOfVar=relStandarddev=standarddev/mean
If these identities are not satisfied, the descriptive statistics in
itself is inconsistent and no data can be found. Here, this is OK
(although some descriptives such as CoeffOfVar do not make sense here
since the data also have negative values)
(2) two values are fixed by the descriptives. Set, e.g.,
x0=min
x99=max
(3) take care of the middle quartiles:
We are left with a system of 6 nonlinear equations for the
remaining count-2=98 unknowns x1 ... x98 and the two restrictions
xmin
I could also envision an evolutionarey algorithm to generate random numbers that will have all desired statistics.
But I wonder if it is possible to give (make up) a set of statistics that can not be fullfilled by a (finite) set of data. I think this should be possible, for instance when mean >> median and skewness
There are definitely many possible inconsistencies, from subtrivial ones (negative variance) to trivial ones (e.g. MAD greater than half the range) to more subtle ones having to do, e.g., with conditions between quantiles and variances.
Martin, hi. In the method I have worked it is not relevant to use SD, variances, curtosis, skewness, etc, mainly because they change somewhat depending on the number of points you employ to simulate. As an example I attach here an excel file with graphs for 100 points predesigned for a trial and error structural equation model with U=1, that I call W(p) ... p=cummulative population for descending order. I obtained that Mean Absol. Deviation=0.2, median=1.096 medias; and K=1 media at p=0.626.
W uses 9 coefficients and W(p)=a+bP^B + cP^C + dP^D+eP^E.
This shows that creative simulations are possible from a perspective different from traditional statistical theory based on SD, variances, freedom degrees, etc. If you want to estimate those parameters you can do it from the tables shown.
As Prof. Hemanta asked, why the main question call it "fake" if it is only an experiment done with mathematics plus trial and error? I agree that developing a distributive structure for the parameters you mention must be very difficult and full of conditions. Thanks, emilio
I am glad this discussion is generating so many interesting approaches. The one proposed by Martin seems very logical. Still I can see it is not exactly trivial so that anyone could do it, so I am kind of inclined to trust on raw data for double checking for experimental data. I still think that some evolving algorithm as proposed in a super computer could work out many of such data based on figures given, and I think such would a very useful tool in data evaluation in science.
Minitab is easiest for generating random data. Excel generates only pseidi random data.
Minitab and SAS, produce pseudo-random data too, but they are of more quality because the pseudo-random data of Excel may be easily detected as non random with a pertinent statistical test .
The last system I know that generated "real" ranom data was the C64. All algorithmicly generated random numbers are by definition pseudo-random numbers. However, this leads to a discussion about the meaning of "random" that was fought elswhere. So I'll stick to the definition (be it sensible or not) that all computer-generated random numbers are, essentially, pseudo-random (because they could be predicted with certainty when the algorithm and the starting conditions were known). [Edit: typos]
Hi Jochen,please, may you elaborate your point somewhat more? Are you speaking of N simple random numbers generated between two limits, for example from 0 to 1; or are you speaking of generating N random variables to be used as a dataset? May you give us a simple numerical example for N=5 with few significant figures to understand your point better? Thanks, emilio
Peter, a question. If a dataset with several "random variables" shows that they are asigned to same series of receptor population fractions, can we declare that they are not "independent non-correlated data" -only because all of them are related to a common population variable, that links them-? How does classical statistics justify and explain it? Thanks, emilio
Peter; your answer -which I thank you- states that colinearity is still a big unsolved problem but as "statistics we do best we can under the circumstances". If there is not yet a satisfactory answer to this question, I can not understand why so many statistical packages, books and teachers insist in this kind of recipes still unconfirmed and not studied enough. Instead of promoting them, they should retire them until they are sure about what they do. Is it an inertial weakness of statistical theory and practice in this particular field? Your answer is honest, but it worries me about published research results based in multivariate linear analysis. Thanks, emilio
For normal distribution, it's easy. Every scientist should now how to do this.
First, generate some normal-distributed random numbers (sampling algorithm for normal distribution can be easily found). Then, calculate the actual mean value of this random set and subtract it from every number. Then, calculate the standard deviation. The ratio of desired and calculated deviation is the coefficient to multiply each number in the random set. Now add the the desired mean value to each number. Voila! What a neat dataset we have now!
Dear all,
Several years elapsed since this discussion. It still stands as my "most read" question in this network and the "most read" question of my current institution. Quite remarkable.
Nowadays I have been working using R where random (and non-random) distributions following certain parameters can be easily generated. I was wondering if anyone here has ever tried to crack this using latest R packages?
The algorithm is as follows:
1) generate n-k random numbers from the desired distribution, where k is the number of parameters of the distribution
2) calculate the k missing values so that the sample statistics match the desired distribution parameters.
Example: A random sample (n=30) of a Poisson distribution should have a mean (=lambda) of exactly m = 15 (here, k=1):
m
Thanks @Wilhem, it will take me some tests to digest that. Still, would you know a certain amount of provided stat parameters which would unmistakably 'fingerprint' the original values? That is, whether sd, min-max, median, etc, would lead me to a *unique* set of, say, n=30 integers ? I'd be delighted to test that in R.
Thanks!
Yes. You simply need n-k non-redundant* statistics that let you derive n-k values.** The remaining k values are given as described previously.
---
*The range, the min and the max are only 2 non-redundant statistics, as the third follows from any given two of those (you eventually get the min and the max as two values). The sd and the mean for a normal distribution would also not be independent as they are parameters of the model and determined by adjusting the remaining k (=2) values.
**Using quantiles like the median can (usually) not be used to derive data values unless you say that the quantiles are data values.
YES JUST CLOSE THE LOOP AND FILTER THE RESULT AND OVERRIDE UNAWANTED RESULT WITH REAL NOISE AND MEASURE THE SIGNAL ... SORRY FOR INTERMEDIATE DATA LOSS ... BUT THIS IS WHAT HAPPENS ALWAYS AND EVERYWHERE ...
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn
seaborn.set_style("darkgrid")
import pandas as pd
%pylab inline
random_sample=random.seed(1)
X=random.random_sample([200,20])
X
random.seed(2)
y=random.random_sample([200,1])
y
model=sm.OLS(y,X)
fit = model.fit()
fit.summary2(model)
Is there any way to generate simulate data , I have a time series dataset of temperature values of a sensor. I have mean, min, max, correlation parameters of the available dataset. Now using these parameters , i want to simulate data or create synthetic data using the available data
Hi, it is a pretty old question and Andreas and others explain how to get there; however, strictly assuming normal distribution (and assuming you wanted a ready-to-use software solution), you can simply use ermvnorm from the R-package SimComp to generate data with an exact mean and SD.
Some examples on the practical application are given in
https://www.biorxiv.org/content/10.1101/810408v1