When statistical data is obtained either by experimentation or by collection, how can one conclude on the type of pdf it belongs to?

Jochen Wilhelm Popular answer

Very comprehensive answer, Alex. I can not find anything wrong, but I still have a comment: to my opinion(!), transformations should be applied when it is clear what they do. Just to apply transformations because it makes (one aspect!) of your adta looking nicer i.e. fitting better to your possibly bad expectations) is not a good way to go. The square-root is often used for count data. But there are Poisson models available, so there should be no need for a square-root transformation when it really is count data. If it isnt count data, I need to elaborate what the square root of the response actually means. Similar thing with log-transformation. It is very often used when the data is right-skewed. The standard (normal) model on the log data is assuming a multiplicative error structure. This is what has to be clear in mind. Otherwise, a gamma model might be more appropriate. And finally, the Cox-Box-transformation leaves you completely alone with finding out what the transformed response tells you. No daubt, the transformed data can be used well for hypothesis tests, but to me it seems unsatisfying to be left claiming that there is evidence to believe that, for instance, the means of strangely transformed values of two samples are not similar.

Mohammad Tahir

What exactly you mean by pdf? Do you mean to say that you want to search the data across pdf's?

Cj Nev

Mohammad, I believe pdf refers to probability distribution function.

Krishna, after carefully establishing with precise definitions (dependence/ independence, continuous/discrete* variables, etc.) your hypothesis (either given or devised), you must ALWAYS plot or graph the data first, for that can only then determine your analysis of its frequency distribution over time and at other times (time series, multi-sample populations or groups for verification). Because as you know your data depends on observed phenomena and your hypothesis of that observed phenomena (how and how much phenomena and hypothesis are linked or correlate; confidence intervals and margins of error), followed by using statistical methods for testing your hypothesis (AND its opposite, or correctly-formulated null hypothesis) of observed phenomena, or hypothesis testing. You use statistical techniques to test the strength (or weakness) of both the validity and reliability of your hypothesis, because observations can be intentially or unintentially faulty and heavily biased. Also, as you know, this is simply the process of employing true science, the scientific method. Furthermore, in your analysis (hypothesis testing, fractiles, regression, etc.) of your plotted data's still-to-be-determined distribution, look for (1) the mean, median, and modes (or central tendencies) for possible skewness (lopsidedness) of your data, right or left on the graph, from the normal probability distribution curve (if skewed, employ fractile analysis for which box plots may be best suitable). Then look at how the data is (2) dispersed on the graph, that is, from establishing the confidence interval (from the Central Limit Theorem; sample population n = normal probability distribution of entire population N, i.e., part=whole per Dirac's Large Number Hypothesis) and variance (=s^2) from the mean to derive the standard deviation (=s). This scientific approach will determine your pdf's of your observed and collected data: normal, skewed, binomial (*discrete, zero-sum, e.g., artificially, independent rolls of dice), linear, curved, curvilinear, etc., ultimately unveiling the truth or falsity of again your carefully-formulated hypothesis/quest.

Jochen Wilhelm

There are a couple of things getting confused here.

1) probability distributions are not frequency distributions. They represent a state of knowledge we have about some outcome.

2) for fitting statistical models, not the frequency distribution of the data is of interests but the probability distribution of the errors (residuals)

3) the probability distribution of the errors is usually derived by theoretical considerations using a minimum number of assumptions and taking as much care as possible of what kind and how much knowledge we can have about the errors. Six typical results for different kinds of data are the binomial distribution (for dichotomous data), the Poisson distribution (for counts), the exponential distribution (for waiting times), the beta distribution (for percent values), the normal distribution (for measurements), and the gamma distribution (for measurements with correlated mean and variance, and for waiting times until the nth event). Some other distributions are around giving some more flexibility in modelling known/expected correlations between means and variances (e.g. the negative binomial).

4) in principle, one is free to choose *any* function that fullfills the requirements of a probability function to fit the statistical model. The problem is not what is "right" or "wrong" but was convinces the reviewers. There are always many "rights" and many "wrongs" possible. Clearly wrong was a distribution that does not consider major sources of ignorance or knowledge you (should) have.

To clear up the connection between these points:

If you do null hypothesis tests following Newman-Pearson, you should control the error rates. To be able to achieve this, the probability distributioon must be equal to the distribution of the relative frequencies. This prerequisite is impossible to meet, but one can get close to it. However, there remain two problems: 1) how can I estimate the "closeness", especially if I have limited data? and 2) how close is "close enough"?

Question 1 is usually tackled by QQ plots comparing the empirical qnatiles to quantiles of a distribution model. But for question 2 there is no general answer available. Luckily, for 99% of models on measured things (no counts, percentatges or other strange measures) I encountered, the normal distribution is an appropriate distribution to model my ignorance about the residuals. Hence, if the Normal-QQ plot shows any obvious patterns, the model is still not good enough (missing predictors, interactions, non-linearities), but I stay with the normal distribution to model (my ignorance of) the residuals.

Alex F Bokov

I agree with what Jochen said and here is my "step-by-step" version of a similar thought (which was going to be a short version but is turning out to be the long version, sorry):

I. Fit your statistical model. And then plot...

A. ...the residuals versus fitted values (if it's a t-test or classic ANOVA, it's just a boxplot or stripchart of residuals on the Y-axis and groups on the X-axis. Are the residuals symmetrically distributed around 0? Do some groups have residuals that vary noticeably more than other groups?

B. ...the residuals against what ideal normal residuals would be (the QQ plot Jochen mentions). Are they close to a straight diagonal line, with only a few points in the tails trailing off, and are these tails roughly similarly sized?

II. If the answer to any of the above questions is a strong no, here are some things you can do:

A. Plot the residuals against all possible "independent" variable (which I will call predictor variables) including independent variables you didn't include in the model (For example if you're comparing control versus experimental but you collect the data over three different days, the days would be a potential predictor variable.).

i. If the residuals go up or down with a new predictor variable, it might be necessary to include it in the model, possibly also its interaction with one or more of the other variables.

ii. If the residuals go up or down with an existing numeric variable, you might need to add a power of that variable to the model (For example, if you have Mass as a predictor that correlates with your residuals, you might need to also include Mass^2).

iii. If the residuals go up or down with an existing categoric variable, you might need to subdivide that category into more specific subcategories, and this will not always be possible.

B. Another alternative is transforming your data. For example, taking a log, square root, or reciprocal of the "dependent" variable (which I will call response variable). This process can be somewhat automated by using a Box-Cox transformation, that some statistical software has. Don't know about other software, but in R you can install the MASS package and use the command boxcox(X) where X is a statistical model you fitted to the data. If is indicates the highest likelihood is near 2, you should try refitting the model with the response variable squared, if highest likelihood is near -2, try the reciprocal of the square, if near -1 then try just the reciprocal, if near 1/2 try the square root, and so on. If near 1, leave it alone, the functional form of the variable is not the problem. If near 0, take the log. Note that this isn't always practical/possible when the response variable has negative values.

C. Or, you can try using generalized least squares. This is available in the nlme package for R.

D. If there are random effects in your data (e.g. which person, which testing facility, which day, which technician-- variables that affect the data but are not relevant to the study AND would not be the same ones if someone repeated your study someplace else) you can use a mixed-effect model. These are also available in the nlme package for R.

The above things to try are arranged roughly in order of increasing technical difficulty. If at some point you encounter one that you have problems with, that is the time to ask a statistician for help.

Actually, if your local statistician is readily available, the best time to ask for help is when designing the experiment. So, the sooner you can get hold of them, the better. However, trying the above steps might be good practice for better understanding what the statistician does with your data and why, and eventually becoming less dependent on him or her.

All the above is just my subjective synthesis of what I read and what more experienced colleagues have explained to me. If any advice I'm giving is incorrect, I will gladly vote up people who post corrections.

Jochen Wilhelm

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?