In the application of the Central Limit Theorem to sampling statistics, the key assumptions are that the samples are independent and identically distributed. How do you justify these assumptions (i.e. why are they likely to be true?)
Assumptions follow by logical reasoning from design issues. So it is the design, i.e. the way how the samples are obtained, what makes these assumptions reasonable.
NB: The CLT is not restricted to identically distributed values. It works for any combination of any distributions, as long as they have a finite variance. The finite variance is the more important point; also if all values are identically distributed, the variance of this particular distribution must be finite.
Assumptions follow by logical reasoning from design issues. So it is the design, i.e. the way how the samples are obtained, what makes these assumptions reasonable.
NB: The CLT is not restricted to identically distributed values. It works for any combination of any distributions, as long as they have a finite variance. The finite variance is the more important point; also if all values are identically distributed, the variance of this particular distribution must be finite.
The independence assumption is also important (if you sum n times the same measure, you will not obtain a Gaussian behavior). In practice, the knowledge of the sampling scheme can be sufficient to decide if the independence assumption holds.
Ravi. The CLT holds for some predesigned distributions that assume central simmetry like normal distributions. So the conclusion is implicit in the main assumptions, and proving this tautology is easy. Another case is when you generate random numbers between two numbers, like 1 to 5; it will conduce to U=3, just in the middle of line.
But when you study a nonlineal function of a random value X, for example Y=1-0.2*X2 the CLT does not function at all, even if you have N=1 million random values for the interval (0;1).
It seems I am the only one who considers that CLT is irrelevant for most researches and analysis of databases. Its assumptions are not reasonable neither frequent in research, except in gaussian statistics. That is my view and it is the result of my experience with different simulations and analysis. If CLT is not well justified, I ask people not using CLT to support statistical arguments. It would be an act of faith. Thanks, emilio
the first version of the CLT was valid only for i.i.d variables. This condition was relaxed in modern proofs of the CLT. The only restriction is that the variables have a finite variance (e.g. Cauchy-distributed variables are out). So there is no tautology.
Your example with a "nonlineal function" (Y=1-0.2*X²) is not clear. What distribution should X² have? If its variance is finite, then the variance of Y is finite, too, and the CLT applies.
In the given non-linear example (Y=1-0.2*X²) variable Y is the phenomenon measured and asumed as perfect CDF for descending data of Y. X is its cummulative frequence –between 0 and 1-.
With this information we may estimate perfect values of media U, and Lorenz Curve, L(X). Then, using computer we obtain a polinomic curve with Least Squares Metod, and its derivate produces the Y@(X) or adimentional Y funtion in medias of distribution. If we multiply U* Y@(X) it is obtained a very close curve Y(X) in its original units.
In this case X it is called “independent variable”, although it is highly related and dependent to Y. I think this is due to an inertial mathematic custom of calling the variable that goes in the horizontal axis as “independent” just because we asign values to it in order to graph any polinomic relation.
You may observe that variance is not needed at all in the process.
If we want to simulate random values of Y for this case, first we obtain a random value from 0 to 1, lets suppose it gives Xi=0.3, so you may estimate Y(x=0.3)= 1-0.2*0.3². I did this for N=100 from 0.01 to 1.00 and made the analysis and graphs. This is a simulation excersise.
I made this excercise for (Y=1-0.2*X²), (Y=1-0.2*X0.5) and (Y=1-0.2*X1) using the excel page you may download. Only the third case gives a centered value of U at the median. Main point is that variance is not needed as a parameter; but you may estimate the case using variances and any parametric function you want to compare both results and test CLT in them.
OK, thanks to Ravi for the question and to you and all persons who answered it. emilio
I agree that “””Symmetry “”” of the “density” (of the Random Variables) is not very important…. as n -> infinity." but would add that even if N is not big (less than 20)." That is important because it means that most statistical texts and teachers may be outdated, if not wrong. You have repeated this idea several times for the seek of good science.
Your sentence "It is USEFUL to reach a “reasonable” approximation also when n is as small as 4 or 5!" must be taken cautiously because it requires that each value of dataset be somewhat close to the average of each one of the N possible intervals, in order to be "representative" and to give a close U value. (I do not care about sigma variances).
"The Central Limit Theorem is a PROBABILITY Theorem (proved AND comprised in the field of Probability Theory, NOT in the field of “Statistics”)." My view is that statistics must relate probabilities (frequences, and other related terms like chances) to datasets, building proxy models of measured phenomenon.
Another approach is when one considers the problem in the context of time series analysis. In the theory of time series researchers need to estimate and test hypothesis about the mean of an original stationary time series or the mean of some transformations of non-stationary time series. Therefore CLT-type results are very important.
There are two ways to verify assumptions: just to rely on empirical knowledge about time series and autocovariance plots or to formally test that there is no significant correlation (there is only week correlation) between observations in time.
Of course numerous real time series have strong correlation. Therefore, as it was mentioned in some answers above, CLT-type results were generalized (mainly in numerous econometrics publications) for much more general assumptions about the summands and for non-Gaussian asymptotic distributions. See the introduction and references in
Article Limit Theorems for Weighted Functionals of Cyclical Long-Ran...