The variance of a set of data is obtained by calculating the mean of the squared deviations of the individual observations . Why not by taking the mean of the absolute deviations (ignoring the positive of negative signs)?
There is a history how the "variance" entered statistics. There is has a paricular meaning. Mathematicians generalized the once recogniced concept, with two consequences: (i) applications were found of which one did not think before but also (ii) people forgot (or never lerned) the "real-life meaning" of variance. To my opinion, the argument that the suqares will remove the negative values giving a positive finite average is a made-up post-hoc excuse and by no means a real reason for the way "variance" is calculated.
The concept was (afaik) introduced by Gauss, possibly prepared by the work of Tycho Brahe, to get the "best estimate" for the position of a star based on a series of individual measurements that are all different from each other and do not set up a solveable system of equations. So the question to be answered (using the available data) is: "where should we most likely expect the star?". To answer this question, two assumptions were required: (i) there exists in fact such fixed position and (ii) the data will scatter symmetrically (and independently) around this position. From these assumptions Gauss derived a probability distribution for the reasonable expectations about the "errors" (deviations of measurements from the center). This distribution turned out to have two parameters: the location (0 at the position of the center) and the precision.(*) Gauss was much not interested in the precision. It was just a nuisance parameter, anyway given by the data, that did not influence the estimation of the most likely position of the star. The maximum likelihood position of the position turned out to be simply the average of the measurements. Gauss assumed a uniform prior when he noted that this should indicate the most likely position.
The calculations of likelihoods were too difficult that time, but Gauss could show that for the used probability model the same solution can be obtained by the mathematically equivalent minimization of the squared deviations. It turned out later that the precision parameter is also related to the squared deviations. Today we often use a re-parametrization of this parameter, its reciprocal, and call it "variance". Today it serves as a measure of the variability of the data. It became indirectly important when Students of Gauss (Lüroth, Dedekind) and later also Student, Pearson and Fisher started to investigate the "probable error of the mean". For this they needed the profile likelihood, that turned out to have the shape of a t-distribution. This distribution again has a location and a dispersion parameter.(**) The latter is today called the "standard error of the mean". This was the actualy important measure to describe how precise the estimate of the center (mean) is, based on the available data. The standard error was for sure related to the variance, so it could simply be estimated from the squared residuals again.
The whole story is surely more complex than given here, and a lot of aspects are left out. However, I think it makes clear the main point: the variance (or its square-root, the standard deviation) are mathematical neccessities for the probabalistic description of expectations and can only be sensibly interpreted under certain assumptions. The relation to the properties of the actual data is quite indirect and less important. The more "fundamental" precision measure is actually the standard error, and the variance is the "reversly derived" measure relating back to the probability distribution that was constructed to describe our expectations about individual values.
To look at the dispersion of the actual data, I find the MAD or IQR or IPR (inter-percentile-range) more instructive and understanable.
(*) Yes, it is the normal-distribution. Note that it has nothing to do with the actual empiric frequency distribution of the deviations! The distribution was derived several times independently for similar purposes. Gauss derivation was never well understood, his notes were never so clear. Derivations of Landon, Herschell, and Maxwell, for instance, are better documented (see e.g. www-biba.inrialpes.fr/Jaynes/cc07s.pdf)
(**) Today only a standardized version with location=0 and dispersion=1 [dispersion agin parametrizes as expected squared difference to the expected value] is given in textbooks and used in most software. The actual profile likelihood would be obtained by scalinf with the standard error and shifting by the mean. This way, confidence intervals for the mean are calculated from the quantiles "standard t-distribution", multiplied with the standard error, and adding the mean.
PS: I am grateful for correction of any conceptual mistakes I may have.
For similar reasons behind why we don't take the deviations from the mean (which sum to 0). Simply taking absolute values gives us a measure that is often difficult to compute (well, more in some fields than in others). More importantly, it doesn't yield the properties, not just from a utility standpoint but also geometric, of the sum of squared deviations. The advantage of the latter is akin to giving us absolute deviations in that it
1) ensures a value rather than consistently yielding 0 (as simply using deviations from the mean would)
&
2) it leads naturally to the definition of standard deviation.
The sum of absolute deviations can be a useful metric (it provides a measure of the average "width", which the standard formulation of variance doesn't. However, the formulation of standard deviation provides a superior measure of average "width" for a given distribution of expected values.
Truthfully, I find neither useful much of the time. There is a great book Dictionary of Distances by Deza & Deza, and there are numerous clustering and classification algorithms. For nonlinear data points in multi-dimensional space, measures of central tendency and those based on them can fail to capture important properties of many datasets. However, the absolute value of deviations from the mean does so more than standard deviation. In general, it's always best to use different metric and graphs/plots to understand the distribution of points for some dataset. That's the average of my 2 cents.
Uhm... because that's how everybody defines variance, probably? And you don't want to reinvent the names and reassign a name to another concept?
There are other quantities describing the spread of the data, including IQR (the 75th percentile minus the 25th percentile) and MAD (median of the absolute deviations from the median; the mean of absolute deviations from the sample mean is not nearly as robust as MAD). See http://en.wikipedia.org/wiki/Robust_measures_of_scale. There's been debate about what's the best measure, and this debate has been going on for a couple of centuries. In the XX century, variance has been winning as it is (1) a sufficient statistic with important theoretical properties; and (2) easier to compute (which @AndrewMessing pointed out). Working with absolute deviations, medians, etc. requires sorting the data, which is an O(n log n) operation, and this needs to be done twice for MAD (although I am sure there are more efficient implementations).
I think the very basic reason is that mean of squared deviations is a differentiable function, while sum of absolute values is not. In many cases, we would like to differentiate variance (or SSQ) function (or its estimate), i.e. least squares is the most conventional example. Whether variance, defined by this way, is the BEST method of determining the "width" of the distribution of a random variable (rv) is another story. It's known that small deviations from normality make the population mean and variance (not their sample estimates, but the population parameters themselves) non-robust parameters, so median and IQR may be preferred to describe the location and width of a rv.
Dear Burak Alakent:
I mostly agree. The relationship between the so-called "mean value theorem" one encounters in elementary calculus and then learns to prove (hopefully) in some analysis course does not share the term- "mean" - for no reason. I would say this has more to do with integration then differentiation, but as the two are so related it hardly matters. Also, there is the fact that one can minimize the least squares linear function of 2 variables in ways that one can't in the same way when it comes to absolute deviations from the mean. That said, a population can have values distributed arbitrarily close to a normal distribution and random sampling can still be fail to be robust to violations of assumptions. So I agree that certain measures based upon the median are superior to those based upon the mean (again, though, granted certain conditions are met; I can't stress enough how understanding the distribution of one's dataset in terms of whatever dimensional space one is working in is key).
An additional motivation is the natural geometric interpretation of summing squares and then taking the square root as is done to obtain a standard deviation. Euclidean distance in deeply entrenched in our early mathematical development. There is an entire area in mathematics that explores properties of Lp spaces. These spaces define a metric where one raises each value to the power p sums and then takes the pth root.
As stated, there are many ways to look at this, and computational ease was a much bigger consideration in the past, but variance does behave very well. In particular, I use a great deal of regression through the origin, where WLS is very important. You can use weighting with other methods, but WLS regression and the study of heteroscedasticity has very nice application. Still, as indicated above, experimenting with different measures may tell you a great deal. - Nice question.
Dear Debabrata, there also exist the Mean Absolute Deviation (MAD) which is exactly what you mentioned. In general there exists the concept of moments and variance is one of them.
There is a history how the "variance" entered statistics. There is has a paricular meaning. Mathematicians generalized the once recogniced concept, with two consequences: (i) applications were found of which one did not think before but also (ii) people forgot (or never lerned) the "real-life meaning" of variance. To my opinion, the argument that the suqares will remove the negative values giving a positive finite average is a made-up post-hoc excuse and by no means a real reason for the way "variance" is calculated.
The concept was (afaik) introduced by Gauss, possibly prepared by the work of Tycho Brahe, to get the "best estimate" for the position of a star based on a series of individual measurements that are all different from each other and do not set up a solveable system of equations. So the question to be answered (using the available data) is: "where should we most likely expect the star?". To answer this question, two assumptions were required: (i) there exists in fact such fixed position and (ii) the data will scatter symmetrically (and independently) around this position. From these assumptions Gauss derived a probability distribution for the reasonable expectations about the "errors" (deviations of measurements from the center). This distribution turned out to have two parameters: the location (0 at the position of the center) and the precision.(*) Gauss was much not interested in the precision. It was just a nuisance parameter, anyway given by the data, that did not influence the estimation of the most likely position of the star. The maximum likelihood position of the position turned out to be simply the average of the measurements. Gauss assumed a uniform prior when he noted that this should indicate the most likely position.
The calculations of likelihoods were too difficult that time, but Gauss could show that for the used probability model the same solution can be obtained by the mathematically equivalent minimization of the squared deviations. It turned out later that the precision parameter is also related to the squared deviations. Today we often use a re-parametrization of this parameter, its reciprocal, and call it "variance". Today it serves as a measure of the variability of the data. It became indirectly important when Students of Gauss (Lüroth, Dedekind) and later also Student, Pearson and Fisher started to investigate the "probable error of the mean". For this they needed the profile likelihood, that turned out to have the shape of a t-distribution. This distribution again has a location and a dispersion parameter.(**) The latter is today called the "standard error of the mean". This was the actualy important measure to describe how precise the estimate of the center (mean) is, based on the available data. The standard error was for sure related to the variance, so it could simply be estimated from the squared residuals again.
The whole story is surely more complex than given here, and a lot of aspects are left out. However, I think it makes clear the main point: the variance (or its square-root, the standard deviation) are mathematical neccessities for the probabalistic description of expectations and can only be sensibly interpreted under certain assumptions. The relation to the properties of the actual data is quite indirect and less important. The more "fundamental" precision measure is actually the standard error, and the variance is the "reversly derived" measure relating back to the probability distribution that was constructed to describe our expectations about individual values.
To look at the dispersion of the actual data, I find the MAD or IQR or IPR (inter-percentile-range) more instructive and understanable.
(*) Yes, it is the normal-distribution. Note that it has nothing to do with the actual empiric frequency distribution of the deviations! The distribution was derived several times independently for similar purposes. Gauss derivation was never well understood, his notes were never so clear. Derivations of Landon, Herschell, and Maxwell, for instance, are better documented (see e.g. www-biba.inrialpes.fr/Jaynes/cc07s.pdf)
(**) Today only a standardized version with location=0 and dispersion=1 [dispersion agin parametrizes as expected squared difference to the expected value] is given in textbooks and used in most software. The actual profile likelihood would be obtained by scalinf with the standard error and shifting by the mean. This way, confidence intervals for the mean are calculated from the quantiles "standard t-distribution", multiplied with the standard error, and adding the mean.
PS: I am grateful for correction of any conceptual mistakes I may have.
I spent some time looking for free sources on the historical reasons and (mostly) modern justifications for the least squares method as well as a treatment of alternatives. For the most comprehensive historical treatment, see:
A History of ParametricStatistical Inference fromBernoulli to Fisher, 1713 to 1935
(http://www.math.ku.dk/~sjo/papers/HaldBook.pdf)
For a more specific historical treatment of the least squares method and theories of errors/residuals/dispersal, as well as derivations and other mathematical aspects of the least squares method vs. alternatives, see
Less than the Least: An alternative Method of Least Squares Regression
(http://www.mcm.edu/mathdept/sarah.pdf)
but note that the discussion of pro's and con's of least squares regression and the alternative is both outdated and not all that comprehensive. That is to say, don't use it as a guide for regression analysis or optimal dispersion measures. For that kind of treatment, see e.g.,
Modern Insights About Pearson’s Correlation and Least Squares Regression
(https://www.soph.uab.edu/statgenetics/People/MBeasley/Courses/Wilcox-insightOLS-2001.pdf)
For a more technical (but with a brief history) account of the problems with least squares regression and alternative methods as well as applications, see
Regression Analysis of Censored Data with Applications in Perimetry
(http://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=39438&fileOId=1670650)
For a discussion on robust measures of central tendency, see
Modern Robust Data Analysis Methods: Measures of Central Tendency
(http://www.researchgate.net/publication/9028424_Modern_robust_data_analysis_methods_measures_of_central_tendency/file/79e4150a40b6c01d39.pdf)
Hopefully, these give some idea as to how the least squares method and definition of variance arose, as well as an idea as to its problems and alternatives.
http://www.math.ku.dk/~sjo/papers/HaldBook.pdf
http://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=39438&fileOId=1670650
Article Modem Robust Data Analysis Methods: Measures of Central Tendency
The "nicer" mathematical nature of squaring is apparent when teaching variance and standard deviation to a group of non-statistics students, like medical students for example. As soon as you start to talk about absolute values, they glaze over in zombie form and you've lost the class. Tell the students to square something and they are comfortable with it. And as posted here by others - just much nicer properties.
Dear Melanie Poulin-Costello:
You raise an interesting and (in my opinion) important issue. I have tutored maybe 2 or 3 medical students in intro stats (and, interestingly, a fair number of nursing students), but almost without exception my students (research methods classes, intro stats classes, or tutoring students taking such classes on the side) have been in the social & behavioral sciences. I think it is safe to say that such students are at least as bad off when it comes to mathematical competency/literacy as medical students. Pre-med programs/degrees generally require at least passing Calc I (admittedly, this doesn't actually teach calculus but rather how to use pre-calculus to apply a bunch of rules, thereby avoiding the conceptual challenges a rigorous introduction to calculus poses). Regardless, many of my students are unable to apply basic algebraic manipulations to formulae or equations, are stumped by the basic logic underlying probability and elementary statistical methods, and aren't interested In more than learning the bare minimum to pass courses.
Worse still, those who go on to graduate school and take multivariate statistics will more often than not learn just enough to associate X method with Y statistical technique and how to implement it using Z statistical software package.
When I was an undergrad, my interesting mix of majors and my job working for the academic support department meant tutoring an odd mix of students. I mention this only because I happened to tutor a young women in Latin or in Greek (I think it was Latin) who, I found out, was an economics major before she switched to classical studies. The only reason she switched is because she couldn't pass calculus. Naturally, I was sympathetic- with very few exceptions, there's no point in taking Calc I unless one takes subsequent calculus courses, real analysis, or some combination of linear algebra and multivariable calculus (e.g., a course based on Shifrin's Multivariate Mathematics, Hubbard & Hubbard's excellent text, or even the classic treatment by Apostol). However, modern calculus courses can't really be simplified anymore than they already are, and universities typically don't want to grant B.S. degrees to students who they can't even pretend they have exposed to a certain amount of mathematics, as math is the tool and language of the sciences.
So why don't we teach the foundations of statistics in intro stats and require those who can't pass such a course to take some transitional quantitative reasoning course(s) or to switch majors? What is the point in arming future researchers, scientists, etc., with statistical methods they don't understand the logic of so that they can at best never use these? These techniques were known to be inadequate by those who developed them: "Many of the statistical methods routinely used in contemporary research are based on a compromise with the ideal (Bakeman et al., 1996). The ideal is represented by permutation tests, such as Fisher’s exact test or the binomial test, which yield exact, as opposed to approximate, probability values (P-values). The compromise is represented by most statistical tests in common use, such as the t and F tests, where P-values depend on unsatisfied assumptions."
Mielke, P. W., & Berry, K. J. (2007). Permutation Methods: A Distance Function Approach (2nd Ed.) (Springer Series in Statistics).
Permutation tests, like many superior methods to those taught in intro stats courses, were formulated in the first half of the 20th century. However, they were basically unusable without computers. Other advances over the standard methods taught can be found in Rand R. Wilcox's Basic Statistics (or his other two intro stats textbooks), which is as elementary as any intro stats textbook but doesn't teach poor inadequate measures, tests, and the (insufficiently explicated) limitations underlying both.
In short, let us grant that students required to take intro stats classed (and even graduate stats courses) often lack the familiarity with conceptual mathematics, probability theory, or (in the case of multivariate stats) linear algebra to really "get" even fairly elementary statistics. Does this mean we should teach them outdated methods we know are inferior partly due to tradition and partly because some of the advances are more complicated?
Heisenberg is usually credited as having developed the matrix mechanics formulation of quantum mechanics. However, he didn't know what matrices were (Born told him, and neither one really understood them enough to take Hilbert's advice). Today, students in physics are required to learn matrix algebra (granted, with a less powerful notation immensely irritating to those of us who had to learn it after learning linear algebra). We changed what we teach based upon developments in physics. Boole, Turing, Shannon, and others all developed theoretical aspects of the computer sciences which either pre-dated computers or pre-dated all but the earliest computer, just as Pearson, Fisher, Cox, and others did for statistics. Yet we don't teach computer science students the state of computer science when there weren't any computers. We require them to learn logic, discrete mathematics, and more, including more modern formulations of information theory and Boolean algebra than Shannon and Boole developed. So why are students in the social, behavioral, medical, cognitive, & psychological sciences using simplified versions of methods developed anywhere from over 2 centuries to almost 100 years ago when those who developed them knew their flaws and inadequacies?
Dear Dr. Corneli:
There are a few issues with your response that perhaps you are well aware of but others may not be.
1) Linear algebra is LINEAR. Variance, probability distributions, and every single use of any regression is NOT (you don't need to fit points to a line that they already fall on). Thus defining deviance of points from a line by the way lines in n-dimensional space are themselves defined gets us nothing (at least, not alone).
2) The initial question concerned why variance is defined as is. By noting that variance defined as is defines standard deviation, you don't really address the question. As variance isn't the result of linear algebra (it is defined in linear algebra as such mostly for reasons that predate the development of matrices by Cayley let alone linear algebra, and in any event the least squares method predates pretty much the entirety of linear algebra), one can't really defend a circular defense of variance by reference to the magnitude of vectors.
3) The sum of absolute MEDIANS is often superior as a measure of location. Nor is there anything wrong with the logic of using absolute deviations from the mean. Indeed, a primary issue relates to computational difficulties, not logical flaws.
4) The acceptance of squared deviations as THE standard has a great deal to do with error distributions that are continuous, assume normality (indeed, are defined by it), and depend upon assumptions that are generally violated in modern usage and are often not robust to such violations.
I feel compelled to chime in as Mr Messing's comments are correct and I feel that Ms Corneli conflates a variety of ideas and appears unclear except for her statement that the sd is the square root of the variance. Given Ms Corneli is MIScorrecting someone, and given there is an audience of learners present, I must indicate that no, Ms Corneli, you are the one that is mistaken. While I am all fo open discussions across people's field of expertise, I suggest we all restrict our pronouncements to our professional areas of education and primary competence, whether it be statistics or biology.
I suggest we take up your misconceptions with have Simon Tarvare, then, as there is an impass and need of some clarification. I searched him, but he is not on researchgate. Feel free to ask his opinion on the matter. Happy to have him arbitrate a scientific matter. Please, ask him.
I did not suggest you apologize. I asserted it is you who are confused and mistaken to MIScorrect someone. I have started to answer questions here as I thought it had a framework scientific professionalism at its core. If this was a random crank page where anything goes or facebook, I wouldn't take the time.
I think the original question has some very subtle and deep issues to a slightly arbitrary choice of why varince is defined th way it is. Interesting sort of along the lines of contemplating why we operate in base 10 when base 2 and 16 are quite nice alternatives. Yes, we have 10 fingers, and yes, we have the pythagorian theorem, but we have become so deeply cognitively wedded to these we rarely stop to think about why and the alternatives. What would 3 fingered men on a planet whose radius i so small the pythagorian theorem didn't work on it's surface? Would they have first mastered base 3 great arc segment geometry instead of planar geometry?
I thought this was a really interesting thread and really enjoyed what Jochen and Andew wrote. Clear and good scholarship. I loved history of mathematics in my studies and it is rare to get to listen and talk bout these Ideas. It bothered me to see them wrongly dismissed.
Thank you all for your scholarly contributions in response to my question.
Why is variance calculated by squaring the deviations? Perhaps the reason is that negative numbers may exist in Cartesian graphs and classical mathematical methods, but not in real natural world. It is like when you look at a rotating object over a plane: if you look from one side it rotates clockwise, if you look from the other side it does counter-clockwise, and both refer to the same situation. Negative numbers is a useful resource of mathematical language, but it creates this kind of confusions that are apparently solved appealing to "squaring the deviations". This may mean that the grammatical rules of our methods effect our interpretations of measured realities. Thanks, emilio
There is no problem with negative numbers. If we wanted to measure the average distance between an observation and the mean, then we could use the absolute value of the difference. And, indeed, this produces the mean deviation.
The answer lies in the idea of variance, which is a measure of the error produced by the use of a single parameter. The mean is the parameter that minimises the total error so long as we define error as the squared difference between the observed (the data) and the expected (the parameter). So the mean is the best single guess in the absence of any other information. We can test other parameters for their power to reduce error by comparing them with the most precise single-parameter model so long as we match the parameter with the error. If we are to test error using mean deviation, then the appropriate parameter is the median, which minimises absolute error. But if we are to use squared deviation, then the appropriate parameter is the mean.
So the definition of error has to be linked with the model parameter. In the case of the mean, the error that is minimised is the squared deviation.
Or so I was given to understand many years ago.
Debabrata, I hope you will not mind if I point out something that you might have overlooked. Taking the mean of the absolute deviations will not result to "variance" which has a squared unit, it will result to mean deviation- with an un-squared unit
Getting the average of absolute deviations is one of the options of getting the standard deviation, aside from getting the square root of the averaged square of the deviations. Getting the average of the deviations (with positive and negative values that always sum to zero) is not a choice.
We are actually trying to get an "average" a "central value" of deviations as Fausto and Xin Yan suggest.
If we use the mean of the absolute deviations, it will not be a "center" but a maximum. The only option left is to get the square root of the averaged square of the deviations- or getting the variance and then getting its square root.
The implication of accepting the average of the deviations which will always be zero, as the standard deviation will have a tremendous impact in Statistics. Ed
why variance? .. because the bias!
If you can calculate it is because you have data, a sample, probably from an experiment. Actually, it is called sample variance. Now imagine that you do another experiment, collect other data, a get lightly different variance. Which is right? If the data source is assumed to be the same, you can assume a theoretical distribution, with a unique variance, also called population variance. If you do infinite number of experiments, and compute the average of sample variances, that is the expected variance, you get the population variance. If you use absolute deviation instead, you get a bias, a value different to the population. In practice we have only a sample, and we work with squares instead absolute values to avoid a bias.
I do not understand your point, Pablo. Variance is defined as E[(E[X]-X)²] and E[X] is the expected value, which is the sum (or integral) of all possible values multiplied with their respective probability(density). This has nothing to do with sample variance. If you have a sample, you can estimate the variance, and such an estimate can be biased. When each observed value (x1, x2, ... xn) is assigned the same probability, an estimate can be calculated as (1/n) sumi((xi-mean(x))²). For sampling with replacement or for samples from infinite populations this estimate is known to be biased, and the bias is a factor of n/(n+1). Therefore, the sample variance, as an unbiased estimator of the variance is given as (1/(n-1))*sumi((xi-mean(x))²).
The same applies for any other measure of variability. The estimate for the mean absolute deviation E[|E[X]-X|], (1/n)*sumi|xi-mean(x)|) is also biased, for instance.
I will try to explain the reasons from two points of view: one theoretical and the other practical.
Theoretical.
In math. Stat., we learned the statistical properties of a good estimator, these are: Consistency; Efficiency and Sufficiency. To describe a distribution, in the statistical sense, there are only needed a measure of central tendency and one of dispersion. The idea is to focus on the latter, because it is the problem under discussion. Of all the measures of dispersion, the best based on the previous concepts is the average squared deviation (population). Other measures of variability, such as, the mean deviation, range, and in lesser sense the average of the absolute values of the deviations from the mean, have better properties than the variance.
Practical.
Naturally, the square root of the average squared deviations, leads us to have a measure of dispersion in the units in which the character is expressed.
Your method does measure the spread of the data, it just is not called variance. Your method might be more appropriate for some applications than variance, but there are mathematical reasons why people stick to the root-of-average-square thingy.
Jochen, in some part of your interesting post (June 18) you mention:
“The maximum likelihood position of the position turned out to be simply the average of the measurements. Gauss assumed a uniform prior when he noted that this should indicate the most likely position.” I think that it should say “the most likely position given randomness of dataset”.
It means that the expected mean asumed equal frequences for each of N dataset values. This is called the Laplace criterion and it also implies that each data set value behaves as the Ui media of a tiny 1/N interval. But when someone uses sample U to estimate variance, as a second parameter and use them in estimating a new distribution called Normal, or Gauss bell, then he/she asumed two distinct structural distributions for same dataset: the Laplace criterion and the Normal structure –with other inner traits as simetry and U at the median position-. Both premises can not coexist in the same analysis because it accepts a contradictory duality in the fundamentals of analysis. The question is about why squared deviations, something I can not answer. I only know that estimated U may be higher or lower than real U, and this effects sharply the variance, or the standard deviation, etc., and even more the parametric normal model obtained.
From Laplace’s book “A Philosophical Essay on probabilites” there is this quotation:
“The theory of chance consists in reducing all the events of the same kind to a certain number of cases equally probable, that is to say, to such as we may be equally undecided about in regard to their existence, and in determining the number of cases favorable to the event whose probability is sought. The ratio of this number to that of all the cases possible is the measure of this probability, which is thus simply a fraction whose numerator is the number of favorable cases and whose denominator is the number of all cases possible.” 50 pages later, dealing with another topic he mentioned “One may draw from the preceding theorem this consequence which ought to be regarded as a general law, namely, that the ratios of the acts of nature are very nearly constant when these acts are considered in great number”. (Laplace, P.S.; 1902; p.61)
Laplace, Pierre Simon de. ““A Philosophical Essay on probabilites” (John Wiley and Sons. 1902. Translated from 6th. French edition). Ps 11-12 and 61. Consulted in Sept/04/1914 at https://archive.org/stream/philosophicaless00lapluoft#page/n5/mode/2up
Thanks, emilio
Debabrata -
Under survey sampling, consider the case of a population divided into (independent) strata. As an analogy, if we take the variance to be the area of a square formed by a standard deviation on each axis, for each stratum, then we can add the areas of the squares to obtain the variance for the population as a whole. But we cannot just add the standard errors.
This relates to the difference between the sum of squares and the square of sums, and there is a body of mathematics (Gaussian) built around this, which leads to the Central Limit Theorem.
It is also useful to include the concept of bias (really bias-squared) with variance as the MSE.
But why keep this if we may do better with other methods, and with more computer intensive methods, say for the case of regression? - (1) The "bad" reason is momentum; the same thing that keeps us stuck with overuse and misuse of the p-value. (2) The "good" reason is that this relic of the past, unlike many others, actually works well, much of the time. Every measure has its pros and cons; situations to which they are or are not very sensitive. Variance - leading to standard errors - is often a good and useful measure. Others may be also.
Jim
Variance conceptually gives an idea about the hetergeneity of the data set.The more the variance of the data..more s d spread of the data set..Generally for computational purpose we use the squaring method..as this s a better and convinient process rather than taking the average of the absolute deviations