Mean and median are often presented both as descriptive statistics, but this is actally not the case. The median is a central value of the data. It is that value for which one expects half of the (possible or observed) values being smaller and the other half being larger.
It is more a coincidence that the mean also is (often, but nor always!) quite in the center of the data, but its derivation is completely different. The mean is the result of a probability model over the "errors". One expects the values scattering around a common center, and this centre is determined as the value for which the observed data has the hhighest likelihood. This likelihood has to be calculated from a probability distribution for the deviations of the observed values from this hypothetical center. Maxwell, Herschell and others derived an approprite probability distribution from the simple assumptions that the expectation about the errors is symmetric (positive and negative errors of the same absolute size are expected with the same probability), and that all errors are congeneric (they share the same probability distribution). The result of this derivation is the normal distribution. Calculating the max. Likelihood using the normal distribution leads to the mean as the "best guess" for this assumed center. This is quite a different interpretation than just being the "center of the data".
Therefore I'd say that you should use the median to describe the center of the data, and you should use the mean if your aim is to model such a common center for which your expectations about the errors are in accordance to the two above given assumptions.
If the data are normally distributed the mean is appropriate. If the distribution is not, for example log-normal or a similar distribution, you could use the median.
Thanks a lot Matthew for the response, Your suggestion is well noted. I asked this question in line with a criticism placed by a Prof. of statistics I know of against the use of mean by most economists in depiction of central value for a sample/population. Besides the normality or otherwise of a data set, what other attributes of the data help in appropriately identifying when to use the mean or median?
In general, the median is more robust with respect to outliers. Hence, if you expect outliers, the median is the better choice. The mean, however, is sometimes more convenient, because it is directly involved in other statistical measures or decision models/statistical test. Yet, also for the median there are tools to calculate confidence intervals or/and statistical test for significant differences between two median values of different observation sets. Thus, I would prefer the median, if I can not assume normal distribution (should be tested in advance).
Mean and median are often presented both as descriptive statistics, but this is actally not the case. The median is a central value of the data. It is that value for which one expects half of the (possible or observed) values being smaller and the other half being larger.
It is more a coincidence that the mean also is (often, but nor always!) quite in the center of the data, but its derivation is completely different. The mean is the result of a probability model over the "errors". One expects the values scattering around a common center, and this centre is determined as the value for which the observed data has the hhighest likelihood. This likelihood has to be calculated from a probability distribution for the deviations of the observed values from this hypothetical center. Maxwell, Herschell and others derived an approprite probability distribution from the simple assumptions that the expectation about the errors is symmetric (positive and negative errors of the same absolute size are expected with the same probability), and that all errors are congeneric (they share the same probability distribution). The result of this derivation is the normal distribution. Calculating the max. Likelihood using the normal distribution leads to the mean as the "best guess" for this assumed center. This is quite a different interpretation than just being the "center of the data".
Therefore I'd say that you should use the median to describe the center of the data, and you should use the mean if your aim is to model such a common center for which your expectations about the errors are in accordance to the two above given assumptions.
I agree with Matthew and Jochen--I have found that if the data presents with a normal distribution, the mean would work well, however most of the research I have worked with has had some sort of skew due to outliers which then requires the median to be used as the measure of central tendency. This is why the average income is many times reported using the median rather than the mean due to major outliers of your multi-billionaires.
It is important to consider the purpose of your analysis as well as the distribution of the data. For many purposes, particularly when you are analysing costs. The mean has the property that the total = N * mean. So if I run a hospital, to plan my budget I want to know the mean expenditure per patient and mean income per patient, because those numbers tell me whether my hospital can cover its bills. And it's the mean that I need even though both of those distributions are undoubtedly very skewed. Knowing the median expenditures and median income would tell me very little about this critical question - the hospital could be losing money even if median income is much greater than median expenditures.
David has hit an important and frequently overlooked issue in selection of the mean or median as an appropriate metric for a particular analysis. His example is paralleled by similar situations when one is interested in estimating volume, mass, exposure to a chemical etc. Any time the parameter of interest is represented by an integral, the median will not be appropriate.
Suppose one is interested in annual exposure to a chemical over a period of 4 months with monthly doses of 1, 1, 1, 97 (units). The median dose is 1 (unit) whereas he mean dose is 25, which as in David's example, nets the total intake (100 units) by multiplying by the number of months 4. The median would provide an estimated total intake of 1 unit....obviously biased low.
interesting...Thanks a lot Kern for the brilliant point shared and the example given. Your point is very well noted and much appreciated. I however seek clarity on the issue of "parameter of interest is represented by an integral" in your statement above. Are you referring to parameters associated with continuous variables? (or also discrete variables". I ask this because I know integrals are more used on continuous variables, and talking of a parameter being represented by an integral sounds a little confusing for me.
David, the integral over a continuous variable is simiar to the sum over a discrete variable. In practice we almost always have actually discrete variables. The income for instance is given as money per person, so to "integrate" the income over all persons would mean to simly sum the individual incomes. Same for the example with the monthly doses: "dose per month" again is discrete, and summing up all the discrete monthly doses gives the total dose over the whole considered period of time. If one really considered continuous variables, the intergal has a different meaning than the integrated variable. For instance, if your variable is speed, the integral over time will be a distance. When the variable is a probability *density*(!) of a continuous variable, the integral over the quantiles will be a probability value.
Very interesting responses. When dealing with Continuous data I like to have a clear idea of the data by not only looking at the central tendency (Mean, Median, Mode), but also studying the shape and spread of the data.
Media is esential of course, and if you divide the CDF by the media, then you obtain the CDF in mediae of the distribution as an adimentional function. I call this method Adimentional Statistics because it separates the analysis of the evolution of media from the evolution of distribution. Then it is possible to observe how the distributive structure and its median (measured in mediae) fluctuates when comparing two samples or two moments. In econometrics average income usually grows -not in this crisis times- but its adimentional distribution does not necessarily improves, something that may be observed when median declines and when half of the total distributed mass of income is received by an smaller fraction of the whole population (which we may call the elite). At the end, final analysis must integrate the real value of the media with the adimentional structure of distribution to understand the whole matter. All this is related to the analysis of another important representation of distribution that is adimentional: the Lorenz Curve. From this perspective the importance of the median as an indicator is somewhat polysemic and shortly-informative for those that work samples that do not resemble to normal distributions -that only behave as expected when standard-deviations are very close to zero and only cover a tiny portion of cases of interest-.This is a theme that incites nice debates among different opinions and experiences.
when the distribution of the contineous variable not normal, it is recommended to report the median rather than the mean, as median is not affected by extreme values.
What if the median would be more proper in not normally distributed data but other studies used to report the mean? So, in order to compare the results to other findings, we would prefer to go with the mean? And, presenting both is not really good an option for us.