The "likelihood" got its names as it is an estimate proportional to the probability to see the data you see, assuming a certain value for the parameter you try to estimate.
Every sample can be seen as a collection of random variables. Or in other words, every observation is the result of a random process: ideally the observation is a random draw out of the population.
This implies that every observation can be linked to an underlying random distribution with a specific density function (for continuous variables) or probability function (for discrete variables). Let's call that function F(). The key point here, is that this function will be defined by the parameter you estimate.
Now you can estimate the probability of observing that single observation. Or in the case of a continuous variable, you calculate a measure proportional to that probability, i.e. the density. Call that observation Xi and the parameter Theta, then you write this probability down as:
F(Xi | Theta)
An example :
You want to estimate the mean weight of students and its variance. For that, you select 10 students and weigh them. You calculate the mean and the variance. These are your two parameters describing your density function. Every student weight is an observation Xi.
Say you assume a normal distribution. Now you have a density function you can use and that is defined by the two parameters you estimated. So you can calculate the likelihood for every single observation. In order to get an idea of the likelihood linked to the complete sample, you apply the rules of probability. And this is the most important part:
**You assume that the students are selected independently**
Only in this case you can use the rule given by Eik. The probability of getting exact this sample can be calculated by simply multiplying the probabilities for every single observation. For continuous variables, you work with the density which is proportional to the probability.
So the multiplicative nature of the likelihood is simple due to:
- the assumption of independent observations
- the fact that you look at the likelihood of seeing observation 1 AND 2 AND 3 AND 4 AND ..., given certain values for the estimated parameters
- the rules of probability.
In a maximum likelihood approach, you do this for multiple values of the parameters of interest, and look for those values where the likelihood is maximized.
-----------
On a side note: very often software uses the log likelihood, exactly to get around the multiplicative nature. The log of a product is a sum, and sums are easier to work with.
The "likelihood" got its names as it is an estimate proportional to the probability to see the data you see, assuming a certain value for the parameter you try to estimate.
Every sample can be seen as a collection of random variables. Or in other words, every observation is the result of a random process: ideally the observation is a random draw out of the population.
This implies that every observation can be linked to an underlying random distribution with a specific density function (for continuous variables) or probability function (for discrete variables). Let's call that function F(). The key point here, is that this function will be defined by the parameter you estimate.
Now you can estimate the probability of observing that single observation. Or in the case of a continuous variable, you calculate a measure proportional to that probability, i.e. the density. Call that observation Xi and the parameter Theta, then you write this probability down as:
F(Xi | Theta)
An example :
You want to estimate the mean weight of students and its variance. For that, you select 10 students and weigh them. You calculate the mean and the variance. These are your two parameters describing your density function. Every student weight is an observation Xi.
Say you assume a normal distribution. Now you have a density function you can use and that is defined by the two parameters you estimated. So you can calculate the likelihood for every single observation. In order to get an idea of the likelihood linked to the complete sample, you apply the rules of probability. And this is the most important part:
**You assume that the students are selected independently**
Only in this case you can use the rule given by Eik. The probability of getting exact this sample can be calculated by simply multiplying the probabilities for every single observation. For continuous variables, you work with the density which is proportional to the probability.
So the multiplicative nature of the likelihood is simple due to:
- the assumption of independent observations
- the fact that you look at the likelihood of seeing observation 1 AND 2 AND 3 AND 4 AND ..., given certain values for the estimated parameters
- the rules of probability.
In a maximum likelihood approach, you do this for multiple values of the parameters of interest, and look for those values where the likelihood is maximized.
-----------
On a side note: very often software uses the log likelihood, exactly to get around the multiplicative nature. The log of a product is a sum, and sums are easier to work with.
This is a very nice question indeed. You are asking why we need to 'multiply' the densities. It is obvious that you are a person who is interested to know the basics first, before entering into deeper mathematics.
When you estimate the population mean M from a number of observations, you use the sample mean m as the estimator. Indeed, if x1, x2 and x3 are the observations, we assume that
x1 = M + e1,
x2 = M + e2, and
x3 = M + e3,
where e1, e2 and e3 are the measurement errors involved. (I have taken three observations here, as is actually done in the laboratory experiments in school level Physics. If you are to measure the length of a rod, you measure it three times, and then take the mean, the sample mean, as an estimate of the 'unknown parameter' which is length in this case.)
Now, the scalar product of the error vector (e1, e2, e3) with itself will be sum of squares of the errors, which is nothing but the square of the Euclidean norm of this vector, the distance from the origin (0, 0, 0).
We aim to estimate M so as to minimize this sum of squares of the errors. While minimizing this expression, using simple calculus we differentiate the sum of squares with respect to M, and this leads to the conclusion that the least square estimate of M is nothing but the sample mean m.
Now, I am coming to your question. The normal probability density function is directly proportional to the exponent of a squared term with the minus sign. Therefore, if we take natural logarithm of the product of normal probability densities with unit variance, with reference to x1, x2, x3 for example, we would end up with maximizing a sum of squares, maximizing because there already is a minus sign in the exponent.
Have you got the answer you were looking for? The basic idea is that the sample mean is a least squares estimate of the population mean even when we do not assume any underlying probability law. When we assume that the underlying probability law followed by the errors is normal, then we proceed to maximize the so called likelihood function. Our actual aim is to minimize a sum of squares, which would come in the case of normally distributed errors only when we take logarithm of the 'product of the densities'. That is why we multiply the densities, for otherwise we would not be minimizing a sum of squares. Just observe that the sample mean is a least square estimate of the population mean. It is a maximum likelihood estimate as well, after we assume normality of the errors.
Thanks Hemanta for your explanation. My understanding problem is diferent: if we assume that
x1 = M + e1, x2 = M + e2, and x3 = M + e3,
then we are assuming that each and all measured N variables contain errors, so necessarily media M is also erroneous. So at the end our estimation of e1, e2 and e3 contains a second level of errors, and we may speak of "errors of errors" too. At the same time it is assumed the same distribution Gauss formula for variable X and errors. There are so many strange assumptions that I wonder if this is a circular reasoning based in Gauss fantasy behind a mathematical mask, that may be formally right but has no real significance because it finally states that data measured is wrong, but it declares that it can "minimize" errors built from errors. Is this obsevation valid? I think the only valid approach requires to assume that data is well measured so we may work it such as it comes. Thanks, emilio
Emilio, I think your problem is somehow related to your conception of the "errors". I think "error" is a wrong/misleading word. It would better be "deviation", in particular the deviations of observed and predicted values. Since any model is a simplification, there will always be deviations that cannot be "explained" by the model (we like to use). However, we can ask what we can *expect* from such unexplainable deviations. Such expectations are formulated in probability distributions (Gauss is only one of many!) that can often be derived from simple and (hopefully) reasonable basic assumptions (e.g. "real numbers" and "symmetry" leads to the Gauss model; "counts" leads to the Poisson model and so on). Now if we can express expectations about deviations (always relative to a given model), we can calculate the likelihood of the whole data set, and eventually we can seek the model that obtains the highest likelihood.
You further say that there is an error associated with the model itself. This point of view is non-sensical in my opinion. Models are not the reality and they are no features of reality. All models are wrong, so to say. The key is to find a model that is useful (e.g. to get new ideas, insights, make predictions or estimate effects). So we are left with interpreting which and how many features of the available data can be "explained" by some given model.
Thanks Jochen for you soon commentary. My view -and yours- is that models must follow data; the model must contain some isomorphism with data as a fixed referent. Each data set may be represented by many models and in the case of statistics, quality control requires that we mantain the best models and discard those that do not fit neither explain the data in a “satisfactory” way. In teaching labor, simplest models and methods that fit fine to data should be presented to students. Otherwise, its teaching would become an unintentional torture of students. I want to be radical about this point: it is time to send Gauss curves and their derived theories to retirement, with honors if posible. If we continue here promoting statistical packages and questions to hold the presence of those theories, there is the risk that RG converts itself in a marketplace of researchers points, conventional wisdom and statistical low quality models and packages. Criticism to ourselves and to others is important as a conditionant of quality. I partially agree with “Models are not the reality and they are no features of reality” because they are interpretations and as such, they are part of reality when they are applied, discussed here and taught to young people that always request convincing reasons. I have no problem in recognizing the importance of Poisson –an other models- in some specifical cases. What bothers me is the quantification of subjective feelings in statistics (like confidence, credibility, likelihood, degrees of freedom, robustness, fuzyness, and other ones). The problem is to develop models and methods to interpret any set of data –as you mention- without appealing to feelings, and this is a practical issue that always requires to make additional subjective premises, and those that can do it better are the researchers in their fields because they work facts and real data. I already proposed a method to advance in this sense and it is open to critics, it has the advantage that it is understandable, free and teachable. I am sure that other good ones may already exist. Thanks, emilio
Thank you Emilio for your detailed answer. As far as I can see, you seem to "discredit" the use of the Gauss distribution in statistical models -almost in general- because of its abuse. You give the analysis of psychological/mental parameters as an example. I agree that in these cases the required basic assumtions are clearly violated. Such data should better be modeled with (ordered) logit models or some other more specialized models (don't ask me which; I am no expert here). However, discrediting the use of the Gauss distribution in general is to throw out the baby with the bath water. There are valuable fields of application, and the analysis is based on the likelihood function, which shape relatively often is a good-enough approximation a Gauss curve (CLT). What I wanted to point out in the last post was that the (typically taught and stated) justification of its use is wrong, misleading, and desastous for any understanding of statistical reasoning. It is tought in the wrong way, and so the wrong conclusions about its applicability and wrong interpretations about its results are drawn. Not the distribution is the problem, not even its theoretical foundation, but the wrong concepts with which it is tought and discussed.
Right now, after reading your post again, I understood it differently: are you saying that probability is a "subjective measure" and that this is your problem? If so, we (at least I) can stop or at least postpone the further discussion, because this is a hot discussed topic of many people, and those are all smarter that I and know much more than I of this topic.My personal viwe is that probability must be subjective since it describes a "state-of-mind" we have (-> expectations). There is nothing objective about probabilities. We can only further demand or wish that our expectations are (1) not contratidtory to all available data and (2) reasonable w.r.t. some accepted measure of reasonability (utility? precision? costs?...). As probabilities are tools to represent out knowledge, they are not less subjective than knowledge itself.
Ok, then there is one point where I disagree: "In teaching labor, simplest models and methods that fit fine to data should be presented to students. Otherwise, its teaching would become an unintentional torture of students.". (*) Teaching students should be oriented at the "needs of science", not at the "needs of students" and not at the "needs of institutions". Teaching simple recipies brought us to the point where we are today. (Almost) No one is getting to the bottom of "science", really asking what knowledge is and how much any newly available data can change this knowledge. By this teaching behaviour ("give simple recipies for all your problems") students will be further motivated to see science just as a tool to make carreer and earn a lot of money. If this is the aim, the simplest available tools will be used to achieve it. This in turn leads to excessive and unreflected use of hypothesis tests, selection of "positive" results, severe publication bias and a huge accumulation of worthless publications. Further, they are left alone in their later research where they will have more difficult problems. They have never learned to critically think about information, knowledge, models, ... and they will inevitably use the well-known "standard solutions" without reflection, and whether or not they might be suitable. (There is an additional high pressure to this wrong or at least disadvantagous behaviour from other publications: many others did it this way, so one cannot go a different way. Scientists/reviewers/editors are so conservative in such things...).
(*) models that do not "fit fine" are worthless anyway. What should be a benefit of a model that doesn't fit to the observations? But good models may be more complicated than those tought to students. Clearly, one starts with simple cases, but then students must learn to tackle real-life problems, including the real-life uncertainties - uncertainties not only about the estimates but also about the interpretation, the philosophical foundation and the practical implications.
Jochen: Your comments are always welcome and make me think a lot, even if they match or not my views. I will reply your sentence “discrediting the use of the Gauss distribution in general is to throw out the baby with the bath water” with another two sentences of anglo-american popular wisdom and humour: 1) “if you have a serious problem then there are two choices: either fix the problem, or either learn to live with it” and 2) “In order to know that an apple is wrotten, I do not need to eat it all”. Gauss normal distribution is not a nice baby-recipe, it is like a piece of sand in eyes that does not permit to have a good image of many important statistical situations, so we better remove it from our eyes than conserving it, even if it has been two centuries in our eyes and has been part of history of science and education. And if this piece of sand has distorted our images of reality through different methods and working parameters derived from the initial sand-premise, then we should think it twice before making students to swallow the whole wroten apple. Learning to live with the problem is not the best option; neither asking our students to eat the whole speech derived from the problem -I was also a victim of it during university years- . I believe we´ll find new better solutions.
In your text you develop gradually a flexible position and the last part shows your main concerns about teaching and practice of statistics in research and modern world. They are plainly justified and show your sense of humanist responsibility about the matter.
In general I admire Gauss for reasons different from his bell curve which I critique about his lack of good theoretical fundamentals and premises, without discrediting him. I understand he never used the subjective terms that bother me: other ones did it after he died.
I believe that we may fix the problem and may develop a new approach –not perfect- but better than the gaussian models. I have given some steps, but need the help of good mathematicians to present them –and to modify them if needed-. It is linked to basic concepts behind the Lorenz curve.
I appreciate very much your dedication and participation in this RG forum, thanks, Emilio.