A parameter of a statistical model should be estimated from iid data. For simplicity, let's consider a single-parameter model. The likelihood principle sais that the best estimate for the parameter woule be the one for which the data is most likely (=MLE). This was already noted by Gauss, but consequenty further developed by Fisher. The method requires to know (or assume) a probability distribution of the residuals (the "observational error", as Gauss named it).
The MLE is the arithmetic mean (average) of the observations for a variety of parameters in a variety of models (normal, Poisson, Binomial, exponential... actually all models where the parameter determins the expected value) [this holds also for k-parameter models like gamma, beta, normal with unknown variance, ...].
On the other hand, the average is the solution of the least-squares condition. Therefore I thought that the LSE and the MLE of the expected value is numerically identical. Further it follows from the likelihood of normal distributed residuals that the MLE is mathematically identical to the LSE.
Now I read in "GAUSS’S LEAST SQUARES CONJECTURE" (JOAKIM EKSTRÖM) in Chapter 3 on page 7:
"Contrary to the discussed claim of Gauss (1809, §178), that the probability distribution of the observational error is of no importance in practice, the probability distribution is crucial as to whether the method of generalized least squares yields the most probable value under the density criterion or not. Given an independence assumption, the normal distribution is not one of many, but the only probability distribution under which the method of generalized least squares yields the most
probable value."
Why? How? I am lost. For instance, the average is the MLE for the expected count (so a Poisson distributed variable), and the average is the LS-solution. Doesn't this contradict the above paragraph?
Can someone help me getting rid of the knots in my brain?