What is the basic difference between the maximum likelihood estimator and the least square estimator?

02 February 2014 71 8K Report

In statistics when we go for estimating parameters, sometimes least square estimator is used and sometimes m. l. e. is used. Which one is better and when it can be applied?

Fabrice Clerot Popular answer

hi Jochen,

i have forgotten almost everything about such topics at least two decades ago but i try to connect the dots

LSE ML for estimators with normal distribution ML estimators are asymptotically normal

the last assertion being

- true for the exponential familly

- true outside the exponential familly under a pretty ugly supplementary assumption (you can find it in theorem 7.63, page 421 of "Theoretical Statistics", M. Schervish, Springer 1995 ... that's a very big book !)

Radford Neal has a simple example of a distribution leading to inconsistent MLE and violating Schervish's assumption, which helps understand the rationale behind it

http://radfordneal.wordpress.com/2008/08/09/inconsistent-maximum-likelihood-estimation-an-ordinary-example/

Demetris Christopoulos

The least squares estimator is present in linear regression problems:

Y=Xb+e, e = error term

Minimize ||e||^2=||Y-Xb||^2, where ||v|| is the l-2 norm (thus the term least squares).

On the other side when your model cannot be linearised, then you perform MLE, ie you choose those values (vector b if they are more than 1) that maximize the likelihood function:

f(x1;b)f(x2;b)...f(xn;b), where f() is the pdf of the distribution that X variable is following and x1,x2,...,xn the values that you have as data for X.

If you are lucky you can find the b with Calculus methods.

Otherwise you have to use iterative methods (starting from a reasonable first guess for your unknown parameter vector b).

Anyway all statistical packages routinely perform the above MLE process.

Jochen Wilhelm

The least-squares estimator (LSE) is a special case of a maximum-likelihood estimator (MLE). The special case is that the probability distribution used for the likelihood is the normal distribution.

The MLE is the parameter value for which the observed data is most likely. This likelihood (of the data) can be calculated for any (assumed) parameter value. If the data are independent, the likelihood is simply the product of the individual probabilities of the observed values. This probability hat to be evaluated, and for this a probability model is required. Depending on the kind of data this could be specified as a binomial distribution, or a Poisson distribution, or an exponential distribution, or... (many many many more).... or a normal distribution.

The MLE is obtained by varying the parameter of the distribution model until the highest likelihood is found. The value of the parameter for this result is called the MLE. One can do this a little more analytically and derive a likelihood function, giving the likelihood depending on the parameter value. Then the derivative can be determined and solved for its root. However, it is often much simpler and numerically more convenient to look at the log of the likelihood function (where the ugly product of probabilities turn into a simple sum). The position of the maximum will not change by such a monotone transformation. It is thus practical to determine the maximum of the log likelihood.

In the special case that the normal distribution is used as probability model, the log likelihood turns out to be proportional to the negative sum of the squared residuals. Hence, the maximum likelihood is where the sum of the squared residuals is minimal. So here we have a nice shortcut and the MLE can be fond as the parameter for which the sum of the squared residuals becomes minimal (=LSE).

One can always use the MLE. When the data is normal distributed, one can also take the shortcut via the LSE, giving the very same result as the MLE (because in this case the LSE *is* the MLE, the calculations are only made simpler).

Jochen Wilhelm

Fausto, please read more carefully. Being a "special case" alredy indicates that this is a special case, doesn't it? And just the sentence later I wrote to which exceptional case this relates. I think this is very clear to everyone who is willing and able to read.

Mohammad Firoz Khan

In principle least-squares is a special case of the maximum likelihood methods. If the amplitudes of the structure factors were distributed as Gaussians with known dispersion then maximum likelihood would become least-squares. But it is structure factors themselves that are distributed according to the Gaussian law, not the amplitudes. But at the end stages of refinement when the model is complete and has a small error then the maximum likelihood could be approximated by the least-squares.

Jochen Wilhelm

Fausto, if a patient who is coughing comes to a doctor. It might be some rare viral infection or strange autoimmune disease or some never-before seen disease, possibly with high virulence, starting some pandemic horror... but it is most likely just a simple cough. And, Fausto, yes yes yes, I very clearly said that. I really wonder if you need some help.

Raed Kouta

Briefly (and therefore not necessarily a detailed explanation), the least squares estimator is to find the best way to explain a random variable with a deterministic variable.

Maximum likelihood is seeking the best probability distribution to explain disperssion a random variable.

The first (LSE) is a regressive method, the second (MLE) is a method of measurement and management of the robustness of a probabilistic model.

Burak Alakent

Views may be wrong, or incorrectly stated (I am not talking about this specific case), but I do not think there is any point in flaming!

Anyway, for his case, as far as I can remember, for static (or steadystate data), MLE and LR give equal results if random error term is Gaussian distributed. However, again if I recall correctly, in the seminal book of Time Series Analysis by Box and Jenkins, the authors, while deriving the parameter estimates of time series models, start with MLE analysis, and then make a couple of simplifying assumptions (neglecting a number of terms) and arrive at LSEs. 8-10 years should have passed since I examined that derivation, and since I do not have the book with me right now, I cannot look up. Nonetheless, LSEs are, at the very least, very close to MLEs in many cases, given that error is Normal.

Jochen Wilhelm

Burak, what you write seems to relate to the central-limit theorem (CLT). For larger sample sizes, the likelihood function approximates the shape of the normal density curve. The log of the normal density curve is a parabola, an so the problem of finding the MLE simplifies to the math of finding the LSE. From another angle, without stressing the CLT, we might consider any arbitrary log-likelihood function of any arbitrary peaked shape. A common algorithm to find the maximum is to express the function by a Taylor series expansion around the point where the first derivative is zero. The result will thus have no linear term (it is zero by definition) but the quadratic, cubic and so on terms. The contribution of the higher-order terms is negligible, and the simplest form again is the parabola as an approximation, again simlifying the math to the LSE (here not to get the maximum but rather to get the variance or "spread" of the likelihood function: the "standard error" of the estimate). The cost of this simplification is that the result is jsut an approximation. How good this approximation is depends on many factors. As Fausto stated virgourosely several times, all this is then and ONLY then NO approximation (i.e. the *correct* MLE and the LSE are identical), if and ONLY if the probability distribution of the variable is normal. If this is not the case, often the *approximation* is still acceptable (the LSE is acceptably close to the MLE; the data have a very comparable likelihood under both, the LSE and the MLE), but for sure there are cases where the LSE is too far from the MLE and therefore not a good or useful estimate (because one would think that for this value the data are most likely but in fact there is another value for which the data are much more likely; more severe though is the faulty conclusion about the variance or standard error).

Jochen Wilhelm

Fausto, I agree that a lot of wrong things are repeated over and over again from millions of "academic" people. Some of these things are harmful, but many are not generally harmful (or only in very special cases). Most of them can be seen as "convenient lies". This is not "academic", sure, but this is how the world is. We (not you or I in particular... just someone who is smart enough) can not and will never make all people understand everything correctly. But the history has shown that despite this fact there are always some great minds looking behind such "common explanations", digging deeper and gaining some better understanding. And this is usually related to the fields of research they are mainly involved in. That means, a physician won't reveal common tales about probability theory, and a statistician won't pinpoint common misconceptions in the definition of pathologies. Even if most non-statisticians use actually incorrect approximations or explanations or even interpretations for their data analysis, I consider this ok as long as their judgements are based on reasonability, coherence, and expert knowledge. Often many different pieces of evidence are taken together to come to a reasonable and coherent conclusion. To my experience it is often more productive and successful to not distract researchers with statistical details, so that they will have more capacities left to interpret their observations in light of their models and their specialist expertise. This works suprisingly well in many cases. The cases where the neglection of statistical principles or incorrect applications of statistical analyses or wrong approximations lead to desasters are generally pointed out (somewhat later, though) by statisticians, forcing the introducton of modified methods or rules (to give an example: the screening of genes for differential expression by new high-througput methods first lead to a desaster where too many false-positive results were produced; but this was recognized! And today terms like "false-discovery rates" are known to many researchers in this field, whereas many were not at all be aware of "error-inflation" and "multiple-testing" some years ago. So there is a development of the community, but it is slow.

Moaning and critisizing they way you do will not speed this process up. It might have rather adverse effects, I am afraid. Instead, you should better demonstrate what the benefit (for the actual researcher!) is when he will change his behavior according to your suggestions. As far as I see, this benefit won't be significant for most researchers (in the life sciences). Those working with survival data and censored data should be informed more specifically. But I think there are a lot of books specialized on these topics and read by the people who are working with such data. If contents in these book are seriousely wrong, you should find some talented people with whom you can write a better, attractive, understandable, enyojable and beneficial book and get it promoted and distributed to the target audience. This would be my advice to help you reaching your aim.

Emilio José Chaves

The purpose of statistics should be to provide a general method to handle any data set, in this case it has no sense to provide a particular method that can only be applied to normal distribution premises after censoring the data set. I suggest to work here a concrete non-normal univariate data set of 40 points -without any censoring- and present here different proposals of solutions found. While Jochen has tried to explain the orthodox method, Fausto is claiming for a deep change in its analysis in order to free statistical teaching from the jail of premises that traps it. This may be painful for many people that teach conventional statistical wisdom, but nor for science that value better interpretations and better methods to approach problems solutions. We must face the situation with honesty at the risk of loosing our most cherished prejudices and text-books. Thanks, emilio.

Subrata Chakraborty

MlEs are obtained with the objective of maximizing the likelihood of the observed sample while LSE are meant to minimizes the error sums of squares. In particular they might coincide as discussed in the ongoing posts.

Emilio José Chaves

OK Fausto, here it is the data for participants analysis. Its mean is exactly 1000. Good luck to all of you. emilio

X variable (40 points)

1 287 1043 1732

6 344 1130 1763

17 407 1216 1784

32 475 1299 1799

52 547 1379 1807

78 623 1455 1815

109 702 1525 1829

145 785 1588 1869

187 870 1645 1991

234 956 1693 2784

Jochen Wilhelm

If any useful information would provided... what exactly is the aim of the analysis? Are there any resonable assumptions? Are there any critical assumptions? The task is like "Here is a map, now please tell me the destimation!" - There is a lot of crucial info missing for a meaningful analysis. Sounds all too stupid to be considered for an answer, and this is possibly a reason for the lack of response. But I guess you consider me too stupid to understand.

Emilio José Chaves

Fausto, let´s be patient about answers. Eventual contributors are free to define the aim of his/her analysis, the rationality and vulnerable aspects of their assumptions, their own method, tools and its limits. Sometimes people that want to participate do not have the courage to expose his/her selves, sometimes they think they have good arguments but prefer to observe debates at a prudent distance, sometimes they recognize at once that debate may be inconvenient to their preferred theories, packages and practices, etc. There are other Q&A here in RGate where people offer their views with graphs results, about concrete given data-sets, without questioning data at all. In anycase, silence is eloquent by itself. When Don Quixote was confronted by Sancho´s good reasons, as many bosses today, he ordered him: "Be silent Sancho, it is not convenient to handle it". (Calla Sancho que no conviene menearlo). In spanish tradition there is a saying among catholic people: "Priests predicate but do not apply it" (el cura predica pero no lo aplica). OK, Fausto, we share our insatisfaction with the state of today´s statistical practices and teaching, and at our age we just wish that young researchers be able to say without fear "Emperor is nude". Cheers to you and other readers of this Q&A. emilio

Demetris Christopoulos

@Fausto, can you give me a link with your main views written in English?

Thank you.

Demetris

Mohammad Firoz Khan

@Fausto

I stand by my post as explained by Jochen why least-squares is a special case of the maximum likelihood methods. In support of my assertion I am attaching herewith an article which proves in its last part when there is no difference in LES and MLE.

Emilio José Chaves

Mohammad, hi. Can you apply your method to the 40 points univariate non-normal sample proposed here and tell us your results of fitting curve to this dataset? Thanks, emilio

Hemanta K. Baruah

Dhruba,

You need to take part in the discussions! When so many senior researchers are writing their views with reference to your question, you should question back if you have failed to understand any point in an answer. Only then, this discussion would be fruitful to you.

Professor Firoz Khan has mentioned in his first answer that the LS method is a special case of the ML method. To be precise, the ML method is based on the LS method. That is why he has said so.

Your question is a very basic one, and that is why this discussion would be a very fruitful one for you. Take part in the discussion actively.

ResearchGate is in fact an open classroom. Junior researchers like you can learn a lot through this medium.

Demetris Christopoulos

Dear Mohammad, accordfing to your uploaded work, what is the case when the epsilon~U(-r,r), ie the error term follows a uniform distribution and not a gaussian one?

Mohammad Firoz Khan

Before questioning me, one should go through the sentence in my first response,

“But at the end stages of refinement, when the model is complete and has a small error then the maximum likelihood could be approximated by the least-squares.”

And, in later response through the phrase,

“which proves in its last part when there is no difference in LES and MLE.”

I again reiterate: If the residual variation is homoscedastic, independent, and Gaussian In this case, least squares the LES is especially useful and usually yields MLE. However, the value of MLE is sometimes limited to large samples, because their small sample properties are sometimes quite unattractive.

The LSE is not enough when the relationships of interest to us are not linear in their parameters, attractive LSE is difficult, or even impossible, to come by.

As such, the linear model E(y|x) = xb may be not enough in a lot of cases. The conditional expectation is just a parameter of the distribution of y conditional on x. The idea of MLE is to base estimation of the parameter not on the conditional expectation but on the whole distribution, I (y|x) Therefore, MLE as a strategy for obtaining asymptotically efficient estimators is PRINCIPAL ONE from the perspective of a large-sample.

Hemanta K. Baruah

Dhruba,

As I had said earlier, your question is a very basic one. But there are doubts and suspicions in this regard! Various questions are coming up as you can see. You need to participate in the discussion!

Dhruba Das

Actually in case of data, if actual observations follows normal distribution around the mean, the MLE estimate is same as the OLS estimate.

Jochen Wilhelm

For the most frequently used probability models (I'd guess for all models, but I have no proof), the least squares estimate of a location parameter is IDENTICAL to the maximum likelihood estimate of this location parameter. Some distributions have no "proprietary" location parameter, but usually the model can be reparametrized with respect to the expected value of the distribution. It is clear for logical reasons that the MLE of the parameter representing the expected value is identical to the LSE. However, I showed it for some of the typical distributions in the attached document.

So I would make the statement much stronger: The LSE of a location parameter (as the expected value, concretely the sample mean) is neccesarily identical to the MLE of the same parameter. This is true for all(?) distributions, not only for the normal distribution.

The drawback of the LSE is that the precision of this estimate can not always be taken from the sample variance as it can for a normal distributed variable. When the distribution is not normal, the shape of the likelihood is not symmetric around the MLE (or LSE), and the standard error loses its meaning as the confidence interval will be asymmetric as well. Here, the correct confidence intervals can only be obtained by the likelihood function.

Only the central limit theorem assures that the likelihood function approximates the normal distribution (as the sampling distribution of the respective statistic), and thus the standard error can again be used to assess the (approximate) precision of the estimate.

Emilio José Chaves

Jochen, hi. It is clear you have a favorite group of models-set: this is your first premise. Then you extrapolate it to all models as a “guess” without proofs to state LSE=MLE, later you admit that even without “location parameter”, model can be “reparametrized” at the distribution media and declare “it is clear for logical reasons” which I am not able to see. As a proof, you offer ten “typical distributions” which are theoretical-mathematical constructions that only prove your point for that models-set. And to close the theoretical discourse you invoque the central limit theorem to declare that “thus the standard error can again be used to assess the (approximate) precision of the estimate”. I did not understand your logical sequence. Perhaps this is a problem to request the help of experts in epystemology of logics, mathematics and statistical models like Deborah Mayo, Fausto and other ones.

I can design thousands of parametric distributions as my models-set, but that does not authorize me to recomend it as a general recipe to analyze real data. Can you apply just one of your models to analyze the 40 points data set proposed? Or a mixture of them if you prefer? If you want let´s do the inverse operation: you give me 40 data points set and I will analyze it assuming they are representative of the sample sent. I use an alternative general method to do it without such assumptions, without standard deviations, nor predictors, nor media’s errors, nor StDev errors, nor confidence intervals. By the way, medias of each interval do not usually correspond to the middle point of intervals. That is decided by the model employed to represent your statistical curves over dataset. OK, let’s be self critical starting by myself. With due respect, emilio

Demetris Christopoulos

Dear Jochen, all your examples are a subset of the Exponential Family of Distributions. You could take the general form and do the work once, instead of doing it so many times. There is no more information among the different examples that you presented. Anyway, you did a work!

Dhruba Das

Respected @ Fausto Galetto

Can you tell me a little bit about your standpoint that, why censoring sample gives problems in LSE.

Jochen Wilhelm

@Demetris: In fact, this propbably would have covered the gamma as well. I have chosen these examples and made them explicit because they are very often used in my field of research (life sciences, biomedical research).

@Emilio: the arithmetic mean is the LSE, and it is the expected value.

@Fausto: I attached the solution to your example. Feels like doing your homework. You are right that LSE and MLE are not identical here. However, there is a simple transformation that again makes them identical. Further I never claimes that the LS method gives good approximate CIs. So it is a little silly to make a big issue here. In fact, your example leads to a very bad LS-approximations for the CIs (see attached file). However, the large-sample approximation works. This is where the CLT neccessarily shows up. And finally I'd like to note that for such assymetric distributions I would prefer the likelihood-intervals over the confidence intervals: the CI leave the same tail areas at both sides, leading here to very different likelihoods at the borders. Thus the data can have a considerably different likelihood for an estimate at the lower and at the upper bound of the interval, what I find kind of counter-intuitive.

Jochen Wilhelm

Fausto, then set eta' = sqrt(eta) and estimate eta'.

Ok, for the values

0.288 0.140 0.553 0.308 0.203 0.636 0.390 0.162 0.323 0.400

(generated from a distribution with eta=0.3) the MLE is 0.373, the 95%CI is from 0.291 to 0.559.

As you and I noted the LSE is not similar and only useful in as (very) large-sample approximation. The LS method can be used to estimate eta' on the transformed values Z=X², and eta is obtained by (eta')².

Fabrice Clerot

see section 3.3 of the attached document

also notice in the introduction :

"MLE has many optimal properties in estimation: sufficiency (complete information about the parameter of interest contained in its MLE estimator); consistency (true parameter value that generated the data recovered asymptotically, i.e. for data of sufficiently large samples); efficiency (lowest-possible variance of parameter estimates achieved asymptotically); and parameterization invariance (same MLE solution obtained independent of the parametrization used). In contrast, no such things can be said about LSE. As such, most statisticians would not view LSE as a general method for parameter estimation, but rather as an approach that is primarily used with linear regression models."

also, LSE is "BLUE" (Best Linear Unbiased Estimate) ... which is a very nice property explaining the ubiquity of LSE in the linear model context (plus other restrictive hypotheses ; homoscedasticity notably)

review the Gauss-Markov theorem (and read carefully the hypotheses !)

http://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem

Demetris Christopoulos

I think Fabrice told them all!

Demetris Christopoulos

@Emilio, can you give us a set of tasks for your #40 data that you have uploaded?

I want to analyse it with some of my methods, but give us the objective please.

Emilio José Chaves

Demetris: I was expecting that each one gave the task of the analysis just to see what they consider their main objectives, methods, tools, premises, etc.

My answer to your question is this: 1) Give a good fitting soft curve graph for Cumulative Distribution Function and its mathematical expression. If you have a soft fitting Lorenz Curve, do the same. If it is possible include the data points in graph to watch fitting. 2) Explain shortly main method and mention data ordering employed. 3) If Standard Deviations or variances are used, please calculate them for 100 points obtained from model with increments of 1/100 of population, and explain any difference with 40 points SD or variance.

I would like to see answers that do not use Probability Density functions, but if they are used, I expect a continuous math expression for them. Given that dataset is not normal, please do not send gaussian formulas.

Thanks for your interest, emilio

Jochen Wilhelm

Fausto, your reaction is not helpful, I find it rather offending. I'll stop the discussion here with this link: http://en.wikipedia.org/wiki/Weibull_distribution (noting that E(X)=lambda*Gamma(1+1/k)). I'll come back when you changed your state of action from "destructive" to "constructive".

Fabrice Clerot

for those interested, two references comparing the performance of MLE, LSE and others

1) for the exponential distribution (in French ... but with a long abstract in English) :

Fabrice Clerot

2) for the Weibull distribution (oh, well ... they do not tell much about LSE except that it performed poorly for small samples and concentrate on other estimators) :

Jochen Wilhelm

Fausto, "all models are wrong, but some are useful". Thus one can always scream WRONG WRONG WRONG wherever you wish. I'd find it constructive to help others to understand where they can be severely wrong (in a pracitcal sence). But this requires to understand the aims of these researchers. For me, for instance, it would have been helpful if you would have said something like "for the exponential family of distributions this is the case because of this and that... but nut neccesarily for distributions that do not belong to this family. As an example take the Weibull. Because here one can not ... bla bla .." or "this is the case for some distributions which have this and this property, because then this and this is the case. For other distributions, where that an that is the case, this does not work because of these reasons... bla bla". Instead, I repeatedly hear you moaning that I (and others) don't answer your "questions" where we (well, at least I) do not understand what you actually want (e.g. I said that the estimators are not the same, and even if they are the same, then the CI can be considerably different, and that a large-sample approximation might require *really* large samples to be useful -- but this all ist not what you wanted to hear or what you seem to recognize...). What am I doing here? It is not worth the time invested.

Emilio José Chaves

@Fabrice. I read professor Sambou (Senegal) article. Very interesting for this debate. It says "Empirical plots are unstable for low sample sizes, are sensitive to sampling, and are very difficult to explain. Analytical expressions for the asymptotic statistical properties of the two estimators are needed for realistic comparison." Well, that is what we are asking here with the 40# dataset. I have clues to handle those -low sample sizes- expressions requested by Dr. Sambou. OK, thanks, emilio

Jochen Wilhelm

Methods refer to models.

Guillermo Enrique Ramos

Dear Emilio

I am new in this topic, because I was traveling for a month by "La bellísima Italia" in vacation. I am very delighted with the interesting comments of most participants. I copied your data and I computed the arithmetic mean it is 1000.075 instead of 1000

Your data is:

1 287 1043 1732

6 344 1130 1763

17 407 1216 1784

32 475 1299 1799

52 547 1379 1807

78 623 1455 1815

109 702 1525 1829

145 870 1645 1991

234 956 1693 2784

I am doing something wrong?

Emilio José Chaves

Hola, Guillermo. I think you may be right because I rounded numbers after defining U=1000 and multiplying it times the adimentional Variable (in fractions of U). I used Excel and recalculated the mean, but am not sure about the decimal format then. Use your own mean because it is not too different to cause big precision problems in interpretation. If in your opinion it deserve severe critics, I would gladly accept and work them. Thanks for your interest, emilio

Emilio José Chaves

Guillermo. The data you show is a 9*4=36 points. My original data is 10*4=40. Please check your transcription. emilio

Emilio José Chaves

Guillermo: the original data with several decimals before rounding was this:

2783,705475

1991,428297

1868,684519

1828,805339

1814,684284

1807,424134

1798,729361

1784,479055

1762,575404

1732,05301

1692,649012

1644,567605

1588,335851

1524,706428

1454,586666

1378,983731

1298,960719

1215,600832

1129,978062

1043,133402

956,0559273

869,6682351

784,8157987

702,2598213

622,673176

546,6390307

474,6517607

407,1197647

344,3698194

286,6526333

234,1492871

186,9782835

145,2029638

108,8390821

77,86236568

52,2159235

31,81739515

16,5657642

6,347785068

1,043996707

Its average is exactly 1000, but the average of the rounded 40 datapoints is 999.025. So I accept there is a 1/1000 fraction error in the informed media of rounded values. I hope this aclaration helps, so thanks for your oportune remark and let´s see its effects. Thanks, emilio

Guillermo Enrique Ramos

Dear Emilio

Thank you for your correction. Your original list is:

1 287 1043 1732

6 344 1130 1763

17 407 1216 1784

32 475 1299 1799

52 547 1379 1807

78 623 1455 1815

109 702 1525 1829

145 785 1588 1869

187 870 1645 1991

234 956 1693 2784

I copy and paste with problems in my previous chat.

I rounded in Excel your last list and obtained the same values as before. But the sum of them is 40003 then the mean is 1000,075. I do not think that this was important but I asked because at first was worried by the difference.

Guillermo Enrique Ramos

Dear Fausto

I congratulate you for your guessing of a distribution behind the sample. The Kolmogorov Smirnov Test gives 0,052552915, which is very far from the critical value with alpha=0,05 (0,2150). The graphic of Fe(t) vs F(t) is near of the identity (Fe=the empirical distribution funtion).

I recognize however that after reading the previous remarks of this topic, I do not guess that the purpose of this challenge was to find a distribution function that works well. You were taking of the difference between de MLE and LSE, then I share the perplexity of Jochen. May be that Emilio and you were working in the same field or very near and that you can guess to each other intentions very quickly, but for me before begining with computing the MLE, I need to know from which of the infinite distribution are taking. I think that may be many distributions alternatives to the postulated by you that can adjust this data very well, the estimator of the mean would be similar to yours and to the usual LSE, but the estimator by MLE of the rest of the parameters will be different because they will be other parameters.

Emilio José Chaves

Guillermo. I repeated the calculations and obtained same result as yours: 1000,075. I will use it to build the cumulative distribution function CDF model that gives a soft fitting curve to the 40 points dataset. I will show it here once it is ready. Thanks for your contribution to correct this point. It teaches me -once again- that we must let data speak by itself, and in this case I used wrongly the model predesigned media, instead of dataset media.

@ Fausto. Thanks for your answer. I have two questions about your curve: 1) Did you order data from top to low, or in ascending order? 2) May you explain with one numerical example using one of the 40 points the obtained F(t)? Is F(t) a cumulative distribution function, a probability density function, or another one? Your answers will help me to understand the applied method that produced your fitting model for this particular dataset. My regards to you and all followers, emilio

Emilio José Chaves

Hi dear Fausto, it is the case that I use to order data from top to low X variable, which I call K because mathematics become easier to handle to me: this effects your population order as a concept. It seems that your variable t means cumulate fraction of population,

K(t| K>=Ko) but I am not sure (if the X order is ascending then I must make a transform of your equation to F(Z) where z= 1-t for the same value of F, getting point (zi;Fi). . All I want is to be sure about your model F(t)=1-exp[-((t/1944.4)^0.6+(t/1810.6)^9)]. before working it at excel and observe your fitting with data points. So please give one value to t and give your F(t) value which I suppose correspond to some of the dataset points I gave. If I am wrong just tell me what t and F(t) means to you. I am not questioning anything, my interest is to interpret you in a proper way when I graph your expression F(t). In this case my problem -not yours- is that I use differents names from yours for horizontal and vertical variables of the curve and different ordering method. The data table is fine as you showed it. Thanks a lot, emilio

Demetris Christopoulos

Dear Fausto, look at the graphs where I have performed the Empirical CDF approximation of R plus a natural spline of n=201 points and together I have plotted your approximation of F(x).

I have some objections around the value ~ 1800, see second plot: your F(x) seems to 'get stacked' there.

Another objection is the value of -0.4781865491e-29 = -0.4781865491*10^(-29) at your formula:

F(t)=1.-1.*exp(-0.1063478300e-1*t^0.6-0.4781865491e-29*t^9)

It seems to me to have been produced from an ill-positioned matrix.

Any details?

Demetris Christopoulos

Dear Fausto, another issue of your propsed CDF is that the corresponding PDF has a local maximum near ~1800, see plot. Shouldn't it present the maximum around Mean(X)=1000 instead of 1800?

Demetris Christopoulos

I think this conversation is too passion for me, since nobody can upload something without been downvoted from somebody!

So, thinking for the pros and cons I am leaving. Bye!

PS By the way Fausto: I am not a normal distribution lover and if you see my work I am a totally non parametric scientist, so I wouldn't do any kind of regression in order to present a formula for the above data. But if you want to label everybody, its ok, no problem, we live (till now) in a world with freedom writting. Ciao...

Jochen Wilhelm

I want to give a short statement why I don't play this game here: The "empirical estimation of a CDF" from a given set of data, without knowing ANYTHING about the kind of data, possible underlying mechanisms, scientific background... is, to my opinion, really unscientific. What can we learn from discuss any (parametric!!) fits of the empirical CDF? If this data is all you know, then use it. Take the empirical CFD as is. If you are going for inference: bootstrap. If you know that the data is related to particular processes, possibly from reliability experiments, you can possibly go and check is a particular (composite?) CDF from this field fits well and thus tells you something more about the data. But this information was not given, and it also would then require to be expert in this particular topic (what is not the case for most of the readers here).

Further the question for the mean is -to my! opinion- not very sensible here. The distribution is clearly bi(tri?)modal (see attached picture of the estimated [and smoothed] density curve). If you want a mean, you can bootstrap it. The result I got is 999 with the 95%CI from 772 to 1230 (giving equal weights to all observations and taking random samples of 40 with replacement).

And funnily I have the feeling that the "circle of statistical quality" is confused about his own problem and solution... but likewise here :) I am keen to learn how this all will be resolved. Thank you.

Demetris Christopoulos

By using only the empirical CDF interpolated by splines (N=1001) and with iterative use of Extremum Surface Estimator (ESE) and Extremum DIstance Estimator (EDE) from R-Package 'inflection' we can find the two critical points:

mu[1]=1022.281

mu[2]=1809.774

No assumption at all.

Jochen Wilhelm

Nice solution, Demetris. But shouldn't the 3rd derivative of the *used* inflection points be positive (-> maximum in the density)? This is not the case for the point at 1022, that reflects a local minimum in the density. Would this make sense? An earlier "inflection point" should/could be close to 0. This gives the same solutions as seen in the density curve.

Using splines, there should be a good solution with 3-4 knots only (rather than 1001). One can additionally use the contraint that the derivative at the largest quantile must be 0.

Demetris Christopoulos

Jochen, that's the reason why I called them 'critical points': the interpretation is another story... The choice of N=1001 knots was made just for increasing the accuracy of ESE & EDE methods only.

Demetris Christopoulos

Fausto, I told you that I don't want to make regressions, but this does not mean that I cannot do them. See a very simple symbolic regression with HeuristicLab 3.3. I don't find it very good, but I don't want to spend time for improving it. So, what?

Emilio José Chaves

Hi all of you, thanks for your participation. I include here an excel file that you may study slowly about my interpretation and graphs of dataset. It is necessary to have good handling of excel to follow its logics and graphs included. It is based in these main premises:

1) Laplace criterion (each datapoint is a median of an interval with frequence 1/40)

2) From 1) we may infer-derive the Lorenz curve (Xcum; Lcum) -Note, I use a descending order, and I call K to the variable of dataset, X to the cumulate populations in fraction.

3) There is an structural function W(x) that define the whole model -after adding a premise for the lowest K value. At x=1 K(1)=W(1) minimum.

4) The W model obtained from Least Square Minimum method of computer for 5 selected points was W(x)=W = -0,6617x^4 + 2,2448x3 -1,8246x^2 - 0,4956x + 0,736

5) From 4) it is obtained L(x) = x^(W(x))

6) K>= (x) = L(x)*(W(x)/x + lnX * W´(x)) ... W´(x)=derivate of W respect to X

If you want to express it for standard ascending order you must make the adequate transforms for all graphs, with z=1-x as the cumulate fraction of population for descending order, etc. This is long to do and I leave it to those interested.

The main goal is to show the model that results from a different method that only use datasets as they come. It is not needed to use standard deviations nor a priori models. It lets data speak by itself.

I think it is enough for the moment. If you have questions please ask them and I will try to answer them. Thanks, emilio.

Demetris Christopoulos

Emilio, why did you choose that model for W:

W = b4*x^4 + b3*x^3+b2*x^2+b1*x + b0+error?

Emilio José Chaves

Demetris. I did not choose that model shape, I just let the computer to asign the W function by using the Tendence Polinomic instruction of excel. There are many possible solutions for the same points, as Fabrice observed some posts ago. As a matter of fact I predesigned W using a function of shape W= a+bx+cx^2+dx^n where n is a real fractionary number. I used trial and error just to get the median quite close to media, just to show that this may happen with non-normal, non simetric distributions. Ok, emilio

Demetris Christopoulos

Dear Emilio, the solution that you gave us is model dependent solution and a purely regressional LSE one. By changing the above class of functions from polynomial to other you could obtain different results: That's the big lack of all model selection solutions. In order not to be in a situation to apologize why this class of model and not the other one, we could instead work without any model assumption and obtain our critical values (whatever they are: moments etc) directly from the empirical CDF function. Anyway, you did a good job, although done by using a specific model.

Emilio José Chaves

Demetris. Of course LSE is very useful for this cases. I agree it is possible to study other functions different from polinomials, but the main point is not W(x), it is the combined functional structure of CDF. My main interest is to change statistical education to youth with something simpler, understandable and teachable, with theoretical fundamentals. I would like to know more about the infinite number of functions that are possible inside a 1x1 square box. I have made trials with w= a(1-bx)^n and other shapes to design but not to fit datasets. But each family of functions has its own limits and you need to control that descending order premise does not break. OK, thanks for your good comments, emilio

Guillermo Enrique Ramos

Dear Friends

For the record, I want to mention that I tried to ajust the data of Emilio with a Beta distribution with 4 parameters, and obtained in SAS its MLE estimators, but this distribution resulted much worst than that obtained before by Fausto.

I think that the problem is not resolved yet, because for example Fausto give us the MLE estimators of his proposed distribution but we have not its LSE, nor the properties of neither and then they were not compared. We know that asymptotically the MLE is eficient but for a sample of size 40, who knows which is better? May be that in this particular example the LSE was as good or better than the MLE.

Emilio José Chaves

Dear Fausto. The dataset was first sent 12 days ago -february 29 I guess-, 3 days ago (march 8) I gave the data with several decimals before rounding it to answer Guillermo observation about exact average of data sent. You may confirm it looking at all answers. I remember working "K ave Ch" to answer Guillermo, but I erased those tables and graphs made with the wrong average and made them again with Guillermo´s corrected value during last 2 days. I hope this helps, emilio

Emilio José Chaves

Fausto, my results with initial media "K ave Ch" were very close to the corrected ones shown one or two days ago. I repeated them to accept Guillermo's observation. Your estimations were close at the extreme, lower and middle points but somewhat distant in the rest. In your analysis you employ a function that gives the cumulate population as a function of the variable -kind of inverse function-, I graphed X vs T(cum.population) and it was different. But that is not a big problem in my opinion, because you contributed with a proposal taking the risk. Have you observed that I do not need "estimators" of any kind? The main point is that methods determine and produce models that do not fit data in many cases, something that you left clear, in my opinion. Demetris and Guillermo rejected my solution saying that I used the Least Square Method to produce structural function W from data, which is only the resulting model for an small part of my analysis. Well, they are not the owners of that method to decide who can use it or not. But if we try to use the same LSM for the CDF of given data set ad middle points of intervals, we will find that it does not work, the curve becomes sinusoidal and only fits few points -this problem increase if you use too many points and very high polinomy levels of exponent-, so the indicator R2 falls.

I believe that this discussion will be important to improve the road toward better models if seen as subproducts of methods -understanding that methods have limits of application according to datasets-.

Thanks for the instant and all your support, emilio

Jochen Wilhelm

Emilio, you state you have a "model". Could you explain in words what this model means? (honestly, I just do not understand this and hope for your help). What can we see in this model, what can we learn from this model? What is the practical use? (I hope you'll get the intention of these questions)

Further, given your data, if one had to make a prediction about future observations: what do you predict? And with what confidence or credability?

Or do I ask the wrong questions? I have to admit that I have no clue about the aims and interpretations of this analysis at all. (as I said above: I do not see the point in "describing a curve" (call it "fitting" if you want) that follows a f_cum vs. quantile plot (or vice versa))

Emilio José Chaves

Jochen. Your ironic questions and comments are not convenient to stimulate open debate. That is not a good example for young researchers and we should not use RG as a ring for ego´s confrontations. Please read my last answer to Fausto, where I treated some of your points and give a look again to the excel file I sent two days ago. After a pause post your concrete Q&A, your own proposal curves to modeling-interpreting the 40 points data set, and I will be glad to consider them. Thanks, emilio

Jochen Wilhelm

Emilio, that's not fair. My questions are NOT ironic. I really do not understand and would appreciate your help. I have a solution fitting your 40 points, slightly closer than Faustos solution, but with 9 parameters (Fausto used less). But this I did for fun, not because I consider my solution as useful for the problem.

Jochen Wilhelm

Once again, I said that I fitted a curve, and sure I used ML to find the "best" parameter values. This is not a simple interpolation (this would have been better, at least going perfectly through all the points, don't you think?!). And as I said, I estimated 9 parameters from the data. I know that these are many, and a good fit with 4 is better than a slightly better fit with only 4. So please do not blame me for that.

So here is the Excel sheet. Take it or leave it. For my convenience, the formula is divided in several parts. It would be cumbersome to put it together into one formula, but it is possible. The calculations are all in the table, feel free to do it.

There will be 1001 formulas that can fit the points reasonably well. Now we have two of them. I don't see the point in doing this. To say it again (!): having only the data withount any concept or theory about how this data was generated will not allow to make any good analysis. It's worthless. If parameters are to be estimated, then they should have a meaning, and this meaning is not provided by the data.

Guillermo Enrique Ramos

Dear Emilio

I do not reject your solution.

I only said that we have not answered yet the question that originates all this interesting matter: Is the MLE the same, better or worst than the LSE?.

I see that your proposal adjust very well to your original data, but I share some of the questions of Jochen because I do not see the parameters in your formula of the Distribution Funtion. For sure I am not the owner of any of the methods discussed here. I only hope to use them well enough to solve my statistical problems.

Emilio José Chaves

OK Guillermo, excuse me if things are as you explained. If you look at the second graph of my excel file you may observe that it contains five data points for function W(x) and also a polynomic equation of degree 4 obtained by using the Tendence function of excel. This function contains the coeficients (a,b,c,d,e,) of W=a+bx+cx^2+dx^3+ex^4 of a fitting function to 5 chosen points, with R2=1 accorging to Excel program. I use that expression plus its derivate in the section of the adimentional model to compute values for each data point of W(x), L(x) Lorenz curve, K ave(x), and K>= (x). You may check the formulas used directly in the file and observe them in the graphs made from formulas. Remember that I order data from top to low values of variable, K, of dataset.

This ordering have a close relationship with the other option of ascending order commonly used, and you can obtain the corresponding graphs using z=1-x for population, plus the same K value and then graph (z;K). But if you want Lorenz curve you must use (Zi; 1-Li) to obtain the proper transform and graphs for ascending data order.

Observe that I do not use predictors, nor estimators, nor standard deviations, nor a priori values of functions. I only use the Laplace criterion: data are medias of smaller intervals of frequence 1/N -as I understand it-.

If you try to make a regression directly with Least Squares Method for (X, K) the program produce oscilating curves that fit 5 points but behave wrong between each pair of datapoints. But if you make it with a simpler structural function as W(x) at the preliminar stage of analysis, then you obtain good results by using the formulas derived from mathematical analysis and properties of Lorenz Curve. I hope I was clear, if not, I will try other ways in another moment. Thanks, emilio

Guillermo Enrique Ramos

Thanks Emilio, I will study your method.

Demetris Christopoulos

Dear Emilio, I think you misunderstood me. First of all I don't want and I have not the power to dictate what somebody will use in his/her analysis. Secondly I think we all are scientists who are working with different methodologies and it is not necessary to agree, we are just exchanging opinions here in RG. Finally I have some remarks:

1)The question was for MLE vs LSE. What did your data provided contribute to the above question?

2)You took the ratio of every value divided by the mean value (~1000). What is the legitimization for doing this? Why not divide by median, for example?

3)At the end of the day, after doing your analysis, what can you say to us practically for the data given? I mean which is the practical advantage of the analysis presented at your excel file?

Badges
Science topic

More Dhruba Das's questions See All

Can anyone explain me in details what is the form of data and their structures ?

Actually I know about the qualitative and quantitative data but in case, if somebody ask what is the form of data and their structures then what will be the answer for that.

09 October 2014 5,309 8 View

In parametric inference, the necessary and sufficient condition that T is MVUE of Ѱ(Ө), is given by COV(T, U_0)= 0, where E(U_0)=0?

Suppose X ~ N(µ, σ^2), then we may have found that the sample mean is unbiased and also the MVUE for population mean µ and hence COV(sample mean, U_0)=0, where U_0 means E(U_0)=0. Now If I...

04 May 2014 5,824 12 View

In SPSS what test do I use to examine if there is gender discrimination in designation and salary if my sample size is 100?

If the sample size is less then we can perform t- test, but for large sample what test I have to perform?

04 May 2014 3,049 12 View

In numerical analysis, is it possible to use central difference formula to obtain the divided difference?

Uging Lagranges formula it can be obtain, but how to proceed using the central difference formula.

03 April 2014 5,827 9 View

What is meant by a parametric function in statistical inference?

In statistical inference our aim is to study the population characteristic that is the parameter. But if it is said that 'T' is an estimator of the function 'Ѱ(ϴ)' (which is of interest) then what...

03 April 2014 10,081 7 View

Can anybody explain me the consistent estimator to me?

Generally, it happens that when sample size increases, the statistics will give better results about the parameters. When the sample size=population size, then sample statistic=population...

02 March 2014 4,599 20 View

If X > Y, and X, Y belongs to R, then is there any dependency between X and Y?

If so, which one is the dependent and which one is independent? If x+y= 1 => x = 1- y, x, y belongs to R, then y is independent and x is dependent. But in case of x > y, how we have to define the...

02 March 2014 5,320 11 View

What is the difference between the difference equation and a differential equation?

There are two types of equations, one is difference equation and the other one is differential equation. The difference equation, which is we get in the calculus of finite differences. Actually...

01 February 2014 1,224 6 View

Can anyone provide the mathematical proof of 0! = 1 (zero factorial is equal to one)?

I have not found in any book by what procedure the value of zero factorial occurs. If there is any mathematical proof in any book please suggest it.

31 December 2013 2,749 32 View

Is there any difference between uniformly minimum-variance unbiased estimators and minimum-variance unbiased estimators?

In books it has been written that these two terms are same. If so, why is the term 'uniform' used in one and not the other?

10 November 2013 907 2 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

Why does my protein refolded to beta sheet during thermal denaturation analysis?

Hi! So i attempted to understand a novel protein behavior towards heat application by analyzing its secondary structure change. I subjected the protein to a thermal denaturation analysis using...

06 August 2024 1,989 3 View