In statistics when we go for estimating parameters, sometimes least square estimator is used and sometimes m. l. e. is used. Which one is better and when it can be applied?
.
hi Jochen,
i have forgotten almost everything about such topics at least two decades ago but i try to connect the dots
LSE ML for estimators with normal distribution ML estimators are asymptotically normal
the last assertion being
- true for the exponential familly
- true outside the exponential familly under a pretty ugly supplementary assumption (you can find it in theorem 7.63, page 421 of "Theoretical Statistics", M. Schervish, Springer 1995 ... that's a very big book !)
Radford Neal has a simple example of a distribution leading to inconsistent MLE and violating Schervish's assumption, which helps understand the rationale behind it
http://radfordneal.wordpress.com/2008/08/09/inconsistent-maximum-likelihood-estimation-an-ordinary-example/
.
The least squares estimator is present in linear regression problems:
Y=Xb+e, e = error term
Minimize ||e||^2=||Y-Xb||^2, where ||v|| is the l-2 norm (thus the term least squares).
On the other side when your model cannot be linearised, then you perform MLE, ie you choose those values (vector b if they are more than 1) that maximize the likelihood function:
f(x1;b)f(x2;b)...f(xn;b), where f() is the pdf of the distribution that X variable is following and x1,x2,...,xn the values that you have as data for X.
If you are lucky you can find the b with Calculus methods.
Otherwise you have to use iterative methods (starting from a reasonable first guess for your unknown parameter vector b).
Anyway all statistical packages routinely perform the above MLE process.
The least-squares estimator (LSE) is a special case of a maximum-likelihood estimator (MLE). The special case is that the probability distribution used for the likelihood is the normal distribution.
The MLE is the parameter value for which the observed data is most likely. This likelihood (of the data) can be calculated for any (assumed) parameter value. If the data are independent, the likelihood is simply the product of the individual probabilities of the observed values. This probability hat to be evaluated, and for this a probability model is required. Depending on the kind of data this could be specified as a binomial distribution, or a Poisson distribution, or an exponential distribution, or... (many many many more).... or a normal distribution.
The MLE is obtained by varying the parameter of the distribution model until the highest likelihood is found. The value of the parameter for this result is called the MLE. One can do this a little more analytically and derive a likelihood function, giving the likelihood depending on the parameter value. Then the derivative can be determined and solved for its root. However, it is often much simpler and numerically more convenient to look at the log of the likelihood function (where the ugly product of probabilities turn into a simple sum). The position of the maximum will not change by such a monotone transformation. It is thus practical to determine the maximum of the log likelihood.
In the special case that the normal distribution is used as probability model, the log likelihood turns out to be proportional to the negative sum of the squared residuals. Hence, the maximum likelihood is where the sum of the squared residuals is minimal. So here we have a nice shortcut and the MLE can be fond as the parameter for which the sum of the squared residuals becomes minimal (=LSE).
One can always use the MLE. When the data is normal distributed, one can also take the shortcut via the LSE, giving the very same result as the MLE (because in this case the LSE *is* the MLE, the calculations are only made simpler).
Fausto, please read more carefully. Being a "special case" alredy indicates that this is a special case, doesn't it? And just the sentence later I wrote to which exceptional case this relates. I think this is very clear to everyone who is willing and able to read.
In principle least-squares is a special case of the maximum likelihood methods. If the amplitudes of the structure factors were distributed as Gaussians with known dispersion then maximum likelihood would become least-squares. But it is structure factors themselves that are distributed according to the Gaussian law, not the amplitudes. But at the end stages of refinement when the model is complete and has a small error then the maximum likelihood could be approximated by the least-squares.
Fausto, if a patient who is coughing comes to a doctor. It might be some rare viral infection or strange autoimmune disease or some never-before seen disease, possibly with high virulence, starting some pandemic horror... but it is most likely just a simple cough. And, Fausto, yes yes yes, I very clearly said that. I really wonder if you need some help.
Briefly (and therefore not necessarily a detailed explanation), the least squares estimator is to find the best way to explain a random variable with a deterministic variable.
Maximum likelihood is seeking the best probability distribution to explain disperssion a random variable.
The first (LSE) is a regressive method, the second (MLE) is a method of measurement and management of the robustness of a probabilistic model.
Views may be wrong, or incorrectly stated (I am not talking about this specific case), but I do not think there is any point in flaming!
Anyway, for his case, as far as I can remember, for static (or steadystate data), MLE and LR give equal results if random error term is Gaussian distributed. However, again if I recall correctly, in the seminal book of Time Series Analysis by Box and Jenkins, the authors, while deriving the parameter estimates of time series models, start with MLE analysis, and then make a couple of simplifying assumptions (neglecting a number of terms) and arrive at LSEs. 8-10 years should have passed since I examined that derivation, and since I do not have the book with me right now, I cannot look up. Nonetheless, LSEs are, at the very least, very close to MLEs in many cases, given that error is Normal.
Burak, what you write seems to relate to the central-limit theorem (CLT). For larger sample sizes, the likelihood function approximates the shape of the normal density curve. The log of the normal density curve is a parabola, an so the problem of finding the MLE simplifies to the math of finding the LSE. From another angle, without stressing the CLT, we might consider any arbitrary log-likelihood function of any arbitrary peaked shape. A common algorithm to find the maximum is to express the function by a Taylor series expansion around the point where the first derivative is zero. The result will thus have no linear term (it is zero by definition) but the quadratic, cubic and so on terms. The contribution of the higher-order terms is negligible, and the simplest form again is the parabola as an approximation, again simlifying the math to the LSE (here not to get the maximum but rather to get the variance or "spread" of the likelihood function: the "standard error" of the estimate). The cost of this simplification is that the result is jsut an approximation. How good this approximation is depends on many factors. As Fausto stated virgourosely several times, all this is then and ONLY then NO approximation (i.e. the *correct* MLE and the LSE are identical), if and ONLY if the probability distribution of the variable is normal. If this is not the case, often the *approximation* is still acceptable (the LSE is acceptably close to the MLE; the data have a very comparable likelihood under both, the LSE and the MLE), but for sure there are cases where the LSE is too far from the MLE and therefore not a good or useful estimate (because one would think that for this value the data are most likely but in fact there is another value for which the data are much more likely; more severe though is the faulty conclusion about the variance or standard error).
Fausto, I agree that a lot of wrong things are repeated over and over again from millions of "academic" people. Some of these things are harmful, but many are not generally harmful (or only in very special cases). Most of them can be seen as "convenient lies". This is not "academic", sure, but this is how the world is. We (not you or I in particular... just someone who is smart enough) can not and will never make all people understand everything correctly. But the history has shown that despite this fact there are always some great minds looking behind such "common explanations", digging deeper and gaining some better understanding. And this is usually related to the fields of research they are mainly involved in. That means, a physician won't reveal common tales about probability theory, and a statistician won't pinpoint common misconceptions in the definition of pathologies. Even if most non-statisticians use actually incorrect approximations or explanations or even interpretations for their data analysis, I consider this ok as long as their judgements are based on reasonability, coherence, and expert knowledge. Often many different pieces of evidence are taken together to come to a reasonable and coherent conclusion. To my experience it is often more productive and successful to not distract researchers with statistical details, so that they will have more capacities left to interpret their observations in light of their models and their specialist expertise. This works suprisingly well in many cases. The cases where the neglection of statistical principles or incorrect applications of statistical analyses or wrong approximations lead to desasters are generally pointed out (somewhat later, though) by statisticians, forcing the introducton of modified methods or rules (to give an example: the screening of genes for differential expression by new high-througput methods first lead to a desaster where too many false-positive results were produced; but this was recognized! And today terms like "false-discovery rates" are known to many researchers in this field, whereas many were not at all be aware of "error-inflation" and "multiple-testing" some years ago. So there is a development of the community, but it is slow.
Moaning and critisizing they way you do will not speed this process up. It might have rather adverse effects, I am afraid. Instead, you should better demonstrate what the benefit (for the actual researcher!) is when he will change his behavior according to your suggestions. As far as I see, this benefit won't be significant for most researchers (in the life sciences). Those working with survival data and censored data should be informed more specifically. But I think there are a lot of books specialized on these topics and read by the people who are working with such data. If contents in these book are seriousely wrong, you should find some talented people with whom you can write a better, attractive, understandable, enyojable and beneficial book and get it promoted and distributed to the target audience. This would be my advice to help you reaching your aim.
The purpose of statistics should be to provide a general method to handle any data set, in this case it has no sense to provide a particular method that can only be applied to normal distribution premises after censoring the data set. I suggest to work here a concrete non-normal univariate data set of 40 points -without any censoring- and present here different proposals of solutions found. While Jochen has tried to explain the orthodox method, Fausto is claiming for a deep change in its analysis in order to free statistical teaching from the jail of premises that traps it. This may be painful for many people that teach conventional statistical wisdom, but nor for science that value better interpretations and better methods to approach problems solutions. We must face the situation with honesty at the risk of loosing our most cherished prejudices and text-books. Thanks, emilio.
MlEs are obtained with the objective of maximizing the likelihood of the observed sample while LSE are meant to minimizes the error sums of squares. In particular they might coincide as discussed in the ongoing posts.
OK Fausto, here it is the data for participants analysis. Its mean is exactly 1000. Good luck to all of you. emilio
X variable (40 points)
1 287 1043 1732
6 344 1130 1763
17 407 1216 1784
32 475 1299 1799
52 547 1379 1807
78 623 1455 1815
109 702 1525 1829
145 785 1588 1869
187 870 1645 1991
234 956 1693 2784
If any useful information would provided... what exactly is the aim of the analysis? Are there any resonable assumptions? Are there any critical assumptions? The task is like "Here is a map, now please tell me the destimation!" - There is a lot of crucial info missing for a meaningful analysis. Sounds all too stupid to be considered for an answer, and this is possibly a reason for the lack of response. But I guess you consider me too stupid to understand.
Fausto, let´s be patient about answers. Eventual contributors are free to define the aim of his/her analysis, the rationality and vulnerable aspects of their assumptions, their own method, tools and its limits. Sometimes people that want to participate do not have the courage to expose his/her selves, sometimes they think they have good arguments but prefer to observe debates at a prudent distance, sometimes they recognize at once that debate may be inconvenient to their preferred theories, packages and practices, etc. There are other Q&A here in RGate where people offer their views with graphs results, about concrete given data-sets, without questioning data at all. In anycase, silence is eloquent by itself. When Don Quixote was confronted by Sancho´s good reasons, as many bosses today, he ordered him: "Be silent Sancho, it is not convenient to handle it". (Calla Sancho que no conviene menearlo). In spanish tradition there is a saying among catholic people: "Priests predicate but do not apply it" (el cura predica pero no lo aplica). OK, Fausto, we share our insatisfaction with the state of today´s statistical practices and teaching, and at our age we just wish that young researchers be able to say without fear "Emperor is nude". Cheers to you and other readers of this Q&A. emilio
@Fausto, can you give me a link with your main views written in English?
Thank you.
Demetris
@Fausto
I stand by my post as explained by Jochen why least-squares is a special case of the maximum likelihood methods. In support of my assertion I am attaching herewith an article which proves in its last part when there is no difference in LES and MLE.
Mohammad, hi. Can you apply your method to the 40 points univariate non-normal sample proposed here and tell us your results of fitting curve to this dataset? Thanks, emilio
Dhruba,
You need to take part in the discussions! When so many senior researchers are writing their views with reference to your question, you should question back if you have failed to understand any point in an answer. Only then, this discussion would be fruitful to you.
Professor Firoz Khan has mentioned in his first answer that the LS method is a special case of the ML method. To be precise, the ML method is based on the LS method. That is why he has said so.
Your question is a very basic one, and that is why this discussion would be a very fruitful one for you. Take part in the discussion actively.
ResearchGate is in fact an open classroom. Junior researchers like you can learn a lot through this medium.
Dear Mohammad, accordfing to your uploaded work, what is the case when the epsilon~U(-r,r), ie the error term follows a uniform distribution and not a gaussian one?
Before questioning me, one should go through the sentence in my first response,
“But at the end stages of refinement, when the model is complete and has a small error then the maximum likelihood could be approximated by the least-squares.”
And, in later response through the phrase,
“which proves in its last part when there is no difference in LES and MLE.”
I again reiterate: If the residual variation is homoscedastic, independent, and Gaussian In this case, least squares the LES is especially useful and usually yields MLE. However, the value of MLE is sometimes limited to large samples, because their small sample properties are sometimes quite unattractive.
The LSE is not enough when the relationships of interest to us are not linear in their parameters, attractive LSE is difficult, or even impossible, to come by.
As such, the linear model E(y|x) = xb may be not enough in a lot of cases. The conditional expectation is just a parameter of the distribution of y conditional on x. The idea of MLE is to base estimation of the parameter not on the conditional expectation but on the whole distribution, I (y|x) Therefore, MLE as a strategy for obtaining asymptotically efficient estimators is PRINCIPAL ONE from the perspective of a large-sample.
Dhruba,
As I had said earlier, your question is a very basic one. But there are doubts and suspicions in this regard! Various questions are coming up as you can see. You need to participate in the discussion!
Actually in case of data, if actual observations follows normal distribution around the mean, the MLE estimate is same as the OLS estimate.
For the most frequently used probability models (I'd guess for all models, but I have no proof), the least squares estimate of a location parameter is IDENTICAL to the maximum likelihood estimate of this location parameter. Some distributions have no "proprietary" location parameter, but usually the model can be reparametrized with respect to the expected value of the distribution. It is clear for logical reasons that the MLE of the parameter representing the expected value is identical to the LSE. However, I showed it for some of the typical distributions in the attached document.
So I would make the statement much stronger: The LSE of a location parameter (as the expected value, concretely the sample mean) is neccesarily identical to the MLE of the same parameter. This is true for all(?) distributions, not only for the normal distribution.
The drawback of the LSE is that the precision of this estimate can not always be taken from the sample variance as it can for a normal distributed variable. When the distribution is not normal, the shape of the likelihood is not symmetric around the MLE (or LSE), and the standard error loses its meaning as the confidence interval will be asymmetric as well. Here, the correct confidence intervals can only be obtained by the likelihood function.
Only the central limit theorem assures that the likelihood function approximates the normal distribution (as the sampling distribution of the respective statistic), and thus the standard error can again be used to assess the (approximate) precision of the estimate.
Jochen, hi. It is clear you have a favorite group of models-set: this is your first premise. Then you extrapolate it to all models as a “guess” without proofs to state LSE=MLE, later you admit that even without “location parameter”, model can be “reparametrized” at the distribution media and declare “it is clear for logical reasons” which I am not able to see. As a proof, you offer ten “typical distributions” which are theoretical-mathematical constructions that only prove your point for that models-set. And to close the theoretical discourse you invoque the central limit theorem to declare that “thus the standard error can again be used to assess the (approximate) precision of the estimate”. I did not understand your logical sequence. Perhaps this is a problem to request the help of experts in epystemology of logics, mathematics and statistical models like Deborah Mayo, Fausto and other ones.
I can design thousands of parametric distributions as my models-set, but that does not authorize me to recomend it as a general recipe to analyze real data. Can you apply just one of your models to analyze the 40 points data set proposed? Or a mixture of them if you prefer? If you want let´s do the inverse operation: you give me 40 data points set and I will analyze it assuming they are representative of the sample sent. I use an alternative general method to do it without such assumptions, without standard deviations, nor predictors, nor media’s errors, nor StDev errors, nor confidence intervals. By the way, medias of each interval do not usually correspond to the middle point of intervals. That is decided by the model employed to represent your statistical curves over dataset. OK, let’s be self critical starting by myself. With due respect, emilio
Dear Jochen, all your examples are a subset of the Exponential Family of Distributions. You could take the general form and do the work once, instead of doing it so many times. There is no more information among the different examples that you presented. Anyway, you did a work!
Respected @ Fausto Galetto
Can you tell me a little bit about your standpoint that, why censoring sample gives problems in LSE.
@Demetris: In fact, this propbably would have covered the gamma as well. I have chosen these examples and made them explicit because they are very often used in my field of research (life sciences, biomedical research).
@Emilio: the arithmetic mean is the LSE, and it is the expected value.
@Fausto: I attached the solution to your example. Feels like doing your homework. You are right that LSE and MLE are not identical here. However, there is a simple transformation that again makes them identical. Further I never claimes that the LS method gives good approximate CIs. So it is a little silly to make a big issue here. In fact, your example leads to a very bad LS-approximations for the CIs (see attached file). However, the large-sample approximation works. This is where the CLT neccessarily shows up. And finally I'd like to note that for such assymetric distributions I would prefer the likelihood-intervals over the confidence intervals: the CI leave the same tail areas at both sides, leading here to very different likelihoods at the borders. Thus the data can have a considerably different likelihood for an estimate at the lower and at the upper bound of the interval, what I find kind of counter-intuitive.
Fausto, then set eta' = sqrt(eta) and estimate eta'.
Ok, for the values
0.288 0.140 0.553 0.308 0.203 0.636 0.390 0.162 0.323 0.400
(generated from a distribution with eta=0.3) the MLE is 0.373, the 95%CI is from 0.291 to 0.559.
As you and I noted the LSE is not similar and only useful in as (very) large-sample approximation. The LS method can be used to estimate eta' on the transformed values Z=X², and eta is obtained by (eta')².
.
see section 3.3 of the attached document
also notice in the introduction :
"MLE has many optimal properties in estimation: sufficiency (complete information about the parameter of interest contained in its MLE estimator); consistency (true parameter value that generated the data recovered asymptotically, i.e. for data of sufficiently large samples); efficiency (lowest-possible variance of parameter estimates achieved asymptotically); and parameterization invariance (same MLE solution obtained independent of the parametrization used). In contrast, no such things can be said about LSE. As such, most statisticians would not view LSE as a general method for parameter estimation, but rather as an approach that is primarily used with linear regression models."
also, LSE is "BLUE" (Best Linear Unbiased Estimate) ... which is a very nice property explaining the ubiquity of LSE in the linear model context (plus other restrictive hypotheses ; homoscedasticity notably)
review the Gauss-Markov theorem (and read carefully the hypotheses !)
http://en.wikipedia.org/wiki/Gauss%E2%80%93Markov_theorem
.
@Emilio, can you give us a set of tasks for your #40 data that you have uploaded?
I want to analyse it with some of my methods, but give us the objective please.
Demetris: I was expecting that each one gave the task of the analysis just to see what they consider their main objectives, methods, tools, premises, etc.
My answer to your question is this: 1) Give a good fitting soft curve graph for Cumulative Distribution Function and its mathematical expression. If you have a soft fitting Lorenz Curve, do the same. If it is possible include the data points in graph to watch fitting. 2) Explain shortly main method and mention data ordering employed. 3) If Standard Deviations or variances are used, please calculate them for 100 points obtained from model with increments of 1/100 of population, and explain any difference with 40 points SD or variance.
I would like to see answers that do not use Probability Density functions, but if they are used, I expect a continuous math expression for them. Given that dataset is not normal, please do not send gaussian formulas.
Thanks for your interest, emilio
Fausto, your reaction is not helpful, I find it rather offending. I'll stop the discussion here with this link: http://en.wikipedia.org/wiki/Weibull_distribution (noting that E(X)=lambda*Gamma(1+1/k)). I'll come back when you changed your state of action from "destructive" to "constructive".
.
for those interested, two references comparing the performance of MLE, LSE and others
1) for the exponential distribution (in French ... but with a long abstract in English) :
.
.
2) for the Weibull distribution (oh, well ... they do not tell much about LSE except that it performed poorly for small samples and concentrate on other estimators) :
.
Fausto, "all models are wrong, but some are useful". Thus one can always scream WRONG WRONG WRONG wherever you wish. I'd find it constructive to help others to understand where they can be severely wrong (in a pracitcal sence). But this requires to understand the aims of these researchers. For me, for instance, it would have been helpful if you would have said something like "for the exponential family of distributions this is the case because of this and that... but nut neccesarily for distributions that do not belong to this family. As an example take the Weibull. Because here one can not ... bla bla .." or "this is the case for some distributions which have this and this property, because then this and this is the case. For other distributions, where that an that is the case, this does not work because of these reasons... bla bla". Instead, I repeatedly hear you moaning that I (and others) don't answer your "questions" where we (well, at least I) do not understand what you actually want (e.g. I said that the estimators are not the same, and even if they are the same, then the CI can be considerably different, and that a large-sample approximation might require *really* large samples to be useful -- but this all ist not what you wanted to hear or what you seem to recognize...). What am I doing here? It is not worth the time invested.
@Fabrice. I read professor Sambou (Senegal) article. Very interesting for this debate. It says "Empirical plots are unstable for low sample sizes, are sensitive to sampling, and are very difficult to explain. Analytical expressions for the asymptotic statistical properties of the two estimators are needed for realistic comparison." Well, that is what we are asking here with the 40# dataset. I have clues to handle those -low sample sizes- expressions requested by Dr. Sambou. OK, thanks, emilio
Dear Emilio
I am new in this topic, because I was traveling for a month by "La bellísima Italia" in vacation. I am very delighted with the interesting comments of most participants. I copied your data and I computed the arithmetic mean it is 1000.075 instead of 1000
Your data is:
1 287 1043 1732
6 344 1130 1763
17 407 1216 1784
32 475 1299 1799
52 547 1379 1807
78 623 1455 1815
109 702 1525 1829
145 870 1645 1991
234 956 1693 2784
I am doing something wrong?
Hola, Guillermo. I think you may be right because I rounded numbers after defining U=1000 and multiplying it times the adimentional Variable (in fractions of U). I used Excel and recalculated the mean, but am not sure about the decimal format then. Use your own mean because it is not too different to cause big precision problems in interpretation. If in your opinion it deserve severe critics, I would gladly accept and work them. Thanks for your interest, emilio
Guillermo. The data you show is a 9*4=36 points. My original data is 10*4=40. Please check your transcription. emilio
Guillermo: the original data with several decimals before rounding was this:
2783,705475
1991,428297
1868,684519
1828,805339
1814,684284
1807,424134
1798,729361
1784,479055
1762,575404
1732,05301
1692,649012
1644,567605
1588,335851
1524,706428
1454,586666
1378,983731
1298,960719
1215,600832
1129,978062
1043,133402
956,0559273
869,6682351
784,8157987
702,2598213
622,673176
546,6390307
474,6517607
407,1197647
344,3698194
286,6526333
234,1492871
186,9782835
145,2029638
108,8390821
77,86236568
52,2159235
31,81739515
16,5657642
6,347785068
1,043996707
Its average is exactly 1000, but the average of the rounded 40 datapoints is 999.025. So I accept there is a 1/1000 fraction error in the informed media of rounded values. I hope this aclaration helps, so thanks for your oportune remark and let´s see its effects. Thanks, emilio
Dear Emilio
Thank you for your correction. Your original list is:
1 287 1043 1732
6 344 1130 1763
17 407 1216 1784
32 475 1299 1799
52 547 1379 1807
78 623 1455 1815
109 702 1525 1829
145 785 1588 1869
187 870 1645 1991
234 956 1693 2784
I copy and paste with problems in my previous chat.
I rounded in Excel your last list and obtained the same values as before. But the sum of them is 40003 then the mean is 1000,075. I do not think that this was important but I asked because at first was worried by the difference.
Dear Fausto
I congratulate you for your guessing of a distribution behind the sample. The Kolmogorov Smirnov Test gives 0,052552915, which is very far from the critical value with alpha=0,05 (0,2150). The graphic of Fe(t) vs F(t) is near of the identity (Fe=the empirical distribution funtion).
I recognize however that after reading the previous remarks of this topic, I do not guess that the purpose of this challenge was to find a distribution function that works well. You were taking of the difference between de MLE and LSE, then I share the perplexity of Jochen. May be that Emilio and you were working in the same field or very near and that you can guess to each other intentions very quickly, but for me before begining with computing the MLE, I need to know from which of the infinite distribution are taking. I think that may be many distributions alternatives to the postulated by you that can adjust this data very well, the estimator of the mean would be similar to yours and to the usual LSE, but the estimator by MLE of the rest of the parameters will be different because they will be other parameters.
Guillermo. I repeated the calculations and obtained same result as yours: 1000,075. I will use it to build the cumulative distribution function CDF model that gives a soft fitting curve to the 40 points dataset. I will show it here once it is ready. Thanks for your contribution to correct this point. It teaches me -once again- that we must let data speak by itself, and in this case I used wrongly the model predesigned media, instead of dataset media.
@ Fausto. Thanks for your answer. I have two questions about your curve: 1) Did you order data from top to low, or in ascending order? 2) May you explain with one numerical example using one of the 40 points the obtained F(t)? Is F(t) a cumulative distribution function, a probability density function, or another one? Your answers will help me to understand the applied method that produced your fitting model for this particular dataset. My regards to you and all followers, emilio
Hi dear Fausto, it is the case that I use to order data from top to low X variable, which I call K because mathematics become easier to handle to me: this effects your population order as a concept. It seems that your variable t means cumulate fraction of population,
K(t| K>=Ko) but I am not sure (if the X order is ascending then I must make a transform of your equation to F(Z) where z= 1-t for the same value of F, getting point (zi;Fi). . All I want is to be sure about your model F(t)=1-exp[-((t/1944.4)^0.6+(t/1810.6)^9)]. before working it at excel and observe your fitting with data points. So please give one value to t and give your F(t) value which I suppose correspond to some of the dataset points I gave. If I am wrong just tell me what t and F(t) means to you. I am not questioning anything, my interest is to interpret you in a proper way when I graph your expression F(t). In this case my problem -not yours- is that I use differents names from yours for horizontal and vertical variables of the curve and different ordering method. The data table is fine as you showed it. Thanks a lot, emilio
Dear Fausto, look at the graphs where I have performed the Empirical CDF approximation of R plus a natural spline of n=201 points and together I have plotted your approximation of F(x).
I have some objections around the value ~ 1800, see second plot: your F(x) seems to 'get stacked' there.
Another objection is the value of -0.4781865491e-29 = -0.4781865491*10^(-29) at your formula:
F(t)=1.-1.*exp(-0.1063478300e-1*t^0.6-0.4781865491e-29*t^9)
It seems to me to have been produced from an ill-positioned matrix.
Any details?
Dear Fausto, another issue of your propsed CDF is that the corresponding PDF has a local maximum near ~1800, see plot. Shouldn't it present the maximum around Mean(X)=1000 instead of 1800?
I think this conversation is too passion for me, since nobody can upload something without been downvoted from somebody!
So, thinking for the pros and cons I am leaving. Bye!
PS By the way Fausto: I am not a normal distribution lover and if you see my work I am a totally non parametric scientist, so I wouldn't do any kind of regression in order to present a formula for the above data. But if you want to label everybody, its ok, no problem, we live (till now) in a world with freedom writting. Ciao...
I want to give a short statement why I don't play this game here: The "empirical estimation of a CDF" from a given set of data, without knowing ANYTHING about the kind of data, possible underlying mechanisms, scientific background... is, to my opinion, really unscientific. What can we learn from discuss any (parametric!!) fits of the empirical CDF? If this data is all you know, then use it. Take the empirical CFD as is. If you are going for inference: bootstrap. If you know that the data is related to particular processes, possibly from reliability experiments, you can possibly go and check is a particular (composite?) CDF from this field fits well and thus tells you something more about the data. But this information was not given, and it also would then require to be expert in this particular topic (what is not the case for most of the readers here).
Further the question for the mean is -to my! opinion- not very sensible here. The distribution is clearly bi(tri?)modal (see attached picture of the estimated [and smoothed] density curve). If you want a mean, you can bootstrap it. The result I got is 999 with the 95%CI from 772 to 1230 (giving equal weights to all observations and taking random samples of 40 with replacement).
And funnily I have the feeling that the "circle of statistical quality" is confused about his own problem and solution... but likewise here :) I am keen to learn how this all will be resolved. Thank you.
By using only the empirical CDF interpolated by splines (N=1001) and with iterative use of Extremum Surface Estimator (ESE) and Extremum DIstance Estimator (EDE) from R-Package 'inflection' we can find the two critical points:
mu[1]=1022.281
mu[2]=1809.774
No assumption at all.
Nice solution, Demetris. But shouldn't the 3rd derivative of the *used* inflection points be positive (-> maximum in the density)? This is not the case for the point at 1022, that reflects a local minimum in the density. Would this make sense? An earlier "inflection point" should/could be close to 0. This gives the same solutions as seen in the density curve.
Using splines, there should be a good solution with 3-4 knots only (rather than 1001). One can additionally use the contraint that the derivative at the largest quantile must be 0.
Jochen, that's the reason why I called them 'critical points': the interpretation is another story... The choice of N=1001 knots was made just for increasing the accuracy of ESE & EDE methods only.
Fausto, I told you that I don't want to make regressions, but this does not mean that I cannot do them. See a very simple symbolic regression with HeuristicLab 3.3. I don't find it very good, but I don't want to spend time for improving it. So, what?
Hi all of you, thanks for your participation. I include here an excel file that you may study slowly about my interpretation and graphs of dataset. It is necessary to have good handling of excel to follow its logics and graphs included. It is based in these main premises:
1) Laplace criterion (each datapoint is a median of an interval with frequence 1/40)
2) From 1) we may infer-derive the Lorenz curve (Xcum; Lcum) -Note, I use a descending order, and I call K to the variable of dataset, X to the cumulate populations in fraction.
3) There is an structural function W(x) that define the whole model -after adding a premise for the lowest K value. At x=1 K(1)=W(1) minimum.
4) The W model obtained from Least Square Minimum method of computer for 5 selected points was W(x)=W = -0,6617x^4 + 2,2448x3 -1,8246x^2 - 0,4956x + 0,736
5) From 4) it is obtained L(x) = x^(W(x))
6) K>= (x) = L(x)*(W(x)/x + lnX * W´(x)) ... W´(x)=derivate of W respect to X
If you want to express it for standard ascending order you must make the adequate transforms for all graphs, with z=1-x as the cumulate fraction of population for descending order, etc. This is long to do and I leave it to those interested.
The main goal is to show the model that results from a different method that only use datasets as they come. It is not needed to use standard deviations nor a priori models. It lets data speak by itself.
I think it is enough for the moment. If you have questions please ask them and I will try to answer them. Thanks, emilio.
Emilio, why did you choose that model for W:
W = b4*x^4 + b3*x^3+b2*x^2+b1*x + b0+error?
Demetris. I did not choose that model shape, I just let the computer to asign the W function by using the Tendence Polinomic instruction of excel. There are many possible solutions for the same points, as Fabrice observed some posts ago. As a matter of fact I predesigned W using a function of shape W= a+bx+cx^2+dx^n where n is a real fractionary number. I used trial and error just to get the median quite close to media, just to show that this may happen with non-normal, non simetric distributions. Ok, emilio
Dear Emilio, the solution that you gave us is model dependent solution and a purely regressional LSE one. By changing the above class of functions from polynomial to other you could obtain different results: That's the big lack of all model selection solutions. In order not to be in a situation to apologize why this class of model and not the other one, we could instead work without any model assumption and obtain our critical values (whatever they are: moments etc) directly from the empirical CDF function. Anyway, you did a good job, although done by using a specific model.
Demetris. Of course LSE is very useful for this cases. I agree it is possible to study other functions different from polinomials, but the main point is not W(x), it is the combined functional structure of CDF. My main interest is to change statistical education to youth with something simpler, understandable and teachable, with theoretical fundamentals. I would like to know more about the infinite number of functions that are possible inside a 1x1 square box. I have made trials with w= a(1-bx)^n and other shapes to design but not to fit datasets. But each family of functions has its own limits and you need to control that descending order premise does not break. OK, thanks for your good comments, emilio
Dear Friends
For the record, I want to mention that I tried to ajust the data of Emilio with a Beta distribution with 4 parameters, and obtained in SAS its MLE estimators, but this distribution resulted much worst than that obtained before by Fausto.
I think that the problem is not resolved yet, because for example Fausto give us the MLE estimators of his proposed distribution but we have not its LSE, nor the properties of neither and then they were not compared. We know that asymptotically the MLE is eficient but for a sample of size 40, who knows which is better? May be that in this particular example the LSE was as good or better than the MLE.
Dear Fausto. The dataset was first sent 12 days ago -february 29 I guess-, 3 days ago (march 8) I gave the data with several decimals before rounding it to answer Guillermo observation about exact average of data sent. You may confirm it looking at all answers. I remember working "K ave Ch" to answer Guillermo, but I erased those tables and graphs made with the wrong average and made them again with Guillermo´s corrected value during last 2 days. I hope this helps, emilio
Fausto, my results with initial media "K ave Ch" were very close to the corrected ones shown one or two days ago. I repeated them to accept Guillermo's observation. Your estimations were close at the extreme, lower and middle points but somewhat distant in the rest. In your analysis you employ a function that gives the cumulate population as a function of the variable -kind of inverse function-, I graphed X vs T(cum.population) and it was different. But that is not a big problem in my opinion, because you contributed with a proposal taking the risk. Have you observed that I do not need "estimators" of any kind? The main point is that methods determine and produce models that do not fit data in many cases, something that you left clear, in my opinion. Demetris and Guillermo rejected my solution saying that I used the Least Square Method to produce structural function W from data, which is only the resulting model for an small part of my analysis. Well, they are not the owners of that method to decide who can use it or not. But if we try to use the same LSM for the CDF of given data set ad middle points of intervals, we will find that it does not work, the curve becomes sinusoidal and only fits few points -this problem increase if you use too many points and very high polinomy levels of exponent-, so the indicator R2 falls.
I believe that this discussion will be important to improve the road toward better models if seen as subproducts of methods -understanding that methods have limits of application according to datasets-.
Thanks for the instant and all your support, emilio
Emilio, you state you have a "model". Could you explain in words what this model means? (honestly, I just do not understand this and hope for your help). What can we see in this model, what can we learn from this model? What is the practical use? (I hope you'll get the intention of these questions)
Further, given your data, if one had to make a prediction about future observations: what do you predict? And with what confidence or credability?
Or do I ask the wrong questions? I have to admit that I have no clue about the aims and interpretations of this analysis at all. (as I said above: I do not see the point in "describing a curve" (call it "fitting" if you want) that follows a f_cum vs. quantile plot (or vice versa))
Jochen. Your ironic questions and comments are not convenient to stimulate open debate. That is not a good example for young researchers and we should not use RG as a ring for ego´s confrontations. Please read my last answer to Fausto, where I treated some of your points and give a look again to the excel file I sent two days ago. After a pause post your concrete Q&A, your own proposal curves to modeling-interpreting the 40 points data set, and I will be glad to consider them. Thanks, emilio
Emilio, that's not fair. My questions are NOT ironic. I really do not understand and would appreciate your help. I have a solution fitting your 40 points, slightly closer than Faustos solution, but with 9 parameters (Fausto used less). But this I did for fun, not because I consider my solution as useful for the problem.
Once again, I said that I fitted a curve, and sure I used ML to find the "best" parameter values. This is not a simple interpolation (this would have been better, at least going perfectly through all the points, don't you think?!). And as I said, I estimated 9 parameters from the data. I know that these are many, and a good fit with 4 is better than a slightly better fit with only 4. So please do not blame me for that.
So here is the Excel sheet. Take it or leave it. For my convenience, the formula is divided in several parts. It would be cumbersome to put it together into one formula, but it is possible. The calculations are all in the table, feel free to do it.
There will be 1001 formulas that can fit the points reasonably well. Now we have two of them. I don't see the point in doing this. To say it again (!): having only the data withount any concept or theory about how this data was generated will not allow to make any good analysis. It's worthless. If parameters are to be estimated, then they should have a meaning, and this meaning is not provided by the data.
Dear Emilio
I do not reject your solution.
I only said that we have not answered yet the question that originates all this interesting matter: Is the MLE the same, better or worst than the LSE?.
I see that your proposal adjust very well to your original data, but I share some of the questions of Jochen because I do not see the parameters in your formula of the Distribution Funtion. For sure I am not the owner of any of the methods discussed here. I only hope to use them well enough to solve my statistical problems.
OK Guillermo, excuse me if things are as you explained. If you look at the second graph of my excel file you may observe that it contains five data points for function W(x) and also a polynomic equation of degree 4 obtained by using the Tendence function of excel. This function contains the coeficients (a,b,c,d,e,) of W=a+bx+cx^2+dx^3+ex^4 of a fitting function to 5 chosen points, with R2=1 accorging to Excel program. I use that expression plus its derivate in the section of the adimentional model to compute values for each data point of W(x), L(x) Lorenz curve, K ave(x), and K>= (x). You may check the formulas used directly in the file and observe them in the graphs made from formulas. Remember that I order data from top to low values of variable, K, of dataset.
This ordering have a close relationship with the other option of ascending order commonly used, and you can obtain the corresponding graphs using z=1-x for population, plus the same K value and then graph (z;K). But if you want Lorenz curve you must use (Zi; 1-Li) to obtain the proper transform and graphs for ascending data order.
Observe that I do not use predictors, nor estimators, nor standard deviations, nor a priori values of functions. I only use the Laplace criterion: data are medias of smaller intervals of frequence 1/N -as I understand it-.
If you try to make a regression directly with Least Squares Method for (X, K) the program produce oscilating curves that fit 5 points but behave wrong between each pair of datapoints. But if you make it with a simpler structural function as W(x) at the preliminar stage of analysis, then you obtain good results by using the formulas derived from mathematical analysis and properties of Lorenz Curve. I hope I was clear, if not, I will try other ways in another moment. Thanks, emilio
Dear Emilio, I think you misunderstood me. First of all I don't want and I have not the power to dictate what somebody will use in his/her analysis. Secondly I think we all are scientists who are working with different methodologies and it is not necessary to agree, we are just exchanging opinions here in RG. Finally I have some remarks:
1)The question was for MLE vs LSE. What did your data provided contribute to the above question?
2)You took the ratio of every value divided by the mean value (~1000). What is the legitimization for doing this? Why not divide by median, for example?
3)At the end of the day, after doing your analysis, what can you say to us practically for the data given? I mean which is the practical advantage of the analysis presented at your excel file?