Suppose a non-linear smooth function is fitted to some data (e.g. means and standard errors for cell survival after various radiation doses). What are some useful ways to assess goodness of fit for the model, without comparing to other models?
Basically, "fitted to some data" describes a particular optimization problem that has been solved, i guess sum of least squares in your case. So goodness of fit then translates into the distance of your solution to the global optimum solution.
I would suggest to estimate the confidence intervals of your parameters; Tight intervals and identifiable parameters here mean that you have a "good" model suitable e.g. for prediction and control/therapy purposes. To this end, the uncertainty of your data needs to be described and taken into account (e.g. using set-based methods ;)
If you have non-identifiable parameters, then the model is over-parametrized.
A model should always be as simple as possible, but no simpler.
You inserted the qualifier "without comparing to other models" which really limits the possible answers. I think you're asking for something equivalent to the R-squared or other effect metrics that are in standard use in linear modeling. If your error distribution is approximately normal, then the standard metrics can be used although curve fitting like you're describing is prone to overfitting and would necessitate something like a cross-validated assessment of the metric.
You can always just plot the fit against the raw data to judge the goodness of fit - just use your eyes!
I think some criteria for semi- and nonparametric models could be applied in your model, such as the average squared error, the mean average squared error, integrated squared error, average predictive square error, (generalized) cross-validation and so on. These criteria do not require comparing with alternative models. For details, see Fahrmeir and Tutz, 2001.
If the experiment includes replications(independent trials with the same predictors e.g. radiation dose), calculating the pure error is usually useful. It's an easy calculation if you are fitting by least squares. The pure error is the minimum any regression function can achieve.
In a sense though, it is a comparison to a model, the so-called saturated model where there is a seperate prediction for each set of predictor variables used in the experiment.
Pearson's chi-square test for goodness-of-fit and Fisher's F-test for the number of parameters. Norman Albright in Berkeley studied all the robust fitting procedures for survival curves. See Radiation Research 1987 Nov;112(2):331-40.
Dear H.E. Lehtihet, If the model used for fitting is given (and not compared to other models as stated by OP), and your objective function fixed too, then the best you can do is to calculate the global optimal solution. The question discussed here as far as I understood OP correctly is which criterium to use to qualify the fitted model, and we have seen several useful suggestions here. Hope this helps you, best
Thank you for your last clarifications. The reason I have asked for these clarifications is due to the ambiguity of your first answer, which seemed to mix up between the best-fit parameters (solutions of the optimization process) and the goodness of fit (GoF) for the model.
When fitting data, the evaluation of the GoF is almost never a trivial task. Even in the linear case, there exist some issues as can be read in the following paper :
http://arxiv.org/pdf/1008.4686v1.pdf
In the non linear case, the problem becomes much more complicated and, of course, is not free of issues either. (see for example the following paper regarding the use of R-squared)
In the case of the question discussed in this thread, the problem is even more complicated, given the fact that Igor has imposed (as you have said) quite stringent additional constraints. Here, Michael has pointed out what I believe is the major difficulty, namely: the cross-validation. Unfortunately, such a difficulty cannot be eliminated using a simple metric such as the distance to the optimal solution. Such a technique is equivalent to the "chi-by-eye" technique mentioned by Michael in his side remark.
Thank you everyone for your suggestions! Especially useful are the references from Marco and HE. I will read them and ask more details as soon as possible.
In essence, I asked this question because I am interested in the following: how to formally (no only subjectively) tell whether or not a model fits the data reasonably? This is a different question from "does model A fit better than model B".
If model fits to the same data are compared, AICc seems like a good method. But if there is only one model, AICc will not help. Is reduced chi-squared a good choice for goodness of fit assessment for one model? For example, the paper (http://arxiv.org/abs/1012.3754) claims that it is not. I would appreciate your suggestions!
Indeed, both AIC and BIC are useful when you have several models and thus will not be of much help in your case.
Regarding the use of the reduced chi2, besides the issues you point out, this tool would not be applicable anyway if the errors do not follow a Gaussian distribution, as already underlined in the paper you cite.
The following paper might be of some interest to you, although it does not offer an answer to your original question.
Thanks again for your comments and reference! So, what would you suggest as a simple way to estimate goodness of fit for a single nonlinear model? Perhaps some Monte Carlo based methods?
Are you asking me this question because you know from our previous discussions in other threads that I like MC-based methods ? (LOL !!!)
More seriously, I don't feel competent to give a knowledgeable answer to your original question on absolute GoF testing. This is like your other interesting but difficult question about the case of small number of data. I have followed that thread from the beginning but without contributing, except for the side remark about the hitchhiker).
Non-parametric MC-based methods will help you get confidence limits for your parameters but they already assume that your model is good, so I don't think they can be used as a reliable absolute-GoF test.
The only thing I can suggest would be to use several different absolute-GoF indices. If you manage to get an acceptable score for each of these indices, then you could conclude with some confidence that your model is indeed good. The problem is that not all of them might be applicable in your case.
For some Abs-GoF indices, see the excellent paper "Structural Equation modelling : Guidelines for Determining Model fit", by Hooper, Coughlan and Mullen (2008).
If you don't find it, I can send you a pdf version.
A very quick and efficient solution is simply to compute Y (est) = f(X) being f the non-linear model of interest , X the indpendent variable(s) and Y (est) the estimate given by the model of the variable of interest Y(obs. The goodness of fit of the model will be immediately estimated in terms of the Pearson correlation coefficient between Y(est) and Y(obs). Not only, but given we expect the best model linking Y(est) and Y(obs) is nothing different from Y(est) = Y(obs) and thus a line having intercept = 0 and angular coefficient = 1, if computing the best linear fit between Y(est) and Y(obs) I get an intercept signifcantly different from zero this indicates I have a problem with a systematc effect I did not take into consideration, if I have an angular coefficient different from unity this means I did not take into consideration a regressor or the order of the model is wrong...
With the additional clarification, I still think that the best way is the Weighted Chi-Square Goodness of Fit Test. In robust fitting procedures, when estimates of experimental errors is difficult, experimental uncertainties are multiplied by the normalized chi-square to reduce the weighted chi-square of the fit. Of course this procedure does not apply if you want to test the model (not the quality of the data). Hope this helps, interesting discussion!
The index you propose is reminiscent to piece-wise GoF indices mentioned in one of the references I gave previously. However, beside the fact that I do not know which assumptions your approach is implicitly making on the error distribution (and perhaps also on the (in)dependencies between these errors at distinct data points), I would suspect that your technique would have a certain tendency to downgrade somewhat an otherwise quite acceptable fit and perhaps more so in the case of a large number of data.
On the other hand, you can still test somewhat the consistency of your approach. For example, you could check the literature for some benchmark data, fitting models and AIC rankings. Then, you could simply apply your own approach on those same models and data to see if you manage to get, at least, the same ranking.
I am not a statistician. The formula I wrote certainly assumes that the errors are independent and Gaussian. Probably it also implicitly assumes lots of other things which I am not aware of.
In general I think that if the value generated by this formula is high (close to 1), this suggests that model predictions are (on average) within the range of the error bars of the data. However, systematic defects in model predictions (for example if the model consistently overpredicts the data by a small amount, so all the residuals are small but positive) will be missed. To check for such things the approach suggested by Alessandro (doing a linear regression of the predicted values vs the data and checking how the intercept and slope differ from 0 and 1, respectively) sound reasonable to me.
However, if the formula generates a small value (e.g.
"Why do you think sample size should affect the results a lot?"
My statement was simply based on a crude estimation of the behavior of your proposed GoF. It seems to me that a bad fit at a single location, and no matter how good the fit might be elsewhere, would downgrade considerably the overall score given par your GoF. In the case of a large number of points, the probability that such spurious points exist would not decrease.
Usually, GoF indices are the result of some averaging operation that certainly accounts for all the data points but that is not too sensible to how good or bad a fit is at any single point.
Thanks again for your answer! I have the following thoughts:
1. Perhaps effects of "outlier" data points on any GoF index are easiest to test by bootstrapping methods - i.e. to see how sensitive is the GoF from the proposed model to perturbations of the data set?
2. To "discourage" the model from small systematic deviations (e.g. from overestimating all data points by a small amount), perhaps an easy way is to multiply the GoF index by the binomial probability coefficient n!/(k!*(n-k)!), where n is the number of data points an k is the number of positive residuals? For large n this can of course be approximated.
As always, would be grateful for input from anybody interested!
1) Indeed, bootstrapping may give you some information regarding sensitivity. However, I don't see how this information can be used subsequently to assess the goodness of fit.
2) I don't think so (or perhaps I do not understand exactly what you mean). A fitting procedure includes two phases : (A) obtaining the best-fit parameters for the selected model; then (B) evaluating the GoF index for the resulting fitted model. The first operation is an optimization whereas the second is an evaluation. Therefore, if you should 'discourage' the model or include any bias, you can do so but only in phase (A) to help guide the optimization process. However, if I understood correctly, you intend to include a modifier in phase (B) and not in phase (A). By doing so, you will modify, of course, the evaluation and thus the score of the fitted model but you will not modify how this model was obtained earlier.
Thanks again for your reply! I will try to be more clear about the points above:
1. I was thinking for example about the following situation: Suppose perturbing the data set by bootstrapping shows that good fits of the model are obtained in all cases when a particular data point happened to be excluded. But when that point was included, the fits were much worse. This could be an argument for saying that the model is generally not too bad for this data set, but one point happened to be an outlier - perhaps by chance, or perhaps because it represents some yet unexplained effect. Does this make sense?
2. Perhaps a clearer way to write what I meant is the following: Suppose there are n data points. The goal of the fitting procedure is to minimize some function G(n) = SUM[ g1(i)-g2(i), i=1..n ], where, for example, g1(i)=ln[(f(i)-y(i))^2/s(i)^2] and g2(i)=ln[(i)!/[(k(i))!*(i-k(i))!]], y(i) are measured data, s(i) are standard deviations, f(i) are model predictions, and k(i) is the number of positive residuals. The goal of using this would be to "strongly encourage" the model to go through the "middle" of the data (i.e. to have equal numbers of positive and negative residuals) and discourage systematic deviations.
Once again, would be grateful for input from you and anybody else interested!
1) Yes it does make sense as you are not rejecting the possibility that the outlier might actually be due to "some yet unexplained effect."
2) My objection concerned the use of a modifier in phase (B) as a way to bias the model. On the other hand, as long as you are working in phase (A), i.e.: optimization, you can introduce modifiers to guide the optimization process and to promote the best-fit solution according to what you think is desirable. This is like modifying the optimization criteria. A simple example can be given in linear fitting when minimizing sum(e²). One drawback is the "square effect". Any outlier that happens to be very off, will attract significantly the best-fit solution, which will end up being offset w.r.t most data points. One way to "discourage" this undesirable effect is to use instead sum(|e|) but at the price of complicating greatly the computation (non-smooth optimization). In any case, this modification is done in phase (A) and not in the next phase (evaluation of the GoF).
My next questions concern the following hypothetical situations:
1. Suppose a customized fitting procedure (e.g. like the one minimizing G(n) which I described above using binomial coefficients to guide favor fit with equal numbers of positive and negative residuals) is used. Can this produce any estimate of "absolute GoF" criteria? For example, use this procedure on several data sets and several models and plot the resulting fits and the values of say G(n)/n. This can in principle produce a situation where values of G(n)/n < some threshold value X represent "reasonable" fits, and if G(n)/n>X the fits are "unreasonable". Of course this is an approximation, but does it make sense in principle?
2. Suppose 2 models (A and B) are fitted to the same data set using the customized procedure minimizing G(n). If it then reasonable to compare the fits of these 2 models by calculating say AICc for each, using the data and predictions from each model, and saying that the one with lower AICc (say model B) fits somewhat better? I ask this because it is in principle possible that if these same models are fitted to the same data using a different procedure (say minimizing AICc instead of the custom function G(n)), it may turn out that model A (istead of model B) has lower AICc.
If my understanding of your post is correct, I would answer no to both question.
1) It seems that you would like to use basically the same function for both phase (A) and phase (B). If you do so, the result will never be credible. It is the same as being the reviewer of your own paper. Usually, the function we use in phase (A) reflects only our fitting criteria (and not the goodness of fit). We use this function in an optimization process to get the best-fit parameters for our model (it is like doing our best when presenting a paper). Once this is done, we must turn to phase (B) to evaluate the GoF. This evaluation must be blind and independent (like any good reviewing process).
2) Here, it seems that you would like to do the opposite. In other words, you would like to use a standard GoF (AICc) as a function to be used in the optimization process. I don't think you can do that for the same reason (independence) I described above.
Your point about question 1 seems very reasonable to me. However, I am confused about question 2. There what I intended was to use different functions for different stages: use some custom function (e.g. G(n)) during the optimization to get best-fit model predictions, and then use a standard function (e.g. AICc) to evaluate the GoF of these predictions. Does this make sense?
I was simply misled by your use of the word "instead" in your previous post : "...minimizing AICc instead of the custom function G(n)".
Now, I fully understand what you intend to do and it makes much more sense with perhaps a few words of caution.
Please note that AICc is not an absolute GoF. It provides a ranking between models while including some parsimony criteria so that the best-ranked model is not necessarily the one that fits best the data.
Thank you once again for your interest and for a very useful reference!
I am aware that AICs is useful for comparing models, but not for absolute GoF estimation.
I now wonder actually that the following may be a very straightforward (but simplistic) way to estimate absolute GoF: simply report the percentage of residuals which are >1 and >2 standard deviations away from the data points. Along with a plot showing the model fit to the data, this simple summary should provide a reasonable idea of what percentage of the data points are fitted poorly. Does this make sense?
Absolute GoF evaluation is not an easy task. The technique you propose is simplistic indeed and I would not trust it as a GoF index. On the other hand, this technique might be useful as a 'BoF' index that will help evaluate the 'badness' of a fit and reject inadequate models.
Thank you for your useful comment! Indeed it makes sense that the simplistic method of calculating the percentage of model predictions which are >1 standard deviations away from the data points can identify a "bad" fit (where such a percentage would be large), but cannot tell the difference between a "pretty good" and a "very good" fit (in both cases the percentage may be zero).
Igor, my vision is that if you make a test to explore your data, that test must contain some link with the model to be used as a fitting curve. For example, you start from a Lorenz curve with 10 data points (Xi; Li) -ordered from tall to low variable- and you test it creating a graph (Xi; Fi) where Fi=ln(Li)/ln(Xi). Then there is behind a link with a model of the Lorenz curve with shape L(X) = X^F(X), and if you get a continuous fitting curve for F(X) then you only need to derive L(X) to obtain the CDF, in mediae of distribution, with the shape: CDF(X) = L(x)*[ F/x + lnX*F´(X) ]
where ... F´(X) = derivate of F(X).
When F(X)=Fc constant you have a Pareto distribution so F´(X)= 0, so its CDF is
CDF(X) = L(X)*Fc/X and given that L(X)=x^(Fc) it becomes CDF(X)=Fc*x^(Fc-1)
This paretian expression comes in mediae vs cummulative fraction of population.
I have worked many data sets in that way, I normally measure the fitting in mean absolute deviation terms, and in most cases the fitting is good, even for extreme values and high dispersions.
There are some logical retrictions because CDF is very sensible to the derivate F´(X) and in some cases the CDF function may increase, which breaks the decreasing order premise. But this is another question for another moment.
The advantage is that you may compare several samples and observe their structural functions F(X), L(X) and CDF(X) graphically to sustain other analysis.
I was wondering if you get some useful information or kinds of literature and references, please share it with me. I'm also working on the same problem. Igor Shuryak
While fitting some spetrums with multiple Debye function or fitting some autocorrelation function with multiple decaying exponential function, an addtional Debye or exponential function always give arise to a better fit in the sense of R^2(Godness of Fit).However, so many basis functions may not explain the experiment data reasonably since they may overfit the data.Then I tried to perform t test for every fitting parameter to check whether it's reliable or not.If not ,the latest fit expression will be disgarded.But I don't know whether it makes sense.
Often the nonlinear correlation to be fitted to data can be 'somehow' linearized, as a first stage. Linearized correlation and the corresponding linearized plot are often quite convenient to qualitatively evidentiate scattering around the trendline and to emphasize major effects and the physical meaning of the correlation parameters. Estimates of the parameters derived by least squares after the linearized correlation (modified from a former nonlinear correlation) can possibly be refined by iterative nonlinear least-squares regression; to find unbiased least-squares estimates for the original correlation. It may be advisable to compare both the linearized and the nonlinear correlations.
Late addition: Some have suggested measures based on the prediction error of the model and bias. It's even made for censored time series data, which might fit quite well with the cell mortality example.
preprint: https://arxiv.org/pdf/1611.03063.pdf
R package: https://rdrr.io/cran/PAmeasures/
This is a concrete implementation related to the point about cross-validation made by @Michael E Young.