How to choose the best fitting (regrission) model among (linear , polynomial and exponential )

Jabar H. Yousif

many thanks Gregory for great help. I downloaded it.

I got the following output.

X= Year

Y= Emission Level

Eqn# 1041 y=a+bx^(1.5)+cx^2lnx

r2=0.9999991687093772

r2adj=0.9999988569753936

StdErr=0.007363935796200432

Fstat=5413264.790440932

a= 49055.52548118241

b= -1.896623876981769

c= 0.00397827284352657

I would like to ask here, What are the values of x . Is it year column or something else?

regards

Zeeshan Anwar

Apply all methods and choose that have less R2

Jabar H. Yousif

Thank you zeeshan for your respond. But, I want the equation of best predication data.

regards

Dinesh kumar Saini

Dear Dr Jabar

i think you should try Cubic splines because of their computational simplicity, smoothness (continuity restritions up to the second derivative), and flexibility. it will help you to get the best result.

Jabar H. Yousif

Many thanks Dinesh for informative answer. But, this needs more math calculations.

Regards

Luis F. Gouveia

Dear Jabar H. Yousif. It was not clear to me what was your goal is in the first place. Is it to choose the model (among several candidate models) with the best predicting ability? This may not correspond to the best fitting (e.g. to the best R2). If you want the best prediction you may compute the PRESS from the several models and choose the one with the smaller PRESS (Prediction Residual Sum of Squares).

I don't know the size and what your data represents (it's origin, the underlying precision, etc), but obtaining such high R2 and R2adj values (0.9999+) may be due to overfitting and simultaneuosly to a very, very low imprecision. Is there any underlying theoretical model that should explain the experimental observations? Are you willing to disclose some more details on your data?

Dear Zeeshan Anwar, I didn't understood the rational for your sentence "...choose that have less R2". Could you please elaborate?

Hope it helps.

BR Luis

Oyebade K. Oyedotun

Dear Jabar,

I think there's no hard and fast rule to determining the best order of regression analysis, n; that is, taking n=1 for linear regression, n=2 for quadratic regression and n>2 for polynomial regression. This is the case, especially when the nature of your data is unknown (i.e. the degree of its non-linearity). Basically, there are two issues to curve fitting to data:

1. Obtaining a function that closely approximates the mapping of input-output data based on the prepared training set which is usually only a snap shot of the entire data available for the domain of interest. Here, the goal is to use an objective/cost function (usually, mean squared error for continuous variables) to rank the obtainable mapping functions/models. Typically, the model/function which gives the lowest error on the training set is "considered" the best. However, there may be a "serious" problem with such a model in what is commonly referred to as over-fitting! Over-fitting occurs when the estimated model/function parameters are so tuned (or expert) on the training set (i.e. gives extremely low error) but much significant error on unseen/test data for the problem. Alternatively, over-fitting could result from models/functions which have too many free parameters and therefore degree of freedom such that they are capable of fitting perfectly (or almost) the training data. Note that the complexity or degree of freedom of regression models increases with its order n.

2. In order to overcome over-fitting, another data set referred to as "validation set" is set aside for ascertaining that estimated functions/models do not over-fit. The basic idea is to use the validation set to control the degree of complexity (or freedom) of the model which brings about over-fitting.

Now to specifically answer your question while bearing the above points in mind. Generally, in regression modelling, one starts with linear regression (i.e. with n=1; the simplest and least complex model)....then performs a test of significance which estimates how much of the training data the obtained model can explain. A parameter known as the coefficient of determination, r2, is usually used. Note that r is the coefficient of correction. Also, the obtained error on the validation set is noted. Then one does quadratic regression (i.e. n=2; a more complex model to linear regression). Again, r2, is obtained and the validation error. If the validation error for regression model with order n=2 is smaller than that of the regression model with order n=1 and obtained r2 for regression model with order n=2 is significantly higher than that of regression model with order n=1, then regression model with order n=2 is better than regression model with n=1. This same process can be repeated for n=3, 4,...etc, until the validation error for regression model with order n+1 is greater than the validation error for the regression model with order n. Furthermore, you should note the difference in values of the parameter, r2, from one regression model with order n to regression model with order n+1. When such difference becomes negligible, and validation is no longer improving or about getting worse (i.e. higher as compared to the immediate lower order regression model), the current model should be a nice model and probably the best. Hope this helps.

How the reviewer determines the degree of originality and novelty of research paper?

Can we use the technology like LIFI to share the electricity?

Is there any database about autism?

Is it time to switch from WIFI with LIFI?

What are the efficient methods for Measuring the Performance of NN models?

What methods can used for encoding the text data into digital form?

Is there a fast publishing ISI journal in the field of computing and information technology?

What degrees of privacy that the cloud computing users have?

How to have the equation of a certain data.

What degrees of freedom can cloud computing users have?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?