Hello Vinayak! I agree with Mario that all commercial statistical softwares are good options for statistical analysis in general, from basic statistics to more complex models. What I would also comment on is on the programming skills required and the easiness in getting the software. From the recommended statistical software, R is free and there is a lot of supporting material for learning the programming language. So, as other colleage said, in regards to statistical software, "one letter: R!
There is never something like a best tool in general. Each tool has its niche. The better you specify your particular needs, abilities, possibilities, and problems, the more useful will (can) be the answers you get.*
---
*To my opinion there was not a single really useful answer yet (up to now the bottom line is that any software doing regression analysis is a software which you could use for regression analysis... this does not adress the question which one would be good or even be the best (and why). To make this clear: the answers are ok but the question was way to unspecific to get any useful answer.
Vinayak, I agree with Jochen that the answer is "it depends." Some packages give you exquisite control over the analysis which is great for a sophisticated user (e.g., R, SAS). Others do the basics simply but don't include advanced features or they have to be done the hard way (e.g., Excel). Some have a nice middle ground where the defaults help to prevent someone from doing something terribly wrong (e.g., I like JMP's automatic centering of predictors in interactions and its use of effect coding of categorical predictors).
You also have to consider whether your regression needs the flexibility to seamlessly include transformations, polynomials, categorical predictors, and splines (again, JMP works well here given its menu orientation that includes these features).
Finally, there's cost. R is free and powerful, but it's easy to do something wrong if you're a casual regression user. Software like SPSS is getting really pricey. Some software has nice licenses for academics - a form of SAS is available for free and JMP has a $50/year license). I'm not sure whether these same licenses are available to international users.
For me, I use JMP for everyday analysis and teaching and R for the sophisticated stuff like nonlinear regression and multilevel logistic regression, so even I don't use just one package for all regressions.
With MS-Excel one can go for six-degree equation. It gives the goodness of fit value (R2). However, to do F-test, some calculations are required, which are cumbersome, but not very much difficult.
Hi Sandip! Thank you for sharing these excellent sources! I have bookmarked all!
I hope that this thread will continue to attract comments; I am definitely interested in learning about each software properties because for example, I would not have any problem recommending excel for learning Regression analysis but I would be cautious about using it for "serious" work in which accuracy is relevant. Thanks again for the links!
SAS offer many general linear models (GLM). But, you must sturdy how to use those.
JMP support many of GLMs of SAS. You can easy use it within ten minutes. You try and evaluate free version. There are several point to choose software.
1) Cost
2) Easy use
3) variety of method
If you choose the best one, it improve your intellectual productivity. Many researchers choose bad option. They are struggle with low productivity.
I agree Michael and part of Patricia. We must check software from the viewpoints if the software improves our intellectual ability.How many papers can we produce by software. This is clear standard to choose software. See my Springer book entitle "New Theory of Discriminant Analysis after R. Fisher." You can understand what JMP can analyze data including microarray data.
"The best software must improve our intelectual productivity" - that's great. Never really thought about that aspect. So true for academics, although I think that industry might have a different focus.
I agree with all the comments above, it depends on your needs and each software has their own pros and cons but from my point of view R is the best one for numerical analysis. It is an open - source license, which means that anyone can download, modify and improve your code.
I think it depends on many factors. For instance, if you like small size applications but powerful then Eviews is better than SPSS. I suggest you to use SPSS if you need AMOS feature of SPSS. Good luck !.
Unfortunately, for a more serious analysis users need specialized software such as SPSS, Eviews or Stata. These programs are expensive and not everyone can afford them. Luckily, there are plenty of free alternatives that can be used instead.
The Best Free and Open Source Software for Statistical Analysis
1) The R Project for Statistical Computing
R is by far the most widely used free statistical environment. It can be used for many different types of analysis. It has a large community and numerous packages are developed for it. Learning it will require a bit of programming knowledge, but there are plenty of tutorials and online courses available for that purpose. This is something we definitely recommend you learn, because it is slow-ly becoming the standard in many professional data analytics communities.
2) PSPP
This is a free alternative for SPSS, and a pretty mature project which can be used for regression analysis, non-parametric tests, T-tests, cluster analysis and much more. It supports over 1 billion variables.
3) Gretl
Gretl is a free alternative to Eviews. It can be used for a wide range of econometric analyses, data series and regression.
4) MicrOsiris
MicrOsiris is a lightweight freeware for performing all sorts of data analysis.
5) Regressit
A completely free add-in for Excel, Regressit can be used for multivariate descriptive data analysis and multiple linear regression analysis.
6) MacAnova
MacAnova is developed at the University of Minnesota and can be used for statistical analysis and matrix algebra.
7) GNU Octave
This tool presents an excellent alternative to Matlab. Not only can it be used for multiple numerical computations, it also has great data visualization capabilities.
8) Dmelt
This tool is a successor to a couple of other statistical environments. It can be used for mathematical analysis, visualization and much more.
Mahesh Kumar, have you ever tried to fit a multiple regression model with interactions, or a model with categorical predictors in Excel? Or have you ever made residual diagnostic plots in Excel?
Regression covers a very broad set of techniques, some of which are not available for the software listed (unless you write your own functions). It would be useful if the questioner says what type of regression is being asked about, or is it for any regression? If you just want a least squares regression with simple diagnostic plots and a couple of variables and a linear model, any should do so it is whichever you (or your co-authors) are used to using. If it is some new technique only discussed in one paper, you'll need to see what those authors used or wrote.
If you compare the statistical software, SAS supports more regressions than other software. But, it is expensive and slightly difficult for the users.
Thus, I wrote four guide books in Japanese. I do not know English book.
JMP is supported by JMP division of SAS.
It is the most easiest software and the cost is cheap.
At first, you check it by free trial version.
I wrote two guide books, one of which is the best seller.
Until 2015, I can firstly success in the cancer gene analysis using JMP and LINGO supported by LINDO Systems Inc. I developed three optimal LDFs and three SVMs by LINGO. You can download many articles from RG.
For me, aforementioned applications are good. I like Eviews, because this application is relatively small. This means you can run Eviews smoothly in a relatively slow-speed laptop.
Eviews combines GUI and coding to get your statistical analysis. Eviews also provides varied options of statistical tests for multicollinearity, serial correlation and so on.
There are a lot of similar questions about good software. This answer differs according to the situation of the user. In terms of its superior functionality, it is SAS. With ease of operation, JMP is easy to use and realizes many functions of SAS. Cost depends on each person's position. I first introduced SAS to Japan and sold it to 32 pharmaceutical manufacturers. SAS was purchased at the company. I moved to university and taught the students of the Faculty of Economics, but it was also a fallout. Nevertheless, despite my designation of all standard usage. Because other colleague claimed SAS was high, the university changed to SPSS. Although I purchased SPSS / Windows for my research for the first time, paying for the update cost was a burden. So I bought JMP in 2010 and I am still using it. Recently SAS lends a license to university researchers free of charge. However, because of the huge function, some of my colleagues did not use it although I lent the SAS reference book I wrote. That is, the learning time to master the software differs among individuals. Many students are familiar with JMP and LINGO by my books. Japanese industry analysis by their DEA was published by Amazon. Unlimited users are free, and the list price is $ 2.99. I introduce important points to RG.
I usually use STATA to run regression analysis. I prefer the output generated by STATA than most softwares. Anyway other softwares such as SPSS, SAS, EXCEL and others do generate. It depends on your field of studies and your preference. Try STATA if you are studying economics. best regards
Based on my experience I think SAS is the best software for regression analysis and many other data analyses offering many advanced up-to-date and new approaches
For example, Eviews is smaller than SPSS so Eviews is better if you have an old laptop. In my experiences smaller applications like Eviews and Stata offer more statistical test options than SPSS IBM. But SPSS offers you AMOS architecture. Hope this helps!
You should try RegressIt, a free Excel add-in which offers very high quality interactive output in Excel for both linear and logistic regression and many novel features to support quality work in both teaching and applications. It now includes an interface with R that allows you to perform very sophisticated regression analysis in R from a menu interface in Excel, with output in both RStudio and Excel. DON'T use Excel's own regression tool in the Analysis Toolpak, which is a legacy from 1995 (and it wasn't good for that time either).
I am actually looking for the same answer to that question but what i figured out was that MS-Excel can take care of Bi-variate and statistics whiles SPSS is a very good software for Multivariate data
Did you mean multivariate regression or multiple regression (i.e., multiple dependent or multiple independent variables - could be both). I have not used SPSS for a very long time, and don't know.
Whatever software you use, I suggest that you see how it handles regression weights. In SAS PROC REG, for example, you can enter a formula for the regression weight, such as w=1/x for the model-based classical ratio estimator. Please see https://www.researchgate.net/project/OLS-Regression-Should-Not-Be-a-Default-for-WLS-Regression, and its updates, and https://www.researchgate.net/publication/333642828_Estimating_the_Coefficient_of_Heteroscedasticity, and https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity for regressions of form
y = y* + (e_0)(z^gamma),
for finite populations.
(You may want to see how it handles autocorrelation as well.)
In old days I used the very impressive TableCurve 2D or 3D allowing to verify some thousand of equations to find the best interpolating curve or surface. However, in this manner, the physical view of the problem was lost having to consider very strange equations, even if very well correlating the experimental data.
Now I'm using the Matlab tool called curve fitting toolbox that is very satisfying to me, as it allows the full control of the form of the correlating equation in real-time.
James R Knaub Of course you should use weights, but they should be scientifically meaningful. Whenever the data are measurements of physical properties, (measurement error)^-2 should be used. Many programs don't even offer the possibility to enter them. With FittingKVdm you can't even start without them, since there is nu such thing as a measurement with zero error.
FittingKVdm uses the errors on both the y and the x values to calculate the weights. And also the x and y residuals, which is unique, I think!
James R Knaub I didn't read it all in detail, sorry (and btw your first link doesn't work), but I think that is what I'm trying to say: in regression, you need to use realistic error estimations for the measurements, since they have a serious influence on the result. I would appreciate very much your comments on my software! It can be tried here: www.lerenisplezant.be/fitting.htm.
James R Knaub: In the article you mention, weighted least squares regression gives Progeny = 0.12796 + 0.2048 Parent for Galton's peas.
Symmetrical (multidirectional) regression gives Progeny = 0.12563 + 0.21548 Parent. This is the most precise because you get the same result if you switch the 'dependent' and 'independent' variable.
I have seen that before. I suppose you might want to do that for some applications in physical science, but generally in regression, y is the random variable, and based on the predicted-y, that is, the model, used as a size measure, the data will tell you the regression weight, though you might use a default for various kinds of applications if you do not have enough good data for that.
With y as the random variable, usually we want one or more x's to use with a sample of y, to "predict" (estimate) for unknown y. That is the usual problem.
James R Knaub Where could you have seen this before? I haven't seen any other software doing this. And what is 'random' about y???
The error flags are quite big. I suspect the SD in the dataset is that of the population, so I guess we should divide by sqrt(N_i -1) with N_i the number of children from parent i to obtain the measurement error? But those N_i are nor given.
y vs x and x vs y I've seen at least three times before. Once in a book by Deming circa 1940. Another in a book by Ken Brewer, 2002.
The reason y is a random variable is because you can write
y = (predicted-y) + e.
You can look at errors-in-variables, but that isn't completely the same thing. Regressions often have too many, or too few, or just not the right independent variables. This impact can be spread around by impacting coefficients, an intercept, and the estimated residuals.
Many people may think that the estimated residual, e (estimated from a given model with estimated coefficients, not the actual relationship where you'd use an epsilon), is a measurement error. It isn't really, although measurement error may be a substantial influence. The estimated residual is the difference between the regression model "predicted" (i.e., estimated) value for y, and the observed or collected value of y. e is random and you have model-unbiasedness if the expected sum of e is zero. For weighted least squares regression, we have e factored into a random factor and a nonrandom, systematic factor. (The regression weight comes from this systematic factor.) Since e is added to the predicted-y to make y, y is a "random variable."
I can see using "errors-in-variables" regression for some experiments, but remember that you could be missing some variables, such as temperature, when doing "calibration" in analytical chemistry, for example. ("Calibration" in survey statistics has a very different meaning, by the way.) Omitted variables may often be a problem, but so is throwing in extraneous variables.
James R Knaub No, Deming's algorithm is not the same as mine. Please have a closer look at my program. It's very simple to use, no hard learning curve!
I think you are making things unnecessarily complicated. If there is something systematic in the residuals it simply means there is another independent variable in the game, OR the model is wrong, isn't it?
If you don't like the Penn State and my references that I gave to you, then you could look up "weighted least squares model."
To understand regression better, you might look up "random variable."
Also, here are some books that might help:
Maddala, G.S.(2001), Introduction to Econometrics, 3rd ed.
Kutner, Nachtsheim, Neter(2004), Applied Linear Regression Models, McGraw-Hill
Applied Regression Analysis and Generalized Linear Models, 2nd ed, 2008, John Fox, Sage
Carroll and Ruppert(1988), Transformation and Weighting in Regression, Chapman & Hall, Ltd. London, UK.
Norman R. Draper, Harry Smith(1998), Applied Regression Analysis, 3rd Edition, Wiley
The Elements of Statistical Learning, Data Mining, Inference, and Prediction, 2nd ed, 2009 (corrected at 7th printing 2013), Trevor Hastie, Robert Tibshirani, and Jerome Friedman, Springer
Särndal, CE, Swensson, B. and Wretman, J. (1992), Model Assisted Survey Sampling, Springer-Verlang.
Brewer, KRW (2002), Combined survey sampling inference: Weighing Basu's elephants, Arnold: London and Oxford University Press.
An Introduction to Model-Based Survey Sampling with Applications, 2012,
Ray Chambers and Robert Clark, Oxford Statistical Science Series
Finite Population Sampling and Inference: A Prediction Approach, 2000,
Richard Valliant, Alan H. Dorfman, Richard M. Royall,
James R Knaub Believe me, I have a lot of experience with real life data, from different branches of science, and I am certain that many number crunchers make thing WAY to complicated. I would appreciate it if you could just take a quick look at my software instead of pushing me to read a whole library.
Everyone. SAS PROC REG accommodates regression where you can enter regression weight ('formula') w. You might look for that or similar software documentation. By the way, last I checked, when SAS PROC REG outputs estimated residuals, when using a regression weight, those aren't the raw estimated residuals. I think those were the random factors of the estimated residuals, which probably do not have expected total zero.