I can array the data such that everything for each regression is in the same row or column. It would be straightforward to run each regression by hand, but it would also be time consuming.
R is certainly a softwar that can solve your problem (www://r-project.org). It is free, very professional, very versatile, but no easy to start with. However, the investment (to learn R) will be very, very rewarding.
Given you have your data organized in a simple tab-delimited text file with 50000 rows and the columns contain the Y-values for a given vector of predictors (this may be simply X-values, or for mor complex regression models group-indicators and/or X-values), then this is the R-code that will perform all the regressions (lines starting with a # are comments):
# read the data into a variable names "tab":
tab = read.delim("name of the textfile.txt", header=FALSE)
# define a vector for the predictor. This vector must contain as many values as "tab" has columns.
# The form of this vector depends on your data/model.
# The simplest form is a simple integer sequence
# (you may adapt this for your model):
x = 1:ncol(tab)
# Now you can run the regression (a "linear model": lm) over all the rows in "tab":
# here the model is a simple straight-line regression, y ~ x.
# You may adapt this for your purpose.
res = apply(tab, MARGIN=1, FUN=function(y) lm(y~x))
# "res" is then a "list" with all the fitted models.
# You can now use further functions on this list to get any
# information you wish. For instance the slopes
# (the slope is the second coefficient in the model fitted here):
# get the summary-table for the model fitted to the data in row 200:
summary(res[[200]])
Final note:
If calculation time is an issue and not all information about the regressions is required, then the package "limma" (http://www.bioconductor.org/packages/release/bioc/html/limma.html) provides a very fast way to fit hundred tousands of such linear models and get the coefficients, t-values, and p-values.
I believe packages like SPSS have a scripting language which you can use to automate these types of large analyses. In fact, most modern statistics packages should support this kind of functionality. Failing that, you could figure out the math and just code it yourself, using results from stats software to validate your implementation of the relevant algorithm.
R is certainly a softwar that can solve your problem (www://r-project.org). It is free, very professional, very versatile, but no easy to start with. However, the investment (to learn R) will be very, very rewarding.
Given you have your data organized in a simple tab-delimited text file with 50000 rows and the columns contain the Y-values for a given vector of predictors (this may be simply X-values, or for mor complex regression models group-indicators and/or X-values), then this is the R-code that will perform all the regressions (lines starting with a # are comments):
# read the data into a variable names "tab":
tab = read.delim("name of the textfile.txt", header=FALSE)
# define a vector for the predictor. This vector must contain as many values as "tab" has columns.
# The form of this vector depends on your data/model.
# The simplest form is a simple integer sequence
# (you may adapt this for your model):
x = 1:ncol(tab)
# Now you can run the regression (a "linear model": lm) over all the rows in "tab":
# here the model is a simple straight-line regression, y ~ x.
# You may adapt this for your purpose.
res = apply(tab, MARGIN=1, FUN=function(y) lm(y~x))
# "res" is then a "list" with all the fitted models.
# You can now use further functions on this list to get any
# information you wish. For instance the slopes
# (the slope is the second coefficient in the model fitted here):
# get the summary-table for the model fitted to the data in row 200:
summary(res[[200]])
Final note:
If calculation time is an issue and not all information about the regressions is required, then the package "limma" (http://www.bioconductor.org/packages/release/bioc/html/limma.html) provides a very fast way to fit hundred tousands of such linear models and get the coefficients, t-values, and p-values.
Great answer Jochen. I've done similar tasks in Python also (using numpy broadcasting techniques you get speeds close to C++).
Clyde can you specify the format of your input file? (for example: let's say ASCII text file with 50 000 columns and "n" columns with each column being: series1, series2, id, name, etc.)
Just another suggestion for SPSS users or newcomers. You can also use PSPP which is free and open-source (http://www.gnu.org/software/pspp/) and maybe even easier to do tasks like that.
in addtion, you can think on parallel computing with R using the snow or parallel package. You may look at the bootstrap example here: http://homepage.stat.uiowa.edu/~luke/R/cluster/cluster.html
and modify according to your needs (5000 distinct regressions) in case you have access to a computer/server with a dozen of processors.
These are great answers! I confess, I was asking preemptively -- my coauthor is still entering the data. However, on the how it's formatted question, I'm expecting to see it in a STATA file. It is a very big data set, but we're only looking at a few variables for this regression, so we could cut down the data a lot. There are some great suggestions here and I look forward to trying them once we have the complete data set. Thank you all!