I need a way to run 50,000 distinct regressions using data from a large sample. What software exists for this?

Jochen Wilhelm Popular answer

R is certainly a softwar that can solve your problem (www://r-project.org). It is free, very professional, very versatile, but no easy to start with. However, the investment (to learn R) will be very, very rewarding.

Given you have your data organized in a simple tab-delimited text file with 50000 rows and the columns contain the Y-values for a given vector of predictors (this may be simply X-values, or for mor complex regression models group-indicators and/or X-values), then this is the R-code that will perform all the regressions (lines starting with a # are comments):

# read the data into a variable names "tab":

tab = read.delim("name of the textfile.txt", header=FALSE)

# define a vector for the predictor. This vector must contain as many values as "tab" has columns.

# The form of this vector depends on your data/model.

# The simplest form is a simple integer sequence

# (you may adapt this for your model):

x = 1:ncol(tab)

# Now you can run the regression (a "linear model": lm) over all the rows in "tab":

# here the model is a simple straight-line regression, y ~ x.

# You may adapt this for your purpose.

res = apply(tab, MARGIN=1, FUN=function(y) lm(y~x))

# "res" is then a "list" with all the fitted models.

# You can now use further functions on this list to get any

# information you wish. For instance the slopes

# (the slope is the second coefficient in the model fitted here):

slopes = sapply(res, function(model) coef(model)[2])

# Plot a historgram of these slopes:

hist(slopes)

# get the summary-table for the model fitted to the data in row 200:

summary(res[[200]])

Final note:

If calculation time is an issue and not all information about the regressions is required, then the package "limma" (http://www.bioconductor.org/packages/release/bioc/html/limma.html) provides a very fast way to fit hundred tousands of such linear models and get the coefficients, t-values, and p-values.

Paul D Barrows

I believe packages like SPSS have a scripting language which you can use to automate these types of large analyses. In fact, most modern statistics packages should support this kind of functionality. Failing that, you could figure out the math and just code it yourself, using results from stats software to validate your implementation of the relevant algorithm.

Jochen Wilhelm

# read the data into a variable names "tab":

tab = read.delim("name of the textfile.txt", header=FALSE)

# define a vector for the predictor. This vector must contain as many values as "tab" has columns.

# The form of this vector depends on your data/model.

# The simplest form is a simple integer sequence

# (you may adapt this for your model):

x = 1:ncol(tab)

# Now you can run the regression (a "linear model": lm) over all the rows in "tab":

# here the model is a simple straight-line regression, y ~ x.

# You may adapt this for your purpose.

res = apply(tab, MARGIN=1, FUN=function(y) lm(y~x))

# "res" is then a "list" with all the fitted models.

# You can now use further functions on this list to get any

# information you wish. For instance the slopes

# (the slope is the second coefficient in the model fitted here):

slopes = sapply(res, function(model) coef(model)[2])

# Plot a historgram of these slopes:

hist(slopes)

# get the summary-table for the model fitted to the data in row 200:

summary(res[[200]])

Final note:

Pedro Correia

Great answer Jochen. I've done similar tasks in Python also (using numpy broadcasting techniques you get speeds close to C++).

Clyde can you specify the format of your input file? (for example: let's say ASCII text file with 50 000 columns and "n" columns with each column being: series1, series2, id, name, etc.)

Just another suggestion for SPSS users or newcomers. You can also use PSPP which is free and open-source (http://www.gnu.org/software/pspp/) and maybe even easier to do tasks like that.

How to analyse co-existence between two species?

Insect physiology PFAA metabolism?

How do I get RMARK to identify the correct number of occurrences in a PoissonMR model?

What benefits do urban forests provide to urbanized locations ?

What could be the remedial approach to correct Pr correction, TKE and Epsilon divergence in AMG solver for Eulerian-DDPM setup?

What software do you use to draw conceptual diagrams?

What are the perceived benefits to the land rights-holders (whether community, family, or individual) of registering customary land rights?

AI and Job postings through ResearchGate

Mission Analysis using Direct Transcription, Collocation

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?