Does anyone have experience with R to automatically determine sensible new data in lm.predict?

07 July 2014 4 8K Report

I'd like to have or write a function that automatically provides "adjusted predictions" for a linear model. For simplicity I could accept that the models must not have any interactions. With "adjusted predictions" I mean that the predictions are for the values of a specific predictor ("target predictor") are calculated with all factors at the reference levels and all metric predictors at their mean values (these cal be determined from model frame).

For this task the function has to find a sensible range for the continuous target predictors. This is relatively simple when the predictors are untransformed. However, I have not found any solution for transformed predictors.

Example:

consider

model = lm(Y ~ A + log(B) + exp(C) + sqrt(D) + poly(E,3))

I will not consider other transformations (like I(.)). The given model is just an example. It could look different, with different variable names and transformations.

To get the adjusted predictions for a target predictor the desired function should be called like

predictions = calcPrediction(model, "C")

to the fitted values varying the values of the variable "C" while all other variables (A,B,D,E) are fixed.

My problem is to set up (automatically, given only the model passed as argument to the function) the required data.frame "newdata" for lm.predict. This data.frame must have the colnames "A","B","C"...= the names of the variables. How can I extract the variable names from the model? The names I can extract all contain the transformations/functions.

The second problem is to find an appropriate range for the target predictor. model$model contains the values used by lm, and they are transformed. To determine the range for newdata I need the untransformed values. How can the applied transformation be undone (automatically)?

I hope this was understandable...

Thanks in advance for any help!

Shane McGee McMahon

Jochen,

Perhaps something like this (see R code below) can solve part of the problem. You can give the transformed data whatever names are convenient, the original data will be in the model, accessible by model$model$A etc, but apparently not used for anything.

It's a little bit ugly and carries around extra data rather than undoing the transform, so if you have a very large data set it's possible the extra space could be problematic. But it's simple to implement and may be suitable.

I hope this addresses your question, please let me know if I'm missing the point.

A = 1:10

B = 1:10

C = 1:10

D = 1:10

E = 1:10

A2 = A

B2 = log(B)

C2 = exp(C)

D2 = sqrt(D)

E2 = (poly(E,3))[,1]

E3 = (poly(E,3))[,2]

E4 = (poly(E,3))[,3]

data.1 = data.frame(A,B,C,D,E,A2,B2,C2,D2,E2,E3,E4)

Y = A + 2*log(B) + 3*exp(C) + 4*sqrt(D) + 5*(poly(E,3))[,1] + 6*(poly(E,3))[,2] + 7*(poly(E,3))[,3]

model = lm(Y ~ A2 + B2 + C2 + D2 + E2 + E3 + E4 + 0*A + 0*B + 0*C + 0*D + 0*E,data=data.1)

model

Call:

lm(formula = Y ~ A2 + B2 + C2 + D2 + E2 + E3 + E4 + 0 * A + 0 *

B + 0 * C + 0 * D + 0 * E, data = data.1)

Coefficients:

A2 B2 C2 D2 E2 E3 E4

1 2 3 4 5 6 7

model$model$A

[1] 1 2 3 4 5 6 7 8 9 10

model$model

1 10.63710 1 0.0000000 2.718282 1.000000 -0.49543369 0.52223297 -0.4534252 1 1 1 1 1

2 31.38609 2 0.6931472 7.389056 1.414214 -0.38533732 0.17407766 0.1511417 2 2 2 2 2

3 73.12858 3 1.0986123 20.085537 1.732051 -0.27524094 -0.08703883 0.3778543 3 3 3 3 3

4 178.51731 4 1.3862944 54.598150 2.000000 -0.16514456 -0.26111648 0.3346710 4 4 4 4 4

5 460.94530 5 1.6094379 148.413159 2.236068 -0.05504819 -0.34815531 0.1295501 5 5 5 5 5

6 1226.94732 6 1.7917595 403.428793 2.449490 0.05504819 -0.34815531 -0.1295501 6 6 6 6 6

7 3308.29063 7 1.9459101 1096.633158 2.645751 0.16514456 -0.26111648 -0.3346710 7 7 7 7 7

8 8964.55554 8 2.0794415 2980.957987 2.828427 0.27524094 -0.08703883 -0.3778543 8 8 8 8 8

9 24336.55939 9 2.1972246 8103.083928 3.000000 0.38533732 0.17407766 -0.1511417 9 9 9 9 9

10 66115.43621 10 2.3025851 22026.465795 3.162278 0.49543369 0.52223297 0.4534252 10 10 10 10 10

Jochen Wilhelm

Shane, thank you for the answer. However, either I did not understand your answer you you did not get my problem...

I do have the model (as a result from lm). I do not call lm().The model I have has the parts $model and $terms (an all other gimmicks of an lm-object).

This is what I have.

Now condider, as example, I want to predict the response for the given range of a predictor (let's call it D) by a call of

predict(model, newdata=newDF)

newDF is a data.frame containing the values of all predictors. I want all predictors in the model (expect D) have the reference level or the mean (as in model$model). This is no problem so far. The problem is to set the values for D. When the model was Y ~ ...+D+.... then it would be simple:

get the range of the values in model$model[,"D"] and create a sequence of equally spaced values of a desired length covering this range. That's it.

But when the model was Y~...+log2(D)+... then I could still find the corresponding column in model$model, but there are the log2-values given, and in newDF I would need the values on the linear scale.The sequence of values for log2(D) can be obtained as explained above. But eventually in newDF I must give these values on the original scale (D, not log2(D)). So these values need to be back transformed (here by 2^x). I do not know how to find the inverse function automatically. I could programm a larger case-structure checking every possible function and providing then applying the corresponding inverse transformation. But this is extremely ugly and it won't work with functions that are not known (and, for instance, how would I invert poly(D,3)?).

So I want a smart work-around for this (and it would be an optional plus if it also would invert "unknown" functions. I am not sure if something like this is possible at all. I did not find anything yet. Therefore I am asking.

Shane McGee McMahon

Sorry, I guess I didn't read your question carefully enough. Hopefully I did a better job understanding your question this time. I don't know of any way to automatically find the inverse transformation. The best approach I can think of is to parse the variable names into commands and find the inverse numerically. A partial implementation follows. What I put together will handle things like exp(A) or log(A), or anything where the variable name is the only argument, but would fail on poly(D,3) as it's written. But with a little bit of work, it should be able to invert any transformation.

If you think it's a worthwhile approach I could flesh it out a little bit more. Have you tried the [email protected] mailing list yet? Somebody there may have a better solution that what I came up with.

Please let me know if I missed the point completely (again).

R code:

Substring.Location

Dave Armstrong

How about something like this?

predDF

How to get a CI for the mean in models with a nuisance parameter?

Is it correct to call lo(w)ess "non-parametric"?

Why doen't it work to call "[<-" directly?

Are these models equivalent?

How to calculate the sample size (t-test) when additional data is available to estimate the standard error?

How does random sampling guarantee equal selection probabilities?

How can I get a confidence interval (CI) for gamma-distributed data?

The old battle revisited: why, say ferquentists, can a hypothesis not have a probability?

Does this probability distribution have a name?

How can I automatically R-label points in a scatterplot while avoiding overplotting of labels?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

How to report results of Generalised Linear Mixed Models in a journal article?

Request a single Lecture notes for math as detailed as this that I can find in one place?

Why 3 replicates for most biological assays? Is it enough to examine the data fits normal distribution?

Normality assumption for linear regression is The assumption of normality is whether for residual errors or predictor variavble?

Posthoc test lettering in JAMOVI?

Which statistical test should we use?

SAS Generalized Linear Model for trial/event anaysis and not survival (time to event) analysis?

What change would occur in physics if the three different sizes of the proton and the two sizes of the deuteron accepted as new physical constants?