R programming language

I am considering if is it appropriate to use two different randomly chosen samples coming from one huge database to proceed two logistic regressions separately on the same subject?. The main cause is a low power of my computer and no possibility to use own written multimatching function that binarizes whole data into 0 and 1 (follow / not follow).

The database consists of 1 500 000 obs. and 54 variables (data.frame). The DV reflects the act of following one of two presidential candidates (1 and 0) and IVs reflect the act of following particular media outlets appearing on Twitter (also 1 and 0). The aim is to present association between media and political agenda and predictive power of particular media.

Unfortunately, I am forced to sample the data because of the computing time. Hence, I am going to randomize two samples (2 x 100k records), proceed the regression, and then, confirm the first one using the second one. Is it consistent with methodological / statistical art ? Thank you in advance.

More Jacek Nożewski's questions See All
Similar questions and discussions