Factor analysis with missing values. Different methods give different results?

Agner Fog @Agner_Fog

08 August 2019 3 3K Report

I have tried different methods for making a factor analysis with missing values, and I get very different results.

Here is a comparison of a 2 factor analysis for a 73x40 data set with 43% missing values, using four different methods:

Method, Cumulative variance for two factors:

A: 0.285 0.408

B: 0.425 0.591

C: 0.193 0.258

D: 0.414 0.636

Method A: Missing values are replaced by the mean.

Method B: Based on covariance matrix computed by CRAN package 'norm'.

Method C: Based on covariance matrix computed by CRAN package 'norm2'.

Method D: Function umxEFA in CRAN package umx.

The software packages in B, C, and D are all designed to analyze data with missing values by means of structured equation modeling.

(The functions in B and D needed minor modifications to overcome limitations to the size of the data set, but nothing that changed the algorithm)

Other methods:

E: Random imputation: This gives non-reproducible results with excessive weight on variables with few missing values.

F: Multiple imputation with an auxiliary variable (Hot deck method). Missing values are replaced by values from another observation with the same value of the auxiliary variable. This method is useful, but I suspect that subsequent correlation tests will be invalid because the auxiliary variable must be something that is assumed a priori to correlate with everything.

I would appreciate an explanation of why the results are so different, and how to get useful and valid results.

FYI: My data set consists of data from different cross-cultural surveys. Each survey reports various cultural variables for a number of different countries. Each survey covers a different subset of countries, and no survey covers all the countries. This is the reason for the missing values.

See also the answers to https://www.researchgate.net/post/Any_suggestions_on_missing_values_in_factor_analysis

David Eugene Booth

OfC there are different methods because different analysis have different requirements. D. Booth

David Morse

Hello Agner,

As David Eugene Booth 's pithy reply suggests, the methods are different, so results are different as well. In brief:

0. 43% is a lot of missing data, so method differences will be more notable when looking at results.

Method A. Mean substitution will tend to attenuate correlations, so the lower "common variance" accounted for isn't surprising. Of all methods, this is probably the one mentioned most often in text discussions, easiest to program (and so is likely to show up as an option in statistical software), but least satisfying from a number of considerations.

Methods B, C, D: You'd have to look at the algorithm used by each to figure out the difference. Asserting that all three are based on SEM methods doesn't pin down the specific missing data strategies.

Method E: Yes, random imputation will be the least replicable; you'd expect that.

Method F: To the extent that you choose auxiliary variable(s) that do correlate with others, you'll tend to bias correlation estimates upwards. To the extent that auxiliary variable(s) do not correlate with others; a downward bias would be expected.

Good luck with your work!

Kelvyn Jones

I find this an interesting question! It seems to me that the answer is likely to be not just technical but also substantive in that there may be different structures in different countries and that the missingess is non -ignorable. and informative. As I understand it, you are forced to drop an entire country, and the methods are all ignoring the levels in the data.

It would be interesting to see what happens with a multilevel factor analysis with responses nested within individuals (a multivariate model) and the individuals nested in countries.

EG Mplus software capability https://www.statmodel.com/usersguide/chapter9.shtml

and MLwiN software

http://www.bristol.ac.uk/cmm/software/mlwin/mlwin-resources.html#multipleresp

Harvey Goldstein and co-workers has been working on a very general multiple imputation procedure that exploits this sort of model in Bayesian framework, you can add in auxiliary variables

Fitting multilevel multivariate models with missing data in responses and covariates that may include interactions and non-linear terms

https://www.semanticscholar.org/paper/Fitting-multilevel-multivariate-models-with-missing-Goldstein-Carpenter/00c069769c659f89c72faaa0225f4ab163d45637

and associated software

https://www.bristol.ac.uk/cmm/media/software/statjr/downloads/manuals/missing_data/v2/missing-data-with-statjr.pdf

Connection between cognitive schemas, frames, paradigms and social scripts

FYI: Relevant preprint?

How can I prepare virus for a TEM or SEM imaging?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?