What is the common framework for filling missing data?

More Adrian Fanaca's questions See All

How do I even start with time-series analisys?

Let's say that I do not know what software to use, what type of regression to use. Is there any automatic software, or everything about time-series is manual? How do I interpret the results? What...

05 June 2017 9,134 2 View

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?

Hi, I have a question about normalizing the MTT OD values for doing the statistical analysis. So, if we have 3 different plates and we call them 3 different replicates, so, first we would...

07 August 2024 8,106 4 View

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

Are there any statistical methods to justify your sampling technique using SPSS or AMOS?

05 August 2024 9,153 4 View

How to report results of Generalised Linear Mixed Models in a journal article?

Hi everyone, If you have written or come across any papers where Generalised Linear Mixed Models are used to examine intervention (e.g., in mental health) efficacy, could you please share the...

04 August 2024 4,130 4 View

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Better ways to analyze the qualitative and quantitative data in a sequential explanatory mixed method approaches

04 August 2024 2,703 6 View

Why 3 replicates for most biological assays? Is it enough to examine the data fits normal distribution?

Just bounced on me. Before statistically analysing significant difference, shouldn't we see if data fits normal distribution first? Is 3 replicates enough to testify the hypothesis of normal...

31 July 2024 8,141 13 View

Posthoc test lettering in JAMOVI?

Does anyone know of a module for the JAMOVI software that is capable of generating mean separations using the classic letters based on post hoc results (e.g., Tukey test)? If, as I believe, such...

31 July 2024 3,333 4 View

Which statistical test should we use?

N=6 Comparing pre and post test likert scale responses. Participants are mix of practicing & preservice teachers.

30 July 2024 7,233 4 View

Request for Advice: Starch Metabolism Research Project?

I am currently considering a research project focusing on a comparative analysis of starch metabolism in orchids and roses. I am particularly interested in identifying the types and quantities of...

30 July 2024 4,267 2 View

Can the limit of quantification (LOQ) of an analytical method fall outside its linear dynamic range, or must it always be within it?

Can an analytical method's limit of quantification (LOQ) be outside its linear dynamic range, or is it always required to be within it? Please provide a thorough explanation supported by verified...

29 July 2024 7,198 9 View

Pragmatic inquiry research design?

Employing a pragmatic inquiry research design, looking for published research using this method, employing qualitative research data collection methods of semi-structured interview and focus...

28 July 2024 540 2 View

Tina Kretschmer Popular answer

Probably not. Multiple imputation or using an estimation method that "borrows" existent information to fill in missing values (e.g., Full information maximum likelihood estimation). So far this is possible at least in Mplus and Stata. In my opinion it's more straightforward but if you have a large amount of missing data I would look into comparing results from imputed data and FIML. I'm sorry I can't help you regarding R...

Tina Kretschmer

Patrick S Malone

Adrian,

I agree with Tina. Multiple imputation and raw-data methods such as FIML are the two major defensible ways to handle data missing at random (the compromise assumption most often used between plausibility and practicality, at least in my fields).

FIML has the advantage of usually being trivially easy to implement in almost any SEM software, including lavaan in R (and likely in other SEM R packages, but I'm not expert in R).

MI has the advantage that all variables in the imputation dataset contribute to the MAR assumption, rather than just those in the model. Mplus gets around what with the AUXILIARY option, but I don't think lavaan has that facility yet.

If the variables contributing to MAR are the same, the two methods are asymptotically equivalent.

Pat

Hosein Jafarkarimi

If your missing data exceeds 15%, you need to remove the observation. If your missing data is less than 5%, you may use mean replacement. If it is more than 5%, you can first determine the demographic profile of the respondents with missing data and then calculate the mean for the sample subgroup representing the identified demographic profile.

If your question with missing data is associated with a construct with multiple items, you can use the average of the responses to all the items associated with the construct.

For more info you can take a look at Hair's "A primer on partial least squares structural equation modeling" (page 50-51)

Ton De Waal

It depends on what you want to do with the imputed whether you need MI. If you're only interested in estimating totals or means (and not in estimating variances or coverage intervals), you can, for instance, use hot deck imputation or regression imputation.

If you want to use multiple imputation and are looking for software to do so (e.g. in R), you can look at the website of Stef van Buuren: http://www.stefvanbuuren.nl/mi/Software.html.

Kelvyn Jones

This site is full of practical help

http://missingdata.lshtm.ac.uk/

Timo Lorenz

Here is a useful link with r-code:

http://www.stat.columbia.edu/~gelman/arm/missing.pdf

Adrian Fanaca

Lavaan indeed. I am curious about deleting data if it represents more than 15% of the answers. I see that other people do not fully agree with this methods. Thoughts?

I read Hosein's 15% suggestion as a heuristic for within-case missing. I sometimes use cutoffs like that in an attempt to omit cases who didn't comply with the research protocol. In a recent example, I dropped cases who completely skipped more than 1/3 of the individual questionnaires in a large packet, using my judgment (and also looking at the frequency distribution of such skipping) as to whether I trusted the remaining responses to be valid vs. largely arbitrary. But it's context-specific -- I would not argue for a firm number, or using such a rule at all, across all studies.

Mark Huisman

I agree with Patrick that it is context-specific, and advising a firm number is not wise. Moreover, if a person has 20% missing scores (or in the same way, a variable has 20% missing), there still is 80% observed. Removing these observed scores is wasteful. And you should not forget that deleting incomplete respondents/variables is also a missing data treatment: complete case, available case, listwise deletion, pairwise deletion. These are usually not the best treaments in terms of power and especially bias, as they assume that data are missing completely at random. The suggested methods (MI and model-based methods like FIML) work under the less strong assumption of Missing at Random (MAR) and can correct some biases.

Thanks a lot, you answers are really helpful. Last thing. I am using Mplus 6, and I can send you the script I am using. Question is, this software solves the problem of missing data. Do I have to address this problem when I present the results in my paper?

Yes. You should always discuss the nature and extent of your missing data problem, your approach to dealing with it, and the assumptions that approach requires. In psychology papers these days, that information is typically included in an analysis strategy section in the methods.

For multiple imputation there are good packages in R. Norm is one, if you have (approximately) multivariate normal data (for variables with missings). Mice is another, based on chained equations, where you can specify a different model (distribution) for each incomplete variable. For the latter, you may want to check the paper of White, Roystan and Wood in Statistics in Medicine (2010).