Which variables (target only, or all?) to enter into Mahalanobis distance analyses when searching for mutlivariate outliers?

29 June 2023 3 2K Report

Hello,

I have a data set with N = 369 individuals measured at a single time point. The goal of the study is to create an assessment of psychological safety (PS). The assessment is a self-report measure asking participants to indicate how psychologically safe they feel using a unipolar 5-point Likert scale ranging from 1 (not at all) to 5 (extremely).

In addition to the assessment I am creating, I also measured a number of demographic variables (e.g., age, salary) and a few additional measures of team environment for validation (e.g., an existing measure of PS, level of team interdependence).

My primarily goal is to run exploratory factor analysis (EFA). This is the first time anyone has conceptualized PS as multidimensional, so one of the primary goals is to uncover the potential factor structure of PS. Also, to identify candidate items for deletion.

In order to prepare for the EFA analyses, I am cleaning the data by following recommendations in (the excellent) Tabachnik & Fidell (2013, 6th ed).

I am currently at the point where I am checking the data for multivariate outliers, starting with Mahalanobis distance. And I cannot find explicit guidelines regarding which variables I should be including as "IVs" in the analysis.

QUESTION: Which variables should I be including in my search for multivariate outliers? Do I include all variables, or only my target variables?

Specifically, do I include only the variables that represent the item pool for my forthcoming PS assessment? Or do I include all the PS items AND demographic variables, the existing PS assessment, interdependence measure, etc.??

I ran the Mahalanobis distance analyses 2 times using both approaches, and found substantial differences:

TIME 1 - With just the PS assessment variables --> I identified n = 28 multivariate outliers.
TIME 2 - With PS items + demographics, etc. --> I identified n = 10 multivariate outliers (all identified as outliers in the TIME 1 analysis).

Syntax I am using - the bolded variables are the ones I am questioning if I should include or not:

REGRESSION

/MISSING LISTWISE

/STATISTICS COEFF OUTS R ANOVA COLLIN TOL

/CRITERIA=PIN(.05) POUT(.10)

/NOORIGIN

/DEPENDENT Subjno

/METHOD=ENTER Age Salary Edu WorkStructure TeamSize Tenure_OnTeam JapaneseBizEnviron EdmondsonPS_TOT Interdep_TOT Valued_TOT PS_1 PS_48 PS_141 PS_163 PS_43 PS_53 PS_73 PS_133 PS_135 PS_19 PS_60_xl26 PS_93 PS_106_xl26 PS_143 PS_58 PS_86 PS_182 PS_56 PS_69 PS_103 PS_164 PS_22 PS_35 PS_91 PS_30 PS_59 PS_63 PS_90 PS_131 PS_140 (**Note, PS assessment var list is truncated b/c large number)

/RESIDUALS = OUTLIERS(MAHAL)

/SAVE MAHAL.

David Morse

Hello Melissa,

If your immediate aim is that of trying to identify a plausible factor structure for your new measure, then the brief answer is to rely only on the scores from that measure (and not the demographic attributes).

However, there are a couple of points you may wish to consider.

First, as the individual responses on the target measure are ordinal strength, you'll need to consider how best to proceed; treating them as interval strength (which is what both the computation for distance measures and the subsequent factoring would do by computing covariances/correlations) is likely not the best choice. Some options here would be: (a) using polychoric correlations; (b) using IRT to explore both dimensionality and unusual (non-model-fitting) cases; (c) using some other scaling method, such as Guttman, S-P (subject-problem), or Mokken; or (d) using factoring models that allow for ordinal scores.

Second, a lot of folks may (rightly) question why you think that outlying cases should be jettisoned from your data set. Cleaning data of impossible values or miscoded scores is an obvious step. Deciding that your sample can not have atypical cases, albeit with otherwise legitimate scores is not so obvious a need or requirement.

Good luck with your work.

Ma'Mon Abu Hammad

When conducting Mahalanobis distance analyses to search for multivariate outliers, it is generally recommended to include all variables in the analysis, including both the target variables and the independent variables (IVs). This means including the variables that represent the item pool for your psychological safety (PS) assessment as well as the demographic variables, existing PS assessment, interdependence measure, etc.

Including all variables in the analysis allows for a comprehensive examination of multivariate outliers, considering the relationships among all the variables in your dataset. This approach ensures that potential outliers are identified based on their multivariate distance from the overall data distribution, taking into account the entire set of variables.

Excluding certain variables from the analysis may lead to biased results, as outliers in one variable may not necessarily be outliers when considering the multivariate space. By including all variables, you are better able to capture the overall structure and patterns of outliers within your dataset.

In your case, it would be appropriate to include all the variables, including the PS assessment items, demographic variables, existing PS assessment, interdependence measure, etc., in your Mahalanobis distance analyses. This will provide a comprehensive assessment of multivariate outliers in relation to the entire set of variables you have measured.

Therefore, it is recommended to use the syntax you provided, including all the variables, to conduct the Mahalanobis distance analyses and identify multivariate outliers.

Melissa Tarantola

Thank you so much, both of your answers were extremely helpful! I ultimately made the decision to run the analyses 3 times and then compare results - (1) full data set, (2) with n = 10 multivariate outliers removed, and (3) with n = 28 multivariate outliers removed. It was time-consuming, but luckily there were very few differences in results, so I didn't have to make any hard decisions.

I will have to look into the analyses David Morse suggested. In selecting my rating scale labels, I used research performed by Beckstead (2014). This study attempted to map the strength of adjectives that are typically attached to Likert scales - thus allowing researchers to select rating scale labels that are "spaced an equal distance apart," and thus to proceed with analyses under a crude assumption of interval scaling. But given this spacing assumption is based on single study, its hardly satisfactory.

Thank you again both of you!

Find a full text article?

How to properly extract proteins from a Caco-2 cell line?

How long do human brain tissue samples last after protein isolation?

How to get successful bacterial plasmid growth?

Correlation confusion, am I interpreting this data correctly?

Undergraduate Statistical research help?

How can I have a better transfection rate of 50% for my HT-22 hippocampal neuronal cells?

Is this normal on my Caco-2 culture?

How can I fix my transfected cells for immunofluorescence protocol?

Powerbead Pro Plates (Qiagen) - leaking?

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

Is this a facetotecta nauplius?

May members post flyers about opportunities to present at a conference? If so, where to post?

How are iso-frequency contours plotted?