The following is a copy of my question on StackExchange's CrossValidated site: http://stats.stackexchange.com/questions/90386/optimal-sampling-strategy-for-efa-cfa-and-sem. Please feel free to answer there or here at ResearchGate at your convinience.

I'm wondering what should be the optimal sampling strategy for my dissertation research. I have four data sources (two open source software projects meta-repositories and two global startup databases). I'd like to perform EFA to discover (or, rather, confirm my theory-based assumptions) the factor structure of the study's constructs. Then I plan to perform CFA to determine validity and reliability of the measurement model. Finally, I plan to perform SEM analysis to test the study's structural model and hypotheses. Having said that, I plan to perform data analysis (at least, SEM - not sure about EFA/CFA) on two data sets: pilot and main. I believe that pilot analysis will allow me to modify model, if fit indices will be inadequate. Then I plan to perform main SEM analysis of the modified model (and, possibly, alternative models) on the main data set. In addition, I plan to perform both covariance-based (CB) and partial least squares (PLS) SEM analysis to compare them for my study. What should be the optimal approach and its steps in terms of the following:

1. Sampling technique. I was thinking about randomized sampling of data from each OSS meta-repo or from a merged data set; then selecting corresponding data from startup databases.

2. Strategy on dividing the sample data set into pilot and main data sets.

3. Any special steps for sampling due to multiple methods (EFA/CFA/SEM).

4. Any special steps for sampling due to alternative models.

5. Any special steps for sampling due to analyzing via both CB-SEM and PLS-SEM.

Bonus question: :-) I plan my study as cross-sectional, but the data in meta-repositories do not exist for exactly the same time frames. For example, data from seven OSS repositories are within range from September 2012 to December 2013. I think that for OSS world the variance within projects' characteristics should not be dramatic, as OSS ecosystem is not very dynamic on average. The question is whether using this semi-cross-sectional approach will allow me to retain statistical validity and what statistical tests exist to confirm that?

You help and advice on this is greatly appreciated!

Similar questions and discussions