Can I collapse my data by mean for a very unbalanced dataset and still get good results?

12 April 2021 2 2K Report

I have a complicated issue. I have a dataset of 44000 surveyed individuals from 68 countries for different time periods. All time periods are non-successive. for example data for Egypt is available for the year 2008 and 2012, data for Argentina is available for the year 2000 and 2005, etc. now this unbalanced pooled data is not representative of the whole country as I am selecting individuals of a certain religion. I ended up with very few observations from one country and very large observations from another country in which this religion is dominant. Some countries has as little as 1 observation and other countries has about 3000. I am testing the impact of the religiosity of a follower on his/her economic outcome living under different qualities of governments. Data about governmental qualities are macro data. so I tried collapsing my dataset by mean so I can add the government quality variable. I ended up with 114 observation which I ran a fixed effect model and got good convincing results with R-squared of about .50.

hettest produced Prob > chi2 = 0.7608. My main model produces Corr(u_i, Xb) = -0.8220 which indicates that fixed effects are strongly correlated with my explanatory variables and that FE is essential to take care of endoginety , rho =.7888 suggesting good reliability.

is collapsing such a very unbalanced datset was the right thing to do? is there a way to weight countries based on the number of observations they have? what would be the alternative if collapsing the set is not the right thing to do?

Federico Nutarelli

Hi,

so I would suggest multiple ways that you can check based on your needs.

1) If you are on stata you can try xtbalance2 package which balances your data maximising the number of observations to keep. Working with balanced dates might be an advantage if you care about nice statistical properties. As per F.E. it is almost invariant using balanced or unbalanced data (please see, for instance, Wooldridge, 2002 for further details)

2) if you want to do a wighted mean (again only non stata but I guess there are similar ways with other softwares too), then there are two ways: you can either employ "asgen" module or , within collapse you can specify your weights as follow: collapse (mean) variable [weights], by().

Now regarding your question about collapsing the unbalanced dates. It depends. You first need to se if the data are censored. If there is a particular reason why there are missing data and if such a reason is connected to your dependent. These are normal checks that avoid endogeneity. As a simple example, take a look at selection brass. You don't want that your observations are endogenously selected within your sample.

That said, if none of the above problems emerges, collapsing or not the data depends on the specific reason why you wan to collapse them.

Mahdi Movahed-Abtahi

Based on your assumptions, you should design and arrange different variabled during fuzzy systems and calculate fuzzyfication. You should revise your design. Indeed, you must use multi-value logic to explain your findings via one casual network of affecting factors.

GMM with very short time, is three enough?

Can we classify HV and LV lines based on line length?

I got 02 moderator values defining Johnson-Neyman significance regions. Is it an error?

What analysis to choose with a large neste dataset and clearly skewed or kurtotic distributions?

Do you know any MR image dataset?

Can NFL theorem be valid in infinite search space in ML?

Does anyone have kinetic and kinematic dataset for gait classification?

Is it necessary to report effect size (ES) in statistics of biomedical research given that we already report p values routinely?

Where can I find public datasets with Covid-19 patient information?

I need dataset for recommandation system?

How to evaluate unsupervised clustering metrics when some elements do not have a cluster label in the result?