If anyone has any suggestions or could point me to useful references, it would be much appreciated. I'm trying to weigh the costs and benefits of balancing my dataset by cutting out observations.
This term refers to a collection of multi-dimensional data set observed over multiple time periods. It is also called longitudinal studies. Panel data should not be confused with data obtained from panel of experts, i.e. country risk analysis when a panel of experts are set up and presented with a question for the experts to answer. The panel data has the form:
(1) Y = a + bX + u
… where a = Y-intercept, b = slope, and u = random error.
(2) u = mu + v
… mu = mean of random error distribution, and v = random error.
BALANCED & UNBALANCED DATA
A balanced data set is a set that contains all elements observed in all time frame. Whereas unbalanced data is a set of data where certain years, the data category is not observed. Recall that in the balanced panel data, the error term is u = mu + v; however, in the unbalanced panel data set, there is an additional error term in “u”; therefore:
(4) u = mu + v + e
… where “e” is the additional disturbance from the unbalanced random effect term. The unbalanced panel data begins to have a problem when the value of “e” exerts significant effect on the system, thus, inflating error term for statement (1). ANOVA, MIVQUE and MLE can be used to estimate this error component.
See: Baltagi, B. H., (2005): Econometric Analysis of Panel Data. John Wiley & Sons, Chichester, England; and Cameron, A. C., and P. K., Trivedi (2009): Microeconometrics Using Stata. Stata Press, College Station, Texas.
Your choice depends on frequency and the reasons for missing data.
If 1) your panel data set is almost complete, that is, missing observations are infrequent, or, at least just a several items of each observation is lost, and 2) you can justify that the the data are randomly missing, then converting an unbalanced into a balanced panel is not costly (the "cost" in this case is just a little loss of efficiency).
If missing data are frequent, the efficiency loss might be considerable.
If missing data are nonrandom, then converting into a panel may result in biased sample.
The reasons for missing data is important. If missing data systematically happens, then the exogeneity assumption doesn't hold. In Paul's notation, your unbalanced panel has the structure below:
i) If d = 1;
Y* = a + bX + u, Observed
ii) If d = 0;
Unobserved
If the residual "u" has a systematic relationship with the indicator "d", then the estimate of "b" will be biased because of the endogeneity.
You can address this problem by modeling the sample selection mechanism as well as the objective of your interest;
Y* = a + bX + u .
d = c + dZ + e .
(Reference)
Paul's recommendation, Cameron and Trivedi (2009), is practical reference as the book is designed for direct use of STATA.
If you need more textbook-style presentation of the background concept, Cameron and Trivedi (2005)'s "Microeconometrics: Methods and Applications." (Cambridge) would be another good reference.
Hi Frances. The fact that you panel is unbalanced should not be ignored, even if it is not problematic. You should ensure that you have random attrition in your panel, i.e., that the units composing your panel leave the panel randomly. If the units do not leave your panel dataset in a random way, they may have some unobserved characteristics that you should control for. Cutting observations in order to get a balanced panel might be a solution (depending on the question that you are studying... as it can also raise some selectivity problems/bias...), but maybe you can use richer information if you work with the unbalanced panel and take into account the potential bias caused by attrition.
Maybe this reference might be helpful:
An Analysis of Sample Attrition in Panel Data: The Michigan Panel Study of Income Dynamics, by John Fitzgerald, Peter Gottschalk, Robert Moffitt (1998)
when estimate the effects of independent variables to dependent variables with using regression models; the balanced panel is more simple than unbalanced.
We need to first understand the reason for the absence of the data. In the case of randomly missing data, most commands can be applied to the unbalanced panel. The problem arises in the case of non-randomly missing data.