I would like to know how can we compare two groups when both the groups have unbalanced panel data. What statistical technique or method can be used on such data?
if you are looking for a tool for analysis then R have "plm" and "pglm" packages for panel data analysis. Method of analysis would depend on the question you want to answer. Estimation of systems of regression equations by generalized least squares (GLS) and maximum likelihood (ML) can be one option.
Thanks Inder for your direction. Actually the whole of the data is divided into two groups depending on the presence or absence of one independent variable which has made it unbalanced panel. So I wanted to check out which of the two groups has more positive significant effect on dependent variable. There are other independent variables also.
DATA: There are two possible data types: (i) continuous data (quantitative) and (ii) discrete data, i.e. categorical (Yes | No).
(i) Continuous Data: If the data is continuous quantitative type, the comparison study is generally accomplished by location analysis, i.e. analysis of the variance. There are two scenarios: (a) two samples with equal length, i.e. n1 - n2 = 0, and (b) two samples of unequal length, i.e. n1 - n2 not equal to zero. Assuming that the data in both instances are normally distributed, then there are two type of applicable T-test.
[a] Data with Equal Length. Suppose that one group of data has sample size of n1 and a second group of data has a sample size of n2. Suppose further that n1 = n2. The appropriate location analysis if d-bar analysis. Let the data set for n1 be Xi: ( x1, x2, ..., xn) and the second set be Yi: (y1, y2, ..., yn). Since Xi and Yi has equal counts, we can pair them and determine their difference:
(x1 – y1) = d1
(x2 – y2) = d2
…
(xn – yn) = dn(pair)
The mean of the difference is simply denoted as d^ or d-bar:
(1) d^ = (1/n)Σdi
With the standard deviation of:
(2) Sd = Sqrt((1/n) Σ(di - d^)2
The test statistic is given by:
(3) Td = d^ / Sd / sqrt(npair)
Set you confidence interval, i.e. 0.95, 99, etc.; the hypothesis statement follows: H0: Td(obs) < T then not statistically significant; HA; Td(obs) > T then statistically significant.
[b] Data of Unequal Length. If n1 not equal to n2, then the unpaired t-test may be used. The test statistical is given by:
(ii) Discrete Data. Now if the data is dichotomous, the test turns to binomial distribution test. One population binomial distribution test may be given by:
(5) Z = (p - p0) / sqrt(p0(1 - P0) / n)
... where p = observed proportion of specified class; p0 = proportion of class of interest in the population; n = sample size. Assuming that the data reading is taken at the same time. If the data is taken at two separate time reading, i.e. pre- and post-introduction of stimulus then a Z-test for two counts Poisson distribution is used. From the stated facts, "... divided into two groups depending on presence or absence of independent variable ..." it appears you have dichotomous data: presence = yes and non-presence = no; therefore, yes = 1 and no =0. Otherwise, follow the non-discrete routine.
NON-NORMALLY DISTRIBUTED DATA: If the data sets are continuous (non-dichotomous) and not normally distributed, it does not matter if both or only one is non-normally distributed, then T-test would be precluded. The underlying assumption of the T-table and Z-Table is that the data are normally distributed. For non-normally distributed in a two set comparison study, use the F-test. There is not distribution for the F-test; F-test compares the shape of the data distribution (their respective variances). The F-test in this case is given by:
(6) F = S12 / S22
... where Si2 = sample variance with n -1 degree of freedom. Find the observed ratio under (6) then compared to the critical value given in the F-table by reading the degrees of freedom: df1 : df2, wherever the two match, that is the critical value of F(standard). H0: F(obs) < F(standard) then there is not significant difference. HA: F(obs) > F(standard) then there is a significant difference between the two samples.
REFERENCES: for your reference and use, the T-table and F-table are attached. I have also provided a link to Kopal Kanji's book: 100 Statistical Tests. I hope this has been helpul. Cheers.
But I need more clarification. For this I feel I must elaborate my problem. I am working on panel data having large n and t=10. I cannot use binomial distribution since the discrete variable is independent variable and in binomial distribution outcome must be binary. Apart from other independent variables I wanted to know whether the presence of this independent variable gives better result or the absence of this model gives better model.
Is there some other method to find out the importance of such discrete variable?
The balanced or un-balanced type of data should not be a problem whenever the degree of 'unbalancedness' is not extreme. Then the problem you are actually confronting is just a problem of specification. You just have to choose the criteria for evaluating many models (by varying the inclusión of the other independent variables).
If I understood the question you want to answer correctly, You want to see whether the model created after including this particular variable is better fit for the data than the model created after excluding it or vice-versa. Then in that case unbalanced nature of data should not be a problem. What you would need to do is to look at methods like R square, adjusted R square, AIC and BIC which indicates how well data fit a statistical model in case of first two methods and compare the relative quality of statistical models in case of the other two.
The first thing is to know if the panel data is incomplete due to either randomly or unrandomly missing observation. It seems you have one group of observations for which a explanatory variable is completely missing. The observations within the two groups (with and without the variable) have something more in common besides having or not having the variable? If they don´t and you just separated them because of that you may have a ramdomly missing observation problem. However, I don´t think it is an unbalanced panel data. It may be a problem of model specification, as pointed out by Carlos. If the economic theory says nothing about the need to include this variable in your model, then you should test for model misspecification. A bunch of several omitted-variable tests can be used to check this problem. Some of them are included as canned routines or even run along the most common regression estimation models. Three simple common tests are Wald test, Lagrange Multiplier, and Likelihood Ratio test. A somewhat more complex but very popular test easily adapted to check this problem is the Hasman test.
No sir. Neither there is any missing observation nor I want to judge the effect after including a variable. I might not be phrasing out my question properly.
I am working on panel data having large n and t is 10 years. There are 6 independent variables and 3 dependent variables. One of the independent variable is discrete dummy variable. Like for example, whether a committee is present in a company or not is the observations. The observations found out that for some years, it was present and for other years it was not present. Like, for Company A the committee was present for 2 years and there was no committee for rest of the 8 years. For company B the committee was present 10 years. For company C the committee was not present in any year. This was observed for n number of companies. Number of presence and absence of the committee is not same for all companies as discussed by example. Then I divided the whole data into two groups to find out whether the presence or absence of committee has any effect on my dependent variable or not. This has made both groups unbalanced, having different t for n. Individually I can run regression on both groups. But I wanted to know how can I compare two groups. Is there some better method or technique to find out the results.