The standard functions in the stat package deal with unbalanced data. (Functions lm, glm and aov).
If you do not have enough memory to run these on your computer (i.e. you have many millions of observations) there is a function biglm in the biglm package. I think these do not deal well with unbalanced datasets.
Thanks Frank for your helpful answer. What do mean with "I think these do not deal well with unbalanced datasets"? do you mean that lm, glm and aov do not fit with huge data set. I indeed have a big dataset
Ok A more fundamental explanation. When you do an ANOVA (or similar analysis) by hand you use simple computation not very inensive equations that do not correct for unbalanced data. Equations that do correct for unbalanced data are computationally more intensive. These are the base functions (aov, lm and glm) implemented in R.
R has a second set of equations called biglm in the biglm packages that can deal with huge datasets (functions biglm). I suspect that these assume balanced data.
So, test first if you are able to estimated your models with the basic functions (lm, aov or glm). If the functions fail due to missing memory then do some readings on how minimize memory use and you might want to shift to the biglm function. I am not sure what are the implications of it if you use unbalanced data.
If your data is strongly unbalanced it might be worthwhile to do some serious reading about what are the possible implications of unbalanced data.
Frank, Thank you so much for your time and efforts. You are highly appreciated. My plan is to use lm command. I have data from 1996 to 2011 with different number of locations (actually many locations some years they are 40), replications and genotypes. I am going to analyze each year and using BLUP to correct the data with a model Location+gentoypes+L:G. Then I will combine all years and using lm with model Y+G+Y:G.
Another option is to coerce the data into a balanced design via resampling and running multiple tests. Whether this is feasible or helpful will depend on how your data are structured. The basic idea is to leave the data as is for those cells with the fewest observations, and then randomly sample from those cells with more observations to that N. With such a large dataset it is probably impossible to permute all the possible cases, but you could run hundreds or thousands of iterations and get an idea of how stable the results are.
While doing this relieves you of worrying about the effects of unbalanced data, by populating most cells with fewer observations than you actually have you are obviously losing information. How much information is lost depends on the structure of the data.
Even though this is an older question, I would like to add that the aov function is used for balanced designs (the results can be hard to interpret without balance ).
Please see the "Note" here: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/aov