How Can the big unblanced data set be analyzed?

More Gregor Steve's questions See All

Did you know that the NULL-result of the Michelson-Morley-experiment (MMX) has a classical explanation?

The NULL-result of the Michelson-Morley-experiment (MMX) has a theoretical, let’s say geometrical, explanation. You might see it below: If light goes up with 90 degree and comes from that point to...

22 March 2024 3,503 1 View

Why is space 3-dimensional?

Is anyone doing research, investigating reasons for space being 3-dimensional.

17 March 2024 8,243 19 View

How to properly read articles from various fields in the context of interdisciplinary research by a beginning researcher?

Hello! Where can I find comprehensive handbooks or online resources that can help me enhance my skills in reading and analyzing scientific articles across diverse disciplines, from epistemology to...

22 December 2023 8,201 7 View

Can eigenvector equations be understood as being self-referential?

Consider the non-trivial eigenvalue equation Ax = λx. Dividing through by λ gives [A/λ]x = x. Can it be said that any vector x that solves this equation is a fixed-point in the vector space? And...

22 November 2023 3,425 3 View

What Qualitative method can I use to study the potential of a phenomenon?

If I want to assess the potential threat of cyber terrorism to the aviation industry of a state (consider 5 aspects: knowledge, awareness, vulnerabilities, response and impact) and have the...

17 November 2023 1,099 3 View

What Qualitative method can I use to study the potential of a phenomenon?

If I want to assess the potential threat of cyber terrorism to the aviation industry of a state (consider 5 aspects: knowledge, awareness, vulnerabilities, response and impact) and have the...

17 November 2023 5,790 3 View

Define creativity, innovation, organizational creativity and leadership?

definitions, steps in creative process and levels of leadership

12 November 2023 4,492 4 View

Does anyone have a culture of Monascus purpureus culture?

I only found it on the atta but too high for someone who has retired. instead of buying buying red yeast supplements and not know the % of Monacolin K, I'd like to grow it in liquid culture,...

14 August 2023 2,815 1 View

How is the Universe expanding proportionally to the space distance from "here" and shrinking proportionally to time distance from "now" ?

This looks as a paradox of the Big Bang theory. The further we look in space the Universe is getting larger and larger. But, at the same time, the further we look in space we are also traveling...

19 July 2023 6,494 4 View

Where can I buy Computer time that has a SPICE simulator with 128 Bit floating point?

I have been working with ngSpice circuit simulator for a couple of months now and have had to give up because of bad (noisy and glitchy) simulation results. I don't feel like I can trust any of...

17 May 2023 4,088 6 View

How do soil microflora interact with plant roots and influence plant nutrition, health, and productivity?

06 August 2024 9,618 3 View

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

Are there any statistical methods to justify your sampling technique using SPSS or AMOS?

05 August 2024 9,153 4 View

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Better ways to analyze the qualitative and quantitative data in a sequential explanatory mixed method approaches

04 August 2024 2,703 6 View

Why 3 replicates for most biological assays? Is it enough to examine the data fits normal distribution?

Just bounced on me. Before statistically analysing significant difference, shouldn't we see if data fits normal distribution first? Is 3 replicates enough to testify the hypothesis of normal...

31 July 2024 8,141 13 View

How do soil microbes affect plant health and productivity and how do plants interact with soil microorganisms and contribute to soil fertility?

31 July 2024 5,087 4 View

What is the role of decomposers in the cycling of matter in the biosphere and role do soil microbes play in plant nutrition availability and uptake?

31 July 2024 7,921 5 View

How microorganisms are important for maintaining of healthy soil and biodiversity and microorganisms and plant roots contribute to soil formation?

31 July 2024 8,939 5 View

Request for Advice: Starch Metabolism Research Project?

I am currently considering a research project focusing on a comparative analysis of starch metabolism in orchids and roses. I am particularly interested in identifying the types and quantities of...

30 July 2024 4,267 2 View

Can the limit of quantification (LOQ) of an analytical method fall outside its linear dynamic range, or must it always be within it?

Can an analytical method's limit of quantification (LOQ) be outside its linear dynamic range, or is it always required to be within it? Please provide a thorough explanation supported by verified...

29 July 2024 7,198 9 View

Pragmatic inquiry research design?

Employing a pragmatic inquiry research design, looking for published research using this method, employing qualitative research data collection methods of semi-structured interview and focus...

28 July 2024 540 2 View

Frank Berninger

The standard functions in the stat package deal with unbalanced data. (Functions lm, glm and aov).

If you do not have enough memory to run these on your computer (i.e. you have many millions of observations) there is a function biglm in the biglm package. I think these do not deal well with unbalanced datasets.

Gregor Steve

Thanks Frank for your helpful answer. What do mean with "I think these do not deal well with unbalanced datasets"? do you mean that lm, glm and aov do not fit with huge data set. I indeed have a big dataset

Ok A more fundamental explanation. When you do an ANOVA (or similar analysis) by hand you use simple computation not very inensive equations that do not correct for unbalanced data. Equations that do correct for unbalanced data are computationally more intensive. These are the base functions (aov, lm and glm) implemented in R.

R has a second set of equations called biglm in the biglm packages that can deal with huge datasets (functions biglm). I suspect that these assume balanced data.

So, test first if you are able to estimated your models with the basic functions (lm, aov or glm). If the functions fail due to missing memory then do some readings on how minimize memory use and you might want to shift to the biglm function. I am not sure what are the implications of it if you use unbalanced data.

If your data is strongly unbalanced it might be worthwhile to do some serious reading about what are the possible implications of unbalanced data.

Frank, Thank you so much for your time and efforts. You are highly appreciated. My plan is to use lm command. I have data from 1996 to 2011 with different number of locations (actually many locations some years they are 40), replications and genotypes. I am going to analyze each year and using BLUP to correct the data with a model Location+gentoypes+L:G. Then I will combine all years and using lm with model Y+G+Y:G.

David J Daegling

Another option is to coerce the data into a balanced design via resampling and running multiple tests. Whether this is feasible or helpful will depend on how your data are structured. The basic idea is to leave the data as is for those cells with the fewest observations, and then randomly sample from those cells with more observations to that N. With such a large dataset it is probably impossible to permute all the possible cases, but you could run hundreds or thousands of iterations and get an idea of how stable the results are.

While doing this relieves you of worrying about the effects of unbalanced data, by populating most cells with fewer observations than you actually have you are obviously losing information. How much information is lost depends on the structure of the data.

Thanks David for your meaningful answer.

George C. Adamidis

Even though this is an older question, I would like to add that the aov function is used for balanced designs (the results can be hard to interpret without balance ).

Please see the "Note" here: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/aov