Hello everyone! I want to calculate the correlation between two variables, but the numbers of their samples are different. N1=45 and N2=20. Does anyone know how to solve this problem? Thanks!
Correlation analysis should be done for two variables in the same sample. If you conduct a survey for two variables, a respondent should respond to the same variables at a glance. Similarly and based on other responses of the respondents for the variable can be considered for correlation analysis. It is impossible to have the correlation analysis for two different samples.
There is only a possibility, provided that
* both the sample size is the same
* both the respondents are the same for both variables in the samples with the same serial order
Hope this will help you understand the requirements of implementing correlation analysis. Good luck in your studies.
One straightforward way to do it is to just ignore missing values and use the samples for which you have both.
I would take it one step forward: try out different imputation method (kNN, MICE ...)
Since (linear) correlation can be seen as a linear regression model I would use the estimated regression line to impute the missing data, the recompute the correlation let's say 5000 times. This would help you estimating the uncertainty due to the missingness.
In these cases it is always better to use multiple methods and compare the results
I think Stefano Nembrini is suggesting imputation because he is assuming that you have lots of other variables. Is this true Vince Ablaza ? I would only use this if the imputation was predicting values with high certainty.
And as with others, this is assuming the data are from one sample just with missing values. If you have two completely separate samples, then please define what you mean by correlation.
I take from granted that there's some overlap between the two.
I guess that you can leave out one sample at a time makes sense. Then data can be generated using the estimate and the prediction error to have at least some idea of the variability of the underlying sampling distribution of the correlation.
Given the sample size i would trust the bootstrap distribution over a normal
It would help to know a bit more about the problem. I need to make assumptions in order to develop an answer.
Say you grew wheat. At grain filling stage you picked some heads and analyzed for three enzymes, and 4 nutrients. These tests are destructive, but on neighboring plants you let the heads mature and gathered yield data. Within each plot you could have taken subsamples, and then average the subsamples to remove some of the within-plot variability and pair the plot averages to run a correlation. Just document how this was done in the methods.
If the samples were measured on the same experimental units but there are missing values, the simplest approach is to remove the experimental units that have missing values. The problem is that if these missing experimental units have something in common then you have biased your results. You could try imputation if you have enough other information, but there are problems with imputing over half of your data. Here too there is a problem if all of your imputed values are from one subpopulation.
Just adding to the discussion, if these are two different sets of (I will assume) people, are they matched so that each person in the n=20 group has say 2 or 3 people they are matched with in the n=45 group? If so, say something about the matching.
You need a number of pairs (for a couple of variables) if you want to calculate correlation of them. Your data must be of the joint format: (x1,y1), (x2,y2), ... , (xn,yn). I guess that you are looking for another concept. Good luck. Babak Jamshidi
I think there are various ways of tackling this kind of problem. The problem of unequal number of observations in two or more variables can either be solved by using generalized imputation, similar case imputation, prediction model etc to make the sample size equal before conducting correlation analysis