How to do correlation analysis with two variables in different sample size?

Dear Vince Ablaza,

Correlation analysis should be done for two variables in the same sample. If you conduct a survey for two variables, a respondent should respond to the same variables at a glance. Similarly and based on other responses of the respondents for the variable can be considered for correlation analysis. It is impossible to have the correlation analysis for two different samples.

There is only a possibility, provided that

* both the sample size is the same

* both the respondents are the same for both variables in the samples with the same serial order

Hope this will help you understand the requirements of implementing correlation analysis. Good luck in your studies.

Regards,

S. Senthilnathan

Stefano Nembrini

One straightforward way to do it is to just ignore missing values and use the samples for which you have both.

I would take it one step forward: try out different imputation method (kNN, MICE ...)

Since (linear) correlation can be seen as a linear regression model I would use the estimated regression line to impute the missing data, the recompute the correlation let's say 5000 times. This would help you estimating the uncertainty due to the missingness.

In these cases it is always better to use multiple methods and compare the results

Daniel Wright

I think Stefano Nembrini is suggesting imputation because he is assuming that you have lots of other variables. Is this true Vince Ablaza ? I would only use this if the imputation was predicting values with high certainty.

And as with others, this is assuming the data are from one sample just with missing values. If you have two completely separate samples, then please define what you mean by correlation.

Stefano Nembrini

I take from granted that there's some overlap between the two.

I guess that you can leave out one sample at a time makes sense. Then data can be generated using the estimate and the prediction error to have at least some idea of the variability of the underlying sampling distribution of the correlation.

Given the sample size i would trust the bootstrap distribution over a normal

Timothy A Ebert

It would help to know a bit more about the problem. I need to make assumptions in order to develop an answer.

Say you grew wheat. At grain filling stage you picked some heads and analyzed for three enzymes, and 4 nutrients. These tests are destructive, but on neighboring plants you let the heads mature and gathered yield data. Within each plot you could have taken subsamples, and then average the subsamples to remove some of the within-plot variability and pair the plot averages to run a correlation. Just document how this was done in the methods.

If the samples were measured on the same experimental units but there are missing values, the simplest approach is to remove the experimental units that have missing values. The problem is that if these missing experimental units have something in common then you have biased your results. You could try imputation if you have enough other information, but there are problems with imputing over half of your data. Here too there is a problem if all of your imputed values are from one subpopulation.

Daniel Wright

Just adding to the discussion, if these are two different sets of (I will assume) people, are they matched so that each person in the n=20 group has say 2 or 3 people they are matched with in the n=45 group? If so, say something about the matching.

Babak Jamshidi

You need a number of pairs (for a couple of variables) if you want to calculate correlation of them. Your data must be of the joint format: (x1,y1), (x2,y2), ... , (xn,yn). I guess that you are looking for another concept. Good luck. Babak Jamshidi

G. Paulraj

Yes Stefano Nembrini said is right. Instead ... you may go for regression to the extent to which it associates with another variable.

Nofiu Idowu Badmus

I think there are various ways of tackling this kind of problem. The problem of unequal number of observations in two or more variables can either be solved by using generalized imputation, similar case imputation, prediction model etc to make the sample size equal before conducting correlation analysis

How to compare two groups with only two measurements?

Enhanced Yellow Fluorescent Protein (Aequorea victoria) DNA sequence?

What kind of fluid could I use for a pitfall trap that also does not invalidate molecular testing?

What's the best way to measure growth rates in House sparrow chicks from day 2 to day 10?

Do we need to write protocol for a second study of a systematic review that has been registered and published?

How to increase a model accuracy ?

Do I need to do linear regression before running mediation analysis using PROCESS MACRO by Hayes ?

Sample dilution for quantitative analysis using ELISA?

Is There Any Feasible Method To Test Fluorescent Compounds Other Than Fluorescence Spectrometers ??

How to calculate average trend in maximum annual NDVI from MODIS?