I´m performing a correlational study of two temporal series of data in order to identify positive or negative correlations between them. Which correlation coefficient is better to use: Spearman or Pearson?
The Pearson correlation coefficient is the most widely used. It measures the strength of the linear relationship between normally distributed variables. When the variables are not normally distributed or the relationship between the variables is not linear, it may be more appropriate to use the Spearman rank correlation method.
There is a very interesting paper about the differences between these two correlation coefficients on the same sets of data:
http://geoinfo.amu.edu.pl/qg/archives/2011/QG302_087-093.pdf
Hope it will be useful!
It depends what kind of data you have. If your data is in ratio scale - use Pearson CC, if in ordinal - Spearman CC. The page http://en.wikipedia.org/wiki/Statistical_data_type will be your guide. Spearman CC is the same as Pearson, but is calculated on ranks instead of actual data values. So Spearman CC can identify both positive and negative correlations also.
I agree with Oleg Devinyak but don’t forget anpther parameter you should take in mind in order to identify the optimum and correct (for your type of data) result. You should test the type of distribution of your data…whether they follow a normal distribution or not (in the first case you should follow a parametric test and in the second case a non-parametric test). Pearson is a parametric one whereas Spearman is a non-parametric test that assesses how the well the relationship between two variables can be described using a monotonic function and it is flexible enough for your data. If you want to test the distrib. of your data..you coild applie a normality test such as Kologorov-Smirov..it's easy and quick ..providing direct answer to your question (normal/non).
I am agree with Dimitra that Pearson CC is relies on the assumption of bi-variate normality for the two variables while no distributional assumptions are required for Spearman's CC. The main purpose to run coeff of corr is to provide a value that describe the strength of relationship b/w tow variables, it may be -ve or +ve.
Pearson CC measures the strength of linear relationship between two variables and does not relies on any assumptions to be correct (as, generally, most of other statistics). It is the p-value for statistical significance of obtained Pearson CC that is determined under assumption of residuals normality. The original question asked about the correlation coefficient (its value), but not about its significance.
Just some additional points.
Pearson correlation coefficient is most appropriate for measurements taken from an interval scale,
While the Spearman correlation coefficient is more appropriate for measurements taken from ordinal scales.
Examples of interval scales include "weight in kg" and "height in inches", in which the individual units (1 kg, 1 inches) are meaningful.
Things like "stress scores" tend to of the ordinal type since while it is clear that "5 stress level " is more stressed than "3 stress score", The question is how do you interpret stress say "1 stress"??.
But when you add up many measurements of the ordinal type, you end up with a measurement which is really neither ordinal nor interval, and is difficult to interpret.
Everything starts with the type of the data you have.
Since Pearson's CC will take into consideration more information (not only ranks, but the proportionality between ranks), it would be better if your dat can meet, at least apparently, the assumptions. If not, Spearman's CC should be used, even if data is in a interval scale.
But remember to take care with Pearson's CC pitfalls. For instance, if you increase the total variance, you increase the r-value artificially. Try to split you ordered data and examine the r-value to each split and then compare to the r-value from the full data and you will see it. Look for the work of Bland and Altman (both 1983 or 1986) on limits of agreement. There, they discuss some of these pitfalls.
Thanks for the answers folks! I am appreciating all of them... In order to advance the discussions, I would to exemplify with the case in which I am interested to compare is the number of sunspots during several solar cycles (11-year cycles) and the number of solar flares during this same time interval (these phenomena are expected to present in phase oscillations). I start my annalysis by adjusting to both datasets the sum of a third order polinomial (necessary to fit secular trends) with a sinusoidal function (necessary to fit periodic oscillations). After this, I subtract from the two datasets the polynomial parts previously adjusted (secular trends), remaining only their oscillatory features. Then, I compare the two datasets by calculating the correlation coeficients between them. As in this example the oscillations are approximately in-phase, the obtained correlation coeficient is expected to be near 1 positive. However, when I perform these same procedures comparing sunspot data with other geomagnetic data (which may present a counter-phase relationship with solar cycles), negative values of correlation coeficients are expected to be found. I have perceived in my analyses that Spearman seems to be better but considerably less sensible to outliers...
Are you really sure that your data demands a LINEAR correlation? Take extra care with these situation, when working with dependent data (since data seems to follow a ciclic oscilation). See this interestint classical example of Ascombe's quartet (http://pt.wikipedia.org/wiki/Quarteto_de_Anscombe), where four different data sets show the same correlation coefficient, but...
Good remark Sandro! It is an interesting link you posted... Actually, this is the first time in my research career that I need to correlate two oscillating (periodic) temporal series. Up to now, all other cases that I studied were between datasets presenting linear or monotonic increases or decreases. Another similar situation can be viewed in my recently published article, titled "Suicide seasonality: Evidence of 11-year cyclic oscillations in Brazilian suicide rates" (see the Fig. 3 of my article available here in RG). The preliminary analysis shown that the cyclic oscillations observed in male suicide rates in Brazil are apparently in counter-phase with sunspot cycles. However, at that instance of my reseach, I only performed a visual comparison of the datasets...
hello friend...
I have gone through all the answers and I am some what agree with them. But I will explore the Katuli's answer.
As we all know that "the efficiency of results of any exp. only depends on the choice of appropriate measure". Pearson's CC is generally used when we are dealing with quantitative data which can be measured physically like height, weight ,pressure etc. On the hand spearman's Rank CC is used when we are dealing with qualitative data which can not be measured quantitatively but can be ordered or ranked like intelligence, Health etc. So it merely depends upon the problem under study that which measure you are going to use and which will be better for that particular problem depending upon the behavior of the problem itself.
regards...
Given your description, Walter, I wonder if fitting b-splines to each ocillation, along with 95% confidence intervals, might not be better. You could then depict the comparison of the two ocillations graphically, with x = Time and Y= # events.
Sandro's right, a linear estimate of association might not be very useful to you.
Sandro and Mary are right; maybe a non linear model would be more preferable for your data.
This is sth. you should test first! I am just giving an example through the following link. It’s non-linear canonical correlation analysis that aims to to determine how similar two or more sets of variables are to one another. It could be applied in various softwares (in my opinion best choices are: Matlab or R but the easiest and quicker way might was with more “commercial” softwares such as spss, stata or statistica).
http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.cs%2Foverals_table.htm
of course there is range of non-linear methods that may appeal to your data better than Can.analysis. Eg. Non-linear time series if your wish to do sth more with your data, for instance detect noise, dimensionality, or chaos.. this would be interesting too as a next step.
Thank you for the advices and suggestions Dimitra, Mary and Savitri... Despite the results of Spearman cc be more interesting at first sight, I agree that Pearson cc is the correct choice to use in my analyses. With respect the linearity (or not) between the studied datasets, it is necessary to remark that the data do not present linear relationship with respect the time (they independently follow approximately sinusoidal functions of time) but they may present linear relationship between them. For instance, supposing the function X = A * sin (wt) and another function Y = B * sin (wt + k) (both these functions with same frequency w), we note that, if k = pi*n (n=0, 2, 4...), Y will vary linearly with X and will be positively correlated (in-phase related) with this variable, that is, Y = (B/A) * X. On the other hand, if k = pi*n, (n=1, 3, 5,...), then Y will be linear and negatively correlated (counter-phase related) with X. That is: Y = (-B/A) * X. For all other possible values of k, the dependency between Y and X is not linear. In a case like this example, it is possible to use Pearson cc to quantify in-phase and counter-phase relationships between X and Y?
As far as I can see your data are not indipendent samples, but rather a time-series of two parameters (suicide rates and sunspot occurence). In that case be careful just correlating the data, since both tests also assume the data-point have to be independent and random samples. Also, there might be confounding factors, which you dont consider in your simple correlation analyses. In that case the correlation might indicate a relationship between the two, but in fact they are linked by a third confoudning variable. I would at least recommend to look at possibilities to incorporate the time-dependence and coundounding factors in your analysis. For time-series and oscillating relationships see e.g. Chapter 17 in Alain F. Zuurs book "Analysing ecological data". Sorry... dont have a fast and simple answer ;)
Ok Christian,
thank you for answer my question.
About confounding factors and the possibility of a third variable linking these two phenomena I agree with you... In reality, several other authors who are also studying those issues have searched for a possible connection involving the production of the neurohormone melatonin, which supposedly (until this moment as an unproven hypothesis) would actuate as the linking variable...
Thank you for the reference suggestion. I will try to found the Zuurs' book you quoted...
Dear Walter. As an ecologist I would also think of another variable in nature that is linked to solar activity (e.g. temperatures, light intensities etc.) and which then affects suicide rates. It would be interesting to see than if in another system the same variable has been previously linked to suicide incidences.
I think the question should not be which is better but which is more applicable tho ur type of data. Spearman’s correlation coefficient is a standardized measure of the strength of relationship between two variables that do not rely on the assumptions
of a parametric test. It is the equivalent of Pearson’s correlation coefficient performed on data that have been converted into ranked scores.
Conversion of data into ranked scores may actually make your data more sensitive to detect significant relationships compared to if they were not ranked.
In fact, as soon as we prepare a frequency distribution table, we cross over from cardinal numbers to ordinal numbers. Statistical analysis is basically performed with ordinal data. Pearson's method is on ordinal data if we construct a frequency distribution table.
Spearman's method is used for nominal data. If the data are already in the nominal form, we have to use Spearman's method anyway.
However, if the data are in the ordinal form it is better to go for Pearson' method. This is what I feel.
The Pearson correlation coefficient is the most widely used. It measures the strength of the linear relationship between normally distributed variables. When the variables are not normally distributed or the relationship between the variables is not linear, it may be more appropriate to use the Spearman rank correlation method.
There is a very interesting paper about the differences between these two correlation coefficients on the same sets of data:
http://geoinfo.amu.edu.pl/qg/archives/2011/QG302_087-093.pdf
Hope it will be useful!
I would like to add one point. In fact, that the relationship is linear must first be statistically non-rejectable. If the relationship is nonlinear already, then computation of correlation coefficient is meaningless.
Spear is a rank correlation coefficient. Pearson uses the raw or original data and therefore a parametric correlation coefficient.
Always compute them both and compare them; if they are not nearly the same, examine the scatterplots to determine which is the more reliable in each case. You could do the scatterplots first and then choose Spearman vs Pearson, but it is faster to calculate all of the correlations and examine the cases where they differ.
Bad reasoning Matthew. Spearman Correlation is only for data in ranks. Largely used when the normalit assumptions do not hold in the data. Pearsons' r is not to be used when your data is in ranks. Read a statistics book!!
Lots of data that appear to be higher than ranks are merely ordinal: body temp as a measure of disease; blood pressure as a measure of health, and so on. And they often have non-gaussian distributions, so p-values that assume normality are incorrect. The way to determine whether the inaccuracy of an assumption matters (linearity, normality) is to do an analysis that assumes them, and compare to an analysis that does not assume them. If the analyses agree, then the departures from the assumptions are not severe enough to affect conclusions. All medical data, and probably all data, do not conform exactly to the usual assumptions, and the only way to determine whether the degree of departure affects the error rate in conclusions is to perform multiple analyses with different errors of approximation.
By coincidence, I just finished reading a statistics book. When it's published you'll be able to read it too.
The theoretical properties of either Pearson's or Spearman's (rank) correlation are derived under the assumption that the data are a random sample. You've acknowledged that the data are time series observations. There is no reason whatever to believe that the observations are independent (or even exchangeable). In fact, in the case of the sunspot series (at least) we know the observations are not independent.
So, shorter answer: you need to use time series methods. As it happens, these are based on Pearson correlations and cross-correlations.
Yep. Simple distinction. Pearsons is the commonly used one however if the data is only ordinal for at least one of X or Y, then Spearman's correlation is appropriate. Pearson is parametric, spearman, non parametric
Pearson is the linear relationship..Spearman is a general relationship included the linear.
We used Pearson because the linear model is the most simple empirical model ..but in almost all the case we need only the relationship (with the exception in Mach-up analysis and q-q normality test)...by this Speraman could be the most usable
Please note that not is only the scale (Interval by Pearson Rank by Speraman)..
you can apply and interval data both coefficients and in general case Speraman will be more high (no all case but in general)...Say tha perason is only by interval scale is say that this data only can apply parametric statistic...the inteval scale and normality is only the two conditions to apply this kind of statistic
Sorry by the last finger error...Say that Pearson is ....
Note that if F is the distribution function of X, and G is the distribution function of Y, then (provided F and G are continuous, increasing functions) both F(X) and G(Y) are uniform in [0,1], and Pearson(X,Y) = Spearman(F(X),G(Y)). In particular this means that Spearman correlation measures dependence contained in copula function of the joint distribution of X and Y, while the Pearson correlation is "contaminated" by marginals.
spearman is used for parametric data (e.g large sample size; at least 30 subject in each group, normal distrib. of data etc)
spearman is used for non-parametric data (as it rank the data)
You have to read about the difference in details (I agree with Abdulvahed Khaledi Darvisha) you need to read this paper
http://geoinfo.amu.edu.pl/qg/archives/2011/QG302_087-093.pdf
Spearman's correlation is just Pearson's for ranks. Pearson's should not be used for non-normal distributions. Especially for negatively correlated skewed distributions. For example, for negatively dependent exponentially distributed variables the minimal value of Pearson's correlation is equal to 1-pi^2/6=-0,644934. In the case of the Weibull distribution with the shape parameter lower than 1 this minimal value is greater (even close to zero!).
Pearson benchmarks linear relationship, Spearman benchmarks monotonic relationship.
The Pearson correlation method measures the strength of the linear relationship between normally distributed variables. This is appropriate most of the time for financial returns data. Whereas,when the variables are not normally distributed or the relationship between the variables is not linear it may be more appropriate to use the Spearman rank correlation method. The Spearman rank correlation method makes no assumptions about the distribution of the data. It may therefore be more appropriate for data with large outliers that hide meaningful relationships between series or for series that are not normally distributed.
The Pearson correlation method measures the strength of the linear relationship between normally distributed variables. This is appropriate most of the time for financial returns data. Whereas,when the variables are not normally distributed or the relationship between the variables is not linear it may be more appropriate to use the Spearman rank correlation method. The Spearman rank correlation method makes no assumptions about the distribution of the data. It may therefore be more appropriate for data with large outliers that hide meaningful relationships between series or for series that are not normally distributed.
A Pearson’s correlation is the same as a standardized regression coefficient. It is used to determine the linear relationship between two variables which are normally distributed. Pearson’s correlation can be strongly affected by extreme scores or outliers. Consequently, if the scores are not normally distributed, the scores can be ranked and a Pearson’s correlation carried out on these ranked scores. This type of correlation is known as Spearman’s rank order correlation coefficient.
Pearson's correlation is bad even for linear relationships. Let X be a Cauchy random variable and let Y=a+bX, with a and b given constants. X,Y have a perfect linear relationship and it is not possible to calculate Pearson's coefficient. But even if second moments exist, it is a bad choice to use Pearson since this is not a copula-based measure. As a consequence of Sklar's Theorem (1959) all de information about de dependence between continuous random variables is in the underlying copula, not in its marginal distributions. It is easy to prove that you may keep the underlying copula unchanged (and so dependence unchanged) and by just changing one of the marginal distributions Perason's coeffcient changes. For example, take (X,Y) a random vector of positive continuous random variables, it is easy to check that the undelying copula for (X,Y) is the same as (log X, log Y) -so the dependence doesn't change- but certainly corr(X,Y) is not equal to corr(log X, logY). Check the following reference:
Embrechts, P., McNeil, A.J., Straumann, D. (1999). Correlation: pitfalls and alternatives. Risk Magazine 5, 69-71.
So forget about Pearson. Spearman is a copula-based measure, but still has a common flaw with Pearson: If Spearman (or Pearson) equals zero, this does not necessarily imply independence. It is better to use "real" dependence measures: copula-based ones and that are equal to zero IF AND ONLY IF the random variables are independent. For example Hoeffding (1940) or Schweizer and Wolff (1981) or any Lp distance between the underlying copula and the "independence" copula. Check:
Nelsen, R.B. (2006) An introduction to copulas. Springer
I agree with almost all the answers. I would like to add that the formula in computing Pearson r is the same as the formula used in determining Spearman rho. although there is a shorter formula for the latter. Inasmuch as ratio/interval variables are required for Pearson r (where the distance between each sample could be as low as 0.000000, etc. and ordinal variables are required for Spearman rho (where the distance between each value could only be 0.5 and 1, there is a wider range of Pearson r values that could be computed than Spearman rho.
You should look at distribution of your data. In case of normal distribution (Gauss's distribution), you can use Pearson correlation coefficient. In case of non-normal distribution Spearman's correlation coefficient should be used.
Normal distribution is rarely (if ever) present in real observational data. Usually we only want to know if normal distribution is a good enough approximation for our data in our situation. I've calculated lots of these coefficients. In most cases they were very similar (when calculated both).
See Chapter 8 from "Statistical Methods in Water Resources" by Helsel and Hirsch:
http://pubs.usgs.gov/twri/twri4a3/pdf/twri4a3-new.pdf
I would calculate both correlation coefficients. If they differ very little, go with Pearson's since itis better known. If they differ a lot, take a good look at the distribution of the data. Usually, in such a case, Spearman's coefficient is more applicable. A plot of the data helps a lot here. Are (x,y) linearly related or are they related linearly in the ranks? Such statistics are among the most used/abused/over used around us. It is important to clarify to others how and when to use these basic statistics. It is quite bad how frequently we see the expression "x is correlated with y", when in fact the person saying this expression does not mean a coefficient of correlation.
Despite that the spearman coefficient assumes nothing about the data distribution, it depends solely on pearson coefficient and derived from it. If the two coefficients were applicable for your data set, please select Pearson coefficient since it uses more informations which lead to more accurate.
Verify the validity of Pearson's correlation coefficient. If valid, go with it, otherwise, Spearman's rank correlation coefficient.
Pearson correlation coefficient is the most and widely used.which measures the strength of the linear relationship between normally distributed variables. When the variables are not normally distributed or the relationship between the variables is not linear, it may be more recommended to use the Spearman rank correlation method.
A coefficient of correlation does not have any distributional assumptions. Only the inferential procedures for the correlation coefficient require normality.
(x,y) are linearly related ==>. Pearson
(x,y) are linear in the ranks ==> Spearman
If the relationship between x and y is not monotonic, neither coefficient should be used.
I must agree with Raid. I would only add that Spearman's rho is Pearson's r applied to ranks, and the rank transformation compensates for non-symmetrically distrbuted data. If the distributions of either X or Y appear to have one long tail, then Spearman's is probably better.
Spearman is Pearson applied to ranks. But any monotone transformation of ranks would be just as meaningful. So why not compute Pearson correlation of the normal quantiles corresponding to the ranks? It will be very close to ordinary Pearson when the data is bivariate normal, but will be much more robust to outliers when the data is close to bivariate normal. But ... your data is time-series data, not an iid sample. So the usual distributional theory, whether of Pearson or Spearman or whatever you like, is not relevant.
You want to summarize the dependence between two variables in a single number. Maybe a one-number statistic does not do justice to the form of dependence in question. Look at a scatterplot; investigate transformations of the two variables. What kind of relationship do you expect, from your scientific knowledge off the field in question?
I suspect that asking whether to use Spearman or Pearson is the wrong question. Study your data first.
I often use normal quantiles transformations first (such as the one by Blom in SAS), and then if I calculate Pearson's correlation coefficient, then in fact the results are similar to those obtained with Spearman's coefficient on the raw data, as Richard has stated above.
Study your data. Exploratory data analysis is very important, as Tukey has pointed out in the past.
I have not so much matter to add to the many exaustive replies to this question. I have only to suggest to chek if the data to correlate have a quite sufficient normality before to use the Pearson correlation . The Spearman's method is directed to data distributed according to ranks : to use this method you have to ascertain if the data are in ranks and in case the possibility to ranking your data.
Pearson correlation coefficient will be suitable incase of normally distributed data otherwise use the spearman test.
Raid have a very god point
the principal confusion is because Pearson coefficient formula are base in Covariance and Standart desviation of each variables...by this is denominated parametric coefficient...but a different of all parametric test in the no need a normal distribution (we uses the average only to calculated the SD and CoVar..
Please note that you can used to obtain the linear association between two variables in differences. By example variables in interval scale (like temperature degree ) and percentages scales (like relative abundance in some species).
Sperman see the global association .
The normal distribution and interval scale are requirements in a parametric test base in mean comparisons. Another big requirements of this kind of test is the
if you need know more about the scales you can check Siegel no parametric statitics (in particular the table 1) http://www.mun.ca/biology/quant/siegel.pdf
in conclusion Pearson only show the linear relationship between two variables
Spearman show all relationsship between two variables included the linear relationship
I think, the method to be used depends on the kind of data. I do not want to repeat what has been mentioned above. Whatever you use to test correlation you have to explain the logical relationships. Ms. Angel in her conclusion summarized clearly the differences. Both methods are valid and indicate the relationships value and significance (one to one).
I assume that you've adjusted your time series for seasonality and other factors that affect such data. It would be important also to examine them graphically. A "number" by itself is not going to convey much.
I think that there is no clear prevalence to choice the Spearman or Pearson test, both of statistically meaning, but that depends on the fact that a rank coefficent correlation (Spearman) or a measure of linear association (Pearson) may be more useful to the prroject to analyze.
It is time to stop this discussion. Use Spearman only when your data is in ranks and the Person Product Moment when data is normally distributed or can be approximately normal. This discussion has gone on to long.!!!
My compliments and support to Jeff Jarrett ! I want only to add a simple consideration : many statistical software do not need to have the choice of the researcher because automatically choice the type of correlation to use according to the distribution of data inserted. So long G Ruggieri
In summary then, Pearson correlation is for interval/ratio scale data, normally distributed and or parametric while Spearman Rho is for ordinal scale data or non parametric data. Speraman correlation I think is good for relationship between two variables i.e. one independent variable while Speraman rho for more than one independent variables. What do you think?
I want to add that the use of the Spearman correlation can be vey useful to evidence outliers or subgrouping of data in the files of the variables to correlate in a linear regression, due to the regular intervals of the distribution of the rank values in comparison with the possible anomalies within the items of a variable : a strong difference between the p of the linear regression with the Spearman p of the same files of data converted into the corresponding scores of the ranks could suggest the need to check the characteristics of the original data and of their distribution.
It depends the Pearson correlation coefficient is the most widely used when the variables are not normally distributed or the relationship between the variables is not linear, it may be more appropriate to use the Spearman rank correlation method.
Hi. You have to find out if the continuous variables in the study are normally distributed. If normally distributed, Pearson's correlation coefficient is appropriate. If not normally distributed, Spearman correlation coefficient is to be selected.
Normality is not needed for either correlation coefficient. This has been repeatedly discussed above. The normality assumption is only needed for some inferences on the true correlation coefficient. When the association between X and Y is linear, then we can consider Pearson's correlation coefficient. When it is non-linear and monotonic, then Spearman's correlation coefficient may work better.
Sorry about my last answer of course the information is correction, because I lost some words. Please see the remarks:
Pearson product moment correlation is for continues, mount shaped distributions. The normality assumption should be required only for inference, as some one said.
Spearman correlation is used for ordinal, most of the time we rank the data, and compute the correlation.
Kendall's correlation is for non parametric data.
I prefer to use the correlation coefficient and p-value, these two values strengthen each other and give strong confidence about the result.
According to what I know ; it's impossible to use indifferently .The reply By Maria Josè Dos Santos is correct: The Pearson correlation is used on data having a bell distribution and it give us the rate of correlation and the level of significance of this correlation when regressing. The Spearman correlation gives the correlation of ordinal data and as I said in a previous intervention we can usefully explore the presence of outliers or subgroups in files of data A and B when regressing A versus B simply converting their data in ranks data and applying on them the Spearman correlation.
If you want to measure linear dependence between y and x: Pearsons's correlation efficient. If you want to measure general association between y and x: Spearman's.
In the case where one has a strictly montonically increasing relationship but which is not very linear, Spearman's coefficient is better to pick up such associations (higher correlation coefficent, and lower p-value) than Pearson's.
In exploratory analyses, I initially look for associations between y & x (if one increases X, does Y increase (or decrease)) without paying much importance to the relative size of the increase (or decrease) in Y: hence I prefer Spearman's.
NB That is the absolute size of the correlation coefficient which is important. p-values for correlation coefficients are notorious to become stat sign. even when the association is weak, given enough sample size.
If your two variables are continuous use Pearson's correlation coefficient but if your variabes are ordinal use Spearman's correlation coefficient.
I would say that when you assume that for each increase in X there is a fixed increase (or decrease) in Y, then we measure for a linear association in the data. When we assume that for each increase in X there is "some increase (or decrease" in Y, we should be measuring a monotonic (not necessarily linear) association with Spearman's rank correlation.
I have the idea that at present sufficient interventions defined the difference and the different use between the Pearson correlation and Spearman correlation ,the first one to be used when between two variables A and B does exist a relationship based on the change of B in conformity with the change of A and vice versa, according to a mathematical link expressed by an equation ,while in the Spearman correlation the relationship is based on variable difference of order between the rank items of the two variables. This allows to verify if two variables to be correlated by the Pearson correlation have sufficient similar distributions to a have a correlation statistically significant. Changing their data in the corresponding ranks and applying on them the Spearman correlation : a significant Spearman p let us know that A and B have significantly accordant distributions, while a preliminary not significant or weak Spearman p let us know the existence of differing distributions based on the presence of subgroupings of data or/an of outliers. I vote up for Raid Amin.
I want to add that an other different useful result can be obtained by checking a statistically significant result of a Pearson correlation. Applying the Pearson correlation on the ranks of the regressed data , this can give a significant or not significant p. In the first case we have the information that the data have really a significant correlation of their distribution. A not significant Spearman p strongly suggests to explore the regressed data because there are high probabilities that the significant Pearson p is due to the presence of one or few outliers able to mask the actual low correlation of the remaining prevalent part of data, that is to say that eliminating the outliers the Pearson correlation p should result not significant.
both gives the magnituge of the relationship between the two variables. I agree with Darvishan.
There is no doubt that Pearson and Spearman correlation give the magnitude of the relationship between two variables, but according to that I reminded before, the two magnitudes are basically different each other because when applied to the same file of data as actual data (Pearson correlation) and to the ranks of the same data (Spearman correlation ) could result both in significant regression, p < 0,05 or differently in Pearson significant regression , p< 0,05 and in Spearman p > 0,05,
Hello, Giancarlo
Please correct me if I'm wrong but are these aspects critical if we focus in the phase of exploratory data analysis (EDA)? I understand that EDA worries more about type II errors instead of type I...
Of course the two techniques are applicable to different situations. One for parametric and other for non parametric. Since other has explained I did not wrote in details. I agree with you.
although they are not directly in your field of investigation, I suggest further reading in relation to the alternative offered between exploratory data analysis and modeling in the analysis of the correlation between time series which falls your question regarding the choice of the correlation coefficient, for example:
i) for EDA
Detecting trends using Spearman's Rank Correlation by Thomas Gauthier
(available at : http://plaza.ufl.edu/yiz21cn/refer/trend%20detection%20using%20spearman%20rank%20cc.pdf)
and
ii) for modeling :
Correlation Testing in Time Series, Spatial and Cross-Sectional Data by P.M.Robinson
avalaible at http://eprints.lse.ac.uk/25470/1/Correlation_Testing_in_Time_Series,_Spatial_and_Cross-Sectional_Data.pdf
Hi there,
When the data are parametric, there are two correlation coefficients that may be estimated: Spearman's and Kendalls Tau. How to best select between the two?
Walter, this may sound strange, but have you tried any regression with your data? For example, a linear regression may give you more details compared to a pear corr. You can also account for confounders/covariates.
In general, if both the variables are measured on the ordinal scale, or, one is ordinal and the other is interval or ratio scale use Spearman's rho or Kendall's tau. If both are interval or ratio then use Pearson's product-moment coefficient. See also the distance covariance [V] test of Gabor J. Szekely [2005].
Remember 'correlation is not necessaryily causation!'.
George Hart
For data exploration, both methods can be used. Albeit, Spearman correlation is computed on ranks whereas Pearson correlation is calculated on true values and depicts linear relationships.
Spearman's correlation coefficient can be applied when one of the two continuous variables or both of the 2 continuous variables are not normally distributed. Second assumption for Spearman's correlation is that there should be a monotonic relationship between the two variables. To check that, you have to draw a scatter diagram, to know if the relationship is monotonic.
Pearson's correlation can be applied when the variables are normally distributed and there are no outliers in both the continuous variables.
I am still in doubt. Does pearson's coeff. of correlation assumes normal distribution or not? There are many data sets which are not normally distributed but free from outliers and are interval/ratio scale measures. Can we use spearman's rho on them strictly as a rule of thumb?
or can we still apply pearson's?
Moreover if pearson's corr. is relying so much on normality and it doesn't makes any difference whether we use spearman's or pearson's, then what is the logic behind existence of pearson's coeff.of correlation?
Pearson’s coefficient of correlation was discovered by Bravais in 1846, but Karl Pearson was the first to describe, in 1896, the standard method of
its calculation and show it to be the best one possible. Pearson also offered some comments about an extension of the idea made by Galton (who
applied it to anthropometric data). He called this
method the “product-moments’’ method (or the
Galton function for the coefficient of correlation
r). An important assumption in Pearson’s 1896
contribution is the normality of the variables
analysed, which could be true only for quantitative variables. Pearson’s correlation coefficient is a measure of the strength of the linear relationship between two such variables
sir thank you so much for your reply but i read in many books that normality test should be performed before applyin pearson's coff. correlation in spss. and now i am unable to normalize the data on which Prof. Andy Field applied pearson's correlation.
I ll be really thankful to you if you can suggest some good article for coming out of this problem with a valid proof.
Distributional assumptions are only needed when inferential procedures are used, Jyoti. If your only wish is to obtain the value of Pearson's correlation coefficient, then you need continuous data but not normality.
Many excellent texts have chapters on the correlation coefficient. Use texts with authors known for their excellent reputations.
The correlation coefficient is just a statistic. If we calculate the value of the statistic (sample mean) we do not assume normality to calculate the mean. If we want to use the sample mean in a t-test, we have to assume normality.
sir, when i applied bi-variate correlation, the pearson's value comes out -.441 and because of p value
I agree with Raid, no normality is needed for both. For testing, you may use permutation test, the normality assumption is still not needed.
Only using Fisher transformation or t-test, uncorrelated bivariate normal distribution assumption is needed.
Spearman's Rank Correlation Coefficient is more robust than Pearson Correlation Coefficient, which is affected by outliers.
Say we have data sets
group x y
1 2 1
1 2 2
1 3 4
1 4 2
1 5 5
2 2 1
2 2 2
2 3 4
2 4 2
2 500 500
For Pearson you will ge for group 1: 0.72 , group 2: 0.9995
For Spearman, nodifference for two groups: 0.73.
Which one should be use, depending on your data.
If you think (500, 500) is typo. Using Spearman, if (500,500) give your valuable information, Using Pearson.
While studying the relationship between two or more variables, we are using the correlation coefficient. The type of correlation we should use based on the scale of the variable value. If our variable value is either ratio or interval scale we can use parametric test. The variable is in ordinal or nominal scale we are using non parametric test. Karl Pearson is a parametric test and Spearman's rank correlation is a non parametric. Spearman is an appropriate iest, the data are in ordinal scale and hence the name of the test is also called as Spearman's rank correlation.
Raid Amin your solutions are excellent and more helpful. thanks a lot for great support