Which comparative analysis to use for rank ordered data?

Michael Tsikerdekis @Michael_Tsikerdekis

02 February 2012 43 10K Report

I have two separate groups which are asked under different conditions to sort a list of items in terms of preference(building a hierarchy). The total number of items is 5 and the data is categorical(nominal).

I want to do a comparative analysis between the two hierarchies and measure how significantly they differ form each other. What analysis should i use?

Nana Celestin Popular answer

The first approach I suggested is first of all necessary. Because this enables you compare survey 1 and 2 for each of the item, and then for the aggregated scores. You can now compare all the items at once to see whether the preference differs significantly among them. This is possible using Friedman test for several related variables. But At this level, you have to work directly with the raw data in SPSS not with the summary contingency table. You can first of all run Friedman for the entire data set (the two survey combined), then splitting (data Menu) the grouping variable (survey) as to have Friedman for survey 1 and two separately. As you can see, this is purely mathematics and in social survey, emphasis shall be placed on visual appreciation of trends and that is why designing a contingency table as the one we agreed on is very essential before any calculation of the P-values. You can complement Friedman (comparing more than two related variables) with a Pair comparison test for two independent variable (you can have one under the nor-parametric test group) as to compare within item for survey 1 and 2. The output for Friedman can be compared with that obtained with Chi-Square. You can even appreciate the consistency with which respondents rank the various items using Cronbach Alpha reliability coefficient and the relationship between items (for instance does the preference of A implies preference of B?) using inter-item correlation coefficients; this will be done with the entire data set as well as separately for survey 1 an 2 using the Split file or Select function as to appreciate the difference. These test are complementary and this triangulation approach will deliberately give a good appreciation of the variability of your data if it is properly done. You can go further and collect your data and we will see after that. Regards.

Michael Tsikerdekis

I thought about correlation tests but these won't tell me anything about the two hierarchies and if they are different from each other.

I can use descriptive statistics, basically use a point system to build the hierarchies and then say well this are the results but i want to do it a bit more scientifically if there is a way.

I also though of using kruskall wallis or man whitney but this i can only apply for each item of the hierarchy, but it can't tell me if the overall order of the hierarchy is similar or different from the other group.

I have no idea if there is a test of variance for hierarchies...

Michael Tsikerdekis

John thank you very much for your answer.I think i should have been more clear and i am sorry about that. In my head it sounded understable but i didn't convey my message properly. By my understanding correlations are a test between two variables which i don't think this is the case here.

Put simply, i get two groups of randomly assigned participants to them. (lets say 50 and 50 people).

Then i give them a list (bananas, apples, carrots, strawberrys, oranges). And i say to them rearrange this list based on your preference. So what i have at the end is two groups that returned to me their own hierarchies of these items.

How do i test that the order of these items(hierarchies) is the same for these two groups. I can't think of any proper methodology to assert this.

Rafiu Olayinka Akano

Since you are interested in whether the two groups are similar or not it is suffice to obtain the rank correlation as a measure of co-movement or concordance and test for its significance. Kendall' tau is valid too.

Ecevit Eyduran

Dear Colleague,

First, you should use nonparametric method. but please give more information on your fresh data. for example, what response variable is continous or categorical? what explanatory variable(s)? Generally, if response variable is categorical, you can use non-parametric method. If you want to compare two independent groups with categorical response variable , you should use Mann-Whitney U test.

Please feel free to contact me.

Regards

Dr. Ecevit EYDURAN

Tatiana Andreeva

How about using multinominal logistic regression?

Perumal Ramasubramanian

Recently, I used Partial order Generelization of Rough Set Theory. There are several statistical tools are also available.

Ikaro Daniel Barreto

I have to agree with Dr. Ecevit EYDURAN, a non-parametric in this case is more adequate. Maybe a Mann-Whitney U test is enough. You could use Kruskal-Wallis too, but with 2 groups is not necessary.

Didier JAMBOU

Dear Michael Tsikerdekis,

Firts I agree with Ecevit Eyduran that we have all to be right about your data, with your example of fruit: i) quantitatively defined (numerical) or qualitative (apples, bananas ..) , ii) classified in groups or only ranked , as in your example, by "preferences", iii) also number in each sampel (5 in your case ) for parametric or non parametric test,

iiii) descriptive statitic (comparison of distributions, independence of group or correlation) or more complex in case of "a preference " of subjects in two groups and comparison of two groups, and so on ...

I think that in each case, some statitic methods are accurate, in others they are impossible to be used.

Finally, if a transformation of qualtiative data, to quantitative ones (banane = 1) is possible, and accurate for future analysis, it can be perhaps more useful to analyze; and "a preference" can become a "scoring" (1,2,3 ..) , or as equivalent , the attribution of a "weight" to data, then analysis with this "weight" in tests (ie factorial analysis).

Regards

Didier J

Michael Tsikerdekis

Thank you all for the answers! You've been really helpful. :-)

I think both of the Hamming Index and the rank correlation coefficients are problematic because the data i get from each participant are more complex. Each participant from group A gives me back a list(e.g. [b,a,c,o,t]) which can be compared with the list of each participant from group B(e.g. [a,b,c,o,t]). The way that i see it and please correct me if i am wrong, is that if i want to use the Kendall tau distance, i would have to cross compare each list of participant from one group to each participant's list from the other group and then add all the results into a mean value. Then i would have to do the same for the rest of the participants. Then having a mean Kendall tau distance for each particpant from Group A against Group B and the vice versa i can use a t-test to determine if the kendall tau distance between the two groups is different(and therefore the original rank ordered data). Like John said, Kendall tau distance is a metric and i don't know if the whole approach is scientific.

Another solution is to use as suggested Tatiana multinomial regression analysis, or as suggested by Ecevit Mann-Whitney U provided that i code all possible combinations from the hierachies into categorical data. Problem with this is that i have 5 items that given that can be arranged in every possible order i end up with 5!=120 possible combinations or else 120 different unique categories that represent each possible combination. Chances are i will definately end up with groups being significantly different from each other but i don't think this will be accurate. I complete ignore distances. Lists [a,b,c,o,t] and [a,b,c,t,o] are more similar than [a,b,c,o,t] and [c,o,t,b,a]. In categories each one of these will become a unique number, hence ignoring any "similarity".

To be honest i am more in favor of the first method but can it really work or be revised so that it can become solid?

Michael Tsikerdekis

Didier i should have specified this from the beggining :( .The data is definately categorical that's why i used the example of fruits. The problem is that each response of a participant becomes a set (e.g. [bananas,apples,carrots] while another participants [apples,carrots,bananas]).

With this i have 2 groups with each participant having an answer like the above two examples.

I need to compare if the answers from the two groups are similar or not and how similar.

Gayathri Mahalingam

Dear Sir

You may use the Co-integration test (E-Views software or from SPSS package) for comparative analysis.

Ecevit Eyduran

For your last request, you must use spearman rank correlation to determine similarity of anwers given for two independent group data with the help of SPSS statistical package program. please click the following link to do spearman rank correlation

http://statistics.laerd.com/spss-tutorials/spearmans-rank-order-correlation-using-spss-statistics.php

Good lucks.

Ecevit EYDURAN, Assist.Prof.

Michael Tsikerdekis

Manuel indeed i think you are right that hamming distance could be a good for testing the similarities. But still the result will be testing an individual's result against another. Question is how can i do this for groups.

As an example imagine that your table looks like this:

PARTICIPANT_ID | GROUP | ANSWER

1 | 1 | (1,2,3,4,5)

2 | 1 | (1,2,3,5,4)

3| 2 | (1,2,3,4,5)

4| 2 | (3,4,5,2,1)

How do i establish the level of similarity between the answers of group 1 and group 2?

Miguel Pessanha Pais

Hi! Well, I am more of an ecologist, but anyway... I was wondering if you could have a matrix with 5 variables (bananas, apples, carrots, strawberries, oranges), where each person is a case and you can reverse the order of preference (5 becoming the highest "score" for a variable). Then you begin by calculating a triangular matrix using something like Kendall's rank correlations between all cases. Now you can use something like ANOSIM or PERMANOVA to test for differences between the two groups using the correlation matrix directly (they use permutation tests to calculate a P-value. ANOSIM is strictly non-parametric. PERMANOVA can even use Monte Carlo methods if the number of unique permutations is not large enough). I am not 100% sure this can be done, but I can look at this better later... I just thought of this now and don't have a lot of time on my hands at this moment, sorry :).

Jaro Lajovic

Michael, there seem to be two different 'steps' to your analysis; actually, I believe Manuel already addressed both them.

(1) Evaluation of (dis)similarity between your groups is basically a distance measurement problem. Each possible hieararchy can be considered a point in a 5-dimensional (hyper)space, and so each of both tested groups can be represented as a cloud of points in this space. The distance between both groups can be summarised by the distance between the group centroids. The distance between the centroids is of course a summary measure; it is valid, but does not bring information on 'distributional characteristics' of the groups (e.g. overlapping of clouds, their pattern and orientation etc.), which, however, (IMHO) are relevant.

This being said, one can calculate the centroid distance also for both groups in your example above. A word of caution: Hamming distance seems appropriate; nevertheless, it might be reasonable to consider some other as well.

The issue with the inter-centroid distance probably remains that although objective it is still likely to be somewhat difficult to interprete.

(2) The second step would be testing the overall difference between clouds. A bootstraping approach (as suggested by Manuel) seems the way to go, as it will eventually provide a significance estimate of between-groups difference. Here (now just thinking aloud :) ) one could designate one group as the reference one and then evaluate the probability of the other one being drawn from the same population/distribution. (It would probably make less sense to test both group against the cloud of uniform distribution in the hyperspace.)

Best, Jaro

Jaro Lajovic

rho sigma research and statistics

www.rosigma.si

Didier JAMBOU

Dear Michael,

Thanks for these precision. In this case, some statistical methods suggested precedently cannot be used.

According to Jaro

"Evaluation of (dis)similarity between your groups is basically a distance measurement problem. Each possible hieararchy can be considered a point in a 5-dimensional (hyper)space, and so each of both tested groups can be represented as a cloud of points in this space. The distance between both groups can be summarised by the distance between the group centroids. The distance between the centroids is of course a summary measure; it is valid, but does not bring information on 'distributional characteristics' of the groups (e.g. overlapping of clouds, their pattern and orientation etc.), which, however, (IMHO) are relevant."

Not being a hyperspecialist, I think that it is that is named "factorial analysis of correspondance " (or only in french and with this literal translation); and softwares make the calculations. Isn't it ? and the distance to a centroid corresponded, for me, to that I named "a weight" for data; the point nearest the centroid is stronger linked than the more far away point; and this notion , always for me, join the point of view of Miguel suggesting the scoring of variables (1,2,3 ..) , that could be considere equivalent to the "weight" of a variable or its distance to a centroid.

Regards

Didier

Ecevit Eyduran

For your problem, Another alternatives are:

1) Chi-Square and G statistics are used to test an association between two categorical variables, but total sample size should be more than 150-200.

2)Kendall Tau correlation, a non-parametric test, like Spearman correlation can be used if sample size is small.

3) Multiple Correspondence Analysis are used to visualize graphically all the interactions among levels of categorical variables more than 2 if sample size is number of variables x 5 or 10.

4) Power analyses for Chi-Square and G statistics can be used in order to determine required sample size for two categorical variables using SAS statistical package program.

I wish you great success.

Dr. Ecevit EYDURAN

Assist. Prof

Maximo Rossi

Separmen rank correlación. It is at Stata. Cheers, maximo

T.s Sanal

You can make a table (group 1 v/s goup 2) in terms of frequencies and percentages. Your data is not sufficient for any statistical testing. Spearman rank correlation coefficient is for ordinal data, not for nominal.

To look the difference between two proportions chi square is appropriate. But your sample size is not adequate. So better make a table as I said, or draw a multiple/ component bar diagram.

Mohammad Mahdi Radan

Dear friend

Would you like to compare two groups of respondents that ranked given items or you would like to compare two groups of items that a given respondents ranked them? It requires different statistics analysis. Anyway, in SPSS, consider each item as a variable, i.e. in one column. Then enter the number that people has given to the items. Suppose I ranked 5 items: first rank to item4, second rank to item 1, third rank to item 5 and etc. you know, every respondent needs one row in SPSS. After data entry you should decide about your main research question. Are you looking for the correlation among mode of ranking? For example those respondents who have given first rank to item 5, ranked item 3 as 5. In this case, Spearman is one of the best. By this you will be able to analyze direction and strength of relationship as well as significant. If you have two groups of respondents, you will need one variable more. Simply consider one column for “type of respondent” and use code 1 and 2 for them. In this case, I think ANOVA (one way analysis of variance) would be better. It can show the differences between groups. Please pay attention carefully to considerations of applying ANOVA. ALL THE BEST

Ecevit Eyduran

As known, a good solution is a non-parametric method for categorical data analysis, likert type data. For determining an association between two categorical variables, Chi-Square and G statistics should be used in contingency tables (r x c table). Also, Contingency coefficient (= root(( chi-square/(chi-square+total sample size)) can be calculated using Chi-square.

Kind Regards

Michael Tsikerdekis

For those joining the discussion, i do plan to treat the data as categorical just in case, however with 5!=120 categories of answers i will probably get a significant difference between the groups while totally ignoring any similarity tests between the true answers which are ordered sets or arrays( a single answer is of type [a,b,c,d,e]).

Manuel, John and Jaro described a process which will alow for an analysis of the similarities. I need however some pointers with the overall process. I found that i can use Levenshtein distance along with Hamming distance but as far as i understand probably i will get the same results. Kendall tau distance might be a promising alternative as well as the L_1 distance of ranks (sum of absolute differences). If i find a software that can help me with all the cross-participant comparisons i don't see why i couldn't try all of them and produce the mean difference in similarity between groups. The first part is pretty much straight forward.

Calculating CI via bootstraping is something that i haven't done so if you have a guide for it i would be extremly helpful.

I had some time to check the PAST and the ANOSIM procedure suggested by Miguel. Indeed ANOSIM provides an ANOVA analysis of distances based on an number of distance measures. Hamming is one of them an it used for DNA sequence similarity analysis. Could i use this procedure to compare the two groups?

PS: Manuel, I am having trouble entering sequence data in one field in PAST (1,2,3,4,5). How did you manage to get it to work?

Vijay Peter

Fuzzy algebra

Nana Celestin

Hi,

You are faced with ordinal variables because Items involved in ranking can be ranked from the most desired (R1 for instance) to the least desired (R5 for instance). We will handle the problem following three simple stages.

Stage 1: Define each item under classification as a single variable in SPSS, PSPP, Stata etc. and assign for each of the item (variable) the rank given to it by each of the respondents. If you have 40 respondents, you are expected to have 40 rows in your data base. And if you have 4 items for instance, you are expected to have 4 columns in your data base, a column standing for a variable or item.

Stage 2: Run frequency analysis for each of the variable (item). Following this, organize the frequency analysis outputs for all the items in a contingency table which model is shown below.

You may have such result

Rank Item 1 Item2 Item3 Item 4

R1 20 (50%) 10 (25%) 10 (25%) 5 (12.5%)

R2 10 10 8 5 4

R3 5 10 8 6 6

R4 2 5 4 4 5

R5 3 (12.5%) 5 (25%) 10 (25%) 20 (50%)

N 40 40 40 40

Visually already, a common reader can appreciate which item is the most preferred and which one the least.

Stage 3: Calculate the P-value (significance value)

You can use Chi-Square test of equality of proportions to compare ranking or preferences between items. Epi-Info 6.04d offers you the possibility to calculate the Chi-Square significant level with such complex contingency table. If you are using the Chi-Square test, you have to work with the values (proportions) in the contingency table.

Otherwise, you can use a test for comparing several related variables. I think it is Friedman (verify). If you are using such test, you have to work with the raw data just as they were entered in the spread sheet.

Set you Alpha (example 0.05 if you are working at the 95% confidence level). If P < alpha, the difference is significant and you can say that items don’t enjoy the same preferences whilst backing you argumentation with their respective weights as presented in the contingency table.

You can call me at (237) 74 54 16 19 or (237) 33 12 14 99

Regards and good luck.

Michael Tsikerdekis

Nana i don't think your solution can be applied to the current problem. You see, what i need to test is not for differences of preferences on items within a group but between groups. My Data has a form such as the one below and i want to test if the answers from Group 1 are similar ot not to Group 2.

Participant ID, Group, Answer

1, 1, [1,2,3,4]

2, 1, [2,3,4,5]

3, 2, [1,2,3,4]

4, 2, [4,3,2,1]

-------------------------

On a second note concerning Hamming Distance, it seems to be also prone to error. consider this two answers:

1,2,3,4

4,3,2,1

1,2,3,4

2,1,4,3

The hamming distance is 4 for both examples, but the second is more similar than the first. Is there any other measure of distance that can balance this?

Miguel Pessanha Pais

(I had to repost this using underlines instead of spaces...)

Here I am again. I am not a statistician, so I may be missing some really important underlying issue here, but I will just show you an example:

I tried doing something quickly using ANOSIM and PERMANOVA with this very simple example (of course we will not achieve great power with 4 cases):

Participant ID, Group, Answer

1, 1, [1,2,3,4]

2, 1, [2,1,3,4]

3, 2, [2,1,4,3]

4, 2, [4,3,2,1]

I used Kendall's tau among participants:

|__1___|___2___|__3__|

2|_0.667_|

3|_0.333_|_0.667_|

4|__-1___|_-0.667|_-0.333|

Then, using this correlation matrix I ran an ANOSIM:

Sample group

S1 1

S2 1

S3 2

S4 2

Global Test

Sample statistic (Global R): 0.375

Significance level of sample statistic: 33.3%

Number of permutations: 3 (All possible permutations)

Number of permuted statistics greater than or equal to Global R: 1

Of course very little can be achieved with only 3 possible permutations, so if you don't have a lot of cases, permutations won't help much.

And then using PERMANOVA, which has an option of using a Monte Carlo method to calculate the significance, so the power in PERMANOVA is not that much dependent on the nr of permutations, but more on the number of replicates (on the denominator degrees of freedom). Moreover, PERMANOVA analyses the actual values in the "distance" matrix, while ANOSIM ranks the values in the "distance" matrix first (so it tests relationships among the values and not the values themselves).

PERMANOVA Results:

Data type: Correlation

Selection: All

Resemblance: Kendall rank correlation

Sums of squares type: Type III (partial)

Fixed effects sum to zero for mixed terms

Permutation method: Unrestricted permutation of raw data

Number of permutations: 999

Factors

Name|Type|Levels

group|Fixed|2

PERMANOVA table of results

Source__df___SS_____MS____Pseudo-F___P(perms)___nr unique perms___P(Monte Carlo)

group___1__1.3611__1.3611___2.8824______0.345___________3___________0.246

Res____2__0.94444_0.47222

Total___3___2.3056

Details of the expected mean squares (EMS) for the model

Source EMS

group 1*V(Res) + 2*S(gr)

Res 1*V(Res)

Construction of Pseudo-F ratio(s) from mean squares

Source__Numerator__Denominator__Num.df__Den.df

gr________1*gr_______1*Res_______1_______2

Estimates of components of variation

Source___Estimate___Sq.root

S(group)__0.44444___0.66667

V(Res) 0.47222____0.68718

So this of course would need more cases to be analyzable, but I was just experimenting... :p

So, both these analyses will give you numbers and P-values regarding both groups, using any measure of "difference" among cases you see fit. The thing is I am not a statistician and I don't know if, in your particular case, this can be done (conceptually and theoretically and philosophically speaking).

Michael Tsikerdekis

Miguel what software did you use to perform the ANOSIM and PERMANOVA? I tried using PAST but it doesn't have kendal tau distance. It supports hamming and manhatan distance for ANOSIM but both fall sort in the example that i used. Certain differences they can't detect properly. Kendal tau distance seems better at this but i am not an expert at this.

Ecevit Eyduran

In general, variance components in any ANOVA model are used to estimate genetic parameters (heritability and repeatability etc) in biological sciences. An ANOVA model can consist of a continous (dependent) variable with discrete variable(s). This is different from your case with no normality. The best choice is a non-parametric method for you. My other suggestion is also that Multiple Correspondence Analysis Technique can be used more effectively to visualize interactions among levels of many discrete variables if you have large sample size. For more information on performing "Multiple Correspondence Analysis", please click the following link: http://www.unt.edu/rss/class/Jon/SPSS_SC/Module9/M9_Correspondence/SPSS_M9_Correspondence1.htm

I hope it will be useful for you.

Ecevit EYDURAN,

Editor of The JIST

Didier JAMBOU

Dear Michael,

When you expose your problem as below, and your main purpose : whatever the participants may be, is the "repartition" (as for a a statistic distribution) of a "preference" (1 to n), with an importance of rank, similar between the two groups ? so, it make me think to "the theory of sets" (in french , and I don't know the equivalent term in english) I have learnedduring my mathematical study: the aim at a basic level is to put in relation ie 2 sets and we can define a common part (intersection), some relation between groups (unilateral, bilateral).

In your case and this representation, two groups are different , and "preferences" of each participant different one from another according to the size of the commun surface between the groups (number of characteristic found common, and the link between each member of a group with all the others of the second group ...)

Perhaps some specialists of this sort of mathematic could study your problem. (see representation attached)

Best regards

Didier

Miguel Pessanha Pais

Both ANOSIM and PERMANOVA are non-parametric, however, they may be influenced if there are significant differences in multivariate dispersion (which you can test using a thing called PERMDISP) I used PRIMER-E v6 with PERMANOVA package (http://www.primer-e.com/), which is really great and ecology-oriented but you have to pay for a license. However, a specific FORTRAN programs to run PERMANOVA and PERMDISP can be found on Marti Anderson's website: http://www.stat.auckland.ac.nz/~mja/Programs.htm

However, I think you can perform both ANOSIM and PERMANOVA in R (package vegan): http://perceval.bio.nau.edu/downloads/igert/IntroR-Course_Notes/R-Course_Day3.pdf

One of the things about ecological data (which is the only data I am used to) is that assumptions of normality or homoscedasticity are rarely met, so the most recent multivariate methods available are non-parametric, such as these (and allow the use of any distance or dissimilarity measure you think best illustrates your hypothesis). Anyway, please read the fundamentals of these analyses to see if you can apply them to your data (you can even try to contact Marti herself).

Anyway, I am just an ecologist, so please give more weight to what statisticians tell you. :)

Nana Celestin

The easiest but erroneous way of doing it maybe to sum responses within group and perform a between group comparison. This will be erroneous in the sense that a respondent who scores 2251 will have the same sum of scores with the one who will score 4411 whilst the 2 perceptions or appreciations of the situation are quite different.

SPSS provides the solution to your problem. By using Multiple Response Analysis, SPSS will count and aggregate for each of the group the number of occurrences for all the categories (possible responses). The very group of test offers you the possibility to crosstab Multiple Responses Sets; you can now crosstab categories' scores with the grouping variable (the variable under which the various groups_two in your case_ are labeled) to distribute the MRS scores to the two group. Chi-Square test can now help you appreciate the difference statistically. Lear more about Multiple Responses Analysis in SPSS and you can then better appreciate what I am trying to explain. MRA is a simple counting technique but the most accurate tool to analyze and compare multiple responses to a set of variables.

Regards.

Michael Tsikerdekis

Ecevit, i definitely agree that non-parametric is the way to go with my data. I tried the MCA example but i have no idea how to add my data so that it will make sense in SPSS. In the example family income and class standing is per individual. My variable is family income, and then i have a another variable(if we perceive it as a set) or 5 variables if each part of the set is considered seperate. The problem with the later is that this is not exactly what the participants answered. Consider this two surveys:

1. Please rearrange the list of ABCDE in any order you think is preferable

2. i. Rate A on scale from 1 - 5 based on your preference

ii. Rate B on scale from 1 - 5 based on your preference

iii. Rate C on scale from 1 - 5 based on your preference

...

In the first survey each letter gets a unique value which another letter cannot while in the second survey if someone wants he or she can rate all letter as 5 or any number. Treating data as standalone variables ignores the fact that participants had no choice of using the same rating(1-5) once they used it in another letter. I hope this make sense.

-----

Didier, the theory of sets looks extremly promising at least visually but i am going to have to find a procedure for a statistical package and see if i can apply it to my problem.

-----

Nana, i created a set on SPSS (small example). The MRA also will force me to make the same assumption i describe above but i'm having trouble understanding the results. The table in SPSS looks like this.

Group, A, B, C, D

1.0 1.0 2.0 3.0 4.0

2.0 2.0 1.0 4.0 3.0

1.0 1.0 2.0 3.0 4.0

2.0 1.0 2.0 4.0 3.0

I grouped ABCD using analyze > multiple response > define variable sets and set them as categorical with a range from 1-4. See the file that i attached for results.

I believe the problem is that MRA is being designed for questionaire where individuals can give multiple responses but are not forced to do so for each case. In my case i ask them to reorder the list they always have to return a number of preference for each one of the 5 variables(4 in this example) and that number can only be used once for one variable only.

---

Please let me know if there is something that i misunderstood.

Nana Celestin

Michael,

Nana Celestin

Michael, this is a typical example of analysis for ranking responses. The process is I explained in my first intervention still hold here. The error we often commit is to believe that it is all type of data set that can be handled globally; No. Some statistical analyses are more structural and this is often the case with Ranking responses. Following your explanation, your design is Ok to me, but how do we analyze the data?

First of all, handle the two surveys separately.

In SPSS, define Income as a grouping variable, and then a variable for each of indicator subjected to ranking. I will suggest that you categorize income to shift from scale to ordinal; for instance 1 (5-10000), 2 (10001-20000), 3 (20001-40000). Categorized base on the local socio-economic indicator, for instance the minimum salary recommended by the law. A first group maybe those who fall below this line and other groups can be defined subsequently.

Lets assume that Nana is a respondent and my income is 10000, I rank Indicator A ‘1’, indicator B ‘2’, Indicator C ‘4’, indicator D ‘4’ and Indicator E ‘5’ for the first survey and then I rank Indicator A ‘2’, indicator B ‘2’, Indicator C ‘4’, indicator D ‘4’ and Indicator E ‘5’ for the second survey.

My variables’ name are as follows: Survey for the two surveys (tow categories for this variable, 1 for the first survey and 2 for the second), Incom for Income, IdA for indicator A, IdB for indicator B, IdC for indicator C, IdD for indicator D and IdE for indicatorE.

This is how my information will look like in the data base:

Survey Incom IdA IdB Idc IdD IdE

1 1 1 2 4 4 5

2 2 1 2 4 4 5

Do the same for all the respondents.

First of all forget about income and run simple crosstabulation using Survey as the independent variable (push in Row) and the five Indicators for ranking as dependent variable (Push in Column). Count within Row.

SPSS will generate five crosstabs (tables).

Design a contingency table with survey and the five indicators having a column in the table. Organize the results of the five crosstabs generated by SPSS in this your own table just by filling information (frequency and percent) in the corresponding cells and appreciate it. The table in fact will have 3 rows (the first one for the variable, the second on for the first survey and the third one for the second survey). Even without running any significance test, you can already appreciate the trend of responses. If you earlier asked SPSS to calculate Chi-Square test, you can freely appreciate the level of difference between the first and the second survey for each of the five indicators. You can now also calculate a chi-Square test comparing all the five indicators between the two surveys at once. This can be done using SPSS but you should know how to go about it. With Epi-Info 6.04d, it more straight forward.

You will be more confident if you succeed this first phase. If this done, integrate income into the analysis as follows.

Crosstab survey and the five indicators layered by income (that is pushing income into the third receiver row of the crosstab window) and run the crosstabs.

Redesign your contingency table but now putting income in the first column and dividing each income row by two (first row for the first survey and the second row for the second survey) and fill the cells systematically using the corresponding information from the SPSS table. SPSS has equally generated Chi-Square significant values if you did not omit ticking this option.

Appreciate your table, comment it and draw your conclusion.

Social science analysis, unlike experimental analysis is much more complex because beside mathematical calculations, it requires a lot of structural analysis and organization. Also the way results shall be presented shall allow readers to freely appreciate the distribution of weights between categories. The comparison first all shall first of all be visual before been complemented with mathematical calculations of P-values.

Regards.

Michael Tsikerdekis

Nana, i made all the calculations up to the contigency table but still it doesn't make sense. Maybe i am doing something wrong.

Okay so i build a dummy table where IdAIdBIdCIdD and none of this can be 0 and they are always within {1,4}. I skipped income cuz i just compare the two surveys .

Survey, IdA, IdB, IdC, IdD

1, 1, 2, 4, 3

1, 2, 1, 3, 4

2, 3, 4, 2, 1

2, 4, 3, 1, 2

I run a cross tabs analysis and the results are in my pdf that i attached to this message. Based on this the contigency table will look like this:

Survey | IdA, IdB, IdC,IdD

Survey 1 | 2, 2, 2, 2

Survey 2 | 2, 2, 2, 2

I may be wrong but the contigency table will always have equal numbers of frequencies for each survey because each participant always provides me with an answer for each variable.

I could create for each survey a contigency table based on the answers which would look like this(each row represents a value(1-4)):

Survey 1

I , IdA , IdB, IdC, IdD

1, 1, 1, 0, 0

2, 1, 1, 0, 0

3, 0, 0, 1, 1

4, 0, 0, 1, 1

Survey 2

I , IdA , IdB, IdC, IdD

1, 0, 0, 1, 1

2, 0, 0, 1, 1

3, 1, 1, 0, 0

4, 1, 1, 0, 0

Looking at this i can definately say that the hierarchies (A-D) where participants put in an order differ between the two surveys. The challenge is from this point on how do i compare these two contigency tables.

Nana Celestin

Your crosstabs is semantically very Ok, indicating that you can define variables and run some statistical tests in SPSS. Congrats. Also, the summary tables for survey 1 and 2 are ok for me. From your crosstabs, I could build a sample of compiled contingency table for you and calculate the difference level between the two surveys; but my problem is your sample size, just the sample size for now. How many people were effectively surveyed? Any way, this is how your compiled contingency table will look like, and using Chi-Square test, compare survey 1 and 2 for each of the ID and for the aggregate (the total scores for each of the rank, for instance the total number of those who ranked 1 for Sur1 and 2 respectively. Do the same for Ranks 2, 3, 4 and 5)

Rank IDA IDB IDC IDD IDE Aggregate

SurV 1 SurV 2 SurV 1 SurV 2 SurV 1 SurV 2 SurV 1 SurV 2 SurV 1 SurV 2 Sur1 Surv2

Chi-Square test

Regards.

Nana Celestin

My table is scattered. However this simple to explain. Rank stands as a column, each ID and the aggregate as a column each. Each of the columns for the IDs and Aggregate is split into two sub-columns (Surv 1 and Surv 2).

Michael Tsikerdekis

The sample size is not determined yet because i want to make sure that the survey is designed in such a way that i can analyze properly the data afterwards. I could get even 150 participants to complete the survey(maybe more considering that it will be a volunteer sample).

Nana thank you for the detailed answer. I now understand your analysis. Basically i isolate each item as if it was a single variable between two groups and then perform non-parametric tests to see the difference between each variable for the two groups. Indeed that is fairly straightforward analysis. Question is, does it apply in my case and will it stand in peer-review.

As i wrote previously the survey can be designed in two ways:

1. Please rearrange the list of ABCDE in any order you think is preferable

2. i. Rate A on scale from 1 - 5 based on your preference

ii. Rate B on scale from 1 - 5 based on your preference

iii. Rate C on scale from 1 - 5 based on your preference

...

The difference here is that the first case imposes a restriction to participants that ABCDE. This restriction is desirable for me because i want them to set an order based on preference.

So is the seperate analysis of each variable valid for my case? Could it stand to reason?

Nana Celestin

Michael Tsikerdekis

I had some time to test all the above so i created a dummy table and filled it with random cases.

The contigency table indeed is visually extremely helpful especially with graphs. For the chi-square though i am not sure if i can perform it since 100% of the cells have expected count less than 5. However, it doesn't matter because i can conduct analysis of variance on the actual data.

For variable test within the same survey, i do expect friedman and wilcoxon matched pair tests to show statistically significant differences because the survey is designed in such a way (ABCDE).

For comparing the two surveys and each one of their 5 variable i do think that a mann-whitney test(for pairs) is appropriate since the samples(surveys) are independent. This will allow me to test for differences between each survey for each variable.

However, in order to be able to predict if the independent variable(the two surveys) affect all of my 5 dependent variables i would need an equivalent of a non-parametric MANOVA test. PERMANOVA or NPMANOVA seems to be able to do that. The paper that is referenced for npmanova for past is here for anyone interested: http://stg-entsoc.bivings.com/PDF/MUVE/6_NewMethod_MANOVA1_2.pdf

I am going to have to read it in detail and see if i can use it in my case. In any case, based on all of your advice, even without npmanova i think with the contigency table, friedman and mann-whitney tests i can sufficiently display if the two surveys produced similar hierarchies.

Nana Celestin

Congrats to the Research Gate Mathematics and Applied Statistics Link for their high sense of selflessness and sharing.

Nisha Arora

I am not sure I understood you completely, but if you want to measure the difference in agreement between different groups about their preferences for the list of items, you can use cohen's kappa and Fleiss' Kappa or Krippendorf 's alpha. These links may help you understand the details.

http://stats.stackexchange.com/questions/132609/comparing-inter-rater-agreement-between-classes-of-raters

http://www.real-statistics.com/reliability/

Badges
Science topic

Similar topics
Mathematics
Statistics

More Michael Tsikerdekis's questions See All

Does my affordance identification framework make sense?

I am experimenting with finding a way to trace behaviors back to design features for a game and I figured affordances may be the way to go. The goal is through participant observation and...

09 October 2015 6,407 3 View

When does using binary variables converted from numeric in linear regression make sense?

I have a numeric output variable and a numeric predictor in a small sample size. For example, my output variable is percentage of domestic abuse per each state and my predictor is the percentage...

01 February 2015 838 12 View

Combining probabilities from two models (Bayesian approach?)

I have two models with the same binary dependent variable but different independent variables. I haven't used all IV in one model because: a) for the second set of IVs there are missing data for...

11 December 2013 6,299 13 View

Linear regression with compositional data: need example on dealing with this

I have a dataset consisting of proportion variables as independent variables. I need to run a linear regression however there is the issue of multicollinearity. I've read that using a centered log...

02 March 2013 1,596 8 View

How can I compare the number of sentences and words between two languages while controlling for natural language variation?

I have a set of paired texts in English and Spanish. I used the punkt tokenizer with pre-trained packages that can be found in the NLTK package for python. It works effectively however I want to...

02 March 2013 1,563 18 View

How to deal with percentage metric and "missing" data.

I have a metric that produces a percentage of the total number of registered users over the total number of all users. Question is that for a third of cases in my dataset, the total users are zero...

01 February 2013 1,176 25 View

Controlling for multiple dichotomous variables and calculating effect size

I am having the following data (as an example). I am interested in measuring the difference in sentences between several texts in English and in Spanish. I can do that either with Mann-Whitney U...

01 February 2013 3,107 5 View

Any good books about designing social media?

I am looking for books for software engineers and hci researchers focused on social media that will have enough scientific depth to be included as lectures for a course on social media. Any ideas?

02 March 2012 2,381 9 View

Two sample Bayesian hypothesis testing for nonparametrics

Hi everyone, I want to test for differences between two independent samples using bayesian hypothesis testing.(the equivalent test here would be a Mann-Whitney U test). Since i am no statistician...

02 March 2012 4,687 12 View

Equivalence & noninferiority testing in social science?

I would like to hear your thoughts on the topic. You see most of the text books tend to argue that the null hypothesis can never be proven because of falsifiability etc. So i have several...

01 February 2012 9,900 5 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Posthoc test lettering in JAMOVI?

Does anyone know of a module for the JAMOVI software that is capable of generating mean separations using the classic letters based on post hoc results (e.g., Tukey test)? If, as I believe, such...

31 July 2024 3,333 4 View

How to back transform the results generated from analyses using log transformed with In(X+1) data?

I am conducting my analysis using SPSS. I log transformed my data using In(X+1) as my data contain zero values. However, when I want to back transform the regression coefficients generated from my...

31 July 2024 7,860 3 View

Have you tried using Vizly for your data analysis? Use the link: https://vizly.fyi/?via=olatomide. How do you see it?

AI has made it easier to code and analyze data

25 July 2024 9,861 1 View

Is it appropriate for researcher(s) to collapse five or four rating Likert scales to three or two as the case maybe during data analysis?

Five or four rating Likert scales e.g. Strongly agree, agree, neutral, disagree and strongly disagree or Strongly agree, agree, disagree and strongly disagree are usually collapse to SA/A, N, D/SD...

24 July 2024 9,841 4 View

How to test multivariate outlier in STATA?

Hey all, I need help testing for multivariate outliers using STATA for my master thesis. The literature recommends the Minimum Covariance Determinant (MCD) (Verardi & Dehon, 2010). I found the...

22 July 2024 8,821 2 View

Who wants opportunities for scientific cooperation?

Dear Colleagues, I hope this message finds you well. My name is Noor Al-Huda K. Hussein, and I am a researcher specializing in deep learning applications in genetic data analysis. I am currently...

16 July 2024 3,981 6 View

Suggestion for PhD Research Topic/Topics in Applied Statistics?

Hi All I recently get admission in PhD statistics. After a long discussion with my supervisor, the topic I selected for PhD is " Air Pollution and its impact on Economy: A case study of...

15 July 2024 1,820 5 View

What is the difference between OTU and ASV analysis?

For microbiome data analysis

13 July 2024 4,542 2 View