I'm not sure whether I follow your question. There are several statistical software that already gives you the mean based on Rasch. For example, you can use Rasch, XCalibre, R and others.
If I understood you correctly, the calibration was calculated using Rasch and then export to SPSS?
If so, you should have two files, one for person and one for items. If that is true, you can only request descriptives analysis for both. Doing that, you will obtain, the mean of person's ability and the mean of the difficulty of the items, respectively.
For a more accurate response, please add more information to your question.
I am using Winsteps 3.92.1 software to analyse the survey data. The data achieved the construct validity (both internal and external) and internal consistency (even the seperation reliability) as per the requirement of the Rasch model. I have also got the output data, persons measures for the SPSS use. I have also done the item difficulty analysis. The results were interesting.
I am now running the parametric statistical test analyses on SPSS 24 using the input data generated from the Winsteps (persons measures after deleting the outliers). SPSS provided the descriptive statistics about the four scales/domains of survey for this research. The mean scores range from -0.32 to 3.56. Based on the Rasch calibration, I have divided the mean scores into Very low (> - 1.0), low (from .69 to -0.9), middle (.70 to 1.5), high (from 1.6 tp 2.6) and very high (from >2.6). I am not sure whether the mean scores which fall under either of these specifications validate the construct under study. I was looking for the literature pertaining to this but couldn't get one. I am seeking the experts view on this; how do we interpret this mean results provided by SPSS based on the input data from the Rasch?
If you have references pertaining to this question, I would be more than happy to read and get insights.
I would advise you to follow the APA guidelines for psychometric studies. I'm not sure why the quartiles would be evidence of validity without an external measurement to compare with. Do you have an external measurement to validate your findings? For example, in an intelligence test, you would have students that scored higher and lower on your test. It is common to correlated with GPA for example.
Under a psychometric perspective, it doesn't matter whether you are using the raw scores (classical theory) or IRT scores.
Please find below three links that may help you out.
The analysis was based on an attitudinal test, where a group of secondary school teachers were surveyed on the four point Likert's scale (Strongly disagree to strongly agree) against different questionnaires. Since the test items were designed specifically for this study (for the first time), there is no external measurement to validate the findings. However, the survey items achieved the construct validity, reliability tests, and other psychometric parameters as per the requirement of the Rasch model. I am now undertaking ANOVA, MANOVA, and Regression tests (on SPSS) based on person measured input data to ascertain the significance of the groups and the predictability of independence variables on the dependent variables.
Anyway, thank you very much for your comments, suggestions and reference links. The articles were insightful.
Pema, the score ranges that you constructed are not evidence of the validity of the instrument until they correspond to a definition of those ranges as they pertain to the underlying construct. This definition can come from a separate measure of the same construct or it can come from a team of experts who can make determinations about the degree of performance on the scale.
Select individuals from each of your four classifications of very low, middle, high, and very high. Does their inclusion within one of the four groups coincide with external evidence? Under a second opinion that uses different information, would they fall within the same classification?
The score ranges were constructed based on the expert's suggestion. He is expert in Rasch Model and SPSS, who helps me when he is free. I thought I would get some confirmatory views from the audience here, not to say that I question the expert's claim.
I will definitely try your suggestion about selecting the individuals from the four classification and ascertain if they fall within the same classification. There is not external measure to cross check the coincidence as this test is first of its kind.
According to Bond and Fox (2007), the reliability cutoff value is .70, and the infit mean square cutoff value ranges from .70 to 1.30 logits. Can we consider this infit cutoff values as the external evidence to compare with the newly constructed classification?
The Rasch construct validity cutoff ranges from -3 to +3, so, can I use this as an external measure to validate the newly constructed mean score classification (Very low (< - 1.0), low (from .69 to -0.9), middle (.70 to 1.5), high (from 1.6 tp 2.6) and very high (from >2.6)?
I would remain grateful for your insights on this, please.
Construct validity is the degree to which an instrument measures what it is intended to measure. It’s not clear to me why you need these five categories of estimates of ability. Is there an expected relationship between inclusion within these five categories and the underlying construct? Otherwise, comparing your results to something else that is known (e.g., expert opinion of ability estimate) will help with validity claims.
Initially the raw data were collected from the survey participants based on the following rating scales:
Strongly disagree - 1
Disagree - 2
Agree- 3
Strongly agree - 4
The above raw scores have been Rasch analysed and achieved the construct validity and case and person reliability. Then the person scores have been used to run an analysis of variance (ANOVA, MANOVA, etc) on SPSS after deleting the extreme scores (which ascertained from the Rasch Winsteps software).
Now The question is how do I determine the performing level of the person on different scales? I have five scales such as GNHVITAL, PRF, TPGC, ISS, and ESS, with different mean scores. For example, GNHVITAL has mean score of -.32. How do I decide the level of performance of the participants on GNHVITAL scale? =Does the mean score -.32 tell us that the performancde level on GNHVITAL scale is good/low or not good or poor or just state agreeability of the scale is low as its logits is below 0.00? How do I exactly determine the person mean scores in logits based on Rasch calibration or Thresholds? I need a spell out specification table determining the performance levels or the endorsability of the scale by the participants? Scales are not dichotomous but polytomous.
During a new calibration process, the persons in your calibration sample serve as the basis for not only the scale but for the interpretation of the scale. On an unknown interval scale, it becomes important to understand the characteristics of the persons in the calibration sample. Rather than defining the intervals and understanding the persons within them, do the opposite. Understand the persons along the full scale and then define the intervals. Fortunately, you do have scaled scores for these persons.
Here are some ways you could assign your cut-points: 1) Estimate the level of the trait via concurrent validity or expert opinion for persons along various portions of the scale. Set the cut-points based on these external definitions. 2) Construct a histogram of the persons’ scaled scores and see if there may be natural cut-points that develop. 3) Are there expected proportions of inclusion within each range of the scale? If so, use those proportions as a basis for your intervals. 4) Are the multiple measures correlated well? Locate the persons that score high on all of the measures and base the interval for scoring high on the ranges for these persons. Do the same for those that score low on all of the measures. This will give you two of the ranges.
Feel free to message me directly if you have further questions.
Thanks for the very informative and insightful suggestions. I couldn't feel more relax now. Your suggestions seem valid and reliable. I will now present my paper following your suggestions. I will quote your reference here if you don't mind.
I have constructed the histograms from the persons' scaled scores for all the survey dimensions, and they seemed to provide the clear base to determine the cutoff points. Different scales seem to provide different base to ascertain the mean cutoff values.
Your suggestions aligned with Dr. Vine here at the University, who helps me when he is free. I think I am now settled with how to go about determining the interval scaled mean score. I think I can also allpt this procedures to the item scores while analysing the item hierarchy difficulty level.
on the logit scales, the items or test questions are placed from the easiest to the most difficult . If the mean score is below 0.00 logit (which is calibrated from -5 to +5), that means there are more easy questions or easy items to endorse, and if the mean score is above 0.00 logit, there are more difficult questions or difficult items to endorse. That is with regard to item difficulty analysis. Now if my mean score for person on vital scale is 1.87. Does it mean there are more able person or more people who agreed to the scale? I am still pondering over this question.
The Likert-type degree of agreement items serve as one method of locating a personʻs placement along the continuum of the underlying trait. Is it assumed that the person who selects the strongly disagree rating has a lower level of the underlying trait than someone else who selects the strongly agree rating (accounting for any negatively-worded items)? If so, then you have an estimate of the level of the underlying trait.
A more substantial rating process would define characteristics of people with varying degrees of the underlying trait and have raters select the level that most resembles their own characteristics. These rating levels would be used to assign ordinal level scores for use in a Rasch analysis. For your study, the degree of agreement was used for the assignment of the rating level.