I have devised a test–retest analysis plan for my new nurse survey on current nurse-led models and preferences. A subset of items are Likert-type within domains (e.g., education). Eleven participants completed these items twice, 7 days apart.

Context: I want to evidence reliability and stability.

Proposed analyses and what each would mean:

  • ICC(3,1) per item What it does: checks if each Likert item gives consistent numerical scores at Time 1 and Time 2. How I read it: higher ICC means better item stability. Useful thresholds: 0.90 excellent. Action: flag items below ~0.60 for review.
  • Spearman correlation of totals or subscales (T1 vs T2) What it does: checks whether people keep the same rank order overall between the two times. How I read it: higher rho means stronger whole-scale stability. Useful threshold: ≥0.70 suggests strong stability.
  • Wilcoxon signed-rank on totals or subscales What it does: tests if overall scores shifted up or down between Time 1 and Time 2. How I read it: p ≥ 0.05 indicates no systematic change. Why I need it: rules out practice effects or drift.
  • Bland–Altman plot on totals or subscales What it does: plots the average score against the difference between Time 1 and Time 2. How I read it: mean difference near zero suggests no bias. Most points within the 95% limits of agreement suggests good agreement. Why I need it: gives a clear visual of agreement and highlights outliers.

Please offer advice on the above.

More Leila Kattach's questions See All
Similar questions and discussions