I wish to analyse normally distributed data by categorising it into 3 categories: low, moderate and high. What would be the best way to go about it? Will working out percentiles help? since the data is normally distributed.
Briefly, it could depend on the data and the question you wish to answer; you could do it based upon some sort of theoretical background. Perhaps see if any literature exists that offers definitions of 'low', 'medium', or 'high' for the variable in question.
If no theoretical definitions exist for what the categories are, a general and relatively simple solution may be to find tertiles. Find the 2 datapoints that divide the sample into three groups that are equivalent in size.
I think it depends on what interpretation you want "high" to mean. In some cases "high" might mean greater than the 99th percentile or greater than the 95th percentile. A more general classification might say "low" is less than the 25th percentile and "high" is greater than the 75th percentile. This would also match how the data are divided by a box plot.
I may be confusing myself here so I'm asking for clarity. When you ask, "Will working out percentiles help?" Do you mean percent of the data or percentage of the range of data?
Take for example the data in the screenshot. Normal data with mean of 5 and std. dev. of 1; split into halves by the median at 4.9496 and halved along the range of the data at 4.9942.
At any rate, I think Keston raises an important point about theoretical divisions. I'm not certain of more than a very few situations that binning continuous data is more useful in analysis or description. (https://www.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html) (https://stats.stackexchange.com/questions/68834/what-is-the-benefit-of-breaking-up-a-continuous-predictor-variable)
Hi Mehreen. Dividing contionous scales into categories is problematic, as reviewers will often question your logic. We've had some success (not too many questions from reviewers) in categorizing ordinal scales,
Article The courage to change: Patient perceptions of 12-Step fellowships
but I suppose you have a contionous variable. If you have no hint in the description of you scale on how to divide it into meaningful categories (as @Keston talks about), you are left with the suggestions above; dividing it into 3 based on the 25th and 75th percentile, or just "force" the data and divide it into three equally numbered groups. As you have normally distributed data, I believe it is the outskirts that is most interesting (you could try to use that as an argument for your division in the article), and I would recommend @Salvatore's suggestion (25/75th percentile). I suppose a 5 and 95th percentile is also possible, but it will leave you with very few in the "high" and "low" groups.
Dear Faisal, I think you have been ably answered. Remember that once change continuous data into low, moderate, and high, it becomes a categorical data. In particular, an ordinal data. If it's the outcome of interest, it will dictate the approach of the statiatical analysis. I think it will also be prudent to use a well-established scale (where feasible) to set the cut-off.