Categorising normally distributed data?

Hi Mehreen,

Briefly, it could depend on the data and the question you wish to answer; you could do it based upon some sort of theoretical background. Perhaps see if any literature exists that offers definitions of 'low', 'medium', or 'high' for the variable in question.

If no theoretical definitions exist for what the categories are, a general and relatively simple solution may be to find tertiles. Find the 2 datapoints that divide the sample into three groups that are equivalent in size.

Hopefully that helps!

Keston

Salvatore S. Mangiafico

I think it depends on what interpretation you want "high" to mean. In some cases "high" might mean greater than the 99th percentile or greater than the 95th percentile. A more general classification might say "low" is less than the 25th percentile and "high" is greater than the 75th percentile. This would also match how the data are divided by a box plot.

Caleb A. Aldridge

I may be confusing myself here so I'm asking for clarity. When you ask, "Will working out percentiles help?" Do you mean percent of the data or percentage of the range of data?

Take for example the data in the screenshot. Normal data with mean of 5 and std. dev. of 1; split into halves by the median at 4.9496 and halved along the range of the data at 4.9942.

At any rate, I think Keston raises an important point about theoretical divisions. I'm not certain of more than a very few situations that binning continuous data is more useful in analysis or description. (https://www.ma.utexas.edu/users/mks/statmistakes/dividingcontinuousintocategories.html) (https://stats.stackexchange.com/questions/68834/what-is-the-benefit-of-breaking-up-a-continuous-predictor-variable)

John-Kåre Vederhus

Hi Mehreen. Dividing contionous scales into categories is problematic, as reviewers will often question your logic. We've had some success (not too many questions from reviewers) in categorizing ordinal scales,

Article The courage to change: Patient perceptions of 12-Step fellowships

but I suppose you have a contionous variable. If you have no hint in the description of you scale on how to divide it into meaningful categories (as @Keston talks about), you are left with the suggestions above; dividing it into 3 based on the 25th and 75th percentile, or just "force" the data and divide it into three equally numbered groups. As you have normally distributed data, I believe it is the outskirts that is most interesting (you could try to use that as an argument for your division in the article), and I would recommend @Salvatore's suggestion (25/75th percentile). I suppose a 5 and 95th percentile is also possible, but it will leave you with very few in the "high" and "low" groups.

Best

John-KÅre

Mehreen Riaz Faisal

Thank you everyone for your valuable input. Appreciate it.

Jonathan Izudi

Dear Faisal, I think you have been ably answered. Remember that once change continuous data into low, moderate, and high, it becomes a categorical data. In particular, an ordinal data. If it's the outcome of interest, it will dictate the approach of the statiatical analysis. I think it will also be prudent to use a well-established scale (where feasible) to set the cut-off.

How to learn more about SPSS and its Application?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

Baseline drift in HPLC? What causes this?

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

How are iso-frequency contours plotted?

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

How to normalize and take the significance of the MTT OD values with 3 replicates for the same cell-line?