Personally, I think referring to previous studies (prior knowledge) is the best approach. Alternatively, ROC analysis may help to suggest you a cut-off base on the study sample. For example using blood pressure as a proxy measure for the risk of heart attack (I make up this), you can first test the association between blood pressure (continuous independent variable) and the binary outcome of heart attack (dependent variable) in logistic regression model. You can easily acquire ROC curve and find the best cut-off according to the sample distribution. You can then adopt this cut-off for blood pressure and treat it as your study outcome in the following analysis. There is no standard method for doing these. I think the key is to make sure your design is well to compare to previous studies and the findings can be refer to the general population but not case sensitive.
To set a reference point or cut-off to convert quantitative variables into binary variables to be used in logistic regression is as following:
1. First you collect the relevant risk factors values.
2. Collect some articles related to your study.
3. Find out the classifications of the risk factors from your study references.
For example:
If you do some research related to Diabetic Retinopathy, then you have to fix the classifications like as,
For Binary Logistic Regression analysis:
Dependent varibale: Binary (1 & 0)
independent variables: Gender, family history of DM, History of Hypertension are in Binary values and HDL, VLDL, Tricylcerides, HbA1C, sys and dia Blood pressure, BUN, and other variables in numericals.
Then, you would perform a Binary Logistic Regression analysis.
Thank you very much for your answer. My problem is in finding the cut-off point which divide my continuous data into 2 groups. I read few articles and they use either the median, 25th percentile, 75th percentile and so on. is there any advice regarding the best choice?
Yes I think in normally distributed data you have to compare it first using t-test / ANOVA then if there is significant difference you have to proceed with one above and below the mean. This is valid in case you want it two groups otherwise you have to look for the universal referral value e.g for body mass index ( 20-25)
You may also find it helpful to look for the implications of the cut-off values. I will use systolic blood pressure readings as a continuous variable whose cut-off value can have different implications. For example, choosing a cut-off value that warrants a medical diagnosis such as having a systolic blood pressure reading equal or more than 140 mmHg (=1) and a reading below 140mmHg(=0). Another example will be choosing a cut-off value that requires a different type of intervention as in systolic blood pressure reading equals or exceeds 180mmHg with signs and symptoms of malignant hypertension (=1), which may require emergency admission, compared to readings less than 180 mmHg (=0) that may require self administered oral treatment without the need for emergency admission.
Furthermore, a reasonable cut-off value is the one that you can expand on its practical implications when discussing your statistical findings. Hence, the first cut-off value (i.e., 140 mmHg) may be preferable if you are examining an association of hypertension with another illness such as atherosclerosis. On the other hand, the second cut-off value (i.e., 180 mmHg with signs and symptoms of malignant hypertension) may be used in a study assessing factors related to the financial cost for emergency admission.
Finally, a cut-off value with practical implications may provide an equally valuable information to practice whether the an association is statistically significant or not.
Personally, I think referring to previous studies (prior knowledge) is the best approach. Alternatively, ROC analysis may help to suggest you a cut-off base on the study sample. For example using blood pressure as a proxy measure for the risk of heart attack (I make up this), you can first test the association between blood pressure (continuous independent variable) and the binary outcome of heart attack (dependent variable) in logistic regression model. You can easily acquire ROC curve and find the best cut-off according to the sample distribution. You can then adopt this cut-off for blood pressure and treat it as your study outcome in the following analysis. There is no standard method for doing these. I think the key is to make sure your design is well to compare to previous studies and the findings can be refer to the general population but not case sensitive.
You can find out from past reviews or guestimate, but whatever you do, you have to write the rational for the cut-off which has to have practical or medical significance depending on the context of your research.
Thank you very much for all of you, your contribution is really valuable. The problem is that I didn't find any previous studies using the same data and converting it from continuous to binary.
I will try a couple of the suggested approaches and see how things go.
If there's no previous studies, I would use the median as the cut-off and then analyse if make sense compare these two groups created regarding the outcome.
I agree with Celine Cardoso, using the median ( central parameter not affected by extreme values) is the best approach. That´s certain even if you do not test normal distribution of data
My research is about the impact of an accreditation program on quality improvement in public hospitals. One aspect of the research is concerned with accreditation benefits and this is where I need to convert the quantitative variables into binary to divide responses into 2 groups, people seeing accreditation beneficial and others thinking of it in a sceptical way,
How many and what questions are in your scale? You may want to determine the weight of each question (whether high importance or low importance for example), then assign scores. After that, determine whether you want to use median, or mean (if there are no large outliers in the aggregates, as these may significantly skew your central tendency) or trimmed mean.
Accreditation is one independent variable, I have 6 more, total of 7 independent variables.
Accreditation benefits are 15 in total, and this is the dependent variable to be used in the multiple logistic regression analysis, once converted into a binary variable i.e. an appropriate cut-off point is required to classify staff as having good perception or sceptical of accreditation benefits as a binary variable.
Two entirely pragmatic approaches would be (1) take your continuous predictor and use quartiles to divide into 4 groups each of equal size; so that would give similar standard errors for each group. Treat this new variable as an ordinal variable (1st, 2nd 3rd, 4th group) and then fit an orthogonal polynomial of the 1st order (underlying linear, 2nd order (underlying quadratic) and so on; testing for improved fit as you make the model more complex; choose the most parsimonious.
(2) Use tree regression to find the best cross-validated cutoff -see for example
These split up the continuous predictor into finer and finer parts as long as it predicts a hold out sample more accurately - it can also automatically handle interactions.