How to determine cut off value of independent variables (e.g. pack year of smoking, parity, amount of alcohol etc.) to consider as risk while analyzing OR in case control study.
Well, I fear that the true reason for such a widespread use of cutoffs in the litterature is the inability of many people to intellectually cope with continuous data (no political implication though).
That said, as outlined by Robert (with whom I fully agree) there are sometimes good reasons for discretising bona fide continuous data. Where there are traditional clinically meaningful cutoffs used in the field ("low birth weight") or if you want to make your results comparable with other works (Willett's binge drinking).
Closer to the data, relationships between biological phenomena are not always monotonous (and even less often linear). Both a deficiency and a too large intake of the same nutrient may, for instance be a risk factor for the same cancer (although I do not expect such a phenomenon for smoking). In that case the use of cutoffs, either derived from the data or otherwise established threshods in the field, may be useful to present the results.
But in many cases there is no biogical rationale justifying splitting your data and "similar studies" all used different cutoff schemes. Besides, as outlined by your example, there is no a priori reason why the same cutoff of an exposure factor should be equally relevant for different outcomes.
This is why, before deciding on data splitting, you should carefully study the distribution of your continuous exposure variable as well as their relationship with the outcome. You should also be wary of "automatic" methods such as median or n-tile split (quintiles were quite fashionnable, a decade ago, for computational commodity, I guess) that are unlikely to be clinically meaningful, cannot be compared between studies, and may yield quite heterogeneous strata.
In short, you should know your data and devise the best, most honest, way of presenting it. In many cases, sticking with continuous data in the most scientifically sound approach, in that it minimises hypotheses about the data structure. Yet it is not always the easiest way to publish in a medical journal. It is sometimes possible to present both a continuous data analysis as long as discretised results, for illustration purpose.
Hi Dipesh, I realize that the medical literature is rather obsessed with odds ratios, but in general, it's best not to throw away information, which is what you do when you dichotomize continuous data, assuming that's what you are starting with. Some thngs are spectaularly bad ideas, such as the "median split," where you divide the data at the median. The reason this is such a bad idea is because you are drawing an arbitrary line right where most of the data are. So imagine, for example, you are talking about number of cigarettes smoked per day. You might be drawing the line between a large number of people who smoke 20 and a similarly large number of people who smoke 21. In other words just a difference of reporting one cigaretter per day would have moved these smokers from one analytic category to the other. So let me repeat, despite medical tradition, you are most often best off retaining your all your original data. But moving on, in many specialties, you will find that the literature already has a working standard. This would be true for things such as "low birth weight" and "very low birth weight." The values used for birth weight have been associated with any number of short-term and long-term clinical outcomes. Nonetheless, when I modeled birthweight as an outcome, I used a continuous variable, as you can see in the article, "Neighborhood support and the birth weight of urban infants," which you can see on page 2 of my profile, but the reviewers asked us to insert stuff on LBW as a dichotomous outcome. On alcohol, for example, Walter Willett at Harvard came up with a dichotomous version of binge drinking that is pretty widely used. So my answer is basically two parts: first, use continuous variables as much as you can as they contain more information; second, before trying to establishing your own cutoffs look for established cutoffs in the literature. Bob
Thank you Robert...... but in literature also there is no consistency in using the cutoffs( some make dichotomous variable, some make multiple variable... in dichotomous also there is variability in using cut off value) in such situation what should we do? what is the principle behind establishing cut off value? if cut off value established for one type of disease, will it be equally applicable for other disease( eg smoking Vs lung cancer, smoking vs gallbladder cancer)
Well, I fear that the true reason for such a widespread use of cutoffs in the litterature is the inability of many people to intellectually cope with continuous data (no political implication though).
That said, as outlined by Robert (with whom I fully agree) there are sometimes good reasons for discretising bona fide continuous data. Where there are traditional clinically meaningful cutoffs used in the field ("low birth weight") or if you want to make your results comparable with other works (Willett's binge drinking).
Closer to the data, relationships between biological phenomena are not always monotonous (and even less often linear). Both a deficiency and a too large intake of the same nutrient may, for instance be a risk factor for the same cancer (although I do not expect such a phenomenon for smoking). In that case the use of cutoffs, either derived from the data or otherwise established threshods in the field, may be useful to present the results.
But in many cases there is no biogical rationale justifying splitting your data and "similar studies" all used different cutoff schemes. Besides, as outlined by your example, there is no a priori reason why the same cutoff of an exposure factor should be equally relevant for different outcomes.
This is why, before deciding on data splitting, you should carefully study the distribution of your continuous exposure variable as well as their relationship with the outcome. You should also be wary of "automatic" methods such as median or n-tile split (quintiles were quite fashionnable, a decade ago, for computational commodity, I guess) that are unlikely to be clinically meaningful, cannot be compared between studies, and may yield quite heterogeneous strata.
In short, you should know your data and devise the best, most honest, way of presenting it. In many cases, sticking with continuous data in the most scientifically sound approach, in that it minimises hypotheses about the data structure. Yet it is not always the easiest way to publish in a medical journal. It is sometimes possible to present both a continuous data analysis as long as discretised results, for illustration purpose.
It is true that using continuous variables generally helps you get the most information out of your data, but there are times when using a cutoff is necessary, for example to establish treatment guidelines. Blood pressure is an example of this: Anything over 140 systolic BP is considered high blood pressure because BP over this level is associated with adverse health outcomes. It may also be used to guide treatment with BP medications. The 140 figure used to be higher until epidemiologic studies showed that systolic BP as low as 140 was associated with bad outcomes too.
The cutpoint at where to categorize your data variable can be based on many things. But first, if a statistical method (such as linear regression) can be used that uses the continous data, by all means use it. Logistic regression will let you use either continous or binary variables. One way you can tweak the cutpoints when categorizing your data in logistic regression is to do a sensitivity and specificity, and positive predictive value / negative predictive value with the results, to see how your regression model performs at predicting outcome at various cutpoints.
Also for putting your results into practical terms, consider a NNT (number needed to treat) analysis.
If you can use more than one cutpoint, you may wish to stratify your data and do a stratified analysis. This will (hopefully) for example establish an increasing risk of the outcome with increasing exposure---the classic example being increasing mortality rates as cigarettes smoked per day increases. (you can do this using one cutpoint too, but using three or more makes a stronger case for a dose-response effect).
Some studies use the bottom ten percent of a value for the cutpoint, for example birthweight, if you can find normative data with which to classify your data. Or you can convert your continous data to Z-scores or a percentile rank and use, for example, the lower 25% or upper 10% as a cutoff. It depends what your data variable is, of course.
In short, you just have to play with the numbers. But as others have said, try to find some basis for your cutpoint in the literature, or something that makes biological sense (physiology frequently exhibits thresholds, for example muscle fatigue or the transition of cells into carcinoma). Or consider cutpoints that have practical value, such as for diagnosing or treating disease, or a population-level health impact of the risk factor, including economic health impacts.