Hello everyone,

I have a dataset of diabetic patients which has been used to train an xgboost model in several outcomes such as stroke, amputation, and more. Originally we used the continuous numeric variables as-is, but we found ambiguity in the results since for example age was giving us results where the older you get the higher the risk to have a stroke.

But, for us as physicians we need a narrower range, so we divide those variables in bins. And indeed this gave us more insight. Nonetheless, we are seeing that some contiguous intervals appear in our results pretty close.

Continuing from the example above, bin(64-78) and bin(79-88), appear one after the other and no other bin from the age variable appears. So we think that the best approach, in this case, is to find the best optimal cutpoint at which the age starts to become a risk factor for stroke.

Then I came across this document (https://www.mayo.edu/research/documents/biostat-79pdf/doc-10027230) which explains in SAS how to find those cutpoints. I am not experienced enough to program this myself, so I want to know how could I achieve to find these cutpoints in python?

Do please restrict to that language, I have already seen R, SAS, even SPSS examples but none in python. There must be some way to do this in Python.

More Sergio Alejandro Záizar's questions See All
Similar questions and discussions