Dear peers,

I'm about to design a nested case-control study in which I will try to identify a biomarker (or biomarker panel) from protein expression levels (determined with mass spectronomy).

I've got a cohort of >1000 patients with endometrial cancer, out of which approx 150 later died from the disease.

I want my biomarker(s) to be able to predict whom will live and whom will die from disease.

Given that my cohort is population-based and consecutive, I believe this is exactly the type of situation in which a NCC is the proper design (where obtaining new info on biomarkers is expensive and time-consuming and thus cannot be done on the entire cohort).

Now, I've come across statistical methods papers with recommendations on how to calculate the proper sample size, specifically "Improving the Quality of Biomarker Discovery Research: the Right Samples and Enough of Them" by Pepe, Li and Feng (Article Improving the Quality of Biomarker Discovery Research: The R...

)

In their paper, they describe a situation in which a researcher wants to develop a biomarker for recurrence in low-risk colon cancer, thus very analogous to my research question. They introduce the concept of Discovery Power and False Leads Expected, and derives sample sizes from the desired performance of the new biomarker by specifying what would be a "useful" biomarker and what would be a "useless" biomarker. In their colon cancer example, a "useful" biomarker would have the same positive predictive value (PPV) as the recurrence rate in high-risk colon cancer, ie. 30%, and a "useless" would have the same PPV as the recurrence rate in low-risk colon cancer, ie. 10%. To me, this makes sense. In a low-risk population, you would be happy with a biomarker that can identify a subset of patients with the same risk of a recurrence as in those with high-risk colon cancer, and you would be unhappy if your biomarker only had the same PPV as the baseline risk of recurrence in the low-risk population. For endometrial cancer, the corresponding figures is PPV=40% is useful and PPV=10% is useless.

Still, I fail to calculate the sample size. They provide a Supplementary Material with a worked example (see attached file), section B3. What I don't get is how a PPV=30% results in a ROC1(f)=0.39. For this, they refer back to the main paper, where they state that logit(PPV)=logit(p)+log(ROC(f)/f).

They also provide a Stata command rocsize, which can be found at https://research.fredhutch.org/diagnostic-biomarkers-center/en/software.html

or type net from https://research.fhcrc.org/content/dam/stripe/diagnostic-biomarkers-statistical-center/files/stata/ in the Stata command window and then press screensize. This command runs a simulation which gives you the Discovery Power, given your sample size input and desired alpha-level. You are expected to provide

  • tp_rate: true positive rate, which they in their example set as ROC1(f)=0.39 (I don't understand why, and again, I don't understand how they got the value 0.39 from a PPV=30%)
  • fp_rate; false positive rate, which in their example is set to 0.1. I don't really understand why it is 0.1
  • nd(#); number of diseased. From their formula-based sample size calculation they got 40
  • ndb(#); number of non-diseased. They had specified a 1:4 ratio of cases:controls, thus =160
  • tpn(#); the joint fixed null true rate for a sample drawn from a population with binormal ROC defined specified alternative rates (tp_rate & fp_rate) and specified sample sizes for diseased (nd(#)) and non-diseased (ndb(#)) subjects. This verbatim from the help section from Stata. I don't understand any of this, but in their example it is set as 0.1.
  • fpn(#); same as tpn(#), but for false positive rate instead of null true rate. In their example set to 0.1

I tried to take a short-cut and just experiment with the values in the simulation, but I just have to many unknowns.

Can anyone walk me through this paper step-by-step, or better yet, you might have an Excel-file on hand that calculates the formulas they provide in their paper? I've tried making an Excel-file myself, but there are too many steps missing in the calculations of the formulas provided and some algebraic terms that are not explained, so I cannot follow what they did and how they got their results.

(The paper is public HHS Public access, so I took the liberty of uploading the Supplementary Material).

BR

Rasmus Green

More Rasmus Walter Green's questions See All
Similar questions and discussions