A year and a half later I now answer my own question. I couldn't find an existing script so I attempted to code the DOE calculation. My code could reproduce some, but not all of the scores of the DUD ligands reported by Vogel et al. Recently, though I found that there is existing code for calculating the DOE as well as other useful metrics for benchmarking ligand set evaluation that is available on the GitHub DeepCoy page. That page also cites the relevant publication. In order to calculate the DOE on an existing ligand set without generating new decoys, the code has to be revised. I was able to do this without much difficulty. The DeepCoy code for DOE was accurate to within 9% difference from the DOE scores reported by Vogel et al. ( except for one unexplained outlier). I'm considering posting the code revision as it adds a different use case. The actual code for calculating the DOE is in the "decoy_utils.py" script that is in the "evaluation" directory. That script is used in tandem with the "select_and_evaluate_decoys.py" script.
In doing this work I also found separate set of statistical metrics useful, which are described by Rohrer and Baumann in a 2008 paper in the J. Chem. Inf. Model. Those statistics have been coded by Dr. Ellingson, and are on GitHub as "SmallMolEval". Using this script isn't completely straightforward, though, because the use instructions are lacking, but can be figured out by examining the code if you have enough coding knowledge.