12 November 2017 5 8K Report

I am working on the topic "Utility Enhancement for Textual Document redaction and Sanitization". I have noted in the literature of de-identification of the medical document that Privacy models perform unnecessary sanitization by sanitizing the negated assertions, (“AIDS negative”). I want to exclude the negated assertions before sanitizing the medical document, which will improve the utility of document. I want to know which dataset will be appropriate for my work. I tried to use the 2010 i2b2 dataset but I could not find the metadata of that dataset. The 2014 i2b2 de-identification Challenge Task 1 consists of 1304 medical records with respect to 296 patients, of which 790 records (178 patients) are used for training, and the remaining 514 records (118 patients) for testing. The medical records are a fully annotated gold standard set of clinical narratives. The PHI categories are grouped into seven main categories with 25 associated sub-categories. Distributions of PHI categories in the training and test corpora are known as in the test corpora 764 age, 4980 dates, hospitals 875, etc. I want to know the same above information for 2010 i2b2 dataset that I could not find yet.

Thank you.

More Aysha Khan's questions See All
Similar questions and discussions