Here are some recommendations for large unstructured medical text datasets that could be valuable for natural language processing research:
- MIMIC-III Critical Care Database: Contains deidentified clinical notes, discharge summaries, radiology reports on ~60,000 intensive care patients. Requires access approval.
- n2c2 Dataset Collections: Includes deidentified discharge summaries, clinical text, and annotations on various tasks like named entity recognition. Publicly available.
- PubMed Central: Vast repository of open access biomedical literature with over 6 million full text articles. Great source but requires preprocessing.
- UK NHS Clinical Notes: Millions of unstructured primary care notes and correspondences but will require direct request and ethics approval.
For privacy reasons, unstructured medical text datasets with real protected health information are limited. But many options exist with synthetic or properly anonymized data. The UC Irvine Machine Learning Repository also hosts some processed annotated sets.
Let me know if you need any help with data wrangling or modeling once you obtain a relevant dataset. Translating unstructured medical text into actionable insights is an impactful application of AI. Wishing you the very best in advancing this effort.
Unstructured data is immensely valuable to healthcare. “If you approach it from a high level, clinical notes are a glimpse into the physician’s brain,” says Brian Laberge, solution engineer at software and solutions provider Wolters Kluwer.