I'm working on a project focused on identifying repeated musculoskeletal cases within a dataset of 3597 electronic medical records (EMRs) of various conditions. Given the limited filtering options available, I'm seeking advice on the most effective sampling methodology to accomplish this task.
Considering the large dataset and the specific focus on musculoskeletal cases, what sampling techniques or methodologies would you recommend for efficiently identifying and sorting out the repeated cases? Are there any particular statistical approaches or strategies that could help optimize this process while ensuring representative sampling?