Governments and public health systems need accurate and agile information about the characteristics and behaviour of COVID-19 to respond to this ongoing public health emergency appropriately. Researchers, public health authorities, and the general public will benefit from reliable and expeditious data to evaluate the impact of the Coronavirus pandemic on health care systems and to plan for an appropriate policy response at all levels of government.Currently, governments and policymakers throughout the world are being forced to make decisions and take actions based on alternative mathematical models developed for other diseases and/or the experience of other countries in which the outbreak has been detected early and developed. In this situation, high-quality institutional-based datasets are the prerequisite of necessary analysis for public health, which is inherently a data-intensive domain. Effective data quality assessment in the data collection process would guarantee the concordant outcomes from different studies worldwide.
There are several institutional-based repositories of public health data with the capability of electronic data collection and dissemination such as the datasets of public health information systems (PHIS), with various data quality assessment methods and standards. However, poor data quality or coding errors in PHIS is not a new issue and can lead to inaccurate inferences of health interventions. For COVID-19, multi-source datasets of the “World Health Organization (WHO)”, “European Centre for Disease Prevention and Control” and “Chinese Center for Disease Control and Prevention (Chinese CDC)” are reputable references for global BI dashboards and academic research, comprising measures of confirmed, deaths, severe, suspected and recovered cases. These resources are widely used to monitor trends in the virus outbreak and assess the risks of the pandemic in several countries and regions.
In the just published paper
Ashofteh, Afshin and Bravo, Jorge M. ‘A Study on the Quality of Novel Coronavirus (COVID-19) Official Datasets’. 1 Jan. 2020 : 1 – 11. (available at https://content.iospress.com/articles/statistical-journal-of-the-iaos/sji200674 or https://doi.org/10.3233/SJI-200674)
we analysed and compared the quality of official datasets available for COVID-19. We used comparative statistical analysis to evaluate the accuracy of data collection by a national (Chinese Center for Disease Control and Prevention) and two international (World Health Organization; European Centre for Disease Prevention and Control) organisations based on the value of systematic measurement errors. We combined excel files, text mining techniques and manual data entries to extract the COVID-19 data from official reports and to generate an accurate profile for comparisons. The findings show noticeable and increasing measurement errors in the three datasets as the pandemic outbreak expanded and more countries contributed data for the official repositories, raising data comparability concerns and pointing to the need for better coordination and harmonized statistical methods. The study offers a COVID-19 combined dataset and dashboard with minimum systematic measurement errors, and valuable insights into the potential problems in using databanks without carefully examining the metadata and additional documentation that describe the overall context of data.
The dataset and dashboard are available at:
Ashofteh, Afshin; Bravo, Jorge (2020), “COVID-19 data set resulted from a study on the quality of Novel Corona-virus official datasets”, Mendeley Data, v1 https://dx.doi.org/10.17632/nw5m4hs3jr.1 with reference to dashboard. doi: 10.17632/nw5m4hs3jr.2, available from: http://dx.doi.org/10.17632/nw5m4hs3jr.2
The description of the dataset comparisons provides valuable insights into the potential problems in using databanks that are the repository of information from many countries without carefully examining the metadata and additional documentation that describe the content and the overall context of data. Developing guidelines, standards, and ontologies for data documentation is crucial for researchers and policymakers in terms of understanding the context of data creation and collection. Moreover, the altering way in which confirmed cases and deaths have been classified in China points to similar problems which may arise in other countries which require a careful forensic analysis on a regular basis to understand how definitions are applied and to what extent data are comparable. There is a growing need for harmonization and standardization of the data gathering, reporting and data analysis processes.
Although this analysis is being conducted at a relatively early stage of the epidemics and, in the course of time, additional data sets have become available, the discussion on the identification of measurement errors remains timely, useful, and important.