Large electronic data sources such as electronic health care records, insurance claims data and administrative data have opened up new opportunities for investigators to rapidly conduct pharmacoepidemiological studies and clinical trials in real-world health care settings and with a large number of patients. A concern is that these data have not been collected systematically for research on the utilisation, safety and effectiveness of medicinal products, which could affect the validity, reliability and reproducibility of the investigation. Attempts have therefore been made to create a systematic methodology for data quality assessment in order to understand the strengths and limitations of the data to answer a research question, the impact they may have on the study results and the measures to be taken to improve or complement the available data. Several data quality frameworks, which are generally concordant as regards their main quality components, have been published.
A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data (eGEMs. 2016;4(1):1244) describes a framework with three data quality categories: Conformance (with sub-categories of Value, Relational and Computational Conformance), Completeness and Plausibility (with sub-categories of Uniqueness, Atemporal and Temporal Plausibility). These categories are applied in two contexts: Verification and Validation. This framework is used by the US National Patient-Centered Clinical Research Network (PCORnet), with an additional component, persistence, and the Observational Health Data Science and Informatics (OHDSI) network. Based on the same framework, the Data Analytics chapter of the Book of OHDSI (2020) provides an automated tool testing the data quality checks in databases conforming to the OMOP common data model. Increasing Trust in Real-World Evidence Through Evaluation of Observational Data Quality (medRxiv. 2021) describes an open source R package that executes and summarises over 3,300 data quality checks in databases available in OMOP.
Duke-Margolis Center’s Characterizing RWD Quality and Relevancy for Regulatory Purposes (2018) specifies that determining if a real-world dataset is fit-for-regulatory-purpose is a contextual exercise, as a data source that is appropriate for one purpose may not be suitable for other evaluations. A RWD set should be evaluated as Fit-for-purpose if, within the given clinical and regulatory context, it fulfils two dimensions: Data Relevancy (including Availability of key data elements, Representativeness, Sufficient subjects and Longitudinality) and Data Quality (Accuracy, Completeness, Provenance and Transparency of data processing).
Data quality frameworks have been described for specific data sources. For example, the EMA’s Draft Guideline on Registry-based studies describes four quality components for use of patient registries (mainly disease registries) for regulatory purposes: Consistency, Completeness, Accuracy and Timeliness.