Print page Resize text Change font-size Change font-size Change font-size High contrast

Home > Standards & Guidances > Methodological Guide

ENCePP Guide on Methodological Standards in Pharmacoepidemiology


5.1. Definition and validation of drug exposure, outcomes and covariates


Historically, pharmacoepidemiology studies relied on patient-supplied information or searches through paper-based health records. This reliance has been reduced with the rapid expansion of access to electronic healthcare records and existence of large administrative databases. Nevertheless, these data sources have led to variation in the way exposures and outcomes are defined and measured, each requiring validation. Chapter 41 of Pharmacoepidemiology (B. Strom, S.E. Kimmel, S. Hennessy. 5th Edition, Wiley, 2012) includes a literature review of the studies that have evaluated the validity of drug, diagnosis and hospitalisation data and the factors that influence the accuracy of these data. The book presents information on primary data sources available for pharmacoepidemiology studies including questionnaires and administrative databases. Further information on databases available for pharmacoepidemiology studies is available in resources such as the ENCePP resource database and the Inventory of Drug Consumption Databases in Europe.


5.1.1. Assessment of exposure


In pharmacoepidemiology studies, exposure data originate mainly from four sources: data on prescribing (e.g. CPRD primary care data), data on dispensing (e.g. PHARMO outpatient pharmacy database), data on payment for medication (namely claims data, e.g. IMS LifeLink PharMetrics Plus) or from data collected from surveys. The population included in these data sources follows a process of attrition: drugs that are prescribed are not necessarily dispensed, and drugs that are dispensed are not necessarily ingested. In Primary non-adherence in general practice: a Danish register study (Eur J Clin Pharmacol 2014;70(6):757–63), 9.3% of all prescriptions for new therapies were never redeemed at the pharmacy, although with some differences between therapeutic and patient groups. The attrition from dispensing to ingestion is even more difficult to measure, as it involves uncertainties about what dispensed drugs are actually taken by the patients and about the patients’ ability to account accurately for their intake. In particular, paediatric adherence is additionally dependent on parents.


Exposure definitions can include simple dichotomous variables (e.g. ever exposed vs. never exposed) or they can be more detailed, including estimates of exposure windows (e.g. current vs. past exposure) or levels of exposure (e.g. current dosage, cumulative dosage over time). Consideration should be given to the level of detail available from the data sources on the timing of exposure, including the quantity prescribed, dispensed or ingested and the capture of dosage instructions when evaluating the feasibility of constructing such variables. This will vary across data sources and exposures (e.g. estimating contraceptive pill ingestion is typically easier than estimating rescue medication for asthma attacks). Discussions with clinicians regarding sensible assumptions will inform variable definition.


The Methodology chapter of the book Drug Utilization Research. Methods and Applications (M. Elseviers, B. Wettermark, A.B. Almarsdottir et al. Ed. Wiley Blackwell, 2016) discusses different methods for data collection on drug utilisation.


5.1.2. Assessment of outcomes

A case definition compatible with the observational database should be developed for each outcome of a study at the design stage. This description should include how events will be identified and classified as cases, whether cases will include prevalent as well as incident cases, exacerbations and second episodes (as differentiated from repeat codes) and all other inclusion or exclusion criteria. The reason for the data collection and the nature of the healthcare system that generated the data should also be described as they can impact on the quality of the available information and the presence of potential biases. Published case definitions of outcomes, such as those developed by the Brighton Collaboration in the context of vaccinations, are not necessarily compatible with the information available in a given observational data set. For example, information on the duration of symptoms may not be available, or additional codes may have been added to the data set following publication of the outcome definition.


Search criteria to identify outcomes should be defined and the list of codes should be provided. Generation of code lists requires expertise in both the coding system and the disease area. Researchers should also consult clinicians who are familiar with the coding practice within the studied field. Suggested methodologies are available for some coding systems (see Creating medical and drug code lists to identify cases in primary care databases. Pharmacoepidemiol Drug Saf 2009;18(8):704-7). Coding systems used in some commonly used databases are updated regularly so sustainability issues in prospective studies should be addressed at the protocol stage. Moreover, great care should be given when re-using a code list from another study as code lists depend on the study objective and methods. Public repository of codes as is available and researchers are also encouraged to make their own set of coding available.


In some circumstances, chart review or text entries in electronic format linked to coded entries can be useful for outcome identification. Such identification may involve an algorithm with use of multiple code lists (for example disease plus therapy codes) or an endpoint committee to adjudicate available information against a case definition. In some cases, initial plausibility checks or subsequent medical chart review will be necessary. When databases have prescription data only, drug exposure may be used as a proxy for an outcome, or linkage to different databases is required.


5.1.3. Assessment of covariates


In pharmacoepidemiology studies, covariates are often used for selecting and matching study subjects, comparing characteristics of the cohorts, developing propensity scores, creating stratification variables, evaluating effect modifiers and adjusting for confounders. Reliable assessment of covariates is therefore essential for the validity of results. Patient characteristics and other key covariates that could be confounding variables need to be evaluated using all available data. A given database may or may not be suitable for studying a research question depending on the availability of these covariates.

Some patient characteristics and covariates vary with time and accurate assessment is time dependent. The timing of assessment of the covariates is an important factor for the correct classification of the subjects and should be clearly specified in the protocol. Assessment of covariates can be done using different periods of time (look-back periods or run-in periods).


Fixed look-back periods (for example 6 months or 1 year) are sometimes used when there are changes in coding methods or in practices or when is not feasible to use the entire medical history of a patient. Estimation using all available covariates information versus a fixed look-back window for dichotomous covariates (Pharmacoepidemiol Drug Saf. 2013; 22(5):542-50) establishes that defining covariates based on all available historical data, rather than on data observed over a commonly shared fixed historical window will result in estimates with less bias. However, this approach may not be applicable when data from paediatric and adult periods are combined because covariates may significantly differ between paediatric and adult populations (e.g., height and weight).


5.1.4. Validation


In healthcare databases, the correct assessment of drug exposure, outcome and covariate is crucial to the validity of research. The validation of electronic information on drug exposure, outcome or covariate is crucial for database studies and definitions should be included in the technical handbook of every database, ideally providing estimates of sensitivity, specificity, and the positive and negative predictive value. Validity of diagnostic coding within the General Practice Research Database: a systematic review (Br J Gen Pract 2010;60:e128-36), the book Pharmacoepidemiology (B. Strom, S.E. Kimmel, S. Hennessy. 5th Edition, Wiley, 2012) and Mini-Sentinel's systematic reviews of validated methods for identifying health outcomes using administrative and claims data: methods and lessons learned (Pharmacoepidemiol Drug Saf. 2012 Jan;21 Suppl 1:82-9) contain examples.


Completeness and validity of all variables used as exposure, outcomes, potential confounders and effect modifiers should be considered. Assumptions included in case definitions or other algorithms may need to be confirmed. For databases routinely used in research, documented validation of key variables may have been done previously by the data provider or other researchers. Any extrapolation of previous validation should, however, consider the effect of any differences in variables or analyses and subsequent changes to health care, procedures and coding. A full understanding of both the health care system and procedures that generated the data is required. This is particularly important for studies relying upon accurate timing of exposure, outcome and covariate recording such as in the self-controlled case series.  External validation against chart review or physician/patient questionnaire is possible with some resources. However, the questionnaires cannot always be considered as ‘gold standard’.


Review of records against a case definition by experts may also be possible. While false positives are more easily measured than false negatives, specificity of an outcome is more important than sensitivity when considering bias in relative risk estimates (see A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol 2005;58(4):323-37). Alternatively, internal logic checks can test for completeness and accuracy of variables. For example, one can investigate whether an outcome was followed by (or proceeded from) appropriate exposure or procedures.


Concordance between datasets such as comparison of cancer or death registries with clinical or administrative records can validate individual records or overall incidence or prevalence rates.

Linkage validation can be used as well, when another database is used for the validation of current one, through linkage methods (Using linked electronic data to validate algorithms for health outcomes in administrative databases., J Comp Eff Res. 2015 Aug;4(4):359-66.)



Individual Chapters:


1. Introduction

2. Formulating the research question

3. Development of the study protocol

4. Approaches to data collection

4.1. Primary data collection

4.1.1. Surveys

4.1.2. Randomised clinical trials

4.2. Secondary data collection

4.3. Patient registries

4.3.1. Definition

4.3.2. Conceptual differences between a registry and a study

4.3.3. Methodological guidance

4.3.4. Registries which capture special populations

4.3.5. Disease registries in regulatory practice and health technology assessment

4.4. Spontaneous report database

4.5. Social media and electronic devices

4.6. Research networks

4.6.1. General considerations

4.6.2. Models of studies using multiple data sources

4.6.3. Challenges of different models

5. Study design and methods

5.1. Definition and validation of drug exposure, outcomes and covariates

5.1.1. Assessment of exposure

5.1.2. Assessment of outcomes

5.1.3. Assessment of covariates

5.1.4. Validation

5.2. Bias and confounding

5.2.1. Selection bias

5.2.2. Information bias

5.2.3. Confounding

5.3. Methods to handle bias and confounding

5.3.1. New-user designs

5.3.2. Case-only designs

5.3.3. Disease risk scores

5.3.4. Propensity scores

5.3.5. Instrumental variables

5.3.6. Prior event rate ratios

5.3.7. Handling time-dependent confounding in the analysis

5.4. Effect measure modification and interaction

5.5. Ecological analyses and case-population studies

5.6. Pragmatic trials and large simple trials

5.6.1. Pragmatic trials

5.6.2. Large simple trials

5.6.3. Randomised database studies

5.7. Systematic reviews and meta-analysis

5.8. Signal detection methodology and application

6. The statistical analysis plan

6.1. General considerations

6.2. Statistical analysis plan structure

6.3. Handling of missing data

7. Quality management

8. Dissemination and reporting

8.1. Principles of communication

8.2. Communication of study results

9. Data protection and ethical aspects

9.1. Patient and data protection

9.2. Scientific integrity and ethical conduct

10. Specific topics

10.1. Comparative effectiveness research

10.1.1. Introduction

10.1.2. General aspects

10.1.3. Prominent issues in CER

10.2. Vaccine safety and effectiveness

10.2.1. Vaccine safety

10.2.2. Vaccine effectiveness

10.3. Design and analysis of pharmacogenetic studies

10.3.1. Introduction

10.3.2. Identification of generic variants

10.3.3. Study designs

10.3.4. Data collection

10.3.5. Data analysis

10.3.6. Reporting

10.3.7. Clinical practice guidelines

10.3.8. Resources

Annex 1. Guidance on conducting systematic revies and meta-analyses of completed comparative pharmacoepidemiological studies of safety outcomes