Print page Resize text Change font-size Change font-size Change font-size High contrast

Home > Standards & Guidances > Methodological Guide

ENCePP Guide on Methodological Standards in Pharmacoepidemiology


5.3. Definition and validation of drug exposure, outcomes and covariates


Historically, pharmacoepidemiology studies relied on patient-supplied information or searches through paper-based health records. The rapid increase in access to electronic healthcare records and large administrative databases has changed the way exposures and outcomes are defined, measured and validated. All variables should be defined with care taking into account the fact that information is often recorded for purposes other than pharmacoepidemiology. Misclassification of exposure, outcome or any covariates, or incorrect categorization of these variables, may lead to information bias, i.e., a distortion of the value of the point estimate.


5.3.1. Assessment of exposure

In pharmacoepidemiology studies, exposure data originate mainly from four data sources: prescription data (e.g. CPRD primary care data), data on dispensing (e.g. PHARMO outpatient pharmacy database), data on payment for medication (namely claims data, e.g. IMS LifeLink PharMetrics Plus) and data collected in surveys. The population included in these data sources follows a process of attrition: drugs that are prescribed are not necessarily dispensed, and drugs that are dispensed are not necessarily ingested. In Primary non-adherence in general practice: a Danish register study (Eur J Clin Pharmacol 2014;70(6):757-63), 9.3% of all prescriptions for new therapies were never redeemed at the pharmacy, with different percentages per therapeutic and patient groups. The attrition from dispensing to ingestion is even more difficult to measure, as it is compounded by uncertainties about which dispensed drugs are actually taken by the patients and the patients’ ability to provide an accurate account of their intake.


Exposure definitions can include simple dichotomous variables (e.g. ever exposed vs. never exposed) or be more detailed, including estimates of duration, exposure windows (e.g. current vs. past exposure) or dosage (e.g. current dosage, cumulative dosage over time). Consideration should be given to the level of detail available from the data sources on the timing of exposure, including the quantity prescribed, dispensed or ingested and the capture of dosage instructions. This will vary across data sources and exposures (e.g. estimating anticonvulsant ingestion is typically easier than estimating rescue medication for asthma attacks). Assumptions made when preparing drug exposure data for analysis have an impact on results: an unreported step in pharmacoepidemiology studies (Pharmacoepidemiol Drug Saf. 2018;27(7):781-8) demonstrates the effect of certain exposure assumptions on findings and provides a framework to report preparation of exposure data. The Methodology chapter of the book Drug Utilization Research. Methods and Applications (M. Elseviers, B. Wettermark, A.B. Almarsdottir et al. Ed. Wiley Blackwell, 2016) discusses different methods for data collection on drug utilisation.


5.3.2. Assesment of outcomes


A case definition compatible with the data source should be developed for each outcome of a study at the design stage. This description should include how events will be identified and classified as cases, whether cases will include prevalent as well as incident cases, exacerbations and second episodes (as differentiated from repeat codes) and all other inclusion or exclusion criteria. The reason for the data collection and the nature of the healthcare system that generated the data should also be described as they can impact on the quality of the available information and the presence of potential biases. Published case definitions of outcomes, such as those developed by the Brighton Collaboration in the context of vaccination, are useful but are not necessarily compatible with the information available in the observational data sources. For example, information on the onset or duration of symptoms may not be available.


Search criteria to identify outcomes should be defined and the list of codes and any used case finding algorithm should be provided. Generation of code lists requires expertise in both the coding system and the disease area. Researchers should consult clinicians who are familiar with the coding practice within the studied field. Suggested methodologies are available for some coding systems, as described in Creating medical and drug code lists to identify cases in primary care databases (Pharmacoepidemiol Drug Saf. 2009;18(8):704-7). Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models (Annu Rev Biomed Data Sci. 2018;1:53-68) reports on methods for phenotyping (finding patients with specific conditions or outcomes) which are becoming more commonly used particularly in multi-resource studies.  Care should be given when re-using a code list from another study as code lists depend on the study objective and methods. Public repository of codes as is available and researchers are also encouraged to make their own set of coding available.


In some circumstances, chart review or free text entries in electronic format linked to coded entries can be useful for outcome identification. Such identification may involve an algorithm with use of multiple code lists (for example disease plus therapy codes) or an endpoint committee to adjudicate available information against a case definition. In some cases, initial plausibility checks or subsequent medical chart review will be necessary. When databases contain prescription data only, drug exposure may be used as a proxy for an outcome, or linkage to different databases is required.


5.3.3. Assesment of covariates


In pharmacoepidemiology studies, covariates use includes selecting and matching study subjects, comparing characteristics of the cohorts, developing propensity scores, creating stratification variables, evaluating effect modifiers and adjusting for confounders. Reliable assessment of covariates is therefore essential for the validity of results. A given database may or may not be suitable for studying a research question depending on the availability of information on these covariates.


Some patient characteristics and covariates vary with time and accurate assessment is therefore time dependent. The timing of assessment of the covariates is an important factor for the correct classification of the subjects and should be clearly reported. Capturing covariates can be done at one or multiple points during the study period. In the later scenario, the variable will be modelled as time-dependent variable (See section 5.4.6).


Assessment of covariates can be done using different periods of time (look-back periods or run-in periods). Fixed look-back periods (for example 6 months or 1 year) can be appropriate when there are changes in coding methods or in practices or when using the entire medical history of a patient is not feasible. Estimation using all available covariates information versus a fixed look-back window for dichotomous covariates (Pharmacoepidemiol Drug Saf. 2013; 22(5):542-50) establishes that defining covariates based on all available historical data, rather than on data observed over a commonly shared fixed historical window will result in estimates with less bias. However, this approach may not always be applicable, for example when data from paediatric and adult periods are combined because covariates may significantly differ between paediatric and adult populations (e.g. height and weight).


5.3.4. Misclassification and validation




Misclassification arises when incorrect information about either exposure or outcome or any covariates is collected in the study or if variables are incorrectly categorized. Misclassification should be detected, measured and removed or reduced to avoid information bias, i.e. a distortion of the value of the point estimate. Misclassification can be either non-differential when it does occur randomly across exposed/non-exposed participants or differential when it is influenced by the disease or exposure status.

Outcome misclassification occurs when a non-case is classified as a case (false positive error) or a case is classified as a non-case (false negative error). The influence of misclassification on the point estimate should be quantified or, if this is not possible, its impact on the interpretation of the results should be discussed.

Exposure misclassification should be measured in each comparison group and the epidemiologic ‘mantra’ about non-differential misclassification of exposure producing conservative estimates should be avoided. It holds true, on the average, for dichotomous exposures that have an effect, but does not necessarily apply to any given estimate (see: Proper interpretation of non-differential misclassification effects: expectations vs observations. Int J Epidemiol. 2005;34(3):680-7).




Most database studies will be subject to outcome misclassification to some degree, although case adjudication against an established case definition or a reference standard can remove false positives, and false negatives can be mitigated if a broad search algorithm is used. Misclassification by exposure should be measured by validation. Validity of diagnostic coding within the General Practice Research Database: a systematic review (Br J Gen Pract. 2010:60:e128 36), the book Pharmacoepidemiology (B. Strom, S.E. Kimmel, S. Hennessy. 5th Edition, Wiley, 2012) and Mini-Sentinel's systematic reviews of validated methods for identifying health outcomes using administrative and claims data: methods and lessons learned (Pharmacepidiol Drug Safety 2012;supp1:82 9) provide examples of validation. External validation against chart review or physician/patient questionnaire is possible in some instances but the questionnaires cannot always be considered as ‘gold standard’.


For databases routinely used in research, documented validation of key variables may have been done previously by the data provider or other researchers. Any extrapolation of a previous validation study should however consider the effect of any differences in prevalence and inclusion and exclusion criteria, the distribution and analysis of risk factors as well as subsequent changes to health care, procedures and coding, as illustrated in Basic Methods for Sensitivity Analysis of Biases, (Int J Epidemiol. 1996;25(6):1107-16). The accurate date of onset is particularly important for studies relying upon timing of exposure and outcome such as in the self-controlled designs.


Linkage validation can be used when another database is used for the validation through linkage methods (see Using linked electronic data to validate algorithms for health outcomes in administrative databases, J Comp Eff Res 2015;4:359-66). In some situations, there is no access to a resource to provide data for comparison. In this case, indirect validation may be an option, as explained in the book Applying quantitative bias analysis to epidemiologic data (Lash T, Fox MP, Fink AK. Springer-Verlag, New-York, 2009).

Structural validation of the database with internal logic checks can also be performed to verify the completeness and accuracy of variables. For example, one can investigate whether an outcome was followed by (or proceeded from) appropriate exposure or procedures or if a certain variable has values within a known reasonable range.


While the positive predictive value is more easily measured than the negative predictive value, a low specificity is more damageable than a low sensitivity when considering bias in relative risk estimates (see A review of uses of health care utilization databases for epidemiologic research on therapeutics. J Clin Epidemiol. 2005;58(4):323-37). FDA’s Quantitative Bias Analysis Methodology Development: Sequential Bias Adjustment for Outcome Misclassification (2017) proposes a method of adjustment when validation of the variable is complete. Use of the Positive Predictive Value to Correct for Disease Misclassification in Epidemiologic Studies (Am J Epidemiol. 1993;138(11):1007–15) proposes a method based on estimates of the positive predictive value which requires validation of a sample of patients with the outcome only, while assuming that sensitivity is non-differential and has been used in a web application (Outcome misclassification: Impact, usual practice in pharmacoepidemiology database studies and an online aid to correct biased estimates of risk ratio or cumulative incidence (Pharmacoepidemiol Drug Saf. 2020;29(11):1450-5) which allows correction of risk ratio or cumulative incidence point estimates and confidence intervals for bias due to outcome misclassification based on this methodology. The article Basic methods for sensitivity analysis of biases (Int J Epidemiol. 1996;25(6):1107-16) provides different examples of methods for examining the sensitivity of study results to biases, with a focus on methods that can be implemented without computer programming. Good practices for quantitative bias analysis advocates explicit and quantitative assessment of misclassification bias, including guidance on which biases to assess in each situation, what level of sophistication to use, and how to present the results.


« Back