Print page Resize text Change font-size Change font-size Change font-size High contrast

Home > Standards & Guidances > Methodological Guide

ENCePP Guide on Methodological Standards in Pharmacoepidemiology


Chapter 4: Approaches to data collection


4. Approaches to data collection

4.1. Primary data collection

      4.1.1. Surveys

      4.1.2. Randomised clnical trails

4.2. Secondary data collection

4.3. Patient registries

      4.3.1. Definition

      4.3.2. Conceptual  differences between a registry and a study

      4.3.3. Methodological guidance

      4.3.4. Registries which capture special populations

      4.3.5. Disease registries in regulatory practice and health technology assessment

4.4. Spontaneous reports

4.5. Social media

      4.5.1. Definition

      4.5.2. Use in pharmacovigilance

      4.5.3. Challenges

      4.5.4. Data protection

4.6. Research networks

      4.6.1. General considerations

      4.6.2. Models of studies using multiple data sources

      4.6.3. Challenges of different models



4. Approaches to data collection


There are two main approaches for data collection: collection of data specifically for a particular study (‘primary data collection’) or use of data already collected for another purpose, e.g. as part of administrative records of patient health care (‘secondary data collection’). The distinction between primary and secondary data collection is important for marketing authorisation holders as it implies different regulatory requirements for the collection and reporting of suspected adverse reactions, as described in Module VI of the Guideline on good pharmacovigilance practice (GVP) - Management and reporting of adverse reactions to medicinal products.


Secondary data collection has become a common approach used in pharmacoepidemiology due to the increasing availability of electronic healthcare records, administrative claims data and other already existing data sources (see Chapter 4.2 Secondary data collection) and due to its increased efficiency and lower cost. In addition, networking between centres active in pharmacoepidemiology and pharmacovigilance is rapidly changing the landscape of drug safety research in Europe, both in terms of networks of data and networks of researchers who can contribute to a particular study with a particular data source (see Chapter 4.6 Research Networks).


4.1. Primary data collection


The methodological aspects of primary data collection studies are well covered in the textbooks and guidelines referred to in the Introduction chapter. Annex 1 of Module VIII of the Good pharmacovigilance practice provides examples of different study designs based on prospective primary data collection such as cross-sectional study, prospective cohort study, active surveillance. Surveys and randomised controlled trials are presented below as examples of primary data collection.


Studies using hospital or community-based primary data collection have allowed the evaluation of drug-disease associations for rare complex conditions that require very large source populations and in-depth case assessment by clinical experts. Classic examples are Appetite-Suppressant Drugs and the Risk of Primary Pulmonary Hypertension (N Engl J Med 1996;335:609-16), The design of a study of the drug etiology of agranulocytosis and aplastic anemia (Eur J Clin Pharmacol 1983;24:833-6) and Medication Use and the Risk of Stevens–Johnson Syndrome or Toxic Epidermal Necrolysis (N Engl J Med 1995;333:1600-8). For some conditions, case-control surveillance networks have been developed and used for selected studies and for signal generation and clarification, e.g. Signal generation and clarification: use of case-control data (Pharmacoepidemiol Drug Saf 2001;10:197-203).


4.1.1. Surveys


A survey is a data collection tool used to gather information about individuals. Surveys are commonly used to collect self-reported data, either on factual information about individuals, or their opinions. They generally have a cross-sectional design and represent a form primary data collection conducted through questionnaires administered by web, phone or paper.


Although used for a long time in other areas as social science or marketing, surveys are nowadays also increasingly used in pharmacoepidemiology, especially in the areas of epidemiology and evaluation of risk minimisation measure (RMM) effectiveness (See chapter 5.9).


Questionnaires used in surveys should be validated based on accepted measures including construct, criterion and content validity, inter-rater and test-retest reliability, sensitivity and responsiveness.

Recommendations with regards to data collection, which medium to use, how to recruit a representative sample and how to formulate the questions in a non-directive way to avoid information bias, are described in the following textbooks: Survey Sampling (L. Kish, Wiley, 1995) and Survey Methodology (R.M. Groves, F.J. Fowler, M.P. Couper et al., 2nd Edition, Wiley 2009).


Although primarily focused on quality of life research, the book Quality of Life: the assessment, analysis and interpretation of patient-related outcomes (P.M. Fayers, D. Machin, 2nd Edition, Wiley, 2007) offers a comprehensive review of the theory and practice of developing, testing and analysing questionnaires in different settings. Health Measurement Scales: a practical guide to their development and use (D. L. Streiner, G. R. Norman, 4th Edition, Oxford University Press, 2008) is a very helpful guide to those involved in measuring subjective states and learning style in patients and healthcare providers.


Representativeness is an important element for surveys; the included sample should be representative of the target population and must be defined with regards to the research question. For example, if the objective of the survey is to evaluate whether the RMM are distributed among the right target population, the lists which are used for the distribution of the RMM material cannot be used as the source population for sampling.


The response rate is also an important metric of survey and it should be reported in a standardised way for each survey so that the comparison among different surveys is possible. Standard Definitions. Final Dispositions of Case Codes and Outcome Rates for Surveys of the American Association for Public Opinion Research provides standard definitions which can be adapted to pharmacoepidemiological surveys. The overall response rate of participation remains low in telephone surveys (J.M. Lepkowski, N.C. Tucker, J.M Bricket al., Ed. Advances in telephone survey methodology Wiley 2007, Part V) and is important to counteract since it leads to lack of power and reduced representativeness. A way to mitigate the low response rate is to include the use of short or personalised questionnaires approved by professional associations.


4.1.2. Randomised clinical trials


Randomised clinical trials is an experimental design that involves primary data collection. There are numerous textbooks and publications on methodological and operational aspects of clinical trials and they are not covered here. An essential guideline on clinical trials is the European Medicines Agency (EMA) Guideline for good clinical practice E6(R2), which specifies obligations for the conduct of clinical trials to ensure that the data generated in the trial are valid. From a legal perspective, the Volume 10 of the Rules Governing Medicinal Products in the European Union contains all guidance and legislation relevant for conduct of clinical trials. A number of documents are under revision. 


The way clinical trials are conducted in the European Union (EU) will undergo a major change when the Clinical Trial Regulation (Regulation (EU) No 536/2014) will come into application and will replace the existing directive.


Hybrid data collection as used in pragmatic trials, large simple trials and randomised database studies are described in Chapter 5.6.


4.2. Secondary data collection


Secondary data collection refers to collection of data already gathered for another purpose (e.g. electronic and non-electronic healthcare data). These can be further linked to non-medical data, as socio-economic or lifestyle factors. The last decades have witnessed the development of key data resources, expertise and methodology that have allowed use of such data for pharmacoepidemiology. The ENCePP Inventory of Data Sources contains information on existing European databases. However, this field is continuously involving and it is recommended to look for recently published reviews and lists of databases.


A comprehensive description of the main features and applications of frequently used electronic healthcare databases for pharmacoepidemiology research in the United States and in Europe appears in the book Pharmacoepidemiology (B. Strom, S.E. Kimmel, S. Hennessy. 5th Edition, Wiley, 2012, Chapters 11 - 18). The limitations existing in using electronic healthcare databases should be acknowledged, as detailed in A review of uses of healthcare utilisation databases for epidemiologic research on therapeutics (J Clin Epidemiol 2005; 58: 23-337).


The primary purpose of the ISPE-endorsed Guidelines for Good Database Selection and use in Pharmacoepidemiology Research (Pharmacoepidemiol Drug Saf 2012;21:1-10) is to assist in the selection and use of data resources in pharmacoepidemiology by highlighting potential limitations and recommending correct procedures. This text mainly refers to databases of routinely collected healthcare information such as electronic medical records and claims databases and does not include spontaneous reporting databases. It is a simple, well-structured guideline that will help investigators to select databases and helps database custodians to describe their database in a useful manner. An entire section is dedicated to the use of multi-database studies. The document also contains references to data quality and validation procedures, data processing/transformation, privacy and security.


The Working Group for the Survey and Utilisation of Secondary Data (AGENS) with representatives from the German Society for Social Medicine and Prevention (DGSPM) and the German Society for Epidemiology (DGEpi) developed a Good Practice in Secondary Data Analysis Version 2 aiming to establish a standard for planning, conducting and analysing studies on the basis of secondary data. The guidance is also aimed to be used as the basis for contracts between data owners (so-called primary users) and secondary users. It is divided into 11 sections addressing, among other aspects, the study protocol, quality assurance and data protection.


The FDA’s Best Practices for Conducting and Reporting Pharmacoepidemiologic Safety Studies Using Electronic Health Care Data Sets provides criteria for best practice that apply to design, analysis, conduct and documentation. It emphasizes that investigators should understand the potential limitations of electronic healthcare data systems, make provisions for their appropriate use and refer to validation studies of safety outcomes of interest in the proposed study and captured in the database.


Guidance for conduction studies within electronic healthcare databases can also be found in the ISPE GPP, in particular sections IV-B (Study conduct, Data collection). This guidance emphasizes the importance of patient data protection.


The International Society for Pharmacoeconomics and Outcome Research (ISPOR) established a task force to recommend good research practices for designing and analysing retrospective databases for comparative effectiveness research (CER). The Task Force has subsequently published three articles (Part I, Part II and Part III) that review methodological issues and possible solutions for CER studies based on secondary data analysis (see also Chapter 10.1 on comparative effectiveness research). Many of the principles are applicable to studies with other objectives than CER, but aspects of pharmacoepidemiological studies based on secondary use of data, such as data quality, ethical issues, data ownership and privacy, are not covered.

Particular issues to be considered in the use of electronic healthcare data for pharmacoepidemiological research include completeness of data capture, bias in the assessment of exposure, outcome and covariates, variability between data sources and the  impact of changes over time in data, access methodology and the healthcare system.


The majority of the examples and methods covered in Chapter 5 are based on studies and methodologic developments in secondary data collection, since this is the most frequent approach used in pharmacoepidemiology. 


Chapter 4.6. deals with models of studies conducted across multiple data sources.  


4.3. Patient registries


4.3.1. Definition


A registry is an organised system that uses observational methods to collect uniform data on specified outcomes in a population defined by a particular disease, condition or exposure. A register is the database deriving from the registry (such as the EU PAS Register), the two terms being often used interchangeably. These terms are sometimes used incorrectly to designate a cohort study with primary data collection or a list of all patients meeting the eligibility criteria for a study. The term ‘patient log-list’ could be used for this purpose.


A patient registry should be considered as a structure for the standardised recording of data from routine clinical practice on individual patients identified by a characteristics or an event, for example the diagnosis of a disease, the occurrence of a condition (e.g., pregnancy), the prescription of a medicinal product, a hospital encounter, or any combination of these.


In European Nordic countries where there is a comprehensive registration of data for a high proportion or all of the population, government-administered patient registries may include hospital encounters, diagnoses and procedures, such as the Norwegian Patient Registry, the Danish National Patient Registry or the Swedish National Patient Register. They may lack information on lifestyle factors, patient-related outcomes and laboratory data. A Review of 103 Swedish Healthcare Quality Registries (J Intern Med 2015; 277(1): 94–136) describes additional healthcare quality registries focusing on specific disorders initiated in Sweden mostly by physicians with data on aspects of disease management, self-reported quality of life, lifestyle, and general health status, providing an important source for research.


4.3.2. Conceptual differences between a registry and a study


As illustrated in Imposed registries within the European postmarketing surveillance system (Pharmacoepidemiol Drug Saf 2018 May 11), the conceptual differences between registries and studies need to be clearly understood.


Patient registries are often integrated into routine clinical practice with systematic and sometimes automated data capture in electronic healthcare records. Whilst the duration of a registry is normally open-ended, that of a study is dictated by the time needed to define and collect data relevant for the specific study objectives. Studies may also require introduction of specific procedures, questionnaires or data collection tools. Studies are set up and managed based on a limited number of endpoints and a specific protocol, whereas patient registries should focus on system(s) specifications in order to ensure continuous, efficient and collaborative data collection, safe data hosting and  availability of retrievable, interoperable and re-usable data.


A register can be used as a source of patients for studies based on either primary data collection (where the data collected for new patients are also used for a specific study) or secondary data collection (analogously to the use of electronic healthcare records). For this purpose, registry data can be enriched with additional information on outcomes, lifestyle data, immunisation or mortality information obtained from linkage to the existing databases such as national cancer registries, prescription databases or mortality records.


4.3.3. Methodological guidance


To support better use of existing registries and facilitate the establishment of new high-quality registries, the EU regulatory network developed the Patient registries initiative. As part of this initiative, the European Medicines Agency (EMA) organised several workshops on disease-specific registries. The reports of these workshops on the EMA Patient registries website describe regulators’ expectation on common data elements to be collected and best practices on topics such as governance, data quality control, data sharing or reporting of safety data. The ENCePP Resource database of data sources is also used to support an inventory of existing disease registries.


Upon request from the European cystic fibrosis society patient registry (ECFSPR), the EMA’s Scientific Advice Working Party issued a Qualification Opinion, concluding that the current status of the registry allows its use as a data source for regulatory purposes for drug utilisation studies, drug efficacy/effectiveness studies and Drug Safety studies. Although it applies only to the ECFSPR, the text of this opinion provides a good indication of the key methodological components expected by regulators for using a disease registry for post-authorisation studies.


The US Agency for Health Care Research and Quality (AHRQ) published a comprehensive document on ‘good registry practices’ entitled Registries for Evaluating Patient Outcomes: A User's Guide, 3rd Edition, which provides methodological guidance on planning, design, implementation, analysis, interpretation and evaluation of the quality of a registry. There is a dedicated section for linkage of registries to other data sources. The EU PARENT Joint Action developed methodological and governance guidelines to facilitate cross-border use of registries. 


Results obtained from analyses of registry data may be affected by the same biases as those of studies described in Chapter 5.2 Bias and confounding. Registries are particularly sensitive to the occurrence of selection bias. This is due to the fact that factors that may influence the enlistment of patients in a registry may be numerous (including clinical, demographic and socio-economic factors) and difficult to predict and identify, potentially resulting in a biased sample of the patient population in case the recruitment has not been exhaustive. In addition, studies that use registry data may also introduce selection bias in the recruitment or selection of registered patient for the specific study, as well as in the differential completeness of follow-up and data collection. It is therefore important to systematically compare the characteristics of the study population with those of the source population.


The randomised registry trial is a new study design that combines the robustness of randomised studies with the higher generalisability of registry data, see Chapter 5.6.3.


4.3.4. Registries which capture special populations


In assessing both safety and effectiveness, special populations can be identified based on age (e.g., paediatric or elderly), pregnancy status, renal or hepatic function, race, or genetic differences. Some registries are focused on these particular populations. Examples of these are the birth registries in Nordic countries. 


The FDA’s Guidance for Industry-Establishing Pregnancy Exposure Registries advises on good practice for designing a pregnancy registry with a description of research methods and elements to be addressed. The Systematic overview of data sources for Drug Safety in pregnancy research provides an inventory of pregnancy exposure registries and alternative data sources on safety of prenatal drug exposure and discusses their strengths and limitations. Example of population-based registers allowing to assess outcome of drug exposure during pregnancy are the European network of registries for the epidemiologic surveillance of congenital anomalies EUROCAT, and the pan-Nordic registries which record drug use during pregnancy as illustrated in Selective serotonin reuptake inhibitors and venlafaxine in early pregnancy and risk of birth defects: population based cohort study and sibling design (BMJ 2015;350:h1798).


For paediatric populations, specific and detailed information as neonatal age (e.g. in days, not just in years), pharmacokinetic parameters and organ maturation need to be considered and is usually missing from the classical datasources, therefore paediatric specific registries are important. The CHMP Guideline on Conduct of Pharmacovigilance for Medicines Used by the Paediatric Population provides further relevant information. An example of registry which focuses on paediatric patients is Pharmachild, which captures children with juvenile idiopathic arthritis undergoing treatment with methotrexate or biologic agents.

Other registries that focus on special populations (e.g., the UK Renal Registry) can be found in the ENCePP Inventory of data sources.


4.3.5. Disease registries in regulatory practice and health technology assessment


Annex 1 of Module VIII of the Good pharmacovigilance practice provides guidance on use of patient registries for regulatory purpose. It emphasises that the choice of the registry population and the design of the registry should be driven by its objective(s) in terms of outcomes to be measured and analyses and comparisons to be performed. As existing disease registries gather insights into the natural history and clinical aspects of diseases and allow comparison of outcomes between different treatments prescribed for the same indication, they are generally preferred to product registries for regulatory purposes. Module VIII also acknowledges that, due to their observational nature, registries should not normally be used to demonstrate efficacy in real world setting, although in some cases (such as rare disease, rare exposure or special population), they may be the only opportunity to provide insight into effectiveness of a medicinal product. On the other hand, when efficacy has been demonstrated in randomised clinical trials (RCTs), registries may be useful to study effectiveness in heterogeneous populations and effect modifiers, such as doses that have been prescribed by physicians and that may differ from those used in RCTs, patient sub-groups defined by variables such as age, co-morbidities, use of concomitant medication or genetic factors, or factors related to a defined country or healthcare system that might influence effectiveness.


Incorporating data from clinical practice into the drug development process is a growing interest from health technology assessment (HTA) bodies and payers since reimbursement decisions can benefit from better estimation and prediction of effectiveness of treatments at the time of product launch. An example of where registries can provide clinical practice data is the building of predictive models that incorporate data from both RCTs and registries to bridge the efficacy-effectiveness gap, i.e. to generalise results observed in RCTs to a real-world setting. Collecting relevant HTA data in early development and planning post-authorisation data collection may therefore support rapid relative effectiveness assessment and decision-making on drug pricing and reimbursement. In this context, the EUnetHTA Joint Action 3 project has issued guidelines for the definition of the research questions and the choice of data sources and methodology that will support the generation of post-launch evidence.


4.4. Spontaneous reports


Spontaneous reports of adverse drug effects remain a cornerstone of pharmacovigilance and are collected from a variety of sources, including healthcare providers, national authorities, pharmaceutical companies, medical literature and more recently directly from patients. EudraVigilance is the European Union data processing network and management system for reporting and evaluation of suspected adverse drug reactions (ADRs). The Global Individual Case Safety Reports Database System (VigiBase) pools reports of suspected ADRs from the members of the WHO programme for international drug monitoring. These systems deal with the electronic exchange of Individual Case Safety Reports (ICSRs), the early detection of possible safety signals and the continuous monitoring and evaluation of potential safety issues in relation to reported ADRs. The report Characterization of databases (DB) used for signal detection (SD) of the PROTECT project shows the heterogeneity of spontaneous databases and the lack of comparability of SD methods employed. This heterogeneity is an important consideration when assessing the performance of SD algorithms.


The strength of spontaneous reporting systems is that they cover all types of legal drugs used in any setting. In addition to this, the reporting systems are built to obtain information specifically on potential adverse drug reactions and the data collection concentrates on variables relevant to this objective and directs reporters towards careful coding and communication of all aspects of an ADR. The increase in systematic collection of ICSRs in large electronic databases has allowed the application of data mining and statistical techniques for the detection of safety signals. There are known limitations of spontaneous ADR reporting systems, which include limitations embedded in the concept of voluntary reporting, whereby known or unknown external factors may influence the reporting rate and data quality. ICSRs may be limited in their utility by a lack of data for an accurate quantification of the frequency of events or the identification of possible risk factors for their occurrence. For these reasons, the concept is now well accepted that any signal from spontaneous reports needs to be verified clinically before further communication.


One challenge in spontaneous report databases is report duplication. Duplicates are separate and unlinked records that refer to one and the same case of a suspected ADR and may mislead clinical assessment or distort statistical screening. They are generally detected by individual case review of all reports or by computerised duplicate detection algorithms. In Performance of probabilistic method to detect duplicate individual case safety reports (Drug Saf 2014;37(4):249-58) a probabilistic method highlighted duplicates that had been missed by a rule-based method and also improved the accuracy of manual review. In the study, however, a demonstration of the performance of de-duplication methods to improve signal detection is lacking.


Validation of statistical signal detection procedures in EudraVigilance post-authorisation data: a retrospective evaluation of the potential for earlier signalling (Drug Saf 2010;33: 475 – 87) has shown that the statistical methods applied in EudraVigilance can provide significantly early warning in a large proportion of Drug Safety problems. Nonetheless, this approach should supplement, rather than replace, other pharmacovigilance methods.


Chapters IV and V of the Report of the CIOMS Working Group VIII ‘Practical aspects of Signal detection in Pharmacovigilance’ present sources and limitations of spontaneously-reported drug-safety information and databases that support signal detection. Appendix 3 of the report provides a list of international and national spontaneous reporting system database.


4.5. Social media


4.5.1. Definition


Technological advances have dramatically increased the range of data sources that can be used to complement traditional ones and may provide compelling insights into effectiveness and safety of interventions. Such data include digital media that exist in a computer-readable format as websites, web pages, blogs, vlogs, social networking sites, internet forums, chat rooms, health portals. A recent addition to this list is represented by the biomedical data collected through wearable technology (e.g., heart rate, physical activity and sleep pattern, dietary patterns). This data is unsolicited and generated in real time.


Social media is considered as a sub-set of digital media. The European Commission’s Digital Single Market Glossary defines social media as “a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0 and that allow the creation and exchange of user-generated content. It employs mobile and web-based technologies to create highly interactive platforms via which individuals and communities share, co-create, discuss, and modify user-generated content.


4.5.2. Use in pharmacovigilance


Social media has been used to provide insights into the patient’s perception of the effectiveness of drugs and for the collection of patient reported outcomes, as discussed in Web-based patient-reported outcomes in Drug Safety and risk management: challenges and opportunities? (Drug Saf 2012;35(6):437-46).


Another possible use of social media is in the signal detection process. In this setting, it would add value only if more issues are identified or they are identified faster, but there is currently no evidence this is the case. Using Social Media Data in Routine Pharmacovigilance: A Pilot Study to Identify Safety Signals and Patient Perspectives (Pharm Med 2017;31(3): 167-74) explores whether analysis of social media data could identify new signals, known signals from routine pharmacovigilance, known signals sooner, and specific issues (i.e., quality issues and patient perspectives). This study also tried to determine the quantity of posts with resemblance to adverse events and the types and characteristics of products that would benefit from social media analysis. It concludes that, although analysis of data from social media did not identify new safety signals, it can provide unique insight into the patient perspective. Assessment was limited by numerous factors, such as data acquisition, language, and demographics. Further research is deemed necessary to determine the best uses of social media data to augment traditional pharmacovigilance surveillance.


From a regulatory perspective, social media is a source of potential reports of suspected adverse drug reactions and marketing authorisation holders are legally obliged to screen web sites under their management and assess whether reports of adverse reactions qualify for spontaneous reporting (see Good Pharmacovigilance practice Module VI (Rev. 2), Chapter VI.B.1.1.4).


4.5.3. Challenges


While offering the promise of new research models and approaches, the rapidly evolving social media environment presents many challenges including the need for strong and systematic processes for selection, validation and study implementation. Articles which detail associated challenges are: Evaluating Social Media Networks in Medicines Safety Surveillance: Two Case Studies (Drug Saf 2015; 38(10): 921-30.) and Social media and pharmacovigilance: A review of the opportunities and challenges (Br J Clin Pharmacol 2015; 80(4): 910-20).


There is currently no defined strategy or framework in place in order to meet the standards around data validity, generalisability for this type of data, and their regulatory acceptance may therefore be lower than for traditional sources. However, more tools and solutions for analysing unstructured data are becoming available, especially for pharmacoepidemiology and Drug Safety research, as in Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts (J Am Med Inform Assoc 2017 Feb 22) and Social Media Listening for Routine Post-Marketing Safety Surveillance (Drug Saf 2016;39(5):443-54).


4.5.4. Data protection


The EU General Data Protection Regulation (GDPR) introduces EU-wide legislation on personal data and security. It specifies that the impact of data protection at the time of study design concept should be assessed and reviewed periodically. Other technical documents may also be applicable such as Smartphone Secure Development Guidelines (2011) published by the European Network and Information Security Agency (ENISA), which advises on design and technical solutions. The principles of these security measures are found in the European Data Protection Supervisor (EDPS) opinion on mobile health (Opinion 1/2015 Mobile Health-Reconciling technological innovation with data protection).


4.6. Research networks


4.6.1. General considerations


Pooling data across different databases increases precision and generalisability of the results. A growing number of studies use data from networks of databases, often from different countries. Some of these networks are based on long-term contracts with selected partners and are very well structured (such as Sentinel, the Vaccine Safety Datalink (VSD) or the Canadian Network for Observational Drug Effect Studies (CNODES)), but others are looser collaborations based on an open community principle (e.g. Observational Health Data Sciences and Informatics (OHDSI)). In Europe, collaborations for multi-database studies have been strongly encouraged by the Drug Safety research funded by the European Commission (EC) and public-private partnerships such as the Innovative Medicines Initiative (IMI). This funding resulted in the conduct of groundwork necessary to overcome the hurdles of data sharing across countries for specific projects (e.g. PROTECT, ADVANCE, EMIF) or for specific post-authorisation studies.


Networking implies collaboration between investigators for sharing expertise and resources. The ENCePP Database of Research Resources may facilitate such networking by providing an inventory of research centres and data sources that can collaborate on specific pharmacoepidemiology and pharmacovigilance studies in Europe. It allows the identification of centres and data sets by country, type of research and other relevant fields.


From a methodological point of view, research networks have many advantages over single database studies:

Research networks increase the size of study populations and shorten the time needed for obtaining the desired sample size. Hence, they can facilitate research on rare events and speed-up investigation of Drug Safety issues.

  • The heterogeneity of treatment options across countries allows studying the effect of different drugs used for the same indication.

  • Research networks may provide additional knowledge on whether a Drug Safety issue exists in several countries (and thereby reveal causes of differential drug effects), on the generalisability of results, on the consistency of information and on the impact of biases on estimates.

  • Involvement of experts from various countries addressing case definitions, terminologies, coding in databases and research practices provides opportunities to increase consistency of results of observational studies.

  • Sharing of data sources facilitates harmonisation of data elaboration and transparency in analyses and benchmarking of data management.

  • The potential for pooling data or results maximises the amount of information gathered for a specific issue addressed in different databases.

Different models have been applied for combining data or results from multiple databases. A common characteristic of all models is the fact that data partners maintain physical and operational control over electronic data in their existing environment. Differences however exist in the following areas: use of a common protocol; use of a common data model; and use of common data transformation analytics. 


Use of a common data model (CDM) implies that local formats are translated into a predefined, common data structure, which allows launching a similar data transformation script across several databases. The CDM can be systematically applied on the entire database (generalised CDM) or on the subset of data needed for a specific study (study-specific CDM). In the EU, study-specific CDMs have generated results in several projects and studies. Initial steps have been taken to create generalised CDMs, but experience based on real-life studies is lacking.


4.6.2. Models of studies using multiple data sources Local data extraction and analysis, separate protocols

The traditional way to combine data from multiple data sources is when data extraction and analysis are performed independently at each centre based on separate protocols. This is usually followed by meta-analysis of the different estimates obtained (see Chapter 5.7). Local data extraction and analysis, common protocol


In this option, data are extracted and analysed locally on the basis of a common protocol. Definitions of exposure, outcomes and covariates, analytical programmes and reporting formats are standardised according to a common protocol and the results of each analysis are shared in an aggregated format and pooled together through meta-analysis. This approach allows assessment of database or population characteristics and their impact on estimates but reduces variability of results determined by differences in design. Examples of research networks that use the common protocol approach are PROTECT (as described in Improving Consistency and Understanding of Discrepancies of Findings from Pharmacoepidemiological Studies: the IMI PROTECT Project. (Pharmacoepidemiol Drug Saf 2016;25(S1): 1-165) and the Canadian Network for Observational Drug Effect Studies (CNODES).


This approach requires very detailed common protocols and data specifications that reduce variability in interpretations by researchers. Multi-centre, multi-database studies with common protocols: lessons learnt from the IMI PROTECT project (Pharmacoepidemiol Drug Saf 2016;25(S1):156-165) states that a priori pooling of data from several databases may disguise heterogeneity that may provide useful information on the safety issue under investigation. On the other hand, parallel analysis of databases allows exploring reasons for heterogeneity through extensive sensitivity analyses. This approach eventually increases consistency in findings from observational drug effect studies or reveal causes of differential drug effects. Local data extraction and central analysis, study-specific common data model


Data can also be extracted from local databases using a study-specific, database-tailored extraction into a CDM and pre-processed locally. The resulting data can be transmitted to a central data warehouse as patient-level data or aggregated data for further analysis. Examples of research networks that used this approach by employing a study-specific CDM with transmission of anonymised patient-level data (allowing a detailed characterisation of each database) are EU-ADR (as explained in Combining multiple healthcare databases for postmarketing drug and vaccine safety surveillance: why and how?, J Intern Med 2014;275(6):551-61), SOS, ARITMO, SAFEGUARD, GRIP, EMIF, EUROmediCAT and ADVANCE. In all these projects, a basic and simple common date model was utilised and R, SAS, STATA or Jerboa scripts have been used to create and share common analytics. Diagnosis codes for case finding can be mapped across terminologies by using the Codemapper, developed in the ADVANCE project, as explained in CodeMapper: semiautomatic coding of case definitions (Pharmacoepidemiol Drug Saf 2017;26(8):998-1005).


An approach to quantify the impact of different case finding algorithms, called the component strategy, was developed in the EMIF and ADVANCE projects and could also be compatible with the simple and generalised common data model (see Identifying Cases of Type 2 Diabetes in Heterogeneous Data Sources: Strategy from the EMIF Project. PLoS One 2016;11(8):e0160648). Local data extraction and central analysis, generalised common data model


Two examples of research networks which use a generalised CDM are the Sentinel Initiative (as described in The U.S. Food and Drug Administration's Mini-Sentinel Program, Pharmacoepidemiol Drug Saf 2012;21(S1):1–303) and OHDSI. The main advantage of a general CDM is that it can be used for virtually any study involving that database. OHDSI is based on the Observational Medical Outcomes Partnership (OMOP) CDM which is now used by many organisations and has been tested for its suitability for safety studies (see for example Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc 2012;19(1):54–60). Conversion into the OMOP CDM, requires formal mapping of database items to standardised concepts. This is resource intensive and will need to be conducted every time the databases is updated.


In A Comparative Assessment of Observational Medical Outcomes Partnership and Mini-Sentinel Common Data Models and Analytics: Implications for Active Drug Safety Surveillance (Drug Saf 2015;38(8):749-65), it is suggested that slight conceptual differences between the Sentinel and the OMOP models do not significant impact on identifying known safety associations. Differences in risk estimations can be primarily attributed to the choices and implementation of the analytic approach. Local data extraction and central analysis, common protocol


For some studies, it has been possible to analyse centrally patient level data extracted based on a common protocol, such as in Selective serotonin reuptake inhibitors during pregnancy and risk of persistent pulmonary hypertension in the newborn: population based cohort study from the five Nordic Countries (BMJ 2012;344:d8012). If databases are very similar in structure and content as is the case for some Nordic registries, a CDM might not be required for data extraction. The central analysis allows removing an additional source of variability linked to the statistical programing and analysis.


4.6.3. Challenges of different models


The different models presented above present many challenges:


Related to the scientific content

  • Differences in the underlying health care systems and mechanisms of data generation and collection

  • Mapping of differing disease coding systems (e.g., the International Classification of Disease, 10th Revision (ICD-10), Read codes, the International Classification of Primary Care (ICPC-2)) and narrative medical information in different languages.

  • Validation of study variables and access to source documents for validation.

Related to the organisation of the network

  • Differences in culture and experience between academia, public institutions and private partners.

  • Differences in the type and quality of information contained within each mapped database.

  • Different ethical and governance requirements in each country regarding processing of anonymised or pseudo-anonymised healthcare data.

  • Choice of data sharing model and access rights of partners.

  • Issues linked to intellectual property and authorship.

  • Sustainability and funding mechanisms.

Each model has strengths and weaknesses in facing the above challenges, as illustrated in Data Extraction and Management in Networks of Observational Health Care Databases for Scientific Research: A Comparison of EU-ADR, OMOP, Mini-Sentinel and MATRICE Strategies (EGEMS 2016 Feb).  Experience has shown that many of these difficulties can be overcome by full involvement and good communication between partners, and a project agreement between network members defining roles and responsibilities and addressing issues of intellectual property and authorship. Several of the networks have made their code, products and data models publicly available as OMOP, Sentinel, ADVANCE.



« Back to main table of contents