Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. Missing data are a common problem in all datasets and can have a significant effect on the conclusions that can be drawn from the data for the following reasons: 1) the absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false; 2) the lost data can cause bias in the estimation of parameters; 3) it can reduce the representativeness of the samples; 4) it may complicate the analysis of the study. Each of these elements can lead to invalid conclusions.
There are different patterns of missing data:
Missing completely at random (MCAR): there are no systematic differences between the missing values and the observed values.
Missing at random (MAR): any systematic difference between the missing values and the observed values can be explained by differences in observed data.
Missing not at random (MNAR): even after the observed data are taken into account, systematic differences remain between the missing values and the observed values.
Complete case analysis, thereby removing the records with missing data, is only valid in certain circumstances (i.e. if the missing data is MCAR). Therefore, it is advised to use statistical methods to impute missing data. These statistical methods will depend on the pattern of missing data. In general, it is desirable to show that conclusions drawn from the data are not sensitive to the particular pattern used to handle missing values. To investigate this, it may be helpful to repeat the analysis with a variety of statistical approaches.
A concise review of methods to handle missing data is provided in the section ‘Missing data’ of the Encyclopedia of Epidemiologic Methods (Gail MH, Benichou J, Editors. Wiley 2000) and in the book Statistical analysis with missing data (Little RJA, Rubin DB. 2nd ed.,Wiley 2002). The section ‘Handling of missing values’ in Modern Epidemiology, 3rd ed. (K. Rothman, S. Greenland, T. Lash. Lippincott Williams & Wilkins, 2008) is a summary of the state of the art, focused on practical issues for epidemiologists. Other useful references on handling missing data include the books Multiple Imputation for Nonresponse in Surveys (Rubin DB, Wiley, 2004) and Analysis of Incomplete Multivariate Data (Schafer JL, Chapman & Hall/CRC, 1997), and the articles Using the outcome for imputation of missing predictor values was preferred (J Clin Epi. 2006;59(10):1092-101), Recovery of information from multiple imputation: a simulation study (Emerg Themes Epidemiol. 2012;9(1):3) and Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data (Stat Med. 2014;33(21):3725-37).
Another method commonly used in epidemiology is to create a category of the variable, or an indicator, for the missing values. This practice can be invalid even if the data are missing completely at random and should be avoided (see Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression. J Am Stat Assoc. 1996;91(433):222-30).
A wide range of statistical software is available to impute missing data, mainly focusing on Multiple Imputation (MI) when missing data is assumed to be MAR, such as The MI Procedure of the SAS Institute. Multiple imputation of missing values (Stata J. 2004;4:227-41) and mice: Multivariate Imputation by Chained Equations in R (J Stat Soft. 2011;45(3)).
A good overview of available software packages is provided in Missing data: A statistical framework for practice (Biom J. 2021;63(5): 915-47).