We are honoured this month to have a guest blog from Nidhi Menon.

Nidhi is a Biostatistician within BDSI, supporting the Health Analytics Research Centre (HARC), a collaboration between ACT Health and its academic partners focused on health data science and research methods as well as analytics in both qualitative and quantitative areas.

Her PhD was centered around the mystery of missing data and the role of multiple imputation to try to resolve this. We hope her valuable insights will help you appreciate the complexity of the problem.

Multiple imputation (MI) has recently become an extremely popular approach when missing data. One big reason for this is that once missing values have been imputed and imputed datasets have been generated, these can be easily analysed using standard statistical analysis, by pooling the estimates using Rubin’s rules. Additionally, by incorporating MI in statistical software, imputation has now become an easy solution for missing data! Multiple Imputation can be very useful to handle missing values if done correctly, however it is equally dangerous if performed incorrectly. In this blog, we touch upon six key factors to bear in mind when performing Multiple Imputation.

**1) Mechanism of Missingness**

Broadly missing data mechanisms can be categorised as Missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR).

Data is said to be missing completely at random (MCAR) if the probability of a value being missing is unrelated to the observed and missing data for that unit. This means that regardless of the observed values, the probability of missing is the same for all units. Data is said to be missing at random (MAR) if the probability of missingness depends on the observed values but not the missing values. The standard implementation of MI depends on the Missing at Random (MAR) assumption. Finally, if the MAR assumption is violated, then the data is said to be missing not at random (MNAR).

If the data is MCAR, then both MI and available case analysis are valid methods of analysis and produce unbiased results. If the data are MAR, then MI is a better approach compared available case analysis, yielding negligible bias in estimates.

2) **Structure of the Imputation Model**

The validity of the inference from multiple imputation is compromised when the analysis model is *“uncongenial*” to the imputation model. For MI to generate valid results, the imputations must be obtained wisely. The most challenging step in MI is choosing the right model to produce imputations – this is referred to as the imputation model. *Uncongeniality* simply means that there is a lack of consistency between the analysis model and the imputation model. This inconsistency arises when the imputer and the analyst have access to different information (Meng, 1994). The imputation model should also reflect the data. For example, if the data is multilevel or longitudinal in nature, both the imputation and the analysis model should incorporate the structure of the dataset.

3) **Selecting Predictors for the Imputation Model**

The most challenging step in MI is choosing the right model to produce imputations i.e. the imputation model. For MI to generate valid results, the imputations must be obtained wisely. The target variable in the imputation model is the variable with the missing values, while the target variable in the analysis model is the outcome. The rule of thumb is that the imputation model include all variables specified in the analysis model, including any interactions implied by the analysis model. Researchers should exclude any terms involving the target variable from the imputation model. Having the analysis model as a subset of the imputation model, would result in typically unbiased point estimates and may result in wider interval estimates. Thus, researchers should ensure that the two models are congenial with each other, and that the imputation model is larger than the analysis model.

4**) Imputing Derived Variables**

Obtaining plausible imputed values using MI gets tricky when the variables used in the analysis includes squares, interactions or logarithmic transformations of covariates. We refer to these transformed (linear or non-linear) variables as derived variables. There are two approaches to imputing missing values in derived variables; *transform – then impute* & *impute – then transform (passive imputation). * In the method *transform- then impute*, we calculate the transformations and then impute for missing values like in any other variable. However, one can never be certain that the relationship between the imputed variables and imputed transformed variables would continue to hold true. In *passive imputation*, we impute variables in their raw form and then transform the imputed variables. This technique maintains consistency within transformations in the study. Hippel (2009) has challenged the importance of matching the shape of the distribution of observed and imputed data. He argues that as long as the imputations preserve the mean and the variance of the observed data, maintaining consistency in transformations may not be relevant. Currently passive imputation remains as the preferred method when handling derived variables (Buuren, 2018).

**5) Number of Imputations**

The basic idea behind multiple imputations is to replace every missing observation with several say ‘m’ plausible values. The choice for the value “m’ has always been a point of contention among researchers. Rubin (1987) identified that using a value of m that lies between say 2 and 10 is relatively efficient for a modest fraction of missing information. He also showed the efficiency of finite-m repeated imputation estimator relative to the infinite-m repeated imputation estimator is (1 + γ_{0}/m) ^{-1/2}, where γ_{0} is the population fraction of missing information. Rubin (1987) illustrated that across different fractions of missing information (FMI), for ‘m’ lying between 2 and 5, the large sample coverage probability remains constant. This explains why the default number of imputations in most statistical packages is set to 5. This has also resulted in the guideline that advises that more than 10 imputations are rarely required.

**6) Method of Imputation**

Popular implementations of multiple imputation in software, include Multiple Imputation using Chained Equation (MICE) and Joint Modelling or (JoMo). The MICE method imputed variables with missing values one at a time from a series of univariate conditional distribution. In contrast, the Joint Modelling method draws missing values simultaneously for all incomplete variables using a multivariate distribution. While both of these methods, were originally proposed for cross sectional data, numerous extensions of the original JoMo and FCS approaches for imputing in longitudinal and clustered study designs have been proposed over the years.

**To Conclude: ***The idea of imputation is both seductive and dangerous* (Rubin 1987). The inclusion of partially observed covariates through MI can lead to reduced bias and increased precision. Researchers are advised to be strategic and vigilant before proceeding with imputation. In this piece, I have outlined key aspects to consider before undertaking MI to improve imputations.

**References:**

Van Buuren, S. (2018). *Flexible imputation of missing data*. CRC press.

Meng, X. L. (1994). Multiple-imputation inferences with uncongenial sources of input. *Statistical Science*, 538-558.

Rubin, D. B. (1987). Multiple Imputation for Non-response in Surveys John Wiley. *New York*.

Von Hippel, P. T. (2009). 8. How to impute interactions, squares, and other transformed variables. Sociological methodology, 39(1), 265-291.