Original Article
Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: A clinical example

https://doi.org/10.1016/j.jclinepi.2006.01.015Get rights and content

Abstract

Background and Objectives

To illustrate the effects of different methods for handling missing data—complete case analysis, missing-indicator method, single imputation of unconditional and conditional mean, and multiple imputation (MI)—in the context of multivariable diagnostic research aiming to identify potential predictors (test results) that independently contribute to the prediction of disease presence or absence.

Methods

We used data from 398 subjects from a prospective study on the diagnosis of pulmonary embolism. Various diagnostic predictors or tests had (varying percentages of) missing values. Per method of handling these missing values, we fitted a diagnostic prediction model using multivariable logistic regression analysis.

Results

The receiver operating characteristic curve area for all diagnostic models was above 0.75. The predictors in the final models based on the complete case analysis, and after using the missing-indicator method, were very different compared to the other models. The models based on MI did not differ much from the models derived after using single conditional and unconditional mean imputation.

Conclusion

In multivariable diagnostic research complete case analysis and the use of the missing-indicator method should be avoided, even when data are missing completely at random. MI methods are known to be superior to single imputation methods. For our example study, the single imputation methods performed equally well, but this was most likely because of the low overall number of missing values.

Introduction

Missing observations are frequently encountered and occur in all types of studies, no matter how strictly designed or how hard investigators try to prevent them. In diagnostic studies, as in other type of epidemiological studies including clinical trials and repeated measurement surveys, missing data often occur in a selective pattern. Patient referral for subsequent measurements, here diagnostic procedures, is commonly based on prior measurements, here prior test results, certainly when data are obtained from routine care. In diagnostic research this leads to the well-known referral (verification or work-up) bias [1]. Consider, for example, a study among children with neck stiffness. The aim was to quantify which diagnostic test results from patient history and physical examination predict the presence or absence of bacterial meningitis and which blood tests, e.g., leukocyte count or c-reactive protein level, have additional predictive value [2]. Patients who presented with severe signs, such as convulsions and high fever, were more often and quicker referred for additional blood testing, before full completion of patient history and physical examination. On the other hand, for patients presenting with very mild or no symptoms, additional tests were less often done because the physician already ruled out a serious disease early in the diagnostic process. Accordingly, the sample of study subjects with complete data did not represent the group as a whole, and subjects with missing data carried important information on the associations studied.

There are three types of missing data [3], [4]. When subjects with missing data form a random subset of the study sample (e.g., because a tube with blood material was accidentally broken), missing data are denoted as missing completely at random (MCAR). Whether missing data are MCAR can easily be tested in the data. When missing data occur in relation to observed covariables (such as selective work-up in diagnostic studies) or the outcome, the subjects with missing data are a selective rather than a completely at random subset of the total study population. This pattern of missingness is confusingly called missing at random (MAR). When the reason for a missing value depends on unknown or unobserved information, they are denoted as missing not at random (MNAR). Unfortunately, it is impossible to determine from the data whether missing data are MAR or MNAR; this can only be reasoned or speculated [3], [4].

Analysis of epidemiological data typically concerns associations between several predictors and an outcome variable using multivariable regression techniques. Most softwares by default exclude every subject from the analysis with at least one missing value on any of the predictors or outcome analyzed. This is called complete case analysis, and it is the most common form of epidemiological analysis. When missing data are MCAR, complete case analysis obviously is inefficient but leads to unbiased associations. However, when missing data are not MCAR, which commonly is the case, it has extensively been argued and shown that complete case analysis is not only inefficient but commonly leads to biased results as well [3], [4], [5], [6], [7].

Various methods have been proposed to deal with missing data. Among them is the missing-indicator method, which uses a dummy variable as an indicator for missing data [5], [8]. For multilevel and repeated measurement analysis with missing values, maximum likelihood methods as for example in the expectation maximization method, have been proposed. When predictors and outcomes are measured only once (as is common for diagnostic studies), imputation of missing values is the advocated approach. In this, missing data are replaced (filled in) by a reasonable estimated value of that variable, commonly a mean value. One may use an unconditional and conditional mean imputation [3], [5], [6], [7], [9]. Unconditional imputation replaces the missing by, for example, the overall variable mean or median from the observed data, or a random value drawn from the subjects with observed data on that variable. Conditional mean imputation replaces the missing by the mean that is estimated from the specific subgroup to which the subject with missing belongs. Conditional mean imputation can be done once (single imputation) or more than once (multiple imputation [MI]). By means of MI, a random component is added to the imputed value, representing uncertainty because the imputed value was not observed but estimated. Single imputation methods are considered to result in unbiased study results (i.e., associations between predictors and outcome) but in an overestimation of the precision (too small standard errors), whereas MI is assumed to yield unbiased results and appropriate standard errors. This notion, however, appears not fully recognized by researchers, because most epidemiological studies still perform complete case analysis. There are only few studies, and certainly no (multivariable) diagnostic studies, that used empirical data in which the various methods to handle missing data have been applied and the results compared [10], [11].

Using empirical data from a study among patients suspected of pulmonary embolism (PE), we evaluated which diagnostic test results (predictors) contribute to predicting the presence or absence of PE by handling the missing values on the predictors in five different ways. These included complete case analysis, the indicator method, the unconditional and conditional single imputation, and MI. Our goal was not to provide a technical overview of different methods for dealing with missing data. For this we refer to the literature [3], [4], [5], [6], [7], [9], [10], [12], [13]. The goal was only to show the effects of the five “missing data methods” when applied to an empirical multivariable diagnostic study.

Section snippets

Design of the example study

For the present analyses we used data from a study on diagnosis of PE for which methods and results have been described [14], [15], [16]. In brief, the study included 398 consecutive patients of 18 years or older who were referred to a Dutch hospital because acute PE was clinically suspected. From all patients, first medical history and physical examination were documented. Additional tests included blood gas analysis, chest radiography, and compression ultrasound of the lower extremities.

Results

Of the152 subjects with at least one missing value 36% (n = 54) had PE and of the 246 subjects without a missing value 47% (n = 116) had PE. The difference in prevalence of PE for subjects with and without missing data was statistically significant (P = 0.02), indicating that the missing data were not MCAR. This was confirmed by comparing the observed values of the predictors for the subjects with at least one missing value to the subjects without any missing values (completely observed subjects).

Discussion and conclusion

Missing data provide a challenge in design and analyses of (clinical) epidemiological studies. In multivariable diagnostic research the aim is often to determine the predictors that independently contribute to predicting the presence or absence of a particular disease in patients suspected of this disease. We illustrated the practical consequences of five well-known methods for handling missing data when using the popular stepwise (backwards) selection approach in multivariable prediction

Acknowledgment

We gratefully acknowledge the support by The Netherlands Organization for Scientific Research (ZON-MW 904-10-006 and 917-46-360).

References (35)

  • D.F. Ransohoff et al.

    Problems of spectrum and bias in evaluating the efficacy of diagnostic tests

    N Engl J Med

    (1978)
  • J.L. Schafer et al.

    Missing data: our view of the state of the art

    Psychol Methods

    (2002)
  • D.B. Rubin

    Inference and missing data

    Biometrika

    (1976)
  • S. Greenland et al.

    A critical look at methods for handling missing covariates in epidemiologic regression analyses

    Am J Epidemiol

    (1995)
  • W. Vach

    Some issues in estimating the effect of prognostic factors from incomplete covariate data

    Stat Med

    (1997)
  • R.J.A. Little

    Regression with missing X's: a review

    J Am Stat Assoc

    (1992)
  • O.S. Miettinnen

    Theoretical epidemiology

    (1985)
  • Cited by (0)

    View full text