Original articleInternal validation of predictive models: Efficiency of some procedures for logistic regression analysis
Introduction
Predictive models are important tools to provide estimates of patient outcome [1]. A predictive model may well be constructed with regression analysis in a data set with information from a series of representative patients. The apparent performance of the model on this training set will be better than the performance in another data set, even if the latter test set consists of patients from the same population 1, 2, 3, 4, 5, 6. This ‘optimism’ is a well-known statistical phenomenon, and several approaches have been proposed to estimate the performance of the model in independent subjects more accurately than based on a naive evaluation on the training sample 3, 7, 8, 9.
A straightforward and fairly popular approach is to randomly split the training data in two parts: one to develop the model and another to measure its performance. With this split-sample approach, model performance is determined on similar, but independent, data [9].
A more sophisticated approach is to use cross-validation, which can be seen as an extension of the split-sample method. With split-half cross-validation, the model is developed on one randomly drawn half and tested on the other and vice versa. The average is taken as estimate of performance. Other fractions of subjects may be left out (e.g., 10% to test a model developed on 90% of the sample). This procedure is repeated 10 times, such that all subjects have once served to test the model. To improve the stability of the cross-validation, the whole procedure can be repeated several times, taking new random subsamples. The most extreme cross-validation procedure is to leave one subject out at a time, which is equivalent to the jack-knife technique [7].
The most efficient validation has been claimed to be achieved by computer-intensive resampling techniques such as the bootstrap [8]. Bootstrapping replicates the process of sample generation from an underlying population by drawing samples with replacement from the original data set, of the same size as the original data set [7]. Models may be developed in bootstrap samples and tested in the original sample or in those subjects not included in the bootstrap sample 3, 8.
In this study we compare the efficiency of internal validation procedures for predictive logistic regression models. Internal validation refers to the performance in patients from a similar population as where the sample originated from. Internal validation is in contrast to external validation, where various differences may exist between the populations used to develop and test the model [10]. We vary the sample size from small to large. As an indicator of sample size we use the number of events per variable (EPV); low EPV values indicate that many parameters are estimated in relation to the information in the data 11, 12. We study a number of measures of predictive performance, and we will show that bootstrapping is generally superior to other approaches to estimate internal validity.
Section snippets
Patients
We analyzed 30-day mortality in a large data set of patients with acute myocardial infarction (GUSTO-I) 13, 14. This data set has been used before to study methodological aspects of regression modeling 15, 16, 17, 18. In brief, this data set consists of 40,830 patients, of whom 2851 (7.0%) had died at 30 days.
Simulation study
Random samples were drawn from the GUSTO-I data set, with sample size varied according to the number of events per variable (EPV). We studied the validity of EPV as an indicator of
Optimism in apparent performance
In Fig. 1 we show the apparent and test performance of the logistic regression model with eight predictors in relation to sample size, as indicated by the number of events per variable (EPV). The apparent performance was determined on random samples from the GUSTO-I data set, with sample sizes (number of deaths) of n = 572 (40), n = 1145 (80), n = 2291 (160), n = 4582 (320), n = 9165 (640) for EPV 5, 10, 20, 40, and 80, respectively. For all performance measures, we note optimism in the
Discussion
Accurate estimation of the internal validity of a predictive regression model is especially problematic when the sample size is small. The apparent performance as estimated in the sample then is a substantial overestimation of the true performance in similar subjects. In our study, split-sample approaches underestimated performance and showed high variability. In contrast, bootstrap resampling resulted in stable and nearly unbiased estimates of performance.
Methods to assess internal validity
Acknowledgements
We would like to thank Kerry L. Lee, Duke Clinical Research Institute, Duke University Medical Center, Durham NC, and the GUSTO investigators for making the GUSTO-I data available for analysis. The research of Dr Steyerberg has been made possible by a fellowship of the Royal Netherlands Academy of Arts and Sciences.
References (33)
- et al.
A simulation study of the number of events per variable in logistic regression analysis
J Clin Epidemiol
(1996) - et al.
Stepwise selection in small data setsa simulation study of bias in logistic regression analysis
J Clin Epidemiol
(1999) - et al.
Multivariable prognostic modelsissues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors
Stat Med
(1996) Regression, prediction and shrinkage
J R Stat Soc B
(1983)Estimating the error rate of a prediction rulesome improvements on cross-validation
JASA
(1983)Probabilistic prediction in patient management and clinical trials
Stat Med
(1986)- et al.
Predictive value of statistical models
Stat Med
(1990) Model uncertainty, data mining and statistical inference
J R Stat Soc A
(1995)- et al.
An introduction to the bootstrap. Monographs on statistics and applied probability
(1993) - et al.
Improvements on cross-validationthe .632+ bootstrap method
JASA
(1997)
Data splitting
Am Statistician
Assessing the generalizability of prognostic information
Ann Intern Med
Regression modelling strategies for improved prognostic prediction
Stat Med
An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction
N Engl J Med
Predictors of 30-day mortality in the era of reperfusion for acute myocardial infarction. Results from an international trial of 41,021 patients
Circulation
A comparison of statistical learning methods on the Gusto database
Stat Med
Cited by (2018)
Cardiac Arrest With or Without Need for Extracorporeal Life Support After Congenital Cardiac Surgery
2024, Annals of Thoracic SurgerySpinach Yield Mapping Using Multispectral UAV Imagery
2024, Smart Agricultural TechnologyLung Cancer Risk Prediction Models for Asian Ever-Smokers
2024, Journal of Thoracic Oncology