Test Data Reuse for the Evaluation of Continuously Evolving Classification Algorithms Using the Area under the Receiver Operating Characteristic Curve

Abstract

Performance evaluation of continuously evolving machine learning algorithms presents new challenges, especially for high-risk application domains such as medicine. In principle, to obtain performance measures that generalize to a target population, a new independent test dataset randomly drawn from the target population should be used each time a new performance evaluation is required. However, test datasets of sufficient quality are often hard to acquire, and it is tempting to utilize a previously used test dataset for a new performance evaluation. With extensive experiments on simulated and real data we illustrate how such a “naive” approach to test data reuse can inadvertently result in overfitting the algorithm to the test data, resulting in a generalization loss and overly optimistic conclusions about the algorithm performance. We investigate the use of a modified version of the reusable holdout mechanism of Dwork et al. [Science, 349 (2015), pp. 636--638], which allows for repeated reuse of the same test dataset. We extend their approach to the use of AUC, the area under the receiver operating characteristic curve, as the reported performance metric. Theoretical guarantees for our method are proven to hold in extremely data-rich scenarios. However, our empirical results indicate promising performance of the proposed technique even on small data. With extensive simulation studies and experiments on real medical imaging data we show that our procedure indeed substantially reduces the problem of overfitting to the test data, even when the test dataset is small, at the cost of a mild additional uncertainty on the reported test performance.

Keywords

  1. adaptive data analysis
  2. data reuse
  3. reusable holdout
  4. receiver operating characteristic curve
  5. area under the ROC curve
  6. AUC

MSC codes

  1. 68T05
  2. 62-07
  3. 62P10

Get full access to this article

View all available purchase options and get full access to this article.

Supplementary Material


PLEASE NOTE: These supplementary files have not been peer-reviewed.


Index of Supplementary Materials

Title of paper: Test Data Reuse for the Evaluation of Continuously Evolving Classification Algorithms Using the Area Under the Receiver Operating Characteristic Curve

Authors: Alexej Gossmann, Aria Pezeshk, Yu-Ping Wang, Berkman Sahiner

File: supplement.pdf

Type: PDF

Contents: - Sections SM1 and SM2 contain highly detailed step-by-step instructions for the data generation procedures and the simulation experiments (the same information is provided in the main text Section 5 in a much shorter form).
- Figure SM1 visualizes the simulation data distribution.
- Section SM3 and Table SM1 show various estimation error and variability metrics observed in the simulation studies and provide a brief discussion of these results (the main text contains a part of these results within Section 6).

References

1.
R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman, Algorithmic stability for adaptive data analysis, in Proceedings of the 48th Annual ACM Symposium on Theory of Computing, ACM, 2016, pp. 1046--1059.
2.
A. Blum and M. Hardt, The ladder: A reliable leaderboard for machine learning competitions, Proc. Mach. Learn. Res., 37 (2015), pp. 1006--1014.
3.
A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition, 30 (1997), pp. 1145--1159.
4.
L. Breiman, Random forests, Machine Learning, 45 (2001), pp. 5--32.
5.
G. C. Cawley and N. L. C. Talbot, On over-fitting in model selection and subsequent selection bias in performance evaluation, J. Mach. Learn. Res., 11 (2010), pp. 2079--2107.
6.
H. P. Chan, B. Sahiner, R. F. Wagner, and N. Petrick, Classifier design for computer-aided diagnosis: Effects of finite sample size on the mean performance of classical and neural network classifiers, Medical Phys., 26 (1999), pp. 2654--2668.
7.
C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Trans. Intell. Syst. Technol., 2 (2011), pp. 1--27.
8.
T. Chen and C. Guestrin, XGBoost: A scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785--794.
9.
S. Chilamkurthy, R. Ghosh, S. Tanamala, M. Biviji, N. G. Campeau, V. K. Venugopal, V. Mahajan, P. Rao, and P. Warier, Deep learning algorithms for detection of critical findings in head CT scans: A retrospective study, The Lancet, 392 (2018), pp. 2388--2396.
10.
F. Ciompi, B. de Hoop, S. J. van Riel, K. Chung, E. T. Scholten, M. Oudkerk, P. A. de Jong, M. Prokop, and B. van Ginneken, Automatic classification of pulmonary peri-fissural nodules in computed tomography using an ensemble of 2D views and a convolutional neural network out-of-the-box, Medical Image Anal., 26 (2015), pp. 195--202.
11.
R. Cummings, K. Ligett, K. Nissim, A. Roth, and Z. S. Wu, Adaptive learning with robust generalization guarantees, Proc. Mach. Learn. Res., 49 (2016), pp. 772--814.
12.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ImageNet: A large-scale hierarchical image database, in 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 248--255.
13.
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, Generalization in Adaptive Data Analysis and Holdout Reuse, preprint, https://arxiv.org/abs/1506.02629, 2015.
14.
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, Generalization in adaptive data analysis and holdout reuse, in Advances in Neural Information Processing Systems, MIT Press, 2015, pp. 2350--2358.
15.
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, Statistics. The reusable holdout: Preserving validity in adaptive data analysis, Science, 349 (2015), pp. 636--638.
16.
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, Guilt-free data reuse, Commun. ACM, 60 (2017), pp. 86--93.
17.
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth, Preserving statistical validity in adaptive data analysis, in Proceedings of the 47th Annual ACM Symposium on Theory of Computing, ACM, 2015, pp. 117--126.
18.
C. Dwork, F. McSherry, K. Nissim, and A. Smith, Calibrating noise to sensitivity in private data analysis, in Theory of Cryptography Conference, Springer, 2006, pp. 265--284.
19.
C. Dwork and A. Roth, The algorithmic foundations of differential privacy, Found. Trends Theoret. Comput. Sci., 9 (2014), pp. 211--407.
20.
V. Feldman, R. Frostig, and M. Hardt, The advantages of multiple classes for reducing overfitting from test set reuse, Proc. Mach. Learn. Res., 97 (2019), pp. 1892--1900.
21.
V. Feldman and T. Steinke, Generalization for adaptively-chosen estimators via stable median, Proc. Mach. Learn. Res., 65 (2017), pp. 728--757.
22.
V. Feldman and T. Steinke, Calibrating noise to variance in adaptive data analysis, Proc. Mach. Learn. Res., 75 (2018), pp. 535--544.
23.
A. E. Flanders, L. M. Prevedello, G. Shih, S. S. Halabi, J. Kalpathy-Cramer, R. Ball, J. T. Mongan, A. Stein, F. C. Kitamura, M. P. Lungren et al., Construction of a machine learning dataset through collaboration: The RSNA 2019 Brain CT Hemorrhage Challenge, Radiology Artif. Intell., 2 (2020), e190211.
24.
Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, in Machine Learning: Proceedings of the Thirteenth International Conference, Morgan Kaufmann, 1996, pp. 148--156.
25.
B. D. Gallas, H.-P. Chan, C. J. D'Orsi, L. E. Dodd, M. L. Giger, D. Gur, E. A. Krupinski, C. E. Metz, K. J. Myers, N. A. Obuchowski, B. Sahiner, A. Y. Toledano, and M. L. Zuley, Evaluating imaging and computer-aided detection and diagnosis devices at the FDA, Academic Radiology, 19 (2012), pp. 463--477.
26.
A. Gelman and E. Loken, The statistical crisis in science. Data-dependent analysis---a “garden of forking paths''---explains why many statistically significant comparisons don't hold up, Amer. Scientist, 102 (2014), p. 460.
27.
J. A. Hanley and B. J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology, 143 (1982), pp. 29--36.
28.
T. J. Hastie, R. J. Tibshirani, and J. H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Series in Statistics, Springer, 2017.
29.
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770--778.
30.
N. Kriegeskorte, W. K. Simmons, P. S. F. Bellgowan, and C. I. Baker, Circular analysis in systems neuroscience: The dangers of double dipping, Nature Neurosci., 12 (2009), pp. 535--540.
31.
M. Kuhn and K. Johnson, Applied Predictive Modeling, Springer, 2013.
32.
U. K. Lopes and J. F. Valiati, Pre-trained convolutional neural networks as feature extractors for tuberculosis detection, Comput. Biol. Medicine, 89 (2017), pp. 135--143.
33.
H. Mania, J. Miller, L. Schmidt, M. Hardt, and B. Recht, Model similarity mitigates test set overuse, in Advances in Neural Information Processing Systems, Curran Associates, 2019, pp. 9993--10002.
34.
R. B. Rao, G. Fung, and R. Rosales, On the dangers of cross-validation. An experimental evaluation, in Proceedings of the 2008 SIAM International Conference on Data Mining, SIAM, 2008, pp. 588--596.
35.
B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, Do ImageNet classifiers generalize to ImageNet?, Proc. Mach. Learn. Res., 97 (2019), pp. 5389--5400.
36.
J. Reunanen, Overfitting in making comparisons between variable selection methods, J. Mach. Learn. Res., 3 (2003), pp. 1371--1382.
37.
R. Roelofs, V. Shankar, B. Recht, S. Fridovich-Keil, M. Hardt, J. Miller, and L. Schmidt, A meta-analysis of overfitting in machine learning, in Advances in Neural Information Processing Systems, Curran Associates, 2019, pp. 9175--9185.
38.
R. Rogers, A. Roth, A. Smith, N. Srebro, O. Thakkar, and B. Woodworth, Guaranteed validity for empirical approaches to adaptive data analysis, Proc. Mach. Learn. Res., 108 (2020), pp. 2830--2840.
39.
R. Rogers, A. Roth, A. Smith, and O. Thakkar, Max-information, differential privacy, and post-selection hypothesis testing, in 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), 2016, pp. 487--494.
40.
D. Russo and J. Zou, Controlling bias in adaptive data analysis using information theory, Proc. Mach. Learn. Res., 51 (2016), pp. 1232--1240.
41.
S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, Learnability, stability and uniform convergence, J. Mach. Learn. Res., 11 (2010), pp. 2635--2670.
42.
M. Shenfeld and K. Ligett, A necessary and sufficient stability notion for adaptive generalization, in Advances in Neural Information Processing Systems, Curran Associates, 2019, pp. 11481--11490.
43.
M. Skocik, J. Collins, C. Callahan-Flintoft, H. Bowman, and B. Wyble, I Tried a Bunch of Things: The Dangers of Unexpected Overfitting in Classification, preprint, https://www.biorxiv.org/content/early/2016/10/03/078816, 2016.
44.
H. Zou and T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol., 67 (2005), pp. 301--320.
45.
T. Zrnic and M. Hardt, Natural analysts in adaptive data analysis, Proc. Mach. Learn. Res., 97 (2019), pp. 7703--7711.

Information & Authors

Information

Published In

cover image SIAM Journal on Mathematics of Data Science
SIAM Journal on Mathematics of Data Science
Pages: 692 - 714
ISSN (online): 2577-0187

History

Submitted: 21 April 2020
Accepted: 1 March 2021
Published online: 3 June 2021

Keywords

  1. adaptive data analysis
  2. data reuse
  3. reusable holdout
  4. receiver operating characteristic curve
  5. area under the ROC curve
  6. AUC

MSC codes

  1. 68T05
  2. 62-07
  3. 62P10

Authors

Affiliations

Funding Information

National Institutes of Health https://doi.org/10.13039/100000002 : R01GM109068, R01MH104680, R01MH107354, R01AR059781

Funding Information

U.S. Food and Drug Administration https://doi.org/10.13039/100000038

Funding Information

National Science Foundation https://doi.org/10.13039/100000001 : 1539067

Funding Information

U.S. Department of Energy https://doi.org/10.13039/100000015

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited By

There are no citations for this item

Media

Figures

Other

Tables

Share

Share

Copy the content Link

Share with email

Email a colleague

Share on social media

The SIAM Publications Library now uses SIAM Single Sign-On for individuals. If you do not have existing SIAM credentials, create your SIAM account https://my.siam.org.