Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# A case-based interpretable deep learning model for classification of mass lesions in digital mammography

## Abstract

Interpretability in machine learning models is important in high-stakes decisions such as whether to order a biopsy based on a mammographic exam. Mammography poses important challenges that are not present in other computer vision tasks: datasets are small, confounding information is present and it can be difficult even for a radiologist to decide between watchful waiting and biopsy based on a mammogram alone. In this work we present a framework for interpretable machine learning-based mammography. In addition to predicting whether a lesion is malignant or benign, our work aims to follow the reasoning processes of radiologists in detecting clinically relevant semantic features of each image, such as the characteristics of the mass margins. The framework includes a novel interpretable neural network algorithm that uses case-based reasoning for mammography. Our algorithm can incorporate a combination of data with whole image labelling and data with pixel-wise annotations, leading to better accuracy and interpretability even with a small number of images. Our interpretable models are able to highlight the classification-relevant parts of the image, whereas other methods highlight healthy tissue and confounding information. Our models are decision aids—rather than decision makers—and aim for better overall human–machine collaboration. We do not observe a loss in mass margin classification accuracy over a black box neural network trained on the same data.

This is a preview of subscription content

## Access options

\$32.00

All prices are NET prices.

## Data availability

The imaging data are not publicly available because they contain confidential information that may compromise patient privacy as well as the ethical or regulatory policies of our institution. Data will be made available on reasonable request, for non-commercial research purposes, to those who contact J.L. (joseph.lo@duke.edu). Data usage agreements may be required. Source Data are provided with this paper.

## Code availability

Code is available on GitHub at https://github.com/alinajadebarnett/iaiabl. Two licenses are offered: an MIT license for non-commercial use and a custom license. The doi for the initial code release is https://doi.org/10.5281/zenodo.5565592.

## References

1. Kochanek, K. D., Xu, J. & Arias, E. Mortality In the United States, 2019 Techical Report 395 (NCHS, 2020); https://www.cdc.gov/nchs/products/databriefs/db395.htm

2. Badgeley, M. A. et al. Deep learning predicts hip fracture using confounding patient and healthcare variables. NPJ Digit. Med. 2, 1–10 (2019).

3. Winkler, J. K. et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatol. 155, 1135–1141 (2019).

4. Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med. https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002683 (2018).

5. Edwards, B. FDA Guidance on clinical decision support: peering inside the black box of algorithmic intelligence. ChilmarkResearch https://www.chilmarkresearch.com/fda-guidance-clinical-decision-support/ (2017).

6. Soffer, S. et al. Convolutional neural networks for radiologic images: a radiologist’s guide. Radiology 290, 590–606 (2019).

7. Sickles, E et al. in. ACR BI-RADS Atlas, Breast Imaging Reporting and Data System. 5th edn, (American College of Radiology, 2013).

8. McKinney, S. M. et al. International evaluation of an ai system for breast cancer screening. Nature 577, 89–94 (2020).

9. Chen, C. et al. This looks like that: deep learning for interpretable image recognition. In Advances in Neural Information Processing Systems 32 8930–8941 (NeurIPS, 2019).

10. Lehman, C. D. et al. Diagnostic accuracy of digital screening mammography with and without computer-aided detection. JAMA Internal Med. 175, 1828–1837 (2015).

11. Salim, M. et al. External evaluation of 3 commercial artificial intelligence algorithms for independent assessment of screening mammograms. JAMA Oncol. 6, 1581–1588 (2020).

12. Schaffter, T. et al. Evaluation of combined artificial intelligence and radiologist assessment to interpret screening mammograms. JAMA Network Open 3, e200265– (2020).

13. Wu, N. et al. Deep neural networks improve radiologists’ performance in breast cancer screening. IEEE Trans. Med. Imaging 39, 1184–1194 (2019).

14. Kim, H.-E. et al. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. The Lancet Digital Health 2, e138–e148 (2020).

15. Giger, M. L., Chan, H.-P. & Boone, J. Anniversary paper: history and status of CAD and quantitative image analysis: the role of medical physics and AAPM. Med. Phys. 35, 5799–5820 (2008).

16. Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 206–215 (2019).

17. Adebayo, J. et al. Sanity checks for saliency maps. In Advances in Neural Information Processing Systems 9505–9515 (NeurIPS, 2018).

18. Arun, N. et al. Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging. Radiology: Artificial Intelligence 3 (2021).

19. Wu, T. & Song, X. Towards interpretable object detection by unfolding latent structures. In Proc. IEEE International Conference on Computer Vision 6033–6043 (IEEE, 2019).

20. Chen, Z., Bei, Y. & Rudin, C. Concept whitening for interpretable image recognition. Nat. Mach. Intell. 2, 772–782 (2020).

21. Demigha, S. & Prat, N. A case-based training system in radiology-senology. In Proc. 2004 International Conference on Information and Communication Technologies: From Theory to Applications, 2004 41–42 (IEEE, 2004).

22. Macura, R. T. & Macura, K. J. Macrad: Radiology image resource with a case-based retrieval system. In International Conference on Case-Based Reasoning 43–54 (Springer, 1995).

23. Floyd Jr, C. E., Lo, J. Y. & Tourassi, G. D. Case-based reasoning computer algorithm that uses mammographic findings for breast biopsy decisions. Am. J. Roentgenol. 175, 1347–1352 (2000).

24. Kobashi, S., Kondo, K. & Hata, Y. Computer-aided diagnosis of intracranial aneurysms in MRA images with case-based reasoning. IEICE Trans. Inform. Syst. 89, 340–350 (2006).

25. Wang, H., Wu, Z. & Xing, E. P. Removing confounding factors associated weights in deep neural networks improves the prediction accuracy for healthcare applications. Pac. Symp. Biocomput. 24, 54–65 (2019).

26. Hu, S., Ma, Y., Liu, X., Wei, Y. & Bai, S. Stratified rule-aware network for abstract visual reasoning. In AAAIConference on Artificial Intelligence (AAAI) (2021).

27. Dundar, A. & Garcia-Dorado, I. Context augmentation for convolutional neural networks. Preprint at https://arxiv.org/abs/1712.01653 (2017).

28. Xiao, K., Engstrom, L., Ilyas, A. & Madry, A. Noise or signal: The role of image backgrounds in object recognition. In International Conference on Learning Representations (2020).

29. Luo, J., Tang, J., Tjahjadi, T. & Xiao, X. Robust arbitrary view gait recognition based on parametric 3D human body reconstruction and virtual posture synthesis. Pattern Recognition 60, 361–377 (2016).

30. Charalambous, C. & Bharath, A. A data augmentation methodology for training machine/deep learning gait recognition algorithms. In Proc. British Machine Vision Conference (BMVC) (eds Richard, C. et al.) 110.1–110.12 (BMVA, 2016).

31. Tang, R., Du, M., Li, Y., Liu, Z. & Hu, X. Mitigating gender bias in captioning systems. In Proc. Web Conference 2021, 633–645 (2021).

32. Zhao, Q., Adeli, E. & Pohl, K. M. Training confounder-free deep learning models for medical applications. Nat. Commun. 11, 1–9 (2020).

33. Schramowski, P. et al. Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nat. Mach. Intell. 2, 476–486 (2020).

34. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2921–2929 (IEEE, 2016).

35. Zheng, H., Fu, J., Mei, T. & Luo, J. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proc. IEEE International Conference on Computer Vision (ICCV), 5209–5217 (IEEE, 2017).

36. Fu, J., Zheng, H. & Mei, T. Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4438–4446 (IEEE, 2017).

37. DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 837–845 (1988).

38. Sun, X. & Xu, W. Fast implementation of delong’s algorithm for comparing the areas under correlated receiver operating characteristic curves. IEEE Signal Process. Lett. 21, 1389–1393 (2014).

39. Park, C. S. et al. Observer agreement using the ACR breast imaging reporting and data system (BI-RADS)-ultrasound, (2003). Korean J. Radiol. 8, 397 (2007).

40. Abdullah, N., Mesurolle, B., El-Khoury, M. & Kao, E. Breast imaging reporting and data system lexicon for us: interobserver agreement for assessment of breast masses. Radiology 252, 665–672 (2009).

41. Baker, J. A., Kornguth, P. J. & Floyd Jr, C. Breast imaging reporting and data system standardized mammography lexicon: Observer variability in lesion description. AJR Am. J. Roentgenol. 166, 773–778 (1996).

42. Rawashdeh, M., Lewis, S., Zaitoun, M. & Brennan, P. Breast lesion shape and margin evaluation: Bi-rads based metrics understate radiologists’ actual levels of agreement. Comput. Biol. Med. 96, 294 – 298 (2018).

43. Lazarus, E., Mainiero, M. B., Schepps, B., Koelliker, S. L. & Livingston, L. S. Bi-rads lexicon for us and mammography: interobserver variability and positive predictive value. Radiology 239, 385–391 (2006).

44. Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In The IEEE International Conference on Computer Vision (ICCV) (IEEE, 2017).

45. Chattopadhay, A., Sarkar, A., Howlader, P. & Balasubramanian, V. N. Grad-CAM++: generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) 839–847 (IEEE, 2018).

46. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proc. 3rd International Conference on Learning Representations (ICLR) (2015).

47. Landis, J. R. & Koch, G. G. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33, 363–374 (1977).

48. Kim, S. T., Lee, H., Kim, H. G. & Ro, Y. M. ICADx: interpretable computer aided diagnosis of breast masses. In Medical Imaging 2018: Computer-Aided Diagnosis Vol. 10575, 1057522 (International Society for Optics and Photonics, 2018).

49. Elter, M., Schulz-Wendtland, R. & Wittenberg, T. The prediction of breast cancer biopsy outcomes using two cad approaches that both emphasize an intelligible decision process. Med. Phys. 34, 4164–4172 (2007).

50. Benndorf, M., Burnside, E. S., Herda, C., Langer, M. & Kotter, E. External validation of a publicly available computer assisted diagnostic tool for mammographic mass lesions with two high prevalence research datasets. Med. Phys. 42, 4987–4996 (2015).

51. Burnside, E. S. et al. Probabilistic computer model developed from clinical data in national mammography database format to classify mammographic findings. Radiology 251, 663–672 (2009).

52. Park, H. J. et al. A computer-aided diagnosis system using artificial intelligence for the diagnosis and characterization of breast masses on ultrasound: added value for the inexperienced breast radiologist. Medicine 98, e14146 (2019).

53. Shimauchi, A. et al. Evaluation of clinical breast MR imaging performed with prototype computer-aided diagnosis breast MR imaging workstation: reader study. Radiology 258, 696–704 (2011).

54. Orel, S. G., Kay, N., Reynolds, C. & Sullivan, D. C. Bi-rads categorization as a predictor of malignancy. Radiology 211, 845–850 (1999).

55. Kalchbrenner, N., Grefenstette, E. & Blunsom, P. A convolutional neural network for modelling sentences. In Proc. 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 655–665 (2014).

56. Wu, J. et al. Deepminer: Discovering interpretable representations for mammogram classification and explanation. Harvard Data Science Review 3 (2021).

## Acknowledgements

We would like to acknowledge breast radiologists M. Taylor-Cho, L. Grimm, C. Kim and S. Yoon, who annotated the dataset used in this paper. This study was supported in part by NIH/NCI U01-CA214183 and U2C-CA233254 (J.L.). This study was supported in part by MIT Lincoln Laboratory (C.R.), Duke TRIPODS CCF-1934964 (C.R.) and the Duke Incubation Fund (A.J.B.).

## Author information

Authors

### Contributions

A.J.B., F.S., D.T., C.C., J.L. and C.R. conceived the idea and developed the model. D.T., A.J.B. and C.C. wrote and reviewed the code. Y.R., A.J.B., F.S. and J.L. performed data collection, and Y.R., D.T. and A.J.B. preprocessed it.

### Corresponding author

Correspondence to Alina Jade Barnett.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Peer review information Nature Machine Intelligence thanks Fredrik Strand and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data

### Extended Data Fig. 1 An automatically generated explanation of mass margin classification for a circumscribed lesion.

This circumscribed lesion is correctly identified as circumscribed. The first two most activated prototypes are drawn from the same image, but are associated with different regions of that image.

### Extended Data Fig. 2 An automatically generated explanation of mass margin classification for an indistinct lesion.

This indistinct lesion is correctly identified as indistinct. The indistinct portion of the lesion margin (right side) activates the indistinct prototype and the circumscribed portion of the lesion margin (left side) activates the circumscribed prototypes.

### Extended Data Fig. 3 An automatically generated explanation of mass margin classification for a spiculated lesion.

This spiculated lesion is correctly identified as spiculated.

### Extended Data Fig. 4 An automatically generated explanation of mass margin classification for an incorrectly classified lesion.

This spiculated lesion is incorrectly identified as circumscribed. The explanation highlights only the circumscribed portion of the mass margin (top), but does not detect the spiculated portion (bottom).

### Extended Data Fig. 5 A comparison of explanations.

We compare explanations from two common saliency methods (GradCAM [44] and GradCAM++ [45]) to a class activation visualization derived from our method. The explanations from IAIA-BL are more likely to highlight the lesion and less likely to highlight the surrounding healthy tissue. This is shown quantitatively by the activation precision metric. The single image visualization is a dramatic simplification of the full explanation that is generated by IAIA-BL. The IAIA-BL and ProtoPNet class activation visualizations shown in this figure are generated by taking the average of prototype activation maps for all prototypes of the correct class.

### Extended Data Fig. 6 The architecture of the IAIA-BL prototype network.

Test image x feeds into convolutional layers f. Each patch of f(x)l is compared to each learned prototype pi by calculating the squared distance between the patch and the prototype. The similarity map shows the closest (most ‘activated,’ that is, smallest L2 distance) patches in red and the furthest patches in blue, overlaid on the test image. Similarity score si is calculated from the corresponding similarity map. The similarity scores s feed into fully connected layer h1, outputting margin logits $${\hat{{{{\bf{y}}}}}}^{{{\text{margin}}}}$$. Margin logits $${\hat{{{{\bf{y}}}}}}^{{{\text{min}}}}$$ feed into fully connected layer h2, outputting malignancy logit ymal.

## Supplementary information

### Supplementary Information

Supplementary Sections 1–10, Tables 1 and 2, and Figs. 1–6.

## Source data

### Source Data Fig. 2

Labels and model predictions used to generate the ROC curves for Fig. 2.

## Rights and permissions

Reprints and Permissions

Barnett, A.J., Schwartz, F.R., Tao, C. et al. A case-based interpretable deep learning model for classification of mass lesions in digital mammography. Nat Mach Intell 3, 1061–1070 (2021). https://doi.org/10.1038/s42256-021-00423-x

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1038/s42256-021-00423-x