Object recognition and viewpoint estimation lie at the heart of visual understanding. Recent studies have suggested that convolutional neural networks (CNNs) fail to generalize to out-of-distribution (OOD) category–viewpoint combinations, that is, combinations not seen during training. Here we investigate when and how such OOD generalization may be possible by evaluating CNNs trained to classify both object category and three-dimensional viewpoint on OOD combinations, and identifying the neural mechanisms that facilitate such OOD generalization. We show that increasing the number of in-distribution combinations (data diversity) substantially improves generalization to OOD combinations, even with the same amount of training data. We compare learning category and viewpoint in separate and shared network architectures, and observe starkly different trends on in-distribution and OOD combinations, that is, while shared networks are helpful in distribution, separate networks significantly outperform shared ones at OOD combinations. Finally, we demonstrate that such OOD generalization is facilitated by the neural mechanism of specialization, that is, the emergence of two types of neuron—neurons selective to category and invariant to viewpoint, and vice versa.
This is a preview of subscription content
Subscribe to Journal
Get full journal access for 1 year
only 7,71 € per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
To access and cite the Biased-Cars dataset, please visit https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/F1NQ3R&faces-redirect=true.
Source code and demos are available on GitHub at https://github.com/Spandan-Madan/generalization_to_OOD_category_viewpoint_combinations.
He, K., Zhang, X., Ren, S. and Sun, J. Deep residual learning for image recognition. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2818–2826 (IEEE, 2016).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 4700–4708 (IEEE, 2017).
Su, H., Qi, C. R., Li, Y. & Guibas, L. J. Render for CNN: Viewpoint estimation in images using CNNs trained with rendered 3D model views. In Proc. IEEE International Conference on Computer Vision 2686–2694 (IEEE, 2015).
Massa, F., Marlet, R. & Aubry, M. Crafting a multi-task CNN for viewpoint estimation. In Proc. British Machine Vision Conference 91.1–91.12 (BMVA, 2016).
Elhoseiny, M., El-Gaaly, T., Bakry, A. & Elgammal, A. A comparative analysis and study of multiview CNN models for joint object categorization and pose estimation. In Proc. International Conference on Machine Learning 888–897 (PMLR, 2016).
Mahendran, S., Ali, H. & Vidal, R. Convolutional networks for object category and 3D pose estimation from 2D images. In Proc. European Conference on Computer Vision Workshops 698–715 (Springer, 2018).
Afifi, A. J., Hellwich, O. & Soomro, T. A. Simultaneous object classification and viewpoint estimation using deep multi-task convolutional neural network. In Proc. International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications 177–184 (2018).
Engstrom, L., Tran, B., Tsipras, D., Schmidt, L. & Madry, A. Exploring the landscape of spatial robustness. In Proc. International Conference on Machine Learning 1802–1811 (PMLR, 2019).
Azulay, A. & Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? J. Mach. Learn. Res. 20, 1–25 (2019).
Srivastava, S., Ben-Yosef, G. & Boix, X. Minimal images in deep neural networks: fragile object recognition in natural images. In Proc. International Conference on Learning Representations (2019).
Alcorn, M. A. et al. Strike (with) a pose: neural networks are easily fooled by strange poses of familiar objects. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 4845–4854 (IEEE, 2019).
Barbu, A. et al. ObjectNet: a large-scale bias-controlled dataset for pushing the limits of object recognition models. Adv. Neural Inf. Process. Syst. 32, 9448–9458 (2019).
Tulsiani, S. & Malik, J. Viewpoints and keypoints. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1510–1519 (IEEE, 2015).
Xiang, Y., Schmidt, T., Narayanan, V. & Fox, D. PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. In Proc. Robotics: Science and Systems (2018).
Manhardt, F. et al. CPS++: improving class-level 6D pose and shape estimation from monocular images with self-supervised learning. Preprint at https://arxiv.org/abs/2003.05848 (2020).
Caruana, R. Multitask learning. Mach. Learn. 28, 41–75 (1997).
Giles, C. L. & Maxwell, T. Learning, invariance, and generalization in high-order neural networks. Appl. Optics 26, 4972–4978 (1987).
Riesenhuber, M. & Poggio, T. Just one view: Invariances in inferotemporal cell tuning. Adv. Neural Inf. Process. Syst. 10, 215–221 (1998).
Goodfellow, I., Lee, H., Le, Q. V., Saxe, A. & Ng, A. Y. Measuring invariances in deep networks. Adv. Neural Inf. Process. Syst. 22, 646–654 (2009).
Achille, A. & Soatto, S. Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 19, 1947–1980 (2018).
Borji, A., Izadi, S. & Itti, L. iLab-20M: a large-scale controlled object dataset to investigate deep learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2221–2230 (IEEE, 2016).
Visual variation learning for object recognition. Jatuporn Toy Leksut https://bmobear.github.io/projects/viva/ (2016).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
The MNIST Database of Handwritten Digits (accessed 13 January 2022); http://yann.lecun.com/exdb/mnist/
Xiang, Y., Mottaghi, R. & Savarese, S. Beyond pascal: a benchmark for 3D object detection in the wild. In Proc. IEEE Winter Conference on Applications of Computer Vision 75–82 (IEEE, 2014).
Caesar, H. et al. nuScenes: a multimodal dataset for autonomous driving. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 11618–11628 (IEEE, 2020).
Min, J., Lee, J., Ponce, J. & Cho, M. Spair-71k: a large-scale benchmark for semantic correspondence. Preprint at https://arxiv.org/abs/1908.10543 (2019).
Larochelle, H., Erhan, D., Courville, A., Bergstra, J. & Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. In Proc. 24th International Conference on Machine Learning 473–480 (PMLR, 2007).
Krause, J., Stark, M., Deng, J. & Fei-Fei, L. 3D object representations for fine-grained categorization. In Proc. 4th International IEEE Workshop on 3D Representation and Recognition 554–561 (IEEE, 2013).
Ozuysal, M., Lepetit, V. & Fua, P. Pose estimation for category specific multiview object localization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 778–785 (IEEE, 2009).
Qiu, W. & Yuille, A. UnrealCV: connecting computer vision to Unreal Engine. In Proc. European Conference on Computer Vision 909–916 (Springer, 2016).
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A. & Koltun, V. CARLA: an open urban driving simulator. In Proc. Annual Conference on Robot Learning 1–16 (2017).
Zhang, Y. et al. Physically-based rendering for indoor scene understanding using convolutional neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 5287–5295 (IEEE, 2017).
Halder, S. S., Lalonde, J.-F. & de Charette, R. Physics-based rendering for improving robustness to rain. In Proc. IEEE/CVF International Conference on Computer Vision 10203–10212 (IEEE, 2019).
Divon, G. & Tal, A. Viewpoint estimation—insights & model. In Proc. European Conference on Computer Vision 252–268 (Springer, 2018).
Mueller, P. et al. Esri CityEngine—A 3D City Modeling Software for Urban Design, Visual Effects, and VR/AR (Esri R&D Center Zurich, 2020); http://www.esri.com/cityengine
Blender—A 3D Modelling and Rendering Package (Blender Foundation, Stichting Blender Foundation, 2020); http://www.blender.org
Savarese, S. Fei-Fei, L. 3D generic object categorization, localization and pose estimation. In 2007 IEEE 11th International Conference on Computer Vision 1–8 (IEEE, 2007).
Ghodrati, A., Pedersoli, M. & Tuytelaars, T. Is 2D information enough for viewpoint estimation? In Proc. British Machine Vision Conference (BMVA, 2014).
Tulsiani, S., Carreira, J. & Malik, J. Pose induction for novel object categories. In Proc. IEEE International Conference on Computer Vision 64–72 (IEEE, 2015).
Penedones, H., Collobert, R., Fleuret, F. & Grangier, D. Improving Object Classification Using Pose Information Technical Report Idiap-RR-30-2012 (Idiap Research Institute, 2012).
Zhao, J. & Itti, L. Improved deep learning of object category using pose information. In Proc. IEEE Winter Conference on Applications of Computer Vision 550–559 (IEEE, 2017).
Li, C., Bai, J. & Hager, G. D. A unified framework for multi-view multi-class object pose estimation. In Proc. European Conference on Computer Vision 254–269 (Springer, 2018).
Grabner, A., Roth, P. M. & Lepetit, V. 3D pose estimation and 3D model retrieval for objects in the wild. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 3022–3031 (IEEE, 2018).
Bricolo, E., Poggio, T. & Logothetis, N. K. 3D object recognition: a model of view-tuned neurons. Adv. Neural Inf. Process. Syst. 9, 41–47 (1997).
Poggio, T. & Anselmi, F. Visual Cortex and Deep Networks: Learning Invariant Representations (MIT Press, 2016).
Olshausen, B. A., Anderson, C. H. & Van Essen, D. C. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci. 13, 4700–4719 (1993).
Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C. & Fried, I. Invariant visual representation by single neurons in the human brain. Nature 435, 1102–1107 (2005).
Rust, N. C. & DiCarlo, J. J. Selectivity and tolerance (invariance) both increase as visual information propagates from cortical area V4 to IT. J. Neurosci. 30, 12978–12995 (2010).
Zeiler, M. D. & Fergus, R. Visualizing and understanding convolutional networks. In Proc. European Conference on Computer Vision 818–833 (Springer, 2014).
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. In Proc. International Conference on Learning Representations Workshop (2014).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Object detectors emerge in deep scene CNNs. In Proc. International Conference on Learning Representations (2015).
Bau, D., Zhou, B., Khosla, A., Oliva, A. & Torralba, A. Network dissection: quantifying interpretability of deep visual representations. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 6541–6549 (IEEE, 2017).
Oquab, M., Bottou, L., Laptev, I. & Sivic, J. Is object localization for free? Weakly-supervised learning with convolutional neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 685–694 (IEEE, 2015).
Morcos, A. S., Barrett, D. G. T., Rabinowitz, N. C. & Botvinick, M. On the importance of single directions for generalization. In Proc. International Conference on Learning Representations (2018).
Zhou, B., Sun, Y., Bau, D. & Torralba, A. Revisiting the importance of individual units in CNNs via ablation. Preprint at https://arxiv.org/abs/1806.02891 (2018).
Yang, G. R., Joglekar, M. R., Song, H. F., Newsome, W. T. & Wang, X.-J. Task representations in neural networks trained to perform many cognitive tasks. Nat. Neurosci. 22, 297–306 (2019).
Torralba, A. & Efros, A. A. Unbiased look at dataset bias. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 1521–1528 (IEEE, 2011).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
Standley, T. et al. Which tasks should be learned together in multi-task learning? In Proc. International Conference on Machine Learning (PMLR, 2020).
Shin, D., Fowlkes, C. C. & Hoiem, D. Pixels, voxels, and views: a study of shape representations for single view 3D object shape prediction. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 3061–3069 (IEEE, 2018).
Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 1492–1500 (IEEE, 2017).
Zagoruyko, S. & Komodakis, N. Wide residual networks. In Proc. British Machine Vision Conference 87.1–87.12 (BMVA, 2016).
Nakkiran, P. et al. Deep double descent: where bigger models and more data hurt. In Proc. International Conference on Learning Representations (2020).
Casper, S. et al. Frivolous units: wider networks are not really that wide. In Proc. Association for the Advancement of Artificial Intelligence (2021).
Cohen, T. S., Geiger, M., Köhler, J. & Welling, M. Spherical CNNs. In Proc. International Conference on Learning Representations (2018).
Cohen, T. S., Weiler, M., Kicanaoglu, B. & Welling, M. Gauge equivariant convolutional networks and the Icosahedral CNN. In Proc. International Conference on Machine Learning 1321–1330 (PMLR, 2019).
We are grateful to T. Poggio and P. Sinha for their insightful advice and warm encouragement. This work has been partially supported by NSF grant IIS-1901030, a Google Faculty Research Award, the Toyota Research Institute, the Center for Brains, Minds and Machines (funded by NSF STC award CCF-1231216), Fujitsu Laboratories (contract no. 40008819) and the MIT-Sensetime Alliance on Artificial Intelligence. We also thank K. Gupta for help with the figures, and P. Sharma for insightful discussions.
This study received funding from Fujitsu Laboratories. The funder through T.S. was involved in conception of the experiment, writing this article and supervising the study. All other authors declare no competing interests.
Peer review information
Nature Machine Intelligence thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Madan, S., Henry, T., Dozier, J. et al. When and how convolutional neural networks generalize to out-of-distribution category–viewpoint combinations. Nat Mach Intell 4, 146–153 (2022). https://doi.org/10.1038/s42256-021-00437-5