Abstract
We address the problem of 3D rotation equivariance in convolutional neural networks. 3D rotations have been a challenging nuisance in 3D classification tasks requiring higher capacity and extended data augmentation in order to tackle it. We model 3D data with multi-valued spherical functions and we propose a novel spherical convolutional network that implements exact convolutions on the sphere by realizing them in the spherical harmonic domain. Resulting filters have local symmetry and are localized by enforcing smooth spectra. We apply a novel pooling on the spectral domain and our operations are independent of the underlying spherical resolution throughout the network. We show that networks with much lower capacity and without requiring data augmentation can exhibit performance comparable to the state of the art in standard retrieval and classification benchmarks.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Convolutional Neural Network (CNNs)
- Spherical Convolution
- Spherical Harmonic Domain
- Spherical Fourier Transform (SFT)
- Group Convolution
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
One of the reasons for the tremendous success of convolutional neural networks (CNNs) is their equivariance to translations in euclidean spaces and the resulting invariance to local deformations. Invariance with respect to other nuisances has been traditionally addressed with data augmentation while non-euclidean inputs like point-clouds have been approximated by euclidean representations like voxel spaces. Only recently, equivariance has been addressed with respect to other groups [1, 2] and CNNs have been proposed for manifolds or graphs [3,4,5].
Equivariant networks retain information about group actions on the input and on the feature maps throughout the layers of a network. Because of their special structure, feature transformations are directly related to spatial transformations of the input. Such equivariant structures yield a lower network capacity in terms of unknowns than alternatives like the Spatial Transformer [6] where a canonical transformation is learnt and applied to the original input.
In this paper, we are primarily interested in analyzing 3D data for alignment, retrieval or classification. Volumetric and point cloud representations have yielded translation and scale invariant approaches: Normalization of translation and scale can be achieved by setting the object’s origin to its center and constraining its extent to a fixed constant. However, 3D rotations remain a challenge to current approaches (Fig. 2 illustrates how classification performance for conventional methods suffers when arbitrary rotations are introduced).
In this paper, we model 3D-data with spherical functions valued in \({\mathbb {R}}^n\) and introduce a novel equivariant convolutional neural network with spherical inputs (Fig. 1 illustrates the equivariance). We clarify the difference between convolution that has spherical outputs and correlation that has outputs in the rotation group \(\mathbf {SO}(3)\) and we apply exact convolutions that yield zonal filters, i.e. filters with constant values along the same latitude. Convolutions cannot be applied with spatially-invariant impulse responses (masks), but can be exactly computed in the spherical harmonic domain through pointwise multiplication. To obtain localized filters, we enforce a smooth spectrum by learning weights only on few anchor frequencies and interpolating between them, yielding, as additional advantage, a number of weights independent of the spatial resolution.
It is natural then to apply pooling in the spectral domain. Spectral pooling has the advantage that it retains equivariance while spatial pooling on the sphere is only approximately equivariant. We also propose a weighted averaging pooling where the weights are proportional to the cell area. The only reason to return to the spatial domain is the rectifying nonlinearity, which is a pointwise operator.
We perform 3D retrieval, classification, and alignment experiments. Our aim is to show that we can achieve near state of the art performance with a much lower network capacity, which we achieve for the SHREC’17 [10] contest and ModelNet40 [11] datasets.
Our main contributions can be summarized as follows:
-
We propose the first neural network based on spherical convolutions.
-
We introduce pooling and parameterization of filters in the spectral domain, with enforced spatial localization and capacity independent of the resolution.
-
Our network has much lower capacity than non-spherical networks applied on 3D data without sacrificing performance.
We start with the related work, then introduce the mathematics of group and in particular sphere convolutions, and details of our network. Last, we perform extensive experiments on retrieval, classification, and alignment.
2 Related Work
We will start describing related work on group equivariance, in particular equivariance on the sphere, then delve into CNN representations for 3D data.
Methods for enabling equivariance in CNNs can be divided in two groups. In the first, equivariance is obtained by constraining filter structure similarly to Lie generator based approaches [12, 13]. Worral et al. [14] use filters derived from the complex harmonics achieving both rotational and translational equivariance. The second group requires the use of a filter orbit which is itself equivariant to obtain group equivariance. Cohen and Welling [1] convolve with the orbit of a learned filter and prove the equivariance of group-convolutions and preservation of rotational equivariance in the presence of rectification and pooling. Dieleman et al. [15] process elements of the image orbit individually and use the set of outputs for classification. Gens and Domingos [16] produce maps of finite-multiparameter groups, Zhou et al. [17] and Marcos et al. [18] use a rotational filter orbit to produce oriented feature maps and rotationally invariant features, and Lenc and Vedaldi [19] propose a transformation layer which acts as a group-convolution by first permuting then transforming by a linear filter.
Recently, a body of work on Graph Convolutional Networks (GCN) has emerged. There are two threads within this space, spectral [20,21,22] and spatial [23,24,25]. These approaches learn filters on irregular but structured graph representations. These methods differ from ours in that we are looking to explicitly learn equivariant and invariant representations for 3D-data modeled as spherical functions under rotation. While such properties are difficult to construct for general manifolds, we leverage the group action of rotations on the sphere.
Most similar to our approach and developed in parallelFootnote 1 is [5], which uses spherical correlation to map spherical inputs to features on \(\mathbf {SO}\)(3), then processed with a series of convolutions on \(\mathbf {SO}\)(3). The main difference is that we use spherical convolutions, which are potentially one order of magnitude faster, with smaller (one fewer dimension) filters and feature maps. In addition, we enforce smoothness in the spectral domain that results in better localization of the receptive fields on the sphere and we perform pooling in two different ways, either as a low-pass in the spectral domain or as a weighted averaging in the spatial domain. Moreover, our method outperforms [5] in the SHREC’17 benchmark.
Spherical representations for 3D-data are not novel and have been used for retrieval tasks before the deep learning era [26, 27] because of their invariance properties and efficient implementation of spherical correlation [28]. In 3D deep learning, the most natural adaptation of 2D methods was to use a voxel-grid representation of the 3D object and amend the 2D CNN framework to use collections of 3D filters for cascaded processing in the place of conventional 2D filters. Such approaches require a tremendous amount of computation to achieve very basic voxel resolution and need a much higher capacity.
Several attempts have been made to use CNNs to produce discriminative representations from volumetric data. 3D ShapeNets [11] and VoxNet [29] propose a fully-volumetric network with 3D convolutional layers followed by fully-connected layers. Qi et al. [8] observe significant overfitting when attempting to train the aforementioned end-to-end and choose to amend the technique using subvolume classification as an auxiliary task, and also propose an alternate 3D CNN which learns to project the volumetric representation to a 2D representation, then processed using a conventional 2D CNN architecture. Even with these adaptations, Qi et al. [8] are challenged by overfitting and suggest augmentation in the form of orientation pooling as a remedy. Qi et al. [7] also present an attempt to train a neural network that operates directly on point clouds. Currently, the most successful approaches are view-based, operating in rendered views of the 3D object [8, 9, 30, 31]. The high performance of these methods is in part due to the use of large pre-trained 2D CNNs (on ImageNet, for instance).
3 Preliminaries
3.1 Group Convolution
Consideration of symmetries, in particular rotational symmetries, naturally evokes notions of the Fourier Transform. In the context of deriving rotationally invariant representations, the Fourier Transform is particularly appealing since it exhibits invariance to rotational deformations up to phase (a truly invariant representation can be achieved through application of the modulus operator).
To leverage this property for 3D shape analysis, it is necessary to construct a rotationally equivariant representation of our 3D input. For a group G and function \(f:E\rightarrow F\), f is said to be equivariant to transformations \(g\in G\) when
where g acts on elements of E and \(g'\) is the corresponding group action which transforms elements of F. If \(E=F\), \(g=g'\). A straightforward example of an equivariant representation is an orbit. For an object x, its orbit O(x) with respect to the group G is defined
Through this example it is possible to develop an intuition into the equivariance of the group convolution; convolution can be viewed as the inner-products of some function f with all elements of the orbit of a “flipped” filter h. Formally, the group convolution is defined as
where \(\eta \) is typically a canonical element in the domain of f (e.g. the origin if \(E = \mathbb {R}^n\), or \(I_n\) if \(E = \mathbf {SO}(n)\)). The familiar convolution on the plane is a special case of the group convolution with the group \(G=\mathbb {R}^2\) with addition,
The group convolution can be shown to be equivariant. For any \(\alpha \in G\),
3.2 Spherical Harmonics
Following directly the preliminaries above, we can define convolution of spherical signal f by a spherical filter h with respect to the group of 3D rotations \(\mathbf {SO}(3)\):
where \(\eta \) is north pole on the sphere.
To implement (6), it is desirable to sample the sphere with well-distributed and compact cells with transitivity (rotations exist which bring cells into coincidence). Unfortunately, such a discretization does not exist [32]. Neither the familiar sampling by latitude and longitude nor the uniformly distributed sampling according to Platonic solids satisfies all constraints. These issues are compounded with the eventual goal of performing cascaded convolutions on the sphere.
To circumvent these issues, we choose to evaluate the spherical convolution in the spectral domain. This is possible as the machinery of Fourier analysis has extended the well-known convolution theorem to functions on the sphere: the Spherical Fourier transform of a convolution is the pointwise product of Spherical Fourier transforms (see [33, 34] for further details). The Fourier transform and its inverse are defined on the sphere as follows [33]:
where b is the bandwidth of f, and \(Y_m^{\ell }\) are the spherical harmonics of degree \(\ell \) and order m. We refer to (8) as the Spherical Fourier Transform (SFT), and to (7) as its inverse (ISFT). Revisiting (6), letting \(y = (f \star _G h)(x)\), the spherical convolution theorem [34] gives us
To compute the convolution of a signal f with a filter h, we first expand f and h into their spherical harmonic basis (8), second compute the pointwise product (9), and finally invert the spherical harmonic expansion (7).
It is important to note that this definition of spherical convolution is unique from spherical correlation which produces an output response on \(\mathbf {SO}\)(3). Convolution here can be seen as marginalizing the angle responsible for rotating the filter about its north pole, or equivalently considering zonal filters on the sphere.
3.3 Practical Considerations and Optimizations
To evaluate the SFT, we use equiangular samples on the sphere according to the sampling theorem of [34]
where \(\theta _j=\pi j/2b\) and \(\phi _k=\pi k/b\) form the sampling grid, and \(a_j^{(b)}\) are the sample weights. Note that all the required operations are matrix pointwise multiplications and sums, which are differentiable and readily available in most automatic differentiation frameworks. In our direct implementation, we precompute all needed \(Y_m^{\ell }\), which are stored as constants in the computational graph.
Separation of Variables: We also implement a potentially faster SFT based on separation of variables as shown in [34]. Expanding \(Y_m^{\ell }\) in (10), we obtain
where \(P_m^{\ell }\) is the associated Legendre polynomial, and \(q_m^{\ell }\) a normalization factor. The inner sum can be computed using a row-wise Fast Fourier Transform and what remains is an associated Legendre transform, which we compute directly. The same idea is done for the ISFT. We found that this method is faster when \(b \ge 32\). There are faster algorithms available [34, 35], which we did not attempt.
Leveraging Symmetry: For real-valued inputs, \(\hat{f}_{-m}^{\ell } = (-1)^{m}\overline{\hat{f}_{m}^{\ell }}\) (this follows from \(\overline{Y_{-m}^{\ell }} = (-1)^m Y_m^{\ell }\)). We thus need only compute half the coefficients (\(m > 0\)). Furthermore, we can rewrite the SFT and ISFT to avoid expensive complex number support or multiplication:
4 Method
Figure 3 shows an overview of our method. We define a block as one spherical convolutional layer, followed by optional pooling, and nonlinearity. A weighted global average pooling is applied at the last layer to obtain an invariant descriptor. This section details the architectural design choices.
4.1 Spectral Filtering
In this section, we define the filter parameterization. One possible approach would be to define a compact support around one of the poles and learn the values for each discrete location, setting the rest to zero. The downside of this approach is that there are no guarantees that the filter will be bandlimited. If it is not, the SFT will be implicitly bandlimiting the signal, which causes a discrepancy between the parameters and the actual realization of the filters.
To avoid this problem, we parameterize the filters in the spectral domain. In order to compute the convolution of a function f and a filter h, only the SFT coefficients of order \(m=0\) of h are used. In the spatial domain, this implies that for any h, there is always a zonal filter (constant value per latitude) \(h_z\), such that \(\forall y,\, y * h = y * h_z\). Thus, it only makes sense to learn zonal filters.
The spectral parameterization is also faster because it eliminates the need to compute the filter SFT, since the filters are defined in the spectral domain, which is the same domain where the convolution computed.
Non-localized Filters: A first approach is to parameterize the filters by all SFT coefficients of order \(m=0\). For example, given \(32 \times 32\) inputs, the maximum bandwidth is \(b=16\), so there are 16 parameters to be learned (\(\hat{h}_0^0, \ldots \hat{h}_0^{15} \)). A downside is that the filters may not be local; however, locality may be learned.
Localized Filters: From Parseval’s theorem and the derivative rule from Fourier analysis we can show that spectral smoothness corresponds to spatial decay. This is used in the construction of graph-based neural networks [36], and also applies to the filters spanned by the family of spherical harmonics of order zero (\(m=0\)).
To obtain localized filters, we parameterize the spectrum with anchor points. We fix n uniformly spaced degrees \(\ell _i\) and learn the correspondent coefficients \(f_0^{\ell _i}\). The coefficients for the missing degrees are then obtained by linear interpolation, which enforces smoothness. A second advantage is that the number of parameters per filter is independent of the input resolution. Figure 4 shows some filters learned by our model; the right side filters are obtained imposing locality.
4.2 Pooling
The conventional spatial max pooling used in CNNs has two drawbacks in Spherical CNNs: (1) need an expensive ISFT to convert back to spatial domain, and (2) equivariance is not completely preserved, specially because of unequal cell areas from equiangular sampling. Weighted average pooling (WAP) takes into account the cell areas to mitigate the latter, but is still affected by the former.
We introduce the spectral pooling (SP) for Spherical CNNs. If the input has bandwidth b, we remove all coefficients with degree larger or equal than b/2 (effectively, a lowpass box filter). Such operation is known to cause ringing artifacts, which can be mitigated by previous smoothing, although we did not find any performance advantage in doing so. Note that spectral pooling was proposed before for conventional CNNs [37].
We found that spectral pooling is significantly faster, reduces the equivariance error, but also reduces classification accuracy. The choice between SP and WAP is application-dependent. For example, our experiments show SP is more suitable for shape alignment, while WAP is better for classification and retrieval. Table 5 shows the performance for each method.
4.3 Global Pooling
In fully convolutional networks, it is usual to apply a global average pooling at the last layer to obtain a descriptor vector, where each entry is the average of one feature map. We use the same idea; however, the equiangular spherical sampling results in cells of different areas, so we compute a weighted average instead, where a cell’s weight is the sine of its latitude. We denote it Weighted Global Average Pooling (WGAP). Note that the WGAP is invariant to rotation, therefore the descriptor is also invariant. Figure 5 shows such descriptors.
An alternative to this approach is to use the magnitude per degree of the SFT coefficients; formally, if the last layer has bandwidth b and \(\hat{f^{\ell }} = [\hat{f}_{-\ell }^{\ell },\hat{f}_{-\ell +1}^{\ell }, \ldots , \hat{f}_{\ell }^{\ell }]\), then is an invariant descriptor [33]. We denote this approach as MAG-L (magnitude per degree \(\ell \)). We found no difference in classification performance when using it (see Table 5).
4.4 Architecture
Our main architecture has two branches, one for distances and one for surface normals. This performs better than having two input channels and slightly better than having two separate voting networks for distance and normals. Each branch has 8 spherical convolutional layers, and 16, 16, 32, 32, 64, 64, 128, 128 channels per layer. Pooling and feature concatenation of one branch into the other is performed when the number of channels increase. WGAP is performed after the last layer, which is then projected into the number of classes.
5 Experiments
The greatest advantage of our model is inherent equivariance to \(\mathbf {SO}\)(3); we focus the experiments in problems that benefit from it; namely, shape classification and retrieval in arbitrary orientations, and shape alignment.
We chose problems related to 3D shapes due to the availability of large datasets and published results on them; our method would also be applicable to any kind of data that can be mapped to the sphere (e.g. panoramas).
5.1 Preliminaries
Ray-Mesh Intersection: 3D shapes are usually represented by mesh or voxel grid, which need to be converted to spherical functions. Note that the conversion function itself must be equivariant to rotations; our learned representation will not be equivariant if the input is pre-processed by a non-equivariant function.
Given a mesh or voxel grid, we first find the bounding sphere and its center. Given a desired resolution n, we cast \(n \times n\) equiangular rays from the center, and obtain the intersections between each ray and the mesh/voxel grid. Let \(d_{jk}\) be the distance from the center to the farthest point of intersection, for a ray at direction \((\theta _j, \phi _k)\). The function on the sphere is given by \(f(\theta _j, \phi _k) = d_{jk},\, 1 \le j,k \le n\).
For mesh inputs, we also compute the angle \(\alpha \) between the ray and the surface normal at the intersecting face, giving a second channel \(f(\theta _j, \phi _k) = [d, \sin \alpha ]\).
Note that this representation is suitable for star-shaped objects, defined as objects that contain an interior point from where the whole boundary is visible. Moreover, the center of the bounding sphere must be one of such points. In practice, we do not check if these conditions hold – even if the representation is ambiguous or non-invertible, it is still useful.
Training: We train using ADAM, for 48 epochs, initial learning rate of \(10^{-3}\), which is divided by 5 on epochs 32 and 40.
We make use of data augmentation for training, performing rotations, anisotropic scaling and mirroring on the meshes, and adding jitter to the bounding sphere center when constructing the spherical function. Note that, even though our learned representation is equivariant to rotations, augmenting the inputs with rotations is still beneficial due to interpolation and sampling effects.
5.2 3D Object Classification
This section shows classification performance on ModelNet40 [11]. Three modes are considered: (1) trained and tested with azimuthal rotations (z/z), (2) trained and tested with arbitrary rotations (\(\mathbf {SO}\)(3)/\(\mathbf {SO}\)(3)), and (3) trained with azimuthal and tested with arbitrary rotations (z/\(\mathbf {SO}\)(3)).
Table 1 shows the results. All competing methods suffer a sharp drop in performance when arbitrary rotations are present, even if they are seen during training. Our model is more robust, but there is a noticeable drop for mode 3, attributed to sampling effects. Since we use equiangular sampling, the cell area varies with latitude. Rotations around z preserve latitude, so regions at same height are sampled at same resolution during training, but not during test. We believe this can be improved by using equal-area spherical sampling.
We evaluate competing methods using default settings of their published code. The volumetric [8] and point cloud based [7, 38] methods cannot generalize to unseen orientations (z/\(\mathbf {SO}\)(3)). The multi-view [9, 30] methods can be seen as a brute force approach to equivariance; and MVCNN [9] generalizes to unseen orientations up to a point. Yet, the Spherical CNN outperforms it, even with orders of magnitude fewer parameters and faster training. Interestingly, RotationNet [30], which holds the current state-of-the-art on ModelNet40 classification, fails to generalize to unseen rotations, despite being multi-view based.
Equivariance to \(\mathbf {SO}\)(3) is unneeded when only azimuthal rotations are present (z/z); the full potential of our model is not exercised in this case.
5.3 3D Object Retrieval
We run retrieval experiments on ShapeNet Core55 [39], following the SHREC’17 3D shape retrieval rules [10], which includes random \(\mathbf {SO}\)(3) perturbations.
The network is trained for classification on the 55 core classes (we do not use the subclasses), with an extra in-batch triplet loss (from [40]) to encourage descriptors to be close for matching categories and far for non-matching.
The invariant descriptor is used with a cosine distance for retrieval. We first compute a threshold per class that maximizes the training set F-score. For test set retrieval, we return elements whose distances are below their class threshold and include all elements classified as the same class as the query. Table 2 shows the results. Our model matches the state of the art performance (from [41]), with significantly fewer parameters, smaller input size, and no pre-training.
5.4 Shape Alignment
Our learned equivariant feature maps can be used for shape alignment using spherical correlation. Given two shapes from the same category (not necessarily the same instance), under arbitrary orientations, we run them through the network and collect the feature maps at some layer. We compute the correlation between each pair of corresponding feature maps, and add the results. The maximum value of the correlation function (which takes inputs on \(\mathbf {SO}(3)\)) corresponds to the rotation that aligns both shapes [28].
Features from deeper layers are richer and carry semantic value, but are at lower resolution. We run an experiment to determine the performance of the shape alignment per layer, while also comparing with the spherical correlation done at the network inputs (not learned).
We select categories from ModelNet10 that do not have rotational symmetry so that the ground truth rotation is unique and the angular error is measurable. These categories are: bed, sofa, toilet, chair. Only entries from the test set are used. Results are in Table 3, while Fig. 6 shows some examples. Results show that the learned features are superior to the handcrafted spherical shape representation for this task, and best performance is achieved by using intermediate layers. The resolution at conv4 is \(32 \times 32\), which corresponds to cell dimensions up to \(11.25 \text { deg}\), so we cannot expect errors much lower than this.
5.5 Equivariance Error Analysis
Even though spherical convolutions are equivariant to \(\mathbf {SO}\)(3) for bandlimited inputs, and spectral pooling preserves bandlimit, there are other factors that may introduce equivariance errors. We quantify these effects in this section.
We feed each entry in the test set and one random rotation to the network, then apply the same rotation to the feature maps and measure the average relative error. Table 4 shows the results. The pointwise nonlinearity does not preserve bandlimit, and cause equivariance errors (rows 1, 4). The mesh to sphere map is only approximately equivariant, which can be mitigated with larger input dimensions (input column for rows 1, 5). Error is smaller when the input is bandlimited (rows 1, 7). Spectral pooling is exactly equivariant, while max-pooling introduces higher frequencies and has larger error than WAP (rows 1, 2, 3). Error for an untrained model demonstrates that the equivariance is by design and not learned (row 6). Note that the error is smaller because the learned filters are usually high-pass, which increase the pointwise relative error. A linear model with bandlimited inputs has zero equivariance error, as expected (row 8).
Note that even conventional planar CNNs will exhibit a degree of translational equivariance error introduced by max pooling and discretization.
5.6 Ablation Study
In this section we evaluate numerous variations of our method to determine the sensitivity to design choices. First, we are interested in assessing the effects from our contributions SP, WAP, WGAP, and localized filters. Second, we are interested in understanding how the network size affects performance. Results show that the use of WAP, WGAP, and localized filters significantly improve performance, and also that further performance improvements can be achieved with larger networks. In summary, factors that increase bandwidth (e.g. max-pooling) also increase equivariance error and may reduce accuracy. Global operations in early layers (e.g. non-local filters) escape the receptive field and reduce accuracy.
6 Conclusion
We presented Spherical CNNs, which leverage spherical convolutions to achieve equivariance to \(\mathbf {SO}\)(3) perturbations. The network is applied to 3D object classification, retrieval, and alignment, but has potential applications in spherical images such as panoramas, or any data that can be represented as a spherical function. We show that our model can naturally handle arbitrary input orientations, requiring relatively few parameters and small input sizes.
Notes
- 1.
The first version of this work was submitted to CVPR on 11/15/2017, shortly after we became aware of Cohen et al. [5] ICLR submission on 10/27/2017.
References
Cohen, T.S., Welling, M.: Group equivariant convolutional networks (2016). arXiv preprint: arXiv:1602.07576
Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Harmonic networks: deep translation and rotation equivariance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (2017)
Bruna, J., Szlam, A., LeCun, Y.: Learning stable group invariant representations with convolutional networks (2013). arXiv preprint: arXiv:1301.3537
Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric deep learning: going beyond Euclidean data. IEEE Signal Process. Mag. 34(4), 18–42 (2017)
Cohen, T.S., Geiger, M., Khler, J., Welling, M.: Spherical CNNs. In: International Conference on Learning Representations (2018)
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR), vol. 1(2), p. 4. IEEE (2017)
Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-view CNNs for object classification on 3D data. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, pp. 5648–5656, 27–30 June 2016
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)
Savva, M., et al.: Shrec’17 track: large-scale 3D shape retrieval from shapenet core55. In: 10th Eurographics workshop on 3D Object retrieval, pp. 1–11 (2017)
Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, pp. 1912–1920, 7–12 June 2015
Segman, J., Rubinstein, J., Zeevi, Y.Y.: The canonical coordinates method for pattern deformation: theoretical and computational considerations. IEEE Trans. Pattern Anal. Mach. Intell. 14(12), 1171–1183 (1992)
Hel-Or, Y., Teo, P.C.: Canonical decomposition of steerable functions. In: Proceedings of the 1996 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 1996, pp. 809–816. IEEE (1996)
Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Harmonic networks: deep translation and rotation equivariance (2016). arXiv preprint: arXiv:1612.04642
Dieleman, S., Willett, K.W., Dambre, J.: Rotation-invariant convolutional neural networks for galaxy morphology prediction. Mon. Not. R. Astron. Soc. 450(2), 1441–1459 (2015)
Gens, R., Domingos, P.M.: Deep symmetry networks. In: Advances in Neural Information Processing Systems, pp. 2537–2545 (2014)
Zhou, Y., Ye, Q., Qiu, Q., Jiao, J.: Oriented response networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
Marcos, D., Volpi, M., Komodakis, N., Tuia, D.: Rotation equivariant vector field networks. CoRR (2016)
Lenc, K., Vedaldi, A.: Understanding image representations by measuring their equivariance and equivalence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 991–999 (2015)
Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs (2013). arXiv preprint: arXiv:1312.6203
Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in Neural Information Processing Systems, pp. 3844–3852 (2016)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2016). arXiv preprint: arXiv:1609.02907
Boscaini, D., Masci, J., Rodolà, E., Bronstein, M.: Learning shape correspondence with anisotropic convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 3189–3197 (2016)
Masci, J., Boscaini, D., Bronstein, M., Vandergheynst, P.: Geodesic convolutional neural networks on Riemannian manifolds. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 37–45 (2015)
Monti, F., Boscaini, D., Masci, J., Rodolà, E., Svoboda, J., Bronstein, M.M.: Geometric deep learning on graphs and manifolds using mixture model CNNs (2016). arXiv preprint: arXiv:1611.08402
Frome, A., Huber, D., Kolluri, R., Bülow, T., Malik, J.: Recognizing objects in range data using regional point descriptors. In: Pajdla, T., Matas, J. (eds.) ECCV 2004, Part III. LNCS, vol. 3023, pp. 224–237. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24672-5_18
Kazhdan, M., Funkhouser, T.: Harmonic 3D shape matching. In: ACM SIGGRAPH 2002 Conference Abstracts and Applications, p. 191. ACM (2002)
Makadia, A., Daniilidis, K.: Spherical correlation of visual representations for 3D model retrieval. Int. J. Comput. Vis. 89(2), 193–210 (2010)
Maturana, D., Scherer, S.: Voxnet: a 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2015, Hamburg, Germany, 28 September–2 October 2015, pp. 922–928 (2015)
Kanezaki, A., Matsushita, Y., Nishida, Y.: Rotationnet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Bai, S., Bai, X., Zhou, Z., Zhang, Z., Jan Latecki, L.: Gift: a real-time and scalable 3D shape search engine. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5023–5032 (2016)
Thurston, W.P.: Three-Dimensional Geometry and Topology, vol. 1. Princeton University Press, Princeton (1997)
Arfken, G.: Mathematical Methods for Physicists, vol. 2. Academic Press, London (1966)
Driscoll, J.R., Healy, D.M.: Computing fourier transforms and convolutions on the 2-sphere. Adv. Appl. Math. 15(2), 202–250 (1994)
Healy, D.M., Rockmore, D.N., Kostelec, P.J., Moore, S.: Ffts for the 2-sphere-improvements and variations. J. Fourier Anal. Appl. 9(4), 341–385 (2003)
Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. CoRR (2013)
Rippel, O., Snoek, J., Adams, R.P.: Spectral representations for convolutional neural networks. CoRR (2015)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, pp. 5105–5114 (2017)
Chang, A.X., et al.: Shapenet: An information-rich 3D model repository. CoRR (2015)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Furuya, T., Ohbuchi, R.: Deep aggregation of local 3D geometric features for 3D model retrieval. In: BMVC (2016)
Tatsuma, A., Aono, M.: Multi-fourier spectra descriptor and augmentation with spectral clustering for 3D shape retrieval. Vis. Comput. 25(8), 785–804 (2009)
Acknowledgments
We are grateful for support through the following grants: NSF-DGE-0966142 (IGERT), NSF-IIP-1439681 (I/UCRC), NSF-IIS-1426840, NSF-IIS-1703319, NSF MRI 1626008, ARL RCTA W911NF-10-2-0016, ONR N00014-17-1-2093, and by Honda Research Institute.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Esteves, C., Allen-Blanchette, C., Makadia, A., Daniilidis, K. (2018). Learning SO(3) Equivariant Representations with Spherical CNNs. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11217. Springer, Cham. https://doi.org/10.1007/978-3-030-01261-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-01261-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01260-1
Online ISBN: 978-3-030-01261-8
eBook Packages: Computer ScienceComputer Science (R0)