3D generation has made significant progress, however, it still largely remains at the object-level. Feedforward 3D scene-level generation has been rarely explored due to the lack of models capable of scaling-up latent representation learning on 3D scene-level data. Unlike object-level generative models, which are trained on well-labeled 3D data in a bounded canonical space, scene-level generations with 3D scenes represented by 3D Gaussian Splatting (3DGS) are unbounded and exhibit scale inconsistency across different scenes, making unified latent representation learning for generative purposes extremely challenging. In this paper, we introduce Can3Tok, the first 3D scene-level varia- tional autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs. Beyond model design, we propose a general pipeline for 3D scene data processing to address scale inconsistency issue. We validate our method on the recent scene-level 3D dataset DL3DV-10K, where we found that only Can3Tok successfully generalizes to novel 3D scenes, while compared methods fail to converge on even a few hundred scene inputs during training and exhibit zero generalization ability during inference. Finally, we demonstrate image-to-3DGS and text-to-3DGS generation as our applications to demonstrate its ability to facilitate downstream generation tasks.
Scene-level representations exhibit varying levels of detail. Also, SfM point clouds and camera poses optimized from COLMAP are inherently not guaranteed to be in a canonical scale, making it difficult for VAE models to learn a unified and smooth latent space across different scenes.
Therefore, 3DGS representations optimized using camera poses and initial SfM point clouds from COLMAP or other metric-scale estimations must be normalized to ensure a unified latent representation.
Even with the normalization, naïve convolutional-based architectures struggle to generalize on unseen scene-level 3D structure (also found by Bolt3D), highlighting the need for improved model design. Can3Tok processes a batch of per-scene 3D Gaussians, with a batch size of B, where each scene contains the same number of
Gaussians N. The encoder encodes input Gaussians into a low-dimensional latent space followed by a VAE reparametrization. And the
decoder reconstructs the embeddings back into 3D space, corresponding to the original input 3D Gaussians.
To illustrate the intuition behind our model's ability to generalize to unseen scene-level 3DGS representations, we showcase the t-SNE visualizations of the latent space of 3DGS for the
same scene with 36 linearly interpolated SO(3) rotations from 0 to
360 degrees. (a) (b) and (c) are the latent visualization of same scenes but different rotations around X-axis, Y-axis and Z-axis, respectively.
(d) is the color map corresponding to varying rotation degrees.
All three rotations exhibit patterns of closed loops,
demonstrating that our model preserves spatial information in the
latent representations. In (e) and (f), red dots are latent embeddings of the same scene but with 200 random SO(3) rotations and
blue dots are latent embeddings of different scenes.
3DGS reconstructions from general scenes often contain noise artifacts like floaters due to the lack of visual observations (unlike objects which are normally captured with sufficient views). To address this issue, we apply semantic-guided filtering to the raw 3DGS input to subsample as-clean-as-possible 3DGS primitives. As shown in the figure, such semantic filtering can preserve the most semantically meaningful contents while removing the less salient and noisy Gaussians. Please refer to the paper for more details.
We thank the following great works Perceiver IO, Michelangelo, LangSAM, PointTransformer, DL3DV-10K dataset and 3DGS for their codes.
@INPROCEEDINGS{gao2023ICCV,
author = {Quankai Gao and Iliyan Georgiev and Tuanfeng Y. Wang and Krishna Kumar Singh and Ulrich Neumann and Jae Shin Yoon},
title = {Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}