OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

TL;DR: OVGGT is a training-free framework enabling streaming 3D reconstruction from arbitrarily long video with constant memory and compute — achieving \(\mathcal{O}(1)\) per-frame cost while surpassing full-cache baselines in accuracy.

OVGGT teaser — quantitative and qualitative results across sequence lengths

Left: Quantitative comparison on 7-Scenes across 200 frames. Right: Qualitative 3D reconstructions demonstrating OVGGT's stability over long sequences (50–500 frames).

Abstract

Streaming 3D reconstruction from video demands bounded memory and compute, yet existing geometric foundation models either operate offline with quadratic cost or accumulate an ever-growing KV cache that exhausts GPU memory within hundreds of frames. We present OVGGT, a training-free framework that bounds both memory and compute to a fixed budget regardless of sequence length. Our approach combines Self-Selective Caching (FFN-residual-based KV cache compression, fully compatible with FlashAttention) with Dynamic Anchor Protection (shielding coordinate-critical tokens from eviction to suppress geometric drift). Experiments on indoor, outdoor, and ultra-long benchmarks show OVGGT processes arbitrarily long videos within a constant VRAM envelope while surpassing existing methods in accuracy.

Method Overview

Overview of OVGGT. At each time step, the input frame is encoded into tokens and processed by a spatial-temporal decoder that attends to a bounded KV cache. Self-Selective Caching scores and compresses tokens, while Dynamic Anchor Protection shields coordinate-critical tokens from eviction.

Self-Selective Caching (SSC) derives per-token importance from FFN residual magnitudes — quantities already computed during the forward pass — requiring zero extra overhead. The Activation Value Rating scores each token as \( s_i^{(l)} = \bigl\| \lambda_2^{(l)} \cdot \mathrm{FFN}\bigl(\mathrm{LN}(\mathbf{h}_i^{(l)})\bigr) \bigr\|_2 \), and the KV Cache Compression module retains the top-scoring tokens within a fixed per-layer budget.

Dynamic Anchor Protection (DAP) explicitly shields geometrically critical tokens from eviction. The Global Initial Anchor permanently protects all first-frame tokens to maintain coordinate-system consistency, while Historical Anchors are adaptively registered as the camera traverses the scene, providing long-range geometric references.

Results

Qualitative Visualizations

Qualitative comparison of 3D reconstructions

Qualitative comparison. Ground truth vs. StreamVGGT, Evict3R, InfiniteVGGT, and OVGGT (Ours). Our method produces more complete and geometrically accurate reconstructions, particularly on long sequences where competing methods exhibit drift or missing regions.

Efficiency Comparison

FPS and VRAM vs. sequence length. OVGGT maintains constant throughput and memory usage regardless of video length, while competing methods exhibit linear or super-linear growth.

Comparison with Full-Cache Baseline

Accuracy, Completeness, and Chamfer Distance vs. sequence length. OVGGT's bounded cache achieves comparable or better reconstruction quality than the full-cache StreamVGGT baseline, with stable performance as sequence length grows.

Quantitative comparison on 7-Scenes and NRGBD datasets across different sequence lengths. Evict3R^† denotes dynamically calibrated pruning rate.

Method	7-Scenes							NRGBD
	Len.	Acc ↓		Comp ↓		NC ↑		Len.	Acc ↓		Comp ↓		NC ↑
	Len.	Mean	Med.	Mean	Med.	Mean	Med.	Len.	Mean	Med.	Mean	Med.	Mean	Med.
Spann3R	200	0.215	0.131	0.122	0.063	0.535	0.550	100	0.111	0.069	0.045	0.015	0.636	0.733
CUT3R		0.087	0.048	0.045	0.014	0.566	0.601		0.039	0.024	0.013	0.004	0.645	0.748
Point3R		0.041	0.019	0.023	0.006	0.579	0.622		0.046	0.028	0.016	0.004	0.662	0.775
TTT3R		0.027	0.015	0.023	0.005	0.582	0.627		0.031	0.019	0.012	0.004	0.650	0.756
StreamVGGT		0.038	0.014	0.029	0.007	0.583	0.628		0.024	0.014	0.013	0.003	0.663	0.777
Evict3R		OOM							0.025	0.015	0.013	0.003	0.664	0.781
Evict3R^†		0.037	0.013	0.027	0.007	0.584	0.631		0.031	0.020	0.013	0.003	0.665	0.791
InfiniteVGGT		0.046	0.016	0.031	0.008	0.582	0.627		0.035	0.022	0.014	0.003	0.669	0.787
Ours		0.024	0.008	0.021	0.005	0.587	0.635		0.022	0.014	0.012	0.003	0.672	0.796
Spann3R	500	0.343	0.263	0.154	0.085	0.515	0.521	300	0.346	0.221	0.175	0.099	0.558	0.586
CUT3R		0.194	0.143	0.092	0.034	0.527	0.538		0.244	0.136	0.081	0.019	0.575	0.613
Point3R		0.056	0.025	0.031	0.012	0.555	0.584		0.076	0.042	0.014	0.004	0.624	0.707
TTT3R		0.065	0.037	0.030	0.006	0.552	0.578		0.102	0.043	0.026	0.005	0.610	0.678
StreamVGGT		OOM							OOM
Evict3R		OOM							OOM
Evict3R^†		0.042	0.016	0.026	0.005	0.559	0.589		0.042	0.026	0.017	0.004	0.640	0.739
InfiniteVGGT		0.040	0.015	0.024	0.005	0.561	0.593		0.053	0.031	0.024	0.005	0.646	0.751
Ours		0.031	0.011	0.020	0.003	0.561	0.593		0.037	0.022	0.015	0.003	0.642	0.740
Spann3R	1000	0.340	0.262	0.154	0.092	0.508	0.510	500	0.516	0.342	0.225	0.130	0.552	0.578
CUT3R		0.240	0.166	0.102	0.015	0.513	0.516		0.328	0.247	0.157	0.085	0.562	0.592
Point3R		0.068	0.028	0.025	0.006	0.533	0.549		0.116	0.049	0.027	0.004	0.620	0.698
TTT3R		0.126	0.080	0.050	0.010	0.525	0.535		0.169	0.082	0.096	0.015	0.594	0.647
StreamVGGT		OOM							OOM
Evict3R		OOM							OOM
Evict3R^†		0.134	0.059	0.052	0.009	0.531	0.545		0.072	0.040	0.026	0.006	0.641	0.739
InfiniteVGGT		0.061	0.031	0.035	0.014	0.537	0.554		0.070	0.046	0.037	0.008	0.642	0.743
Ours		0.039	0.014	0.020	0.003	0.537	0.554		0.054	0.032	0.026	0.006	0.637	0.732

Quantitative comparison on full sequences of ETH3D (outdoor) and Long3D (ultra-long) datasets. Ours²⁰⁰ and Ours⁴⁰⁰ denote 200K and 400K token budgets.

Method	ETH3D (Outdoor)						Long3D (Ultra-Long)
	Acc ↓		Comp ↓		NC ↑		Acc ↓		Comp ↓		NC ↑
	Mean	Med.	Mean	Med.	Mean	Med.	Mean	Med.	Mean	Med.	Mean	Med.
CUT3R	0.940	0.607	0.709	0.374	0.718	0.812	6.189	2.405	1.921	1.535	0.501	0.497
TTT3R	0.598	0.374	0.585	0.223	0.728	0.826	7.341	2.018	1.455	1.235	0.509	0.503
StreamVGGT	0.601	0.369	0.442	0.169	0.791	0.933	OOM
Evict3R^†	0.605	0.375	0.442	0.163	0.792	0.934	4.928	2.710	0.715	0.204	0.507	0.504
InfiniteVGGT	0.603	0.371	0.444	0.169	0.792	0.933	4.344	3.668	0.974	0.205	0.517	0.525
Ours²⁰⁰	0.628	0.396	0.380	0.121	0.790	0.934	2.453	1.794	0.390	0.060	0.507	0.509
Ours⁴⁰⁰	0.535	0.317	0.394	0.107	0.793	0.934	2.449	1.675	0.542	0.151	0.507	0.509

Video depth evaluation on Bonn and KITTI across different sequence lengths.

Method	Bonn						KITTI
	Abs Rel ↓			δ < 1.25 ↑			Abs Rel ↓			δ < 1.25 ↑
	100	300	500	100	300	500	100	300	500	100	300	500
StreamVGGT	0.055	--	--	0.974	--	--	0.166	--	--	0.740	--	--
Evict3R^†	0.063	0.072	0.072	0.963	0.951	0.957	0.192	0.213	0.198	0.693	0.700	0.705
InfiniteVGGT	0.056	0.073	0.070	0.975	0.957	0.960	0.165	0.249	0.257	0.742	0.556	0.577
Ours	0.055	0.071	0.067	0.974	0.956	0.959	0.128	0.133	0.135	0.839	0.844	0.839

BibTeX

@article{lu2026ovggt,
  title={OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer},
  author={Si-Yu Lu and Po-Ting Chen and Hui-Che Hsu and Sin-Ye Jhong and Wen-Huang Cheng and Yung-Yao Chen},
  journal={arXiv preprint arXiv:2603.05959},
  year={2026}
}

OVGGT: O(1) Constant-Cost StreamingVisual Geometry Transformer