PLOT: Pseudo-Labeling via Object Tracking for Monocular 3D Object Detection

Fig 1. PLOT generates accurate 3D labels directly from monocular videos without requiring auxiliary sensors or training, as illustrated in (a) qualitative results across diverse scenarios. Furthermore, (b) our object tracking and aggregation pipeline produces shape-complete pseudo-LiDARs, yielding BEV maps comparable to ground truth and can identify miss-labeled objects (marked with a red star).

Abstract

Monocular 3D object detection is crucial for scalable perception across fields like autonomous driving, robotics, and surveillance. However, progress is hindered by limited 3D annotations and the inherent ambiguity of single-image geometry. Existing methods often rely on strong geometric assumptions or carefully curated datasets, which limit their applicability to real-world scenarios. In this paper, we present PLOT (Pseudo-Labeling via Object Tracking), a framework that generates 3D annotations from monocular videos without auxiliary sensors or model retraining. PLOT tracks object and background trajectories to estimate camera motion and perform object association in pose-unknown settings. These trajectories provide point correspondences that align frame-wise pseudo-LiDARs, which are then fused via simple optimization into a unified object shape robust to occlusion and viewpoint shifts. Recognizing temporal coherence as a fundamental requirement for reliable shape fusion and video perception, we design a global object memory that preserves consistent object identities across frames. PLOT achieves robust annotation quality and strong generalization on both M3OD video benchmarks and in-the-wild videos, proving its effectiveness across diverse and unconstrained domains.

Method

We propose PLOT, a framework for generating 3D annotations from monocular videos without auxiliary sensors or model retraining. As illustrated in Fig. 2, PLOT leverages off-the-shelf detectors and depth estimators to extract 2D masks and metric depth maps, which are combined via dense point tracking to form temporally grounded correspondences (Sec. 3.1). These are used to estimate relative poses through point-based registration (Sec. 3.2), enabling both shape fusion and motion analysis. To maintain identity consistency and recover missed instances, we introduce a global object memory (GOM) that refines labels over time (Sec. 3.3). Finally, object pseudo-LiDARs are constructed by aggregating registered points across frames and projecting them back to each time frame for consistent attribute estimation (Sec. 3.4).

Fig 2. Overall architecture of PLOT. Given monocular videos, we extract 2D detections and depth, and track points to obtain temporally grounded correspondences. These are used to estimate relative poses and camera motions across frames, enabling shape fusion and orientation estimation. A global object association module refines trajectories and recovers missing instances. Finally, completed pseudo-LiDARs are reprojected to each frame for consistent 3D attribute annotation.

Benchmarks

We evaluate PLOT on standard monocular 3D object detection (M3OD) video benchmarks—KITTI, KITTI-360 and Waymo—and compare it against recent open-set detectors, weakly-supervised, and pseudo-labeling methods that rely on driving-specific priors or pose estimates. For pseudo-labeling methods, including ours, results are obtained by training MonoDETR on the generated labels with the same default configuration across all methods to ensure fairness; additional details are provided in the supplementary material. As most M3OD methods focus on the Car' and Pedestrian' classes, our main paper reports quantitative results primarily for these categories.

Quantitative Results

Qualitative Results on Waymo Sequences

Comparison with Ground-Truth

*Green-GT, Red-Pseudo-labels

In-the-Wild

While our labeler is designed for open-world settings, existing benchmarks with video inputs are constrained to driving scenes, limiting the scope of evaluation. Thus, we provide qualitative comparisons on in-the-wild videos (MOSE, MOT16, PEXELS) to assess the generalization of PLOT beyond vehicle-centric environments.

In-the-Wild Generalization

Spatial-Temporal Consistency

Conclusion

In this paper, we introduced PLOT, a framework for generating reliable 3D annotations from monocular videos through tracking-driven object association and label refinement. Beyond monocular 3D detection, this video-based formulation holds potential for broader 3D tasks that suffer from missing camera information or incomplete shapes, such as CAD model retrieval and object-level reconstruction. We hope that this paradigm provides a scalable alternative to data-intensive monocular 3D detection pipelines and opens new directions for video-based 3D understanding without explicit supervision.

BibTex

@article{lee2025plot,

title={PLOT: Pseudo-Labeling via Object Tracking for Monocular 3D Object Detection},

author={Lee, Seokyeong and Aund, Sithu and Choi, Junyong and Kim, Seungryong and Kim, Ig-Jae and Cho, Junghyun},

journal={arXiv preprint arXiv:2507.02393},

year={2025}

}