The video-plus-depth format has been widely used for representing the 3D scene due to its main advantage of compatibility to image format. In practice, the depth inconsistency may lead to unsatisfactory view synthesis results. In this paper, we propose a new structure-from-motion (SfM) technique, called locally temporal bundle adjustment (LTBA), to handle the dynamic scenes as well as the static camera motion, which violates the conventional structure from motion assumption. By integrating the camera information, depth map, and video temporally, we develop a geometric quadrilateral filter to reduce noise in the depth map and enhance the spatio-temporal consistency to improve the quality of depth maps. We show the improved quality of dynamic depth maps by using the proposed algorithm through experiments on real video-plus-depth sequences.