Comparing LSTM and Transformer for Video Depth Estimation

Rozhin Fani, Berke Gur

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Accurate depth estimation from monocular video is critical for robotics applications such as simultaneous localization and mapping (SLAM) and navigation. Monocular depth estimation from video can be improved by incorporating temporal information across frames. The recently introduced sequence modeling techniques of recurrent long-short-term memory (LSTM) networks and Transformer architectures provide two potential approaches for aggregating temporal cues. This work presents a comparative study of using LSTM and Transformer modules for video depth prediction. The proposed depth pipeline extracts optical flow features between frames and passes them to either an LSTM or Transformer encoder before decoding into a depth map prediction. Compared to LSTM, the Transformer’s ability to capture long-range dependencies allows it to propagate information more effectively across long sequences. It is shown that the Transformer outperforms LSTM models by five- to sixfold in depth map estimation based on standard metrics. This analysis provides insights into the advantages of Transformer over recurrent LSTM models for aggregation of temporal signals in depth estimation and other similar sequence prediction tasks. The Transformer’s ability in aggregating motion across sequences holds promise for more robust spatial perception.

Original languageEnglish
Title of host publication7th EAI International Conference on Robotic Sensor Networks - EAI ROSENET 2023
EditorsÖmer Melih Gül, Paolo Fiorini, Seifedine Nimer Kadry
PublisherSpringer Science and Business Media Deutschland GmbH
Pages89-99
Number of pages11
ISBN (Print)9783031644948
DOIs
Publication statusPublished - 2024
Event7th EAI International Conference on Robotics and Networks, ROSENET 2023 - Istanbul, Turkey
Duration: 15 Dec 202316 Dec 2023

Publication series

NameEAI/Springer Innovations in Communication and Computing
ISSN (Print)2522-8595
ISSN (Electronic)2522-8609

Conference

Conference7th EAI International Conference on Robotics and Networks, ROSENET 2023
Country/TerritoryTurkey
CityIstanbul
Period15/12/2316/12/23

Keywords

  • Computer vision
  • Depth estimation
  • LSTM
  • Machine learning
  • Optical flow
  • Sequence modeling
  • Transformer

Fingerprint

Dive into the research topics of 'Comparing LSTM and Transformer for Video Depth Estimation'. Together they form a unique fingerprint.

Cite this