TY - GEN
T1 - Spatio-Temporal Consistent Non-homogeneous Extreme Video Retargeting
AU - Imani, Hassan
AU - Islam, Md Baharul
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Due to the availability of heterogeneous display devices and their aspect ratios, video retargeting has received considerable research attention among researchers. Non-consistent video retargeting can significantly affect a video's spatial and temporal quality, particularly for extreme retargeting cases. Since no perfectly annotated datasets exist for video retargeting, deep learning-based techniques are rarely utilized. This paper proposes a method that learns to retarget videos by detecting the salient areas and shifting them to the appropriate location. First, we segment the salient objects using a unified Transformer model. Using convolutional layers and a shifting strategy, we shift and warp objects to the suitable size and location in the frame. We use 1D convolution for shifting the salient objects. We also use a frame interpolation technique to preserve temporal information. To train the network, we feed the retargeted frames to a variational auto-encoder network to map the retargeted frames back to the input frames. Besides, we design perceptual and wavelet-based loss functions to train our model. Thus, we train the network unsupervised. Extensive qualitative and quantitative experiments and ablation studies on the DAVIS dataset show the superiority of the proposed method over the existing state-of-The-Art methods.
AB - Due to the availability of heterogeneous display devices and their aspect ratios, video retargeting has received considerable research attention among researchers. Non-consistent video retargeting can significantly affect a video's spatial and temporal quality, particularly for extreme retargeting cases. Since no perfectly annotated datasets exist for video retargeting, deep learning-based techniques are rarely utilized. This paper proposes a method that learns to retarget videos by detecting the salient areas and shifting them to the appropriate location. First, we segment the salient objects using a unified Transformer model. Using convolutional layers and a shifting strategy, we shift and warp objects to the suitable size and location in the frame. We use 1D convolution for shifting the salient objects. We also use a frame interpolation technique to preserve temporal information. To train the network, we feed the retargeted frames to a variational auto-encoder network to map the retargeted frames back to the input frames. Besides, we design perceptual and wavelet-based loss functions to train our model. Thus, we train the network unsupervised. Extensive qualitative and quantitative experiments and ablation studies on the DAVIS dataset show the superiority of the proposed method over the existing state-of-The-Art methods.
KW - CNNs
KW - Salient objects
KW - Segmentation
KW - Spatial and temporal coherence
KW - Video retargeting
UR - http://www.scopus.com/inward/record.url?scp=85186968395&partnerID=8YFLogxK
U2 - 10.1109/ICCE59016.2024.10444165
DO - 10.1109/ICCE59016.2024.10444165
M3 - Conference contribution
AN - SCOPUS:85186968395
T3 - Digest of Technical Papers - IEEE International Conference on Consumer Electronics
BT - 2024 IEEE International Conference on Consumer Electronics, ICCE 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Conference on Consumer Electronics, ICCE 2024
Y2 - 6 January 2024 through 8 January 2024
ER -