Özet
Multi-human parsing has received considerable research attention in recent years. Deep learning-based Multi-human parsing methods demonstrated promising results. In reality, most methods suffer while running on edge devices due to their extensive network architecture and low inference speed. Moreover, the inadequacies in modeling long-range feature dependencies have led to suboptimal representations of discriminative features across semantic classes. To address these challenges and facilitate real-time implementation on edge devices, we design a deep yet lightweight Encoder and a Multi-Scale Self-Attention based Decoder to capture long-range dependencies and spatial relationships. Furthermore, we have optimized our model through half-precision quantization, enhancing efficiency for edge devices. Experiments on publicly available Crowd Instance-level Human Parsing (CIHP) and Look into Person (LIP) datasets show the efficacy of our framework to parse multi-human with high inference speed at 55.6 FPS. Additionally, real-world testing on Jetson Nano edge devices showcases competitive performance. An extensive ablation study on different modules validates our network.
Orijinal dil | İngilizce |
---|---|
Dergi | Multimedia Tools and Applications |
DOI'lar | |
Yayın durumu | Kabul Edildi/Basımda - 2024 |
Harici olarak yayınlandı | Evet |