In this paper, we propose a novel feature fusion framework of dual cross-attention transformers to model global feature interaction and capture complementary information across modalities ...