Abstract: Multimodal Large Language Models (MLLM), which integrate large language models (LLMs) with vision models, aim to overcome the text-centric limitations of traditional LLMs. While models like ...