[1]MENG Xiang,WANG Boyue,GAO Yihan,et al.Visual-language key clue discovery-based multimodal fake news detection model[J].CAAI Transactions on Intelligent Systems,2026,21(1):109-119.[doi:10.11992/tis.202505007]
Copy
CAAI Transactions on Intelligent Systems[ISSN 1673-4785/CN 23-1538/TP] Volume:
21
Number of periods:
2026 1
Page number:
109-119
Column:
学术论文—机器感知与模式识别
Public date:
2026-03-05
- Title:
-
Visual-language key clue discovery-based multimodal fake news detection model
- Author(s):
-
MENG Xiang; WANG Boyue; GAO Yihan; WU Guangchao; LIU Yikun; LYU Songcheng; YIN Baocai
-
School of Information Science and Technology, Beijing University of Technology, Beijing 100124, China
-
- Keywords:
-
multimodal fake news detection; multi-scale feature interaction; key clue discovery; fine-grained representation; cross-modal attention; global feature alignment; memory-enhanced mechanism; semantic inconsistency detection
- CLC:
-
TP391.1
- DOI:
-
10.11992/tis.202505007
- Abstract:
-
Multimodal fake news detection aims to enhance the reliability of authenticity assessment by integrating diverse modalities such as text, images, videos, and audio. However, existing models often overlook discriminative local details and struggle to capture the critical inconsistencies between textual and visual content. To address these challenges, this study proposes a novel multimodal fake news detection model, termed the visual-language key clue discovery-based multimodal fake news detection model (VKC-MFND), which is designed to discover key visual-linguistic cues. The model comprises three main components: a multi-scale feature extraction module, a key feature information extraction module, and a multi-scale feature alignment module. Specifically, the multi-scale feature extraction module captures both global features (sentence-level or description-level) and local features (word-level or object box-level) from text and images, thereby enriching the diversity of information representation. The key feature information extraction module utilizes attention-based interactions among fine-grained features to uncover discriminative clues and aligns them with global semantic representations, facilitating the fusion of critical cross-modal information. Meanwhile, the multi-scale feature alignment module optimizes the model using both classification and alignment losses, enhancing semantic consistency in the shared feature space. Extensive experiments conducted on three benchmark multimodal fake news datasets—Weibo, Weibo-19, and Pheme—demonstrate that the proposed model significantly outperforms state-of-the-art approaches. Further ablation studies confirm the effectiveness and necessity of each component in the model.