MultiModal-DeepFake - Project Page (2024)

DGM4: Detecting and Grounding Multi-Modal Media Manipulation

  • Rui Shao1,2
  • Tianxing Wu2
  • Ziwei Liu2
  • 1School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)

  • 2S-Lab, Nanyang Technological University

TL;DR: Different from existing single-modal forgery detection tasks, DGM4 performs real/fake classification on image-text pairs, and further attempts to detect fine-grained manipulation types and ground manipulated image bboxes and text tokens.


MultiModal-DeepFake - Project Page (1)

Abstract

Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM4). DGM4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content (i.e., image bounding boxes and text tokens), which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM4 dataset, where image-text pairs are manipulated by various approaches, with rich annotation of diverse manipulations. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of our model; several valuable observations are also revealed to facilitate future research in multi-modal media manipulation.

Links

Video


DGM4 Dataset

MultiModal-DeepFake - Project Page (6)


We present DGM4, a large-scale dataset for studying machine-generated multi-modal media manipulation. The dataset specifically focus on human-centric news, in consideration of its great public influence. It consists a total of 230k news samples, including 77,426 pristine image-text pairs and 152,574 manipulated pairs. The manipulated pairs contain:

  • 66,722 Face Swap Manipulations (FS)
  • 56,411 Face Attribute Manipulations (FA)
  • 43,546 Text Swap Manipulations (TS)
  • 18,588 Text Attribute Manipulations (TA)

Where 1/3 of the manipulated images and 1/2 of the manipulated text are combined together to form 32,693 mixed-manipulation pairs.

Some sample images and their annotations are shown below. For more information about the data structure, annotation details and other properties about the dataset, you can refer to our github page.


MultiModal-DeepFake - Project Page (7)

MultiModal-DeepFake - Project Page (8)

MultiModal-DeepFake - Project Page (9)

Method

Proposed HAMMER

Figure below shows the architecture of proposed HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER). It 1) aligns image and text embeddings through manipulation-aware contrastive learning between Image Encoder Ev, Text Encoder Et in shallow manipulation reasoning and 2) further aggregates multi-modal embeddings via modality-aware cross-attention of Multi-Modal Aggregator F in deep manipulation reasoning. Based on the interacted multi-modal embeddings in different levels, various manipulation detection and grounding heads (Multi-Label Classifier Cm, Binary Classifier Cb, BBox Detector Dv, and Token Detector Dt) are integrated to perform their tasks hierarchically.


MultiModal-DeepFake - Project Page (10)

Benchmark results

Based on the DGM4 dataset, we provide the first benchmark for evaluating model performance on our proposed task. To validate the effectiveness of our HAMMER model, we adapt SOTA multi-modal learning methods to our DGM4 setting for full-modal comparision, and further adapt deepfake detection and sequence tagging methods for single-modal comparison. As shown in the tables, HAMMER outperforms all multi-modal and single-modal methods in all sub-tasks, under all evaluation metrics.


MultiModal-DeepFake - Project Page (11)


MultiModal-DeepFake - Project Page (12) MultiModal-DeepFake - Project Page (13)

Visualization results

Visualization of detection and grounding results. Ground truth annotations are in red, and prediction results are in blue. The visualization results show that our method can accurately ground the manipulated bboxes & text tokens, and successfully detect the manipulation types.


MultiModal-DeepFake - Project Page (14)


Visualization of attention map. We plot Grad-CAM visualizations with respect to specifc text tokens (in green) for all the four manipulated types. For FS and FA, we visualize the attention map regarding some key words related to image manipulation. For TS and TA, we visualize the attention map regarding the manipulated text tokens. The attention maps show our model can use text to facilitate locating manipulated image regions, and capture subtle semantic inconsistencies between the two modalities to tackle DGM4.


MultiModal-DeepFake - Project Page (15)

Bibtex

@inproceedings{shao2023dgm4, title={Detecting and Grounding Multi-Modal Media Manipulation}, author={Shao, Rui and Wu, Tianxing and Liu, Ziwei}, booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2023}} 

Acknowledgement

We referred to the project page of AvatarCLIP when creating this project page.

MultiModal-DeepFake - Project Page (2024)

References

Top Articles
Latest Posts
Article information

Author: Nicola Considine CPA

Last Updated:

Views: 5755

Rating: 4.9 / 5 (49 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Nicola Considine CPA

Birthday: 1993-02-26

Address: 3809 Clinton Inlet, East Aleisha, UT 46318-2392

Phone: +2681424145499

Job: Government Technician

Hobby: Calligraphy, Lego building, Worldbuilding, Shooting, Bird watching, Shopping, Cooking

Introduction: My name is Nicola Considine CPA, I am a determined, witty, powerful, brainy, open, smiling, proud person who loves writing and wants to share my knowledge and understanding with you.