Download PDFOpen PDF in browserMultimodal Hate Speech Detection from Videos and TextsEasyChair Preprint 1074312 pages•Date: August 19, 2023AbstractSince social media posts also consist of videos with associated comments, and many of these videos or their comments impart hate speech, detecting them in this multimodal setup is crucial. We have focused on the early detection of hate speech in videos by exploiting features from an initial set of comments. We devise Text Video Classifier (TVC), a multimodal hate classifier, based on four modalities which are character, words, sentence, and video frame features, respectively, and develop a Cross Attention Fusion Mechanism (CA-FM) to learn global feature embeddings from the inter-modal features. We report the architectural details and the experiments performed. We use several sampling techniques and train this architecture on a Vine dataset of both video and their comments. Our proposed architectural design attains performance improvement on the models previously constructed on the chosen dataset, for an output probability threshold of 0.5, showing the positive effect of using the CA-FM and TVC. Keyphrases: Cross Attention Fusion Mechanism, TVC, multimodal
|