Title Redacted due to Anonymity Period

Published in , 2024

Online abusive content detection, particularly in low-resource settings and within the audio modality, remains underexplored. We investigate the potential of pre-trained audio representations for detecting abusive language in low-resource languages, in specific, in Indian languages. Leveraging powerful representations from models such as Wav2Vec and Whisper, we explore few-shot cross-lingual abuse detection using the ADIMA (Gupta et al., 2022) dataset. Our approach integrates these representations within the Model-Agnostic Meta-Learning (MAML) framework to classify abusive language in 10 languages. We experiment with various shot sizes (50-200) evaluating the impact of limited data on performance. Additionally, a feature visualization study was conducted to better understand model behaviour. This study highlights the generalization ability of pre-trained models in low-resource scenarios and offers valuable insights into detecting abusive language in multilingual contexts.

Recommended citation: