TY - GEN
T1 - Overview of the HASOC subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages
AU - Jaiswal, Amit Kumar
AU - Mandl, Thomas
AU - Modha, Sandip
AU - Shahi, Gautam Kishore
AU - Madhu, Hiren
AU - Satapara, Shrey
AU - Majumder, Prasenjit
AU - Schäfer, Johannes
AU - Ranasinghe, Tharindu
AU - Zampieri, Marcos
AU - Nandini, Durgesh
N1 - Publisher Copyright:
© 2021 Owner/Author.
PY - 2022/1/26
Y1 - 2022/1/26
N2 - The widespread of offensive content online such as hate speech poses a growing societal problem. AI tools are necessary for supporting the moderation process at online platforms. For the evaluation of these identification tools, continuous experimentation with data sets in different languages are necessary. The HASOC track (Hate Speech and Offensive Content Identification) is dedicated to develop benchmark data for this purpose. This paper presents the HASOC subtrack for English, Hindi, and Marathi. The data set was assembled from Twitter. This subtrack has two sub-tasks. Task A is a binary classification problem (Hate and Not Offensive) offered for all three languages. Task B is a fine-grained classification problem for three classes (HATE) Hate speech, OFFENSIVE and PROFANITY offered for English and Hindi. Overall, 652 runs were submitted by 65 teams. The performance of the best classification algorithms for task A are F1 measures 0.91, 0.78 and 0.83 for Marathi, Hindi and English, respectively. This overview presents the tasks and the data development as well as the detailed results. The systems submitted to the competition applied a variety of technologies. The best performing algorithms were mainly variants of transformer architectures.
AB - The widespread of offensive content online such as hate speech poses a growing societal problem. AI tools are necessary for supporting the moderation process at online platforms. For the evaluation of these identification tools, continuous experimentation with data sets in different languages are necessary. The HASOC track (Hate Speech and Offensive Content Identification) is dedicated to develop benchmark data for this purpose. This paper presents the HASOC subtrack for English, Hindi, and Marathi. The data set was assembled from Twitter. This subtrack has two sub-tasks. Task A is a binary classification problem (Hate and Not Offensive) offered for all three languages. Task B is a fine-grained classification problem for three classes (HATE) Hate speech, OFFENSIVE and PROFANITY offered for English and Hindi. Overall, 652 runs were submitted by 65 teams. The performance of the best classification algorithms for task A are F1 measures 0.91, 0.78 and 0.83 for Marathi, Hindi and English, respectively. This overview presents the tasks and the data development as well as the detailed results. The systems submitted to the competition applied a variety of technologies. The best performing algorithms were mainly variants of transformer architectures.
KW - Deep learning
KW - Hate Speech
KW - Multilingual Text Classification
KW - Offensive Language
KW - Social Media
KW - machine learning
KW - Multilingual Datasets
KW - Under-resourced language
KW - hate speech
KW - social media
UR - https://arxiv.org/ftp/arxiv/papers/2112/2112.09301.pdf
UR - https://www.scopus.com/pages/publications/85124344402
U2 - 10.1145/3503162.3503176
DO - 10.1145/3503162.3503176
M3 - Conference contribution
SN - 9781450395960
VL - 3159
T3 - ACM International Conference Proceeding Series
SP - 1
EP - 3
BT - FIRE 2021 - Proceedings of the 13th Annual Meeting of the Forum for Information Retrieval Evaluation
A2 - Ganguly, Debasis
A2 - Gangopadhyay, Surupendu
A2 - Mitra, Mandar
A2 - Majumder, Prasenjit
A2 - Majumder, Prasenjit
PB - Association for Computing Machinery
T2 - FIRE '21 : 13th Annual Meeting of the Forum for Information Retrieval Evaluation
Y2 - 13 December 2021 through 17 December 2021
ER -