๐ŸŽ™๏ธ BAAD: A Multipurpose Dataset for Automatic Bangla Offensive Speech Recognition

A comprehensive speech dataset of Bengali Abusive Words designed to advance automatic slang speech recognition for the Bangla language.

  • ๐Ÿ“š Type: Research Paper & Dataset
  • ๐Ÿ›๏ธ Institution: DIU NLP LAB
  • ๐Ÿ“… Timeline: June 2021 - October 2022
  • ๐Ÿ“– Published In: Data in Brief (Elsevier)
  • ๐Ÿ”— Paper Link: ScienceDirect
  • ๐Ÿ’ป GitHub: AUDIO-Split-Code

๐Ÿ“Š Dataset Statistics

114 Slang Words
43 Non-Slang Words
6,100 Audio Clips
60+ Native Speakers
20+ Districts Covered

๐Ÿ“ Abstract

In spite of being the fifth most spoken native language in the world, Bangla has barely received any attention in the domain of audio and speech recognition. This article represents a speech dataset of Bengali Abusive Words with some non-abusive words which are very close to the abusive ones.

In this work, a multipurpose dataset is presented to recognize automatic slang speech for Bangla language, which was prepared by collection, annotation, and refinement of data. It consists of 114 slang words and 43 non-slang words with 6,100 audio clips.

๐ŸŽฏ Data Collection Process

  • ๐Ÿ‘ฅ 60 native speakers participated for slang word collection
  • ๐Ÿ‘ฅ 23 native speakers participated for non-abusive words
  • ๐Ÿ—ฃ๏ธ Speakers from various dialects across 20+ districts of Bangladesh
  • ๐Ÿ“‹ 10 university students participated in dataset evaluation, annotation, and refinements

๐Ÿ”ฌ Research Applications

  • ๐Ÿค– Develop automatic Bengali slang speech recognition systems
  • ๐Ÿ“Š Serve as a new benchmark for speech recognition-based ML models
  • ๐Ÿ”Š Simulate real-world scenarios using background noise in the dataset
  • ๐Ÿงน Option to remove background noise for clean audio processing
  • ๐Ÿ“ˆ Dataset can be enriched further for extended research

โœจ Key Contributions

  • ๐ŸŒ First comprehensive Bangla offensive speech dataset
  • ๐ŸŽ™๏ธ Multi-dialect audio collection from across Bangladesh
  • ๐Ÿ“ Carefully annotated and refined by expert evaluators
  • ๐Ÿ”“ Open-source code available on GitHub
  • ๐Ÿ“š Published in peer-reviewed journal (Data in Brief)

"Researchers can use this dataset to develop an automatic Bengali Slang speech recognition system, and also it can be used as a new benchmark for creating speech recognition-based machine learning models."