๐๏ธ BAAD: A Multipurpose Dataset for Automatic Bangla Offensive Speech Recognition
A comprehensive speech dataset of Bengali Abusive Words designed to advance automatic slang speech recognition for the Bangla language.
- ๐ Type: Research Paper & Dataset
- ๐๏ธ Institution: DIU NLP LAB
- ๐ Timeline: June 2021 - October 2022
- ๐ Published In: Data in Brief (Elsevier)
- ๐ Paper Link: ScienceDirect
- ๐ป GitHub: AUDIO-Split-Code
๐ Dataset Statistics
๐ Abstract
In spite of being the fifth most spoken native language in the world, Bangla has barely received any attention in the domain of audio and speech recognition. This article represents a speech dataset of Bengali Abusive Words with some non-abusive words which are very close to the abusive ones.
In this work, a multipurpose dataset is presented to recognize automatic slang speech for Bangla language, which was prepared by collection, annotation, and refinement of data. It consists of 114 slang words and 43 non-slang words with 6,100 audio clips.
๐ฏ Data Collection Process
- ๐ฅ 60 native speakers participated for slang word collection
- ๐ฅ 23 native speakers participated for non-abusive words
- ๐ฃ๏ธ Speakers from various dialects across 20+ districts of Bangladesh
- ๐ 10 university students participated in dataset evaluation, annotation, and refinements
๐ฌ Research Applications
- ๐ค Develop automatic Bengali slang speech recognition systems
- ๐ Serve as a new benchmark for speech recognition-based ML models
- ๐ Simulate real-world scenarios using background noise in the dataset
- ๐งน Option to remove background noise for clean audio processing
- ๐ Dataset can be enriched further for extended research
โจ Key Contributions
- ๐ First comprehensive Bangla offensive speech dataset
- ๐๏ธ Multi-dialect audio collection from across Bangladesh
- ๐ Carefully annotated and refined by expert evaluators
- ๐ Open-source code available on GitHub
- ๐ Published in peer-reviewed journal (Data in Brief)
"Researchers can use this dataset to develop an automatic Bengali Slang speech recognition system, and also it can be used as a new benchmark for creating speech recognition-based machine learning models."