Project Title: Hate Speech Detection using Machine Learning for Roman Urdu
Category: Machine Learning / AI
Project File: Download Project File
Tayyab Waqar
tayyab.waqar@vu.edu.pk
maliktayyab786_1
Hate Speech Detection using Machine Learning for Roman Urdu
Project Domain / Category
Data Science / Machine Learning / Natural Language Processing (NLP)
Abstract / Introduction
With the rise of online social media platforms, the issue of hate speech has become increasingly prevalent. Hate speech can lead to social tension and harm, especially in multilingual countries like Pakistan, where Roman Urdu is commonly used online. This project aims to develop a machine learning model to detect hate speech in Roman Urdu comments. The focus is on gathering a robust dataset of Roman Urdu comments from social media, pre-processing it, extracting relevant features, and training machine learning models to classify hate speech effectively. Additionally, a web interface will be developed post-completion to allow users to test the model's performance with real-time data.
Functional Requirements:
Admin (Student) will perform all these (Functional Requirements) tasks.
1. Data-Collection
• For this project, student will collect data from any social media platform (such as YouTube, Facebook, Twitter, or Instagram) to detect hate speech. The dataset must contain at least 5000 comments, focusing on Roman Urdu. The data set is shared in the link below for the idea.
2. Data Preparation
• Prepare the dataset by labelling it as "Hate Speech (HS)" or "Non-Hate Speech (NHS)." This step involves manually reviewing the data to assign appropriate labels, ensuring the dataset is clean and ready for use in machine learning.
3. Data Pre-Processing
• As most of the data in the real world are incomplete containing noisy and missing values. Therefore, student have to apply pre-processing on data. In pre-processing, student will normalize the dataset, handle stop words, missing values, and noise & outliers, and remove duplicate values.
4. Feature Extraction
• After the pre-processing step, student will apply the feature extraction method. Student can use Term Frequency - Inverse Document Frequency (TF-IDF), Uni-Gram (1-Gram), Bi-Grams (2-Grams), Tri-Grams (3-Grams), or N-Grams feature extraction method.
5. Train & Test Data
• Split the dataset into 70% training and 30% testing data for the machine learning models.
6. Machine learning Techniques
• Student must use at least three classifiers/models (e.g. Naïve Bayes, Naïve Bayes Multinomial, Poly Kernel, RBF Kernel, Decision Tree, Random Tree or Random Forest Tree etc.) of three different machine learning techniques/algorithms.
7. Confusion Matrix
• Generate a confusion matrix to evaluate the performance of each classification model.
8. Accuracy Evaluation
• Find the accuracy of all techniques and compare their accuracy.
• This project will also tell us which machine learning technique is better to detect Toxic comments.
9. Web Interface Integration
• After the model development, integrate a web interface to allow users to test the model’s performance using real-time comments.
Tools/Techniques:
• Anaconda: Python distribution platform for development.
• Jupiter Notebook: For implementing machine learning models.
• Python: Programming language used for data pre-processing, model training, and feature extraction.
• Machine Learning Algorithms: For training and testing hate speech detection.
• Web Interface: Basic HTML/CSS, Flask, or Django.
Prerequisite:
• Knowledge of Artificial Intelligence, Machine Learning, and Natural Language Processing concepts is required. Students will cover a short course relevant to these concepts, alongside SRS and Design initial documentation or see the links below.
Helping Material:
Python:
https://www.w3schools.com/python/
https://www.tutorialspoint.com/python/index.htm
Feature Extraction Method:
https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be
https://www.analyticsvidhya.com/blog/2021/04/guide-for-feature-extraction-techniques/
https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-
real-world-dataset-796d339a4089
https://www.analyticsvidhya.com/blog/2021/07/feature-extraction-and-embeddings-in-nlp-
a-beginners-guide-to-understand-natural-language-processing/
http://uc-r.github.io/creating-text-features
Machine Learning Techniques:
https://towardsdatascience.com/machine-learning-an-introduction-23b84d51e6d0
https://towardsdatascience.com/top-10-algorithms-for-machine-learning-beginners-
https://towardsdatascience.com/10-machine-learning-methods-that-every-data-scientist-
https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623
https://www.youtube.com/watch?v=fG4e4TUrJ3E
https://www.youtube.com/watch?v=7eh4d6sabA0
Dataset:
https://drive.google.com/file/d/1Jq62ErAQiMpWfEz9_DwSkjmyYdmwWWu6/view
Supervisor:
Name: Tayyab Waqar
Email ID: tayyab.waqar@vu.edu.pk
Skype ID: maliktayyab786_1
No reviews available for this project.