Cyber Abuse Detection using Machine Learning for Roman Urdu

Machine Learning / AI

Project Details

Project Information

Project Title: Cyber Abuse Detection using Machine Learning for Roman Urdu

Category: Machine Learning / AI

Semester: Spring 2025

Course: CS619

Complexity: Very Complex

Supervisor Details

Project Description

Cyber Abuse Detection using Machine Learning for Roman Urdu


Project Domain / Category:

Data Science / Machine Learning / Natural Language Processing (NLP)

 

Abstract / Introduction

The extensive use of social media has led to a significant increase in cyber abuse, including harassment, bullying, and offensive language, particularly in Roman Urdu. The absence of effective automated detection systems allows such content to persist, negatively impacting online interactions. Identifying cyber abuse in Roman Urdu presents a unique challenge due to informal language structure, variations in spelling, and contextual meanings. This project aims to develop a machine learning-based model capable of detecting and classifying cyber abuse in Roman Urdu text. The proposed system will utilize natural language processing (NLP) techniques and will be trained on data collected from social media platforms. Furthermore, a web interface will be developed to enable users to evaluate the model’s performance in real time.

 

Functional Requirements:

Admin (Student) will perform all these (Functional Requirements) tasks.

1.      Data-Collection

·        For this project, the student will collect data from any social media platform (such as YouTube, Facebook, Twitter, or Instagram) to detect cyber abuse. The dataset must contain at least 5,000 comments focusing on Roman Urdu.

·        The student is required to create their own dataset, and using pre-existing datasets from sources like Kaggle or other online repositories will not be accepted. Any attempt to do so will result in a deduction of marks. A sample dataset is provided in the link below for reference.

2.      Data Preparation

·        Prepare the dataset by labeling each comment as "Abusive (A)" or "Non-Abusive (NA)." This step involves manually reviewing the data to assign appropriate labels, ensuring the dataset is clean, well-structured, and suitable for machine learning.

3.      Data Pre-Processing

·        As real-world data is often incomplete, noisy, and contains missing values, the student must apply pre-processing techniques to ensure data quality. The following steps should be performed systematically:

o    Missing Values

         First, check how many missing values are present and display the output.

         Then, apply an appropriate technique to handle them (e.g., remove or fill with relevant values).

o    Duplicate Values

         First, check the number of duplicate entries and display the output.

         Then, remove the duplicates to maintain data quality.

o    Noise & Outliers

         First, identify noisy or extreme values and display the output.

         Then, clean or handle them to improve dataset reliability.

·        Additionally, the student must normalize the dataset, remove stop words, and ensure data is properly structured before feature extraction.

 

4.      Feature Extraction

·        After the pre-processing step, the student will apply feature extraction techniques to convert textual data into a structured format suitable for machine learning models. Possible techniques include Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words (BoW), N-Gram Models (Uni-Gram, Bi-Gram, Tri-Gram, etc.), Word Embeddings (Word2Vec, FastText, GloVe) can also be applied.

·        The student must have a clear understanding of the working principles, advantages, and limitations of the chosen feature extraction method. It is essential to justify the selection by explaining why a particular technique was used and how it contributes to improving the model's performance.

5.      Train & Test Data

·        The student will split the dataset into 75% training data and 25% testing data to evaluate the performance of the machine learning models. To ensure reliable results, the student can apply randomized splitting to avoid bias and maintain data diversity.

6.      Machine learning Techniques

·        The student must use at least three different classifiers/models from distinct machine learning techniques/algorithms. Possible choices include Naïve Bayes (Multinomial, Bernoulli), Support Vector Machine (SVM) with different kernels (Poly, RBF), Decision Tree, Random Forest, Logistic Regression, and Ensemble Methods. The selection should be based on the suitability of the algorithm for text classification tasks.

·        Additionally, the student must have a clear understanding of each chosen model, including its algorithmic working, advantages, limitations, and practical applications. It is essential that the student can justify their selection by explaining why a particular model was chosen over others. Furthermore, the student should be proficient in the implementation and coding of the selected models and be able to analyse their performance effectively.

7.      Confusion Matrix

·        The student must generate a confusion matrix for each classification model to evaluate its performance. The confusion matrix should include key metrics such as True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) to assess the model’s accuracy. A separate confusion matrix must be created for each selected machine learning model, and the results should be analyzed to compare their effectiveness in detecting cyber abuse.

8.      Accuracy Evaluation

·        The student must find the accuracy of all selected machine learning techniques and compare their performance.

·        This project will also determine which machine learning technique is more effective for detecting cyber abuse.

·        In addition to accuracy, the student should evaluate precision, recall, and F1-score for a more comprehensive analysis.

·        The student must visually represent accuracy comparisons using graphs, bar charts, or other suitable visualizations to highlight differences between models.

·        A final analysis should be conducted to explain which model performed best and why, based on the evaluation metrics.

9.      Web Interface Integration

·        After developing the model, the student will integrate a web interface to allow users to

test the model’s performance using real-time comments.

 

·        The interface should provide a text input field where users can enter a comment, and the system will classify it as Abusive (A) or Non-Abusive (NA).

·        The web interface will be developed using Flask or Django, with a simple HTML/CSS frontend for user interaction.

·        The student should ensure that the interface is fully functional, correctly linked to the trained model, and capable of making real-time predictions.

Tools/Techniques:

·        Anaconda: Python distribution platform for development.

·        Jupiter Notebook: For implementing machine learning models.

·        Python: Programming language used for data pre-processing, model training, and feature extraction.

·        Machine Learning Algorithms: For training and testing hate speech detection.

·        Web Interface: Basic HTML/CSS, Flask, or Django.

Prerequisite:

·        Knowledge of Artificial Intelligence, Machine Learning, and Natural Language Processing concepts is required. Students will cover a short course relevant to these concepts, alongside SRS and Design initial documentation or see the links below.

Helping Material:

Python:

https://www.python.org/ https://www.w3schools.com/python/ https://www.tutorialspoint.com/python/index.htm Feature Extraction Method:

https://towardsdatascience.com/feature-extraction-techniques-d619b56e31be https://www.analyticsvidhya.com/blog/2021/04/guide-for-feature-extraction-techniques/ https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real- world-dataset-796d339a4089

https://www.analyticsvidhya.com/blog/2021/07/feature-extraction-and-embeddings-in-nlp-a- beginners-guide-to-understand-natural-language-processing/

http://uc-r.github.io/creating-text-features

Machine Learning Techniques:

https://towardsdatascience.com/machine-learning-an-introduction-23b84d51e6d0 https://towardsdatascience.com/top-10-algorithms-for-machine-learning-beginners- 149374935f3c

https://towardsdatascience.com/10-machine-learning-methods-that-every-data-scientist-should- know-3cc96e0eeee9

https://towardsdatascience.com/machine-learning-classifiers-a5cc4e1b0623 https://www.youtube.com/watch?v=fG4e4TUrJ3E https://www.youtube.com/watch?v=7eh4d6sabA0

Dataset: https://drive.google.com/file/d/1l8Mo22kVQzrucbo2LCwnP74sRZ4Eztb_/view?usp=sharing

 

Supervisor:

Name: Tayyab Waqar

Email ID: tayyab.waqar@vu.edu.pk

Skype ID: maliktayyab786_1

 

Languages

  • Python, HTML, CSS Language

Tools

  • Anaconda, Jupyter Notebook, Machine Learning Algorithms, Flask, Django Tool

Project Schedules

Assignment #
Title
Start Date
End Date
Sample File
1
SRS Document
Friday 2, May, 2025 12:00AM
Thursday 22, May, 2025 12:00AM
2
Design Document
Friday 23, May, 2025 12:00AM
Tuesday 29, July, 2025 12:00AM
3
Prototype Phase
Wednesday 30, July, 2025 12:00AM
Friday 12, September, 2025 12:00AM
4
Final Deliverable
Saturday 13, September, 2025 12:00AM
Monday 3, November, 2025 12:00AM

Viva Review Submission

Review Information
Supervisor Behavior

Student Viva Reviews

No reviews available for this project.