SET11121_1: Data Wrangling


Task Title: Data Wrangling (Part A)

Subject Code: SET11121 / SET11521

Objective: There are two objectives associated with this assignment. The first objective is to perform a literature review on recent approaches to abusive language detection technique. The second objective is to load given abusive language dataset (JSON file) in the Python programming architecture and perform simple dataset splitting for test and train procedure. Performing simple filtering and frequency analysis also required in the Python code.


Overview: In the literature, review student must use three contemporary research papers (published after 2016) which deals with abusive language detection from the textual content. The size of the document should be around 1200 words.

In the programming task, the student must use three different JSON files given in the resources segment of the task. These three files represent three separate classes of Tweets (neither, racism, sexist). The student must upload these data in the Python code and split them test and train segment. Finally, the student must performa word-based frequency analysis on the dataset.           


University: Edinburg Napier

Tool requirement:

  • Python 3.7: Python programming language is required to handle JSON dataset.
  • PyCharm: Educational version of the PyCharm is used to manage the resources of the Python program.

Implementation Details:

  • Initially, a Python class has been designed, which can represent the Dataset and its processing functions.
  • The class must have a contractor for initializing path for the JSON files and train-test split percentage.
  • A text filtering function is also implemented in the in the class.
  • For this task NLTK related libraries are used rigorously.



Sample Output  

[nltk_data] Downloading package stopwords to

[nltk_data]     C:\Users\Krazzy\AppData\Roaming\nltk_data...

[nltk_data]   Unzipping corpora\

[nltk_data] Downloading package punkt to

[nltk_data]     C:\Users\Krazzy\AppData\Roaming\nltk_data...

[nltk_data]   Unzipping tokenizers\

Five Most common:[('sexist', 761), ('kat', 717), ('like', 713), ('women', 693), ('islam', 536)]

Least common:('maiming', 1)


Process finished with exit code 0


