Sentence Correction Using Recurrent Neural Network

Nipun Agrawal
Geek Culture
Published in
10 min readApr 15, 2021

--

Introduction:

Social Media is the platform where people can communicate to each other by creating, exchanging, or sharing their ideas in a virtual network. Most people are using social media to express their feelings in text. Most of the ML/DL models used these texts to determine the sentiments or to predict any criminal activities and many more NLP-related tasks. The ML and DL models are trained in traditional language, mostly English for any NLP-related task. Nowadays people use short forms/abbreviations like (ppl for people, 2 for to, wen/whn for when and many more) in their texts which might not be very much helpful in doing NLP-based task.

Table of Contents

  • Business Problem
  • Deep Learning Problem
  • Dataset Overview
  • Loss
  • EDA
  • Data Preprocessing
  • Deep Learning Models
  • Model Evaluation
  • Model Prediction
  • Error Analysis
  • Final Pipeline
  • Model Inference
  • References

Business Problem:

Building the model for changing the corrupted text data to the distribution of standard English, results in the increased performance of many NLP based models. Here, input data will have random corruption which is a superset of target data converting them into target data while preserving the semantic meaning of text.

Deep Learning Problem:

Changing the random corrupted or sms text into proper English language while preserving the semantic meaning of the text.

Dataset Overview:

Based on our problem statement, this is the dataset which is publicly available.

https://www.comp.nus.edu.sg/~nlp/corpora.html

This dataset contains social media text along with their normalized text and Chinese translation of the normalized text. For our problem we need only social media text and their normalized English text. Social media text contains 2000 dataset.

Dataset Overview

Since our data is in txt format which contains SMS text in one line, Standard English in second line and Chinese Translation of standard English in 3rd line. We would be using only SMS_Text and Standard English for our problem Statement. After splitting, converted the txt format into csv files having 2 columns SMS_TEXT and ENGLISH_TEXT.

Loss:

As mention in the research paper, we will be using categorical cross entropy. It is also known as softmax plus cross entropy loss which trains model to output a probability over all the classes. It is used in a multi class classification.

img source: https://leakyrelu.com/2020/01/01/difference-between-categorical-and-sparse-categorical-cross-entropy-loss-function/

EDA:

Before proceeding further, we should check for any missing values.

Now we will be performing different EDA based on each column.

SMS_TEXT

  • Length of Sentence in SMS_TEXT

From the above distribution plot we can see that most of the length of sentences are in range of 20–80. Very few sentences length are more than 150.

  • Number of Words

Most of the sentences have words from 5 to 20. Very few sentences have words more than 30.

  • Calculating the percentile Values of Sentence Length: The 99.8th percentile value for length of sentence is 161, 99.9th percentile value is 202. There is huge difference between 99.8th percentile and 99.9th percentile. So we should consider maximum length for SMS_text=161. There is no sentence of length between 161 and 202.
  • Calculating the percentile Values of Number of Words: 99.9th percentile have 39 words while 100th percentile have 49 words in the sentence. Maximum number of words in the sentence should be 39.
  • Occurrences of Each Character in SMS_TEXT:

From the above character ‘e’ has most frequency. Followed by the character ‘a’,’t’ and ‘o’.

  • Frequency of POS_Tags

From the above plot we can see that POS tag with Person has most frequency. It has frequency of around 700 in the SMS_Text column.

  • 25 Most frequently occurring stop words in the SMS_TEXT

From the above plot we can see that stopwords like ‘to’ and ‘i’ have most frequency in SMS_Text column.

2. English Text

  • Length of Sentence

From the above plot we can that there are very few sentences whose length is more than 200.

  • Number of Words

From the above plot, we can see that there are very few sentences whose numbers of words are more than 40.

  • Calculating Percentile Value Sentence Length: 99.7th percentile value is 200 while 99.8th percentile value is 215. Very few sentences have length more than 200 so max_length of English sentence would be 200.
  • Calculating Percentile Values of Number of words:99.9th percentile has 48 words while 100th percentile has 59 words in the sentence. Maximum number of words in the sentence should be 48.
  • Occurrences of Each Character in English Text:

Characters like ‘e’,’o’,’a’ and ‘t’ have most frequency in English text columns.

  • Frequency of POS_Tags

From the above plot we can see that POS tags with Person has most frequency in English_Text. It has frequency of around 500.

  • 25 Most Frequently occurring Stopwords in English Text

From the above plot we can see that ‘you’ stopword has occured most of the times in the English_text.

Data Preprocessing:

Data preprocessing has been done on 2 ways:

  • Character Level: Based on the eda, we have kept only those sentences of SMS_text whose length is less than 170 and English_text length less than 200. We have deleted the English Text column and created 2 columns:English input which contains the start token ‘\t’ before every sentence and English Output which contains the end token ‘\n’ after every sentence. In the first data we will be adding ‘\n’ token in English Input and output so that our model can learn the end token. Based on above pre-processing step only 7 sentences have been dropped from the dataset.
  • Word Level: Based on eda, we have kept only those sentences of SMS_Text whose length is less than 39 and English_Text length less than 40. We have deleted the English_Text column and created 2 columns same as that of charcter level with only difference start token is <start> and end token is <end>. Based on above preprocessing step only 12 sentences have been dropped from dataset.

In the research paper, it is mentioned to also look for characters whether they are strictly printable or not (Here they ask us to check whether all the characters are English characters or not). In our preprocessing step we have not checked for it because there are some characters that might be representing some English words

Eg: SMS_TEXT: Yar lor… But if go later hor we muz go by ourselves… Then how… Ü still sleeping ar me eating now oredi ü still sleeping…

ENGLISH_TEXT: Yes. But if go later, we must go by ourselves. Then how? Are you still sleeping? And I am eating now and you are still sleeping.

From the above eg Ü represents the words You and if we removed Ü semantic meaning of sentence changed.

DATA PREPARATION:

  1. Character Level: Splitting the dataset into train and test in the ratio of 99:1.

We then create the object of character level tokenizer with no filters and lower case =false. Fitting our tokenizer with train dataset and converting train and test dataset into sequences give us the vocabulary for input and output of the model.We then padded the sequences with longest pre-processed sentence in both input and output sentence. We then created the character level embedding matrix for both input and output sentences which will be used by our model embedding layer as weights. From the above processing steps we got the input_sentence vocabulary of 103 characters and output_sentence vocabulary of 92 characters.

2. Word Level: Splitting the dataset into train and test in the ratio of 99:1.

We then create the object of word level tokenizer with filters and lower case =false and oov_token=True. Fitting our tokenizer with train dataset and converting train and test dataset into sequences giving us the vocabulary for input and output of the model.We then padded the sequences with longest pre-processed sentence in both input and output sentence. We then created the word level embedding matrix using fasttext for both input and output sentences which will be used by our embedding layer of model as weights. From the above preprocesisng steps we got input_sentence vocabulary of 3702 words and output_sentence vocabulary of 3040 words.

Deep Learning Models:

Here we will be using encoder decoder model where encoder and decoder have single layer of either GRU or LSTM

image source: https://www.kdnuggets.com/2019/08/deep-learning-transformers-attention-mechanism.html

1. Sequence to Sequence model with One Hot Encoded inputs:

In this, sentences are converted into one hot coded vectors of dimension of number of characters. It will check for each word whether the character is present or not. If present it would be marked as 1 and rest of them would be marked as 0. Suppose, if a word has 4 characters in the word then one hot encoded representation would be [[1000][0100][0010][0001]].

Encoder Decoder Architecture with GRU 100 units

Here we used 100 units of GRU, categorical cross entropy as loss, Adam optimizer with learning rate of 0.0001, we got loss of 0.5648.

Encoder Decoder Architecture with LSTM 100 units.

Here we used 100 units of LSTM, categorical cross entropy as loss, Adam optimizer with learning rate of 0.0001, we got loss of 0.5587.

2. Sequence to Sequence model with Character level tokenization:

In this, sentences are tokenized into character level tokens and are passed to encoder and decoder model.

Encoder Decoder architecture with 100 units of Gru

Here we used 100 units of GRU, Sparse categorical cross entropy as loss, Adam optimizer with learning rate of 0.01 and 0.001, we got loss of 0.8484 and 0.7583 respectively.

Encoder Decoder model with 100 units of LSTM.

Here we used 100 units of LSTM, Sparse categorical cross entropy as loss, Adam optimizer with learning rate of 0.01, we got loss of 0.6794.

3. Sequence to Sequence model with Character Level tokenization and Bahadanau Attention:

image source: https://blog.floydhub.com/attention-mechanism/amp/

After getting the outputs from the encoder, we would be calculating attention score of each encoder output with respect to decoder input and hidden state for each time step. It tells how much attention or weight the decoder will give to each encoder output to generate the next decoder output.Here we used 100 units of LSTM, Sparse categorical cross entropy as loss, Adam optimizer with learning rate of 0.01, we got loss of 0.3930.

4. Sequence to Sequence model with Word Level tokenization and Bahadanau Attention:

In this, sentences are tokenized into word level and these tokenized sentences are used to create embedding matrix using fasttext model which will be used in both encoder and decoder embedding layer as weights. Here for any learning rate,model was overfitting.

Model Evaluation:

Model Comparison

From the above models, character level model with Bahadanau Attention got the least loss. So we will be considering the Bahadanau Attention model for prediction of sentences.

Model Prediction:

Bahadanu Attention Model Sentence Prediction.

From the prediction we can observe that our model is not performing well enough for any sentences. It might because of very less dataset is used to train the model.

Error Analysis:

Lets check the average bleu score of predicted sentence of the model and original sentence. We calculated the average bleu score of validation dataset to be 0.6743. Lowest bleu score is 0.5220 and highest bleu score is 0.7496.

Let us plot the bleu score of validation data.

From the above plot, we can observe that majority of bleu scores are in the range of 0.65 to 0.75.There might be the case that we are getting least bleu scores because of more number of rare words present in the sentence in comparison to highest bleu scores.

Here, I have counted rare words in validation dataset whose frequency in dataset is less than or equal to 10. Based on presence of number of rare words in the text our bleu score does not vary. There are some texts whose bleu score is high and have many rare words in comparison to rare words present in text with lowest bleu score in both SMS and English text.

Final Pipleine:

In this Predict Function, we would be able to send multiple sentences present inside the list. The model will able to predict the sentences parallely.

Model Inference:

Model Inference Video

Future Works:

  • There are no datasets available other than 2000 sentences which is less for deep learning models. We have not done any data augmentation to increase the data so there might be chances that increase in data would result in better performance of model.
  • In encoder and decoder model we have used only LSTM and GRU, there might be possibility that using Bidirectional LSTM might increase the performance of model.
  • We could also check for character based transfer learning model that might give us the good prediction

Reference:

Github Repository:

Hope you like this blog. For any details contact me via LinkedIn

Linkedin Profile:

--

--