Text Processing
Contents
Text Processing#
Load Data#
All data from the preceeding web scrape is loaded.
review_text
is converted to lower case.
import pandas as pd
# load data
df = pd.read_csv('data/processed_review_data.csv')
df['review_text'] = df['review_text'].str.lower()
Clean Text#
Several functions to process review text are developed and applied. These include:
Contraction expansion
String formatting
Duplicate character and word removal
Spelling correction
Lemmatisation and tokenisation
Define Functions#
Contraction Expansion#
A dictionary of contractions and associated expnasions are defined.
The Python Regular Expression library is used to identify and replace all contractions with their expanded form.
import re
# define contractions dictionary
cList = {
# A.
"ain't": "am not","aren't": "are not",
# C.
"can't": "cannot","can't've": "cannot have","'cause": "because","could've": "could have","couldn't": "could not",
"couldnt": "could not","couldn't've": "could not have",
# D.
"didn't": "did not","doesn't": "does not","don't": "do not",
# H.
"hadn't": "had not","hadn't've": "had not have","hasn't": "has not","haven't": "have not","he'd": "he would",
"he'd've": "he would have","he'll": "he will","he'll've": "he will have","he's": "he is","how'd": "how did",
"how'd'y": "how do you","how'll": "how will","how's": "how is",
# I.
"i'd": "i would","i'd've": "i would have","i'll": "i will","i'll've": "i will have","i'm": "i am","i've": "i have",
"isn't": "is not","it'd": "it had","it'd've": "it would have","it'll": "it will","itll": "it will",
"it'll've": "it will have","it's": "it is",
# L.
"let's": "let us",
# M.
"ma'am": "madam","mayn't": "may not","might've": "might have","mightn't": "might not","mightn't've": "might not have",
"must've": "must have","mustn't": "must not","mustn't've": "must not have",
# N.
"needn't": "need not","needn't've": "need not have",
# O.
"o'clock": "of the clock","oughtn't": "ought not","oughtn't've": "ought not have",
# S.
"shan't": "shall not","sha'n't": "shall not","shan't've": "shall not have","she'd": "she would",
"she'd've": "she would have","she'll": "she will","she'll've": "she will have","she's": "she is","should've": "should have",
"shouldn't": "should not","shouldn't've": "should not have","so've": "so have","so's": "so is",
# T.
"that'd": "that would","that'd've": "that would have","that's": "that is","there'd": "there had",
"there'd've": "there would have","there's": "there is","they'd": "they would","they'd've": "they would have",
"they'll": "they will","they'll've": "they will have","they're": "they are","they've": "they have","to've": "to have",
# V.
"vr" : "virtual reality",
# W.
"wasn't": "was not","we'd": "we had","we'd've": "we would have","we'll": "we will","we'll've": "we will have",
"we're": "we are","we've": "we have","weren't": "were not","what'll": "what will","what'll've": "what will have",
"what're": "what are","what's": "what is","what've": "what have","when's": "when is","when've": "when have",
"where'd": "where did","where's": "where is","where've": "where have","who'll": "who will","who'll've": "who will have",
"who's": "who is","who've": "who have","why's": "why is","why've": "why have","will've": "will have","won't": "will not",
"won't've": "will not have","would've": "would have","wouldn't": "would not","wouldn't've": "would not have",
# Y.
"y'all": "you all","y'alls": "you alls","y'all'd": "you all would","y'all'd've": "you all would have",
"y'all're": "you all are","y'all've": "you all have","you'd": "you had","you'd've": "you would have",
"you'll": "you you will","you'll've": "you you will have","you're": "you are","you've": "you have"
}
c_re = re.compile('(%s)' % '|'.join(cList.keys()))
# define expansion function
def expandContractions(text, cList=cList):
def replace(match):
return cList[match.group(0)]
return c_re.sub(replace, text)
String Formatting#
The following string formatting is performed:
All sentences are transformed to end in a ‘.’
All floating point numbers are removed.
Non alphanumeric characters are dropped.
Repeated fullstops and additional white space are dropped.
# define str format function
def clean(text):
text = re.sub(r'[?!:]', '.', text) # make all sentence ends with '.'
text = re.sub('\d*\.\d+','', text) # remove all floats
text = re.sub("[^a-zA-Z0-9. ]", '', text) # remove all not listed chars
text = re.sub('\.\.+', '. ',text) # remove repeat fullstops
text = re.sub(' +',' ', text) # remove extra whitespace
return text
Duplicate Removal#
Consecutive duplicate words are dropped and consecutive repeat characters are limited to a maximum of 2.
# define duplicate removal function
from itertools import groupby
def consec_dup(text):
text = " ".join([x[0] for x in groupby(text.split(" "))]) # remove repeat consecutive words
text = re.sub(r'(.)\1+', r'\1\1',text) # replace >2 consecutive duplicate characters
return text
Spelling Correction#
The SymspellPy
library [Wolf, 2022] is used to perform spelling correction.
import pkg_resources
from symspellpy import SymSpell, Verbosity
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "frequency_dictionary_en_82_765.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
def spell(text):
sentences = text.split('.')
corrected = " ".join([sym_spell.lookup_compound(x, max_edit_distance=2, ignore_non_words=True,ignore_term_with_digits=True)[0].term+'.' for x in sentences])
return corrected
Lemmatisation#
The spaCy
library [Honnibal et al., 2020] is utilised to lemmatise the review text and transform all words to their base root form.
import spacy
nlp = spacy.load("en_core_web_sm")
def lemma(text):
doc = nlp(text)
text = [token.lemma_ for token in doc]
text = " ".join(text)
return text
Apply Functions#
All functions are applied and the output saved to .csv for proceeding stages.
for func in [expandContractions,clean,consec_dup,spell,lemma]:
df.review_text = df.review_text.map(func,na_action='ignore')
df.dropna().to_csv('data/preTag_df1.csv',index=False)
df
review_text | |
---|---|
0 | be be pretty good . |
1 | the game have not crash on I do not know what ... |
2 | good cod since bo2 . come from a cod vet the m... |
3 | I like the game because I be a big cod fan eve... |
4 | just hit lvi 55 in 18 hour be it fun . yes be ... |
... | ... |
115947 | I have like how cod make sure to add some cont... |
115948 | I hate this game but I still play it because I... |
115949 | my been . |
115950 | too many mode cater to the young below 30 audi... |
115951 | pew game . . |
115952 rows × 1 columns