One of the topics that I told myself that I would learn by the end of the year was Natural Language Processing. Over this winter break, I began taking an NLP course on DataCamp with the hopes of producing my own project out of it. I decided upon building a classifier that determines whether a tweet was written by Donald Trump or the current president, Joe Biden. Methods that I planned on covering included the tokenization of strings using the NLTK package as well as different methods of preprocessing that would allow the models to produce better results once all the tweets were vectorized. The three models that I decided to test out were a Naive Bayes Model, a Support Vector Machine, and a Logistic Regression. All performed decently well.
One of the biggest challenges I faced when doing this project was scraping tweets off of Twitter and creating a robust dataset for machine learning. Since I do not have a developer account with Twitter, I needed to find a way to scrape tweets off of Twitter. One method I found was to use an advanced scraping tool written in python known as Twint. Twint allows for scraping Tweets from Twitter profiles without using Twitter's API. The documentation for Twint can be found here: https://github.com/twintproject/twint. The main issue with Twint was figuring out how to use it as I ran into issues with scraping more than 80 tweets as well as other issues that did not allow me to run a search. After a couple of hours, I was able to finally scrape 2000 of the most recent tweets from Joe Biden's account (POTUS). His twitter account can be found here: https://twitter.com/POTUS. However, after scraping Biden's tweets, I realized that I could not do the same for Trump as his Twitter account was deactivated by Twitter in January of 2021. Because of this, I had to use a website known as the Trump Twitter Archive to create a dataset of his tweets. The Trump Twitter Archive can be found here: https://www.thetrumparchive.com/. The main reason why I chose to create a model that differentiates specifically between Trump and Biden is because I feel as if they have very different mannerisms in the way that they tweet. Trump tends to be on the more vocal side as he clearly displays his emotions and opinions in his tweets while Biden's are a bit more muted. I wanted to see if my models could pick up on these differences and ultimately predict whether the tweet was written by Trump or Biden.
# importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# importing package necessary for twint to run
import nest_asyncio
nest_asyncio.apply()
import twint
c = twint.Config()
c.Username = 'POTUS' # user
c.Output = "biden_tweets.csv" # path to csv file
c.Store_csv = True # store tweets in a csv file
c.Limit = 2000 # limit number of tweets to 2000
c.Hide_output = True
c.Pandas = True
# twint.run.Search(c) ##commented out so i don't rescrape data
# reading in data scraped from twitter
df = pd.read_csv('biden_tweets.csv')
df.shape
Ultimately, I decided upon creating a dataset with 4000 observations. This includes 2000 tweets from both Trump and Biden. This size was determined as I felt as if it were a large enough dataset for the models to perform well.
df.columns
# removing all other columns besides the tweet and the username
df = df.drop(columns = ['id', 'conversation_id', 'created_at', 'date', 'time', 'timezone',
'user_id', 'name', 'place', 'language', 'mentions',
'urls', 'photos', 'replies_count', 'retweets_count', 'likes_count',
'hashtags', 'cashtags', 'link', 'retweet', 'quote_url', 'video',
'thumbnail', 'near', 'geo', 'source', 'user_rt_id', 'user_rt',
'retweet_id', 'reply_to', 'retweet_date', 'translate', 'trans_src',
'trans_dest'])
df.head()
As seen above, we have a rather clean looking dataset with a username column and a column for the tweets which is just a string of words written by the user. Below, I read in the data from the Trump Twiiter Archive which contained over 50,000 tweets from Trump. I needed to remove retweets as those were not written by specifically Trump. I then needed to order them based on date and take the most recent 2000 tweets.
# reading in data taken from the Trump Twitter Archive
df1 = pd.read_csv('tweets_01-08-2021 - tweets_01-08-2021.csv')
df1.head()
# removing retweets
df1 = df1[df1["text"].str.contains("RT")==False]
# sort on dates and taking the last 2000 tweets
df1 = df1.sort_values(['date'], ascending=[False])[:2000]
df1.columns
# removing all columns besides the text
df1 = df1.drop(columns = ['id', 'isRetweet', 'isDeleted', 'device', 'favorites',
'retweets', 'date', 'isFlagged'])
# renaming the text column to match the tweets column in Biden dataset
df1 = df1.rename(columns = {'text':'tweet'})
# creating column for username to match up with Biden dataset
df1['username'] = "realDonaldTrump"
df1
# concatinating the datasets
tweets = pd.concat([df, df1])
tweets.head()
After finalizing the two datasets that each contain the 2000 most recent tweets for both Biden and Trump, I concatenated the them and created one large dataset of 4000 tweets that is ready for natural language processing.
The data preprocessing step was where I began to implement many of the topics that I had been learning through my NLP course. I began by importing the TweetTokenizer from NLTK which separated each tweet into tokens to make the processing easier. I then began to remove all tokens that were not an alphabetical word as I felt that the model should only look at the words being written in the tweet as opposed to including digits. After, I converted all the tokens to lowercase words as a way to reduced the number of columns as well as improve model performance. Next, I noticed that some of the tokens were singular letters such as "a" or "s" for some reason. I removed these as well. Lastly I both lemmatized and stemmed each of the tokens which from my understanding, removes plurals from nouns and tenses from verbs. This also improves model performance as there are not different tokens for votes, vote, voted, and voting. I then rejoined all the tokens into a single string ready to be vectorized. Word vectors are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between words in a corpus. I chose a TFIDF-Vectorizer over a Bag of Words Vectorizer as it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. The last step in preprocessing was to remove stopwords from the tweets which was done within the initialization of the TFIDF-Vectorizer.
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
# tweet tokenizing each of the tweets in the dataset
word_tokens = [tt.tokenize(sentence) for sentence in tweets.tweet]
# filter out non letter characters
cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens]
# convert all tokens to lowercase
lower_tokens = [[word.lower() for word in item] for item in cleaned_tokens]
# removing all singular characters in tokens
tokens = [[i for i in item if len(i) > 1] for item in lower_tokens]
# lemmatize the tokens
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_output = [[lemmatizer.lemmatize(w) for w in item] for item in tokens]
# stem the tokens and join them into a single string prepared for vectorization
from nltk.stem import PorterStemmer
Stemmer = PorterStemmer()
stemmed_output = [' '.join([Stemmer.stem(w) for w in item]) for item in lemmatized_output]
# adding the processed tweets back into tweets dataframe
tweets["processed"] = stemmed_output
tweets.head()
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
# intialize tfidf vectorizer and remove stopwords
tfidf_vect = TfidfVectorizer(stop_words = ENGLISH_STOP_WORDS, lowercase = True, ngram_range = (1,1))
# fit vectorizer to processed tweets
tfidf_vect.fit(tweets['processed'])
# Transform the vectorizer
X2 = tfidf_vect.transform(tweets['processed'])
# Create DataFrames from the vectorizer
tfidf_df = pd.DataFrame(X2.toarray(), columns=tfidf_vect.get_feature_names())
labels = tweets['username']
labels = labels.reset_index()
labels = labels.drop(columns = ['index'])
# adding in labels column (username) to vectorized dataframe
tfidf_df['label'] = labels
tfidf_df.head()
As seen above, after vectorizing the tweets, the result is a dataframe containing 4454 columns, one of which is the label column that shows whether the tweet was written by Trump or Biden. This dataframe was now ready to train models.
Now that the data collection and preprocessing was out of the way, I was able to do the fun part which includes training the models and testing their performance. I first split the data into X and y with X being all the vectorized tweet columns and y being the label column. I then ran a train-test-split using a 20% test size and a random state of 13 for reproducability.
from sklearn.model_selection import train_test_split
X = tfidf_df.drop(columns = ['label'])
y = tfidf_df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)
The first model that I wanted to attempt was a Naive Bayes Model. In my NLP course, I was told that Naive Bayes and Support Vector Machine models would be the most effective when classifying vectorized strings of text. I began tuning the hyperparameters of the NB model through five-fold cross-validation. The hyperparameter that I chose to tune was the alpha level. Below displays the average training and validation accuracies after cross-vaildation for each of the ten alpha levels that I decided to test.
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate
alphas = np.arange(0.1,1.1,0.1)
# create accuracy arrays
accuracyArray_train = []
accuracyArray_valid = []
# for loop that gives the average train and test scores for each cross fold
for i in alphas:
nb = MultinomialNB(alpha = i)
nb.fit(X_train, y_train)
scores = cross_validate(nb, X_train, y_train, cv = 5, return_train_score=True)
accuracyArray_train.append(np.mean(scores["train_score"]))
accuracyArray_valid.append(np.mean(scores["test_score"]))
print("Average Training Accuracies:", accuracyArray_train)
print("Average Validation Accuracies:", accuracyArray_valid)
Below shows a graph of training and validation accuracies at each alpha level which we can see decreases as alpha level increases.
# plotting training vs validation accuracies by alpha level
plt.plot(alphas, accuracyArray_valid, label = "valid")
plt.plot(alphas, accuracyArray_train, label = "train")
plt.legend(loc="center")
plt.title("Training Performance vs Validation Performance")
plt.xlabel("Alpha Level")
plt.ylabel("Accuracy")
Ultimately, I saw that the alpha level that produced the highest validation accuracy was a level of 0.6. I then retrained the model using this alpha level and ran the model over the testing data.
# testing model with best alpha level on testing data
best_alpha_acc = max(accuracyArray_valid)
best_alpha_idx = accuracyArray_valid.index(best_alpha_acc)
best_alpha = alphas[best_alpha_idx]
final_nb = MultinomialNB(alpha = best_alpha)
final_nb.fit(X_train, y_train)
predictions = final_nb.predict(X_test)
currAccuracy_nb = accuracy_score(y_test, predictions)
print("The Best Alpha Level is:", best_alpha)
print("The Testing Accuracy for the Best Alpha Level is:", currAccuracy_nb)
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(final_nb, X_test, y_test, cmap = 'inferno')
As seen above, after running the final model on the testing data, the accuracy rate ended up being at 95.125%. The Naive Bayes model performed very well as only 39 of the 800 tweets in the testing set were misclassified.
The next model that I build was a Support Vector Machine. In terms of hyperparameter tuning, I decided to tune the kernel to see which produces the best validation accuracy. I once again used cross-validation and the average training and validation accuracies are listed below as well as the graph that shows the change in accuracies as the kernel changes.
from sklearn.svm import SVC
kernel = ['linear', 'poly', "rbf", 'sigmoid']
training_mse_svm = []
validation_mse_svm = []
for j in kernel:
svm = SVC(kernel = j)
svm.fit(X_train, y_train)
scores = cross_validate(svm, X_train, y_train, cv = 5, return_train_score=True)
training_mse_svm.append(np.mean(scores["train_score"]))
validation_mse_svm.append(np.mean(scores['test_score']))
print("Average Training Accuracies:", training_mse_svm)
print("Average Validation Accuracies:", validation_mse_svm)
# plotting training vs validation accuracies by kernel
plt.plot(range(4), validation_mse_svm, label = "valid")
plt.plot(range(4), training_mse_svm, label = "train")
plt.legend(loc="center")
plt.title("Training Performance vs Validation Performance")
plt.xlabel("Kernel")
plt.ylabel("Accuracy")
# testing model with best kernel on testing data
best_kernel_acc = max(validation_mse_svm)
best_kernel_idx = validation_mse_svm.index(best_kernel_acc)
best_kernel = kernel[best_kernel_idx]
final_svm = SVC(kernel = best_kernel)
final_svm.fit(X_train, y_train)
predictions = final_svm.predict(X_test)
currAccuracy_svm = accuracy_score(y_test, predictions)
print("The Best Kernel is:", best_kernel)
print("The Testing Accuracy for the Best Kernel is:", currAccuracy_svm)
plot_confusion_matrix(final_svm, X_test, y_test, cmap = 'inferno')
I determined that the best choice of kernel for the SVM was 'rbf' which is also the default choice of kernel. This final model produced a testing accuracy of 96.375% and only misclassified 29 tweets out of 800.
Lastly, I attempted to build a Logisitc Regression Model and tuned the hyperparameter of solver. I used the same methods on this model that I did on the previous two. Once again, training and validation accuracies are listed below as well as the graph which displays accuracies as the solver changes.
from sklearn.linear_model import LogisticRegression
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
training_mse_lr = []
validation_mse_lr = []
for j in solver:
lr = LogisticRegression(solver = j)
lr.fit(X_train, y_train)
scores = cross_validate(lr, X_train, y_train, cv = 5, return_train_score=True)
training_mse_lr.append(np.mean(scores["train_score"]))
validation_mse_lr.append(np.mean(scores['test_score']))
print("Average Training Accuracies:", training_mse_lr)
print("Average Validation Accuracies:", validation_mse_lr)
# plotting training vs validation accuracies by solver
plt.plot(range(5), validation_mse_lr, label = "valid")
plt.plot(range(5), training_mse_lr, label = "train")
plt.legend(loc="center")
plt.title("Training Performance vs Validation Performance")
plt.xlabel("Solver")
plt.ylabel("Accuracy")
# testing model with best solver on testing data
best_solver_acc = max(validation_mse_lr)
best_solver_idx = validation_mse_lr.index(best_solver_acc)
best_solver = solver[best_solver_idx]
final_lr = LogisticRegression(solver = best_solver)
final_lr.fit(X_train, y_train)
predictions = final_lr.predict(X_test)
currAccuracy_lr = accuracy_score(y_test, predictions)
print("The Best Solver is:", best_solver)
print("The Testing Accuracy for the Best Solver is:", currAccuracy_lr)
plot_confusion_matrix(final_lr, X_test, y_test, cmap = 'inferno')
I determined that the best solver was 'liblinear' and after running this final model over the test data, the testing accuracy resulted in 94.875%. 41 of the 800 tweets were misclassified.
print("Naive Bayes Accuracy:", currAccuracy_nb)
print("Support Vector Machine Accuracy:", currAccuracy_svm)
print("Logistic Regression Accuracy:", currAccuracy_lr)
In the end, all three models that I chose to build performed extremely well in terms of determining whether Trump or Biden was the one to write a tweet. The Naive Bayes model had a 95.125% accuracy, the Support Vector Machine had a 96.375% accuracy, and the Logistic Regression had a 94.875% accuracy. Both the Naive Bayes and Support Vector Machines outperformed the Logistic Regression which was expected. I believe part of the reason why my accuracies were so high was due to the fact that I preprocessed the data well.
Overall this project was fun to do as I go to put what I learned to the test with my own exercise. Natural Language Processing is a very prominent topic in the field of data science and I am glad I got to learn it by the end of 2021. Looking to the future, I will probably want to play around with more text preprocessing methods as well as attempt to build multi-class classifiers instead of a binary classifer such as the one done in this project. In the end, I am happy that my models performed well and hopefully I will get to implement what I learned in my Data Science Capstone classes that I am taking in the Winter and Spring before I graduate.