Trump or Biden? (NLP Tweet Classifier)

By: Tyler Chia

Date: 12/23/21

Introduction

One of the topics that I told myself that I would learn by the end of the year was Natural Language Processing. Over this winter break, I began taking an NLP course on DataCamp with the hopes of producing my own project out of it. I decided upon building a classifier that determines whether a tweet was written by Donald Trump or the current president, Joe Biden. Methods that I planned on covering included the tokenization of strings using the NLTK package as well as different methods of preprocessing that would allow the models to produce better results once all the tweets were vectorized. The three models that I decided to test out were a Naive Bayes Model, a Support Vector Machine, and a Logistic Regression. All performed decently well.

Data Collection

One of the biggest challenges I faced when doing this project was scraping tweets off of Twitter and creating a robust dataset for machine learning. Since I do not have a developer account with Twitter, I needed to find a way to scrape tweets off of Twitter. One method I found was to use an advanced scraping tool written in python known as Twint. Twint allows for scraping Tweets from Twitter profiles without using Twitter's API. The documentation for Twint can be found here: https://github.com/twintproject/twint. The main issue with Twint was figuring out how to use it as I ran into issues with scraping more than 80 tweets as well as other issues that did not allow me to run a search. After a couple of hours, I was able to finally scrape 2000 of the most recent tweets from Joe Biden's account (POTUS). His twitter account can be found here: https://twitter.com/POTUS. However, after scraping Biden's tweets, I realized that I could not do the same for Trump as his Twitter account was deactivated by Twitter in January of 2021. Because of this, I had to use a website known as the Trump Twitter Archive to create a dataset of his tweets. The Trump Twitter Archive can be found here: https://www.thetrumparchive.com/. The main reason why I chose to create a model that differentiates specifically between Trump and Biden is because I feel as if they have very different mannerisms in the way that they tweet. Trump tends to be on the more vocal side as he clearly displays his emotions and opinions in his tweets while Biden's are a bit more muted. I wanted to see if my models could pick up on these differences and ultimately predict whether the tweet was written by Trump or Biden.

In [1]:
# importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [2]:
# importing package necessary for twint to run
import nest_asyncio
nest_asyncio.apply()
In [3]:
import twint

c = twint.Config()

c.Username = 'POTUS'       # user
c.Output = "biden_tweets.csv"     # path to csv file
c.Store_csv = True       # store tweets in a csv file
c.Limit = 2000           # limit number of tweets to 2000
c.Hide_output = True
c.Pandas = True

# twint.run.Search(c) ##commented out so i don't rescrape data
In [4]:
# reading in data scraped from twitter
df = pd.read_csv('biden_tweets.csv')
In [5]:
df.shape
Out[5]:
(2000, 36)

Ultimately, I decided upon creating a dataset with 4000 observations. This includes 2000 tweets from both Trump and Biden. This size was determined as I felt as if it were a large enough dataset for the models to perform well.

In [6]:
df.columns
Out[6]:
Index(['id', 'conversation_id', 'created_at', 'date', 'time', 'timezone',
       'user_id', 'username', 'name', 'place', 'tweet', 'language', 'mentions',
       'urls', 'photos', 'replies_count', 'retweets_count', 'likes_count',
       'hashtags', 'cashtags', 'link', 'retweet', 'quote_url', 'video',
       'thumbnail', 'near', 'geo', 'source', 'user_rt_id', 'user_rt',
       'retweet_id', 'reply_to', 'retweet_date', 'translate', 'trans_src',
       'trans_dest'],
      dtype='object')
In [7]:
# removing all other columns besides the tweet and the username
df = df.drop(columns = ['id', 'conversation_id', 'created_at', 'date', 'time', 'timezone',
       'user_id', 'name', 'place', 'language', 'mentions',
       'urls', 'photos', 'replies_count', 'retweets_count', 'likes_count',
       'hashtags', 'cashtags', 'link', 'retweet', 'quote_url', 'video',
       'thumbnail', 'near', 'geo', 'source', 'user_rt_id', 'user_rt',
       'retweet_id', 'reply_to', 'retweet_date', 'translate', 'trans_src',
       'trans_dest'])
In [8]:
df.head()
Out[8]:
username tweet
0 potus This morning, the First Lady and I stopped by ...
1 potus I had the honor of hosting the 2021 Kennedy Ce...
2 potus Today, I signed the Accelerating Access to Cri...
3 potus Tune in as I sign the Accelerating Access to C...
4 potus Today, I signed the bipartisan Uyghur Forced L...

As seen above, we have a rather clean looking dataset with a username column and a column for the tweets which is just a string of words written by the user. Below, I read in the data from the Trump Twiiter Archive which contained over 50,000 tweets from Trump. I needed to remove retweets as those were not written by specifically Trump. I then needed to order them based on date and take the most recent 2000 tweets.

In [9]:
# reading in data taken from the Trump Twitter Archive
df1 = pd.read_csv('tweets_01-08-2021 - tweets_01-08-2021.csv')
In [10]:
df1.head()
Out[10]:
id text isRetweet isDeleted device favorites retweets date isFlagged
0 9.845497e+16 Republicans and Democrats have both created ou... f f TweetDeck 49 255 2011-08-02 18:07:48 f
1 1.234653e+18 I was thrilled to be back in the Great city of... f f Twitter for iPhone 73748 17404 2020-03-03 1:34:50 f
2 1.218011e+18 RT @CBS_Herridge: READ: Letter to surveillance... t f Twitter for iPhone 0 7396 2020-01-17 3:22:47 f
3 1.304875e+18 The Unsolicited Mail In Ballot Scam is a major... f f Twitter for iPhone 80527 23502 2020-09-12 20:10:58 f
4 1.218160e+18 RT @MZHemingway: Very friendly telling of even... t f Twitter for iPhone 0 9081 2020-01-17 13:13:59 f
In [11]:
# removing retweets
df1 = df1[df1["text"].str.contains("RT")==False]
In [12]:
# sort on dates and taking the last 2000 tweets
df1 = df1.sort_values(['date'], ascending=[False])[:2000]
In [13]:
df1.columns
Out[13]:
Index(['id', 'text', 'isRetweet', 'isDeleted', 'device', 'favorites',
       'retweets', 'date', 'isFlagged'],
      dtype='object')
In [14]:
# removing all columns besides the text
df1 = df1.drop(columns = ['id', 'isRetweet', 'isDeleted', 'device', 'favorites',
       'retweets', 'date', 'isFlagged'])

# renaming the text column to match the tweets column in Biden dataset
df1 = df1.rename(columns = {'text':'tweet'})
In [15]:
# creating column for username to match up with Biden dataset
df1['username'] = "realDonaldTrump"
In [16]:
df1
Out[16]:
tweet username
327 To all of those who have asked, I will not be ... realDonaldTrump
323 The 75,000,000 great American Patriots who vot... realDonaldTrump
316 https://t.co/csX07ZVWGe realDonaldTrump
221 If Vice President @Mike_Pence comes through fo... realDonaldTrump
212 Get smart Republicans. FIGHT! https://t.co/3fs... realDonaldTrump
... ... ...
11304 https://t.co/LFOSsWB58M realDonaldTrump
11360 “The Fraternal Order of Police endorsed Presid... realDonaldTrump
11305 https://t.co/73kpVWrugh realDonaldTrump
9387 https://t.co/gF4VmXWFoK realDonaldTrump
11361 https://t.co/LnBpKJE9yi realDonaldTrump

2000 rows × 2 columns

In [17]:
# concatinating the datasets
tweets = pd.concat([df, df1])
In [18]:
tweets.head()
Out[18]:
username tweet
0 potus This morning, the First Lady and I stopped by ...
1 potus I had the honor of hosting the 2021 Kennedy Ce...
2 potus Today, I signed the Accelerating Access to Cri...
3 potus Tune in as I sign the Accelerating Access to C...
4 potus Today, I signed the bipartisan Uyghur Forced L...

After finalizing the two datasets that each contain the 2000 most recent tweets for both Biden and Trump, I concatenated the them and created one large dataset of 4000 tweets that is ready for natural language processing.

Data Preprocessing

The data preprocessing step was where I began to implement many of the topics that I had been learning through my NLP course. I began by importing the TweetTokenizer from NLTK which separated each tweet into tokens to make the processing easier. I then began to remove all tokens that were not an alphabetical word as I felt that the model should only look at the words being written in the tweet as opposed to including digits. After, I converted all the tokens to lowercase words as a way to reduced the number of columns as well as improve model performance. Next, I noticed that some of the tokens were singular letters such as "a" or "s" for some reason. I removed these as well. Lastly I both lemmatized and stemmed each of the tokens which from my understanding, removes plurals from nouns and tenses from verbs. This also improves model performance as there are not different tokens for votes, vote, voted, and voting. I then rejoined all the tokens into a single string ready to be vectorized. Word vectors are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between words in a corpus. I chose a TFIDF-Vectorizer over a Bag of Words Vectorizer as it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. The last step in preprocessing was to remove stopwords from the tweets which was done within the initialization of the TFIDF-Vectorizer.

In [19]:
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
In [20]:
# tweet tokenizing each of the tweets in the dataset
word_tokens = [tt.tokenize(sentence) for sentence in tweets.tweet]
In [21]:
# filter out non letter characters
cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens]
In [22]:
# convert all tokens to lowercase
lower_tokens = [[word.lower() for word in item] for item in cleaned_tokens]
In [23]:
# removing all singular characters in tokens
tokens = [[i for i in item if len(i) > 1] for item in lower_tokens]
In [24]:
# lemmatize the tokens
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatized_output = [[lemmatizer.lemmatize(w) for w in item] for item in tokens]
In [25]:
# stem the tokens and join them into a single string prepared for vectorization
from nltk.stem import PorterStemmer

Stemmer = PorterStemmer()

stemmed_output = [' '.join([Stemmer.stem(w) for w in item]) for item in lemmatized_output]
In [26]:
# adding the processed tweets back into tweets dataframe
tweets["processed"] = stemmed_output

tweets.head()
Out[26]:
username tweet processed
0 potus This morning, the First Lady and I stopped by ... thi morn the first ladi and stop by the child ...
1 potus I had the honor of hosting the 2021 Kennedy Ce... had the honor of host the kennedi center honor...
2 potus Today, I signed the Accelerating Access to Cri... today sign the acceler access to critic therap...
3 potus Tune in as I sign the Accelerating Access to C... tune in a sign the acceler access to critic th...
4 potus Today, I signed the bipartisan Uyghur Forced L... today sign the bipartisan uyghur forc labor pr...
In [27]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS

# intialize tfidf vectorizer and remove stopwords
tfidf_vect = TfidfVectorizer(stop_words = ENGLISH_STOP_WORDS, lowercase = True, ngram_range = (1,1))

# fit vectorizer to processed tweets
tfidf_vect.fit(tweets['processed'])

# Transform the vectorizer
X2 = tfidf_vect.transform(tweets['processed'])

# Create DataFrames from the vectorizer
tfidf_df = pd.DataFrame(X2.toarray(), columns=tfidf_vect.get_feature_names())
In [28]:
labels = tweets['username']
In [29]:
labels = labels.reset_index()
labels = labels.drop(columns = ['index'])

# adding in labels column (username) to vectorized dataframe
tfidf_df['label'] = labels
In [30]:
tfidf_df.head()
Out[30]:
aanhpi aapi abandon abbott abc abdullah abil abl abolish abort ... yourselv youth youtub yvett zaidi zaila zelenskyy zero zip zone
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 4454 columns

As seen above, after vectorizing the tweets, the result is a dataframe containing 4454 columns, one of which is the label column that shows whether the tweet was written by Trump or Biden. This dataframe was now ready to train models.

Model Selection, Tuning, and Evaluation

Now that the data collection and preprocessing was out of the way, I was able to do the fun part which includes training the models and testing their performance. I first split the data into X and y with X being all the vectorized tweet columns and y being the label column. I then ran a train-test-split using a 20% test size and a random state of 13 for reproducability.

In [31]:
from sklearn.model_selection import train_test_split

X = tfidf_df.drop(columns = ['label'])
y = tfidf_df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)

The first model that I wanted to attempt was a Naive Bayes Model. In my NLP course, I was told that Naive Bayes and Support Vector Machine models would be the most effective when classifying vectorized strings of text. I began tuning the hyperparameters of the NB model through five-fold cross-validation. The hyperparameter that I chose to tune was the alpha level. Below displays the average training and validation accuracies after cross-vaildation for each of the ten alpha levels that I decided to test.

In [32]:
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate

alphas = np.arange(0.1,1.1,0.1)

# create accuracy arrays
accuracyArray_train = []
accuracyArray_valid = []

# for loop that gives the average train and test scores for each cross fold
for i in alphas:
    nb = MultinomialNB(alpha = i)
    nb.fit(X_train, y_train)
    scores = cross_validate(nb, X_train, y_train, cv = 5, return_train_score=True)
    accuracyArray_train.append(np.mean(scores["train_score"]))
    accuracyArray_valid.append(np.mean(scores["test_score"]))

print("Average Training Accuracies:", accuracyArray_train)
print("Average Validation Accuracies:", accuracyArray_valid)
Average Training Accuracies: [0.98859375, 0.9871874999999999, 0.9860937499999999, 0.9848437500000001, 0.9839062500000001, 0.983046875, 0.98171875, 0.98125, 0.980703125, 0.979765625]
Average Validation Accuracies: [0.9609375, 0.9606250000000001, 0.9609375, 0.9615625, 0.9615625, 0.9621875000000001, 0.961875, 0.961875, 0.9615625, 0.9612499999999999]

Below shows a graph of training and validation accuracies at each alpha level which we can see decreases as alpha level increases.

In [33]:
# plotting training vs validation accuracies by alpha level
plt.plot(alphas, accuracyArray_valid, label = "valid")
plt.plot(alphas, accuracyArray_train, label = "train")
plt.legend(loc="center")
plt.title("Training Performance vs Validation Performance")
plt.xlabel("Alpha Level")
plt.ylabel("Accuracy")
Out[33]:
Text(0, 0.5, 'Accuracy')

Ultimately, I saw that the alpha level that produced the highest validation accuracy was a level of 0.6. I then retrained the model using this alpha level and ran the model over the testing data.

In [34]:
# testing model with best alpha level on testing data
best_alpha_acc = max(accuracyArray_valid)
best_alpha_idx = accuracyArray_valid.index(best_alpha_acc)
best_alpha = alphas[best_alpha_idx]

final_nb = MultinomialNB(alpha = best_alpha)
final_nb.fit(X_train, y_train)

predictions = final_nb.predict(X_test)
currAccuracy_nb = accuracy_score(y_test, predictions)
print("The Best Alpha Level is:", best_alpha)
print("The Testing Accuracy for the Best Alpha Level is:", currAccuracy_nb)
The Best Alpha Level is: 0.6
The Testing Accuracy for the Best Alpha Level is: 0.95125
In [35]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(final_nb, X_test, y_test, cmap = 'inferno')
Out[35]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f8508d68b50>

As seen above, after running the final model on the testing data, the accuracy rate ended up being at 95.125%. The Naive Bayes model performed very well as only 39 of the 800 tweets in the testing set were misclassified.

The next model that I build was a Support Vector Machine. In terms of hyperparameter tuning, I decided to tune the kernel to see which produces the best validation accuracy. I once again used cross-validation and the average training and validation accuracies are listed below as well as the graph that shows the change in accuracies as the kernel changes.

In [36]:
from sklearn.svm import SVC
kernel = ['linear', 'poly', "rbf", 'sigmoid']

training_mse_svm = []
validation_mse_svm = []

for j in kernel:
    svm = SVC(kernel = j)
    svm.fit(X_train, y_train)
    scores = cross_validate(svm, X_train, y_train, cv = 5, return_train_score=True)
    training_mse_svm.append(np.mean(scores["train_score"]))
    validation_mse_svm.append(np.mean(scores['test_score']))
    
print("Average Training Accuracies:", training_mse_svm)
print("Average Validation Accuracies:", validation_mse_svm)
Average Training Accuracies: [0.9915625, 0.9978125, 0.9971875000000001, 0.9806250000000001]
Average Validation Accuracies: [0.951875, 0.8456249999999998, 0.961875, 0.9487500000000001]
In [37]:
# plotting training vs validation accuracies by kernel
plt.plot(range(4), validation_mse_svm, label = "valid")
plt.plot(range(4), training_mse_svm, label = "train")
plt.legend(loc="center")
plt.title("Training Performance vs Validation Performance")
plt.xlabel("Kernel")
plt.ylabel("Accuracy")
Out[37]:
Text(0, 0.5, 'Accuracy')
In [38]:
# testing model with best kernel on testing data
best_kernel_acc = max(validation_mse_svm)
best_kernel_idx = validation_mse_svm.index(best_kernel_acc)
best_kernel = kernel[best_kernel_idx]

final_svm = SVC(kernel = best_kernel)
final_svm.fit(X_train, y_train)

predictions = final_svm.predict(X_test)
currAccuracy_svm = accuracy_score(y_test, predictions)
print("The Best Kernel is:", best_kernel)
print("The Testing Accuracy for the Best Kernel is:", currAccuracy_svm)
The Best Kernel is: rbf
The Testing Accuracy for the Best Kernel is: 0.96375
In [39]:
plot_confusion_matrix(final_svm, X_test, y_test, cmap = 'inferno')
Out[39]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f8508da22b0>

I determined that the best choice of kernel for the SVM was 'rbf' which is also the default choice of kernel. This final model produced a testing accuracy of 96.375% and only misclassified 29 tweets out of 800.

Lastly, I attempted to build a Logisitc Regression Model and tuned the hyperparameter of solver. I used the same methods on this model that I did on the previous two. Once again, training and validation accuracies are listed below as well as the graph which displays accuracies as the solver changes.

In [40]:
from sklearn.linear_model import LogisticRegression
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

training_mse_lr = []
validation_mse_lr = []

for j in solver:
    lr = LogisticRegression(solver = j)
    lr.fit(X_train, y_train)
    scores = cross_validate(lr, X_train, y_train, cv = 5, return_train_score=True)
    training_mse_lr.append(np.mean(scores["train_score"]))
    validation_mse_lr.append(np.mean(scores['test_score']))
    
print("Average Training Accuracies:", training_mse_lr)
print("Average Validation Accuracies:", validation_mse_lr)
Average Training Accuracies: [0.9765625, 0.9765625, 0.9767187500000001, 0.9765625, 0.9765625]
Average Validation Accuracies: [0.9471875000000001, 0.9471875000000001, 0.9475, 0.9471875000000001, 0.9471875000000001]
In [41]:
# plotting training vs validation accuracies by solver
plt.plot(range(5), validation_mse_lr, label = "valid")
plt.plot(range(5), training_mse_lr, label = "train")
plt.legend(loc="center")
plt.title("Training Performance vs Validation Performance")
plt.xlabel("Solver")
plt.ylabel("Accuracy")
Out[41]:
Text(0, 0.5, 'Accuracy')
In [42]:
# testing model with best solver on testing data
best_solver_acc = max(validation_mse_lr)
best_solver_idx = validation_mse_lr.index(best_solver_acc)
best_solver = solver[best_solver_idx]

final_lr = LogisticRegression(solver = best_solver)
final_lr.fit(X_train, y_train)

predictions = final_lr.predict(X_test)
currAccuracy_lr = accuracy_score(y_test, predictions)
print("The Best Solver is:", best_solver)
print("The Testing Accuracy for the Best Solver is:", currAccuracy_lr)
The Best Solver is: liblinear
The Testing Accuracy for the Best Solver is: 0.94875
In [43]:
plot_confusion_matrix(final_lr, X_test, y_test, cmap = 'inferno')
Out[43]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f84f9f2d580>

I determined that the best solver was 'liblinear' and after running this final model over the test data, the testing accuracy resulted in 94.875%. 41 of the 800 tweets were misclassified.

Results and Discussion

In [44]:
print("Naive Bayes Accuracy:", currAccuracy_nb)
print("Support Vector Machine Accuracy:", currAccuracy_svm)
print("Logistic Regression Accuracy:", currAccuracy_lr)
Naive Bayes Accuracy: 0.95125
Support Vector Machine Accuracy: 0.96375
Logistic Regression Accuracy: 0.94875

In the end, all three models that I chose to build performed extremely well in terms of determining whether Trump or Biden was the one to write a tweet. The Naive Bayes model had a 95.125% accuracy, the Support Vector Machine had a 96.375% accuracy, and the Logistic Regression had a 94.875% accuracy. Both the Naive Bayes and Support Vector Machines outperformed the Logistic Regression which was expected. I believe part of the reason why my accuracies were so high was due to the fact that I preprocessed the data well.

Overall this project was fun to do as I go to put what I learned to the test with my own exercise. Natural Language Processing is a very prominent topic in the field of data science and I am glad I got to learn it by the end of 2021. Looking to the future, I will probably want to play around with more text preprocessing methods as well as attempt to build multi-class classifiers instead of a binary classifer such as the one done in this project. In the end, I am happy that my models performed well and hopefully I will get to implement what I learned in my Data Science Capstone classes that I am taking in the Winter and Spring before I graduate.