The goal of this project was to build a classifier that predicts whether a specific tweet was written by Donald Trump or Joe Biden. The three classifiers that were used were a Naive Bayes Model, a Support Vector Machine, and a Logistic Regression.
Goal:
Conclusion:
Python notebook containing code and outputs for data preparation can be found here.
Python notebook containing code and outputs for model tuning, training, and evaluation here.
Final Report for the project can be found here.
Final Poster can be found here.
The goal of this project was to use preexisting data from the 2019 NBA draft in addition to NCAA college basketball statistics as a way to create the best classification model to predict whether or not a player would be chosen in the 2021 NBA draft. I found that the best model out of the ones that I attempted was a logistic regression that used an optimal threshold. This model had a 93% overall accuracy rate and an 80% true positive rate on the 40% test data split, which was the highest out of all the models considering the overall accuracy was not lowered too drastically. On July 29th, I will come back to this project as a way to see how many of the 37 individuals that my model predicted to be drafted are actually drafted to the NBA.
Edit (Post July 29, 2021): Out of the 37 players that my model predicted to be drafted, only 8 of them were actually drafted leaving us with a 21.6% accuracy rate. With that being said, there 10 players who I predicted to be drafted who went undrafted but eventually signed with NBA G-League teams or are currently on an NBA roster. Many of the players that I predicted to be drafted actually ended up not even entering the draft and stayed at their respective schools. In the datatable, I have highlighted the players that were officially drafted by NBA teams in the 2021 draft.
Goal #1: Predicting whether or not Donald Trump would win a specific county
Goal #2: Use K-Means clustering to cluster our data and see if there were any distinct groups driven by specific demographic variables
Simple data analysis project done in order to calculate the price elasticity of demand for Disney World single day passes over the past decade using attendance records and average prices. Doing this allowed me to make conclusions about Disney’s pricing strategy relating to their theme parks.
UCSB Data Science Club Project. Through this project, my group and I wanted to see if we could order a Spotify playlist based on similarity to another playlist. In order to do this, we got numerical data from our Spotify playlists and calculated the error between the numerical values of songs from the second playlist and the average, median, and mode values from the first. This allowed us to make predictions as to which songs were most similar to the playlist.
Honors Contract Project. In this project, using a combination of data science methods and thorough research, I proposed a mitigation strategy that requires certain homeowners, based on the purchase price of their house, to install solar panels as a way of reducing carbon emissions.
Right before Fall quarter 2020, I participated in a Datathon hosted by the UC system and ImagineScholar. The goal of this project was to analyze and visualize data about energy and load shedding in South Africa. My partner and I decided to see if there is a correlation between instances of load shedding and international/foreign investment to see if load shedding was impacting the economy.
During the summer of 2020, I took an online data science class that taught me the basics of data science specifically in python. The two links below are the two projects that I completed for the class. The first is based on a dataset pertaining to 911 calls in Montgomery County, Pennsylvania in 2015 and 2016. The second is based on bank stocks from 2006 to 2015. Below the links are a few bullet points that display what I learned through these projects.