For this project we will be analyzing some 911 call data from Kaggle. All the data is specific to Montgomery County, Pennsylvania.
# Import Pandas and Numpy
import pandas as pd
import numpy as np
# Import visualization libraries and set %matplotlib inline.
%matplotlib inline
# Read in csv file
df = pd.read_csv('911.csv')
# Check the info of the dataframe
df.info()
# Check the head of the dataframe
df.head()
# What are the top 5 zipcodes for 911 calls?
df['zip'].value_counts().head()
The top five zipcodes for 911 calls are 19401, 19464, 19403, 19446, and 19406 with the total number of calls to the right of the zipcode number.
# What are the top 5 townships for 911 calls?
df['twp'].value_counts().head()
The top five townships for 911 calls are Lower Merion, Abington, Norristown, Upper Merrion, and Cheltenham with the number of 911 calls to the right of the township name.
# How many unique title codes are there?
df['title'].nunique()
There are 110 unique title codes in the dataframe for 911 calls.
# Adding a new column to the dataframe that shows the reason for the 911 call
# Creating a function that splits the title column and takes the string before the colon
# in order to extract the reason for the call
def reason(title):
return title.split(':')[0]
df['Reason'] = df['title'].apply(lambda x: reason(x))
df['Reason']
# What is the most common Reason for a 911 call based off of this new column?
df['Reason'].value_counts()
The most common reason for a 911 call in Montgomery County, Pennsylvania is EMS.
# Import both seaborn and matplotlib for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x = 'Reason', data=df)
Above is a countplot that shows the total number of calls based on each of the three reasons: EMS, Fire, and Traffic.
# Convert timeStamp variable from a string to DateTime objects
type(df['timeStamp'].iloc[0])
The data type of the objects in the timeStamp column are strings.
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
# Create three new colummns that display the hour, month, and day respectively
df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)
df['Month'] = df['timeStamp'].apply(lambda x: x.month)
df['Day'] = df['timeStamp'].apply(lambda x: x.dayofweek)
df.head()
The Day column displays the day as an integer from 0 to 6. I then created a dictionary to map each of the integers to a string name for each day of the week.
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df['Day'] = df['Day'].apply(lambda x: dmap[x])
df.head()
# Creating a countplot that counts the number of calls per day of the week
# and splits them based on the reason for the call
sns.countplot(x = 'Day', data = df, hue = 'Reason')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
As displayed by the graph above, EMS is the main reason for 911 calls in Montgomery County, Pennsylvania on any day of the week with fires and traffic following respectively.
# Creating the same countplot for months
sns.countplot(x = 'Month', data = df, hue = 'Reason')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
We can see that the number of 911 calls in Montgomery County, Pennsylvania start to drop later in the year as the lowest number of calls was made in December.
Looking at the plot above, it is clear that certain months are missing from the dataset. September, October, and November are all not included in the dataset.
# Create a groupby object that groups the data by month and aggregate the data
byMonth = df.groupby('Month').count()
byMonth.head()
# Create a simple lineplot that shows the number of calls based on the month
byMonth['desc'].plot()
# Create a lmplot() to create a linear fit on the number of calls per month
byMonth = byMonth.reset_index()
sns.lmplot(x = 'Month', y = 'desc', data = byMonth)
Even though September, October, and November are missing from the dataset, we can use an lmplot to predict these numbers based on the months preceding this patch as well as the months following it.
# Create a new column that shows the date
df['Date'] = df['timeStamp'].apply(lambda x: x.date())
df.head()
# Perform a groupby on the data that groups together by date
byDate = df.groupby('Date').count()
byDate['desc'].plot()
plt.tight_layout()
# Create a plot that shows the number of traffic related calls based on date
byTraffic = df[df['Reason'] == 'Traffic']
byTraffic.groupby('Date').count()['desc'].plot()
plt.title('Traffic')
plt.tight_layout()
According to the graph, it seems that in the middle of January of 2016, there was a drastic increase in the number of 911 calls in Montgomery County, Pennsylvania.
# Create a plot that shows the number of fire related calls based on date
byFire = df[df['Reason'] == 'Fire']
byFire.groupby('Date').count()['desc'].plot()
plt.title('Fire')
plt.tight_layout()
# Create a plot that shows the number of EMS related calls based on date
byEMS = df[df['Reason'] == 'EMS']
byEMS.groupby('Date').count()['desc'].plot()
plt.title('EMS')
plt.tight_layout()
# Restructure the dataframe so that the columns become the Hours and the Index becomes the Day of the Week.
dayandhour = df.groupby(by=['Day','Hour']).count()['Reason'].unstack()
dayandhour.head()
# Create a heatmap using seaborn
sns.heatmap(dayandhour, cmap='coolwarm')
It seems that the majority of 911 calls in Montgomery County, Pennsylvania occur between 4 and 5 pm.
# Create a clustermap based on the data
sns.clustermap(dayandhour, cmap='coolwarm')
# Create a heatmap using seaborn that is based on month as opposed to hour
dayandmonth = df.groupby(by=['Day','Month']).count()['Reason'].unstack()
dayandmonth.head()
sns.heatmap(dayandmonth, cmap='coolwarm')
From the heatmap above, there is a drastic increase in the number of 911 calls on Saturdays in January in Montgomery County, Pennsylvania.
# Create a new clustermap
sns.clustermap(dayandmonth, cmap='coolwarm')