911 Calls Project

Tyler Chia

For this project we will be analyzing some 911 call data from Kaggle. All the data is specific to Montgomery County, Pennsylvania.

Question 1: Data and Setup

In [1]:
# Import Pandas and Numpy
In [2]:
import pandas as pd
import numpy as np
In [3]:
# Import visualization libraries and set %matplotlib inline.
In [4]:
%matplotlib inline
In [5]:
# Read in csv file
In [6]:
df = pd.read_csv('911.csv')
In [7]:
# Check the info of the dataframe
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99492 entries, 0 to 99491
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   lat        99492 non-null  float64
 1   lng        99492 non-null  float64
 2   desc       99492 non-null  object 
 3   zip        86637 non-null  float64
 4   title      99492 non-null  object 
 5   timeStamp  99492 non-null  object 
 6   twp        99449 non-null  object 
 7   addr       98973 non-null  object 
 8   e          99492 non-null  int64  
dtypes: float64(3), int64(1), object(5)
memory usage: 6.8+ MB
In [9]:
# Check the head of the dataframe
In [10]:
df.head()
Out[10]:
lat lng desc zip title timeStamp twp addr e
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station ... 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 NORRISTOWN HAWS AVE 1
3 40.116153 -75.343513 AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;... 19401.0 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01 NORRISTOWN AIRY ST & SWEDE ST 1
4 40.251492 -75.603350 CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S... NaN EMS: DIZZINESS 2015-12-10 17:40:01 LOWER POTTSGROVE CHERRYWOOD CT & DEAD END 1

Question 2: Basic Information From the Dataframe

In [11]:
# What are the top 5 zipcodes for 911 calls?
In [12]:
df['zip'].value_counts().head()
Out[12]:
19401.0    6979
19464.0    6643
19403.0    4854
19446.0    4748
19406.0    3174
Name: zip, dtype: int64

The top five zipcodes for 911 calls are 19401, 19464, 19403, 19446, and 19406 with the total number of calls to the right of the zipcode number.

In [13]:
# What are the top 5 townships for 911 calls?
In [14]:
df['twp'].value_counts().head()
Out[14]:
LOWER MERION    8443
ABINGTON        5977
NORRISTOWN      5890
UPPER MERION    5227
CHELTENHAM      4575
Name: twp, dtype: int64

The top five townships for 911 calls are Lower Merion, Abington, Norristown, Upper Merrion, and Cheltenham with the number of 911 calls to the right of the township name.

In [15]:
# How many unique title codes are there?
In [16]:
df['title'].nunique()
Out[16]:
110

There are 110 unique title codes in the dataframe for 911 calls.

Question 3: Creating New Features

In [17]:
# Adding a new column to the dataframe that shows the reason for the 911 call
In [18]:
# Creating a function that splits the title column and takes the string before the colon 
# in order to extract the reason for the call
In [19]:
def reason(title):
    return title.split(':')[0]
In [20]:
df['Reason'] = df['title'].apply(lambda x: reason(x))
In [21]:
df['Reason']
Out[21]:
0            EMS
1            EMS
2           Fire
3            EMS
4            EMS
          ...   
99487    Traffic
99488    Traffic
99489        EMS
99490        EMS
99491    Traffic
Name: Reason, Length: 99492, dtype: object
In [22]:
# What is the most common Reason for a 911 call based off of this new column?
In [23]:
df['Reason'].value_counts()
Out[23]:
EMS        48877
Traffic    35695
Fire       14920
Name: Reason, dtype: int64

The most common reason for a 911 call in Montgomery County, Pennsylvania is EMS.

Question 4: Visualization of the Data

In [24]:
# Import both seaborn and matplotlib for data visualization
In [25]:
import seaborn as sns
In [26]:
import matplotlib.pyplot as plt
In [27]:
sns.countplot(x = 'Reason', data=df)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x118301eb0>

Above is a countplot that shows the total number of calls based on each of the three reasons: EMS, Fire, and Traffic.

In [28]:
# Convert timeStamp variable from a string to DateTime objects
In [29]:
type(df['timeStamp'].iloc[0])
Out[29]:
str

The data type of the objects in the timeStamp column are strings.

In [30]:
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
In [31]:
# Create three new colummns that display the hour, month, and day respectively
In [32]:
df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)
df['Month'] = df['timeStamp'].apply(lambda x: x.month)
df['Day'] = df['timeStamp'].apply(lambda x: x.dayofweek)
df.head()
Out[32]:
lat lng desc zip title timeStamp twp addr e Reason Hour Month Day
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station ... 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1 EMS 17 12 3
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1 EMS 17 12 3
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 NORRISTOWN HAWS AVE 1 Fire 17 12 3
3 40.116153 -75.343513 AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;... 19401.0 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01 NORRISTOWN AIRY ST & SWEDE ST 1 EMS 17 12 3
4 40.251492 -75.603350 CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S... NaN EMS: DIZZINESS 2015-12-10 17:40:01 LOWER POTTSGROVE CHERRYWOOD CT & DEAD END 1 EMS 17 12 3

The Day column displays the day as an integer from 0 to 6. I then created a dictionary to map each of the integers to a string name for each day of the week.

In [33]:
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
In [34]:
df['Day'] = df['Day'].apply(lambda x: dmap[x])

df.head()
Out[34]:
lat lng desc zip title timeStamp twp addr e Reason Hour Month Day
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station ... 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1 EMS 17 12 Thu
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1 EMS 17 12 Thu
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 NORRISTOWN HAWS AVE 1 Fire 17 12 Thu
3 40.116153 -75.343513 AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;... 19401.0 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01 NORRISTOWN AIRY ST & SWEDE ST 1 EMS 17 12 Thu
4 40.251492 -75.603350 CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S... NaN EMS: DIZZINESS 2015-12-10 17:40:01 LOWER POTTSGROVE CHERRYWOOD CT & DEAD END 1 EMS 17 12 Thu
In [35]:
# Creating a countplot that counts the number of calls per day of the week
# and splits them based on the reason for the call
In [36]:
sns.countplot(x = 'Day', data = df, hue = 'Reason')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Out[36]:
<matplotlib.legend.Legend at 0x118301730>

As displayed by the graph above, EMS is the main reason for 911 calls in Montgomery County, Pennsylvania on any day of the week with fires and traffic following respectively.

In [37]:
# Creating the same countplot for months
In [38]:
sns.countplot(x = 'Month', data = df, hue = 'Reason')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Out[38]:
<matplotlib.legend.Legend at 0x11824e460>

We can see that the number of 911 calls in Montgomery County, Pennsylvania start to drop later in the year as the lowest number of calls was made in December.

Looking at the plot above, it is clear that certain months are missing from the dataset. September, October, and November are all not included in the dataset.

In [39]:
# Create a groupby object that groups the data by month and aggregate the data
In [40]:
byMonth = df.groupby('Month').count()
byMonth.head()
Out[40]:
lat lng desc zip title timeStamp twp addr e Reason Hour Day
Month
1 13205 13205 13205 11527 13205 13205 13203 13096 13205 13205 13205 13205
2 11467 11467 11467 9930 11467 11467 11465 11396 11467 11467 11467 11467
3 11101 11101 11101 9755 11101 11101 11092 11059 11101 11101 11101 11101
4 11326 11326 11326 9895 11326 11326 11323 11283 11326 11326 11326 11326
5 11423 11423 11423 9946 11423 11423 11420 11378 11423 11423 11423 11423
In [41]:
# Create a simple lineplot that shows the number of calls based on the month
In [42]:
byMonth['desc'].plot()
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x118981160>
In [43]:
# Create a lmplot() to create a linear fit on the number of calls per month
In [44]:
byMonth = byMonth.reset_index()
sns.lmplot(x = 'Month', y = 'desc', data = byMonth)
Out[44]:
<seaborn.axisgrid.FacetGrid at 0x1158db040>

Even though September, October, and November are missing from the dataset, we can use an lmplot to predict these numbers based on the months preceding this patch as well as the months following it.

In [45]:
# Create a new column that shows the date
In [46]:
df['Date'] = df['timeStamp'].apply(lambda x: x.date())
df.head()
Out[46]:
lat lng desc zip title timeStamp twp addr e Reason Hour Month Day Date
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station ... 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1 EMS 17 12 Thu 2015-12-10
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1 EMS 17 12 Thu 2015-12-10
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 NORRISTOWN HAWS AVE 1 Fire 17 12 Thu 2015-12-10
3 40.116153 -75.343513 AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;... 19401.0 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01 NORRISTOWN AIRY ST & SWEDE ST 1 EMS 17 12 Thu 2015-12-10
4 40.251492 -75.603350 CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S... NaN EMS: DIZZINESS 2015-12-10 17:40:01 LOWER POTTSGROVE CHERRYWOOD CT & DEAD END 1 EMS 17 12 Thu 2015-12-10
In [47]:
# Perform a groupby on the data that groups together by date
In [48]:
byDate = df.groupby('Date').count()
byDate['desc'].plot()
plt.tight_layout()
In [49]:
# Create a plot that shows the number of traffic related calls based on date
In [50]:
byTraffic = df[df['Reason'] == 'Traffic']
byTraffic.groupby('Date').count()['desc'].plot()
plt.title('Traffic')
plt.tight_layout()

According to the graph, it seems that in the middle of January of 2016, there was a drastic increase in the number of 911 calls in Montgomery County, Pennsylvania.

In [51]:
# Create a plot that shows the number of fire related calls based on date
In [52]:
byFire = df[df['Reason'] == 'Fire']
byFire.groupby('Date').count()['desc'].plot()
plt.title('Fire')
plt.tight_layout()
In [53]:
# Create a plot that shows the number of EMS related calls based on date
In [54]:
byEMS = df[df['Reason'] == 'EMS']
byEMS.groupby('Date').count()['desc'].plot()
plt.title('EMS')
plt.tight_layout()
In [55]:
# Restructure the dataframe so that the columns become the Hours and the Index becomes the Day of the Week. 
In [56]:
dayandhour = df.groupby(by=['Day','Hour']).count()['Reason'].unstack()
dayandhour.head()
Out[56]:
Hour 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
Day
Fri 275 235 191 175 201 194 372 598 742 752 ... 932 980 1039 980 820 696 667 559 514 474
Mon 282 221 201 194 204 267 397 653 819 786 ... 869 913 989 997 885 746 613 497 472 325
Sat 375 301 263 260 224 231 257 391 459 640 ... 789 796 848 757 778 696 628 572 506 467
Sun 383 306 286 268 242 240 300 402 483 620 ... 684 691 663 714 670 655 537 461 415 330
Thu 278 202 233 159 182 203 362 570 777 828 ... 876 969 935 1013 810 698 617 553 424 354

5 rows × 24 columns

In [57]:
# Create a heatmap using seaborn
In [58]:
sns.heatmap(dayandhour, cmap='coolwarm')
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a2f2cd0>

It seems that the majority of 911 calls in Montgomery County, Pennsylvania occur between 4 and 5 pm.

In [59]:
# Create a clustermap based on the data
In [60]:
sns.clustermap(dayandhour, cmap='coolwarm')
Out[60]:
<seaborn.matrix.ClusterGrid at 0x11b0f3ca0>
In [61]:
# Create a heatmap using seaborn that is based on month as opposed to hour
In [62]:
dayandmonth = df.groupby(by=['Day','Month']).count()['Reason'].unstack()
dayandmonth.head()
Out[62]:
Month 1 2 3 4 5 6 7 8 12
Day
Fri 1970 1581 1525 1958 1730 1649 2045 1310 1065
Mon 1727 1964 1535 1598 1779 1617 1692 1511 1257
Sat 2291 1441 1266 1734 1444 1388 1695 1099 978
Sun 1960 1229 1102 1488 1424 1333 1672 1021 907
Thu 1584 1596 1900 1601 1590 2065 1646 1230 1266
In [63]:
sns.heatmap(dayandmonth, cmap='coolwarm')
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b043640>

From the heatmap above, there is a drastic increase in the number of 911 calls on Saturdays in January in Montgomery County, Pennsylvania.

In [64]:
# Create a new clustermap
In [65]:
sns.clustermap(dayandmonth, cmap='coolwarm')
Out[65]:
<seaborn.matrix.ClusterGrid at 0x11b868130>