当前位置：首页 >

美国交通事故分析(2017)(项目练习_5)

发布时间：2024/3/7 45 豆豆

生活随笔收集整理的这篇文章主要介绍了美国交通事故分析(2017)(项目练习_5) 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

#截取2017年的 import pandas as pd data = pd.read_csv('./US_Accidents_Dec19.csv') datacopy = data.copy() datacopy['Start_Time'] = pd.to_datetime(datacopy['Start_Time']) datacopy['year'] = datacopy['Start_Time'].apply(lambda x:x.year) data1 = datacopy[datacopy['year']==2017] data1.to_csv('./USaccident2017.csv')

对USaccident2017.csv开始分析
导入需要使用的包

import numpy as np from sklearn.preprocessing import OneHotEncoder from sklearn.decomposition import PCA import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline import folium import pandas as pd import webbrowser from pyecharts import options as opts from pyecharts.charts import Page, Pie, Bar, Line, Scatter from sklearn.preprocessing import RobustScaler from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.model_selection import train_test_split import xgboost as xgb data = pd.read_csv('./USaccident2017.csv') data.shape #(717483, 51) data.head() Unnamed: 0IDSourceTMCSeverityStart_TimeEnd_TimeStart_LatStart_LngEnd_LatEnd_LngDistance(mi)DescriptionNumberStreetSideCityCountyStateZipcodeCountryTimezoneAirport_CodeWeather_TimestampTemperature(F)Wind_Chill(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Precipitation(in)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilightyear01234

9206

A-9207

MapQuest

201.0

2017-01-01 00:17:36

2017-01-01 00:47:12

37.925392

-122.320595

NaN

0.01

Accident on I-80 Westbound at Exit 15 Cutting ...

NaN

I-80 E

El Cerrito

Contra Costa

94530

US/Pacific

KCCR

2017-01-01 00:53:00

44.1

40.8

79.0

29.91

10.0

WSW

5.8

NaN

Partly Cloudy

False

True

False

Night

2017

9207

A-9208

MapQuest

201.0

2017-01-01 00:26:08

2017-01-01 01:16:06

37.878185

-122.307175

NaN

0.01

Accident on I-580 Southbound at Exit 12 I-80 I...

NaN

I-580 W

Berkeley

Alameda

94710

US/Pacific

KOAK

2017-01-01 00:53:00

51.1

NaN

83.0

29.97

10.0

West

11.5

NaN

Overcast

False

True

False

Night

2017

9208

A-9209

MapQuest

201.0

2017-01-01 00:53:41

2017-01-01 01:22:35

38.014820

-121.640579

NaN

0.00

Accident on Taylor Rd Southbound at Bethel Isl...

2998.0

Taylor Ln

Oakley

Contra Costa

94561

US/Pacific

KCCR

2017-01-01 00:53:00

44.1

40.8

79.0

29.91

10.0

WSW

5.8

NaN

Partly Cloudy

False

Night

2017

9209

A-9210

MapQuest

241.0

2017-01-01 01:18:51

2017-01-01 01:48:01

37.912056

-122.323982

NaN

0.01

Lane blocked and queueing traffic due to accid...

NaN

Bayview Ave

Richmond

Contra Costa

94804

US/Pacific

KCCR

2017-01-01 01:11:00

44.1

42.5

82.0

29.95

9.0

3.5

NaN

Mostly Cloudy

False

Night

2017

9210

A-9211

MapQuest

222.0

2017-01-01 01:20:12

2017-01-01 01:49:47

37.925392

-122.320595

NaN

0.01

Queueing traffic due to accident on I-80 Westb...

NaN

I-80 E

El Cerrito

Contra Costa

94530

US/Pacific

KCCR

2017-01-01 01:11:00

44.1

42.5

82.0

29.95

9.0

3.5

NaN

Mostly Cloudy

False

True

False

Night

2017

字段说明

https://www.jianshu.com/p/9e597dc8ae71

#查看空值情况 data.isnull().sum()[data.isnull().sum()!=0]

#处理空值 #无影响或者不分析的列删除 deletelist= ['Unnamed: 0', 'ID','TMC', 'End_Lat', 'End_Lng', 'Airport_Code','Weather_Timestamp','Wind_Chill(F)','Civil_Twilight', 'Nautical_Twilight','Astronomical_Twilight', 'year','Number'] data1 = data.drop(deletelist, axis=1) #删除有空值的行 data1 = data1.dropna(axis = 0,subset=['City','Zipcode','Timezone','Sunrise_Sunset']) #温度湿度气压能见度用均值填补 data1['Temperature(F)'] = data1['Temperature(F)'].fillna(data1['Temperature(F)'].mean()) data1['Humidity(%)'] = data1['Humidity(%)'].fillna(data1['Humidity(%)'].mean()) data1['Pressure(in)'] = data1['Pressure(in)'].fillna(data1['Pressure(in)'].mean()) data1['Visibility(mi)'] = data1['Visibility(mi)'].fillna(data1['Visibility(mi)'].mean()) #风速使用近邻填补 data1['Wind_Speed(mph)'] = data1['Wind_Speed(mph)'].interpolate(method='nearest', order=4) #天气状况风向用众数填补 data1['Weather_Condition'] = data1['Weather_Condition'].fillna(data1['Weather_Condition'].mode()) data1['Wind_Direction'] = data1['Wind_Direction'].fillna(data1['Wind_Direction'].mode()) #降水量没有就用0填补 data1['Precipitation(in)'] = data1['Precipitation(in)'].fillna(0) #风向，把同样含义单词的合并起来 occupation = {"CALM":"Calm", "N":"North", "S":"South", "W":"West", "E":"East", "VAR":"Variable"} f = lambda x : occupation.get(x,x) #在occupation中找对应的值 data1['Wind_Direction']= data1['Wind_Direction'].map(f) #最后矫正索引因为删除了部分行 data1.index = range(len(data1))

3.数据可视化应用

#发生事故最多的州 a=(Bar(init_opts=opts.InitOpts(width="2000px",height="400px")) .add_xaxis(data1['State'].value_counts().index.tolist()) .add_yaxis('各州事故发生数量',data1['State'].value_counts().tolist(),color='#499C9F') .set_series_opts(label_opts=opts.LabelOpts(is_show= False)) ) a.render_notebook()

(点击可放大图片)
前5多事故发生的州分别为：CA(加利福尼亚州) TX(得克萨斯州) FL(佛罗里达州) NY(纽约州)NC(北卡罗来纳州)
都是比较发达人口较多的地区

#事故发生时间 x1=pd.DatetimeIndex(data1["Start_Time"]).hour.value_counts().sort_index().index.tolist() x=[str(i) for i in x1] #pyehchart需要字符类型 from pyecharts.charts import Line b = (Line(init_opts=opts.InitOpts(width="1000px",height="400px")).add_xaxis(x).add_yaxis('各时间事故发生数',pd.DatetimeIndex(data1["Start_Time"]).hour.value_counts().sort_index().tolist() ,color='#F7BA0B',is_smooth=True).set_series_opts(label_opts=opts.LabelOpts(is_show= False), markarea_opts=opts.MarkAreaOpts(data=[opts.MarkAreaItem(name="早高峰", x=("6", "9")),opts.MarkAreaItem(name="晚高峰", x=("15", "18"))])).set_global_opts(xaxis_opts=opts.AxisOpts(name='时间/时',name_location = "center",name_gap= 40))) b.render_notebook()

早高峰晚高峰有明显的凸出来，是交通最繁杂车流量最多的时候

#事故发生月份 x1=pd.DatetimeIndex(data1["Start_Time"]).month.value_counts().sort_index().index.tolist() x=[str(i) for i in x1] #pyehchart需要字符类型 from pyecharts.charts import Line q = (Line(init_opts=opts.InitOpts(width="1000px",height="400px")).add_xaxis(x).add_yaxis('月份事故发生数',pd.DatetimeIndex(data1["Start_Time"]).month.value_counts().sort_index().tolist() ,color='#AED54C',is_smooth=True,areastyle_opts=opts.AreaStyleOpts(opacity=0.5)).set_global_opts(xaxis_opts=opts.AxisOpts(name='month',name_location = "center",name_gap= 40))) q.render_notebook()

下半年事故发生数明显多于上半年，也许是因为下半年节假日较多且工作量较多

#各天气下发生事故数量 weather10 = data1['Weather_Condition'].value_counts().head(10) c=(Bar() .add_xaxis(weather10.index.tolist()) .add_yaxis('各天气下发生事故数量',weather10.tolist(),color='#48A43F') .set_series_opts(label_opts=opts.LabelOpts(is_show= False)) .set_global_opts( xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15))#解决标签名太长 )) c.render_notebook()

可见，大部分事故发生时是clear天气晴朗的，但是后3.4个为阴天或多云，天气原因还是会部分影响事故发生

#天气晴朗下发生事故类型 Clear_wearther = data1[:][data1['Weather_Condition']=='Clear'] occupation = {1:"轻微事故", 2:"一般事故", 3:"较大事故", 4:"重大事故"} f = lambda x : occupation.get(x,x) #在occupation中找对应的值 Clear_wearther['Severity']= Clear_wearther['Severity'].map(f) Clear_wearther['Severity'].value_counts().index d = (Pie().add("hotel",[list(z) for z in zip(['一般事故', '较大事故', '重大事故', '轻微事故'],Clear_wearther['Severity'].value_counts().tolist())]).set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%")) ) d.render_notebook()

#查看发生事故时各环境条件 #是否是白天 1就是白天 m = lambda x : 1 if x=='Day' else 0 data1['Sunrise_Sunset'] = data1['Sunrise_Sunset'].apply(m) #是否有降雨 m = lambda x : 1 if x>0 else 0 data1['PrecipitationORnot'] = data1['Precipitation(in)'].apply(m) df0=pd.concat([data1['Crossing'].value_counts(),data1['PrecipitationORnot'].value_counts(),data1['Sunrise_Sunset'].value_counts().sort_index(),data1['Traffic_Signal'].value_counts(),data1['Give_Way'].value_counts(),data1['Bump'].value_counts()],axis=1)h = (Bar().add_xaxis(['是否路口','有无降水','是否白天','有无信号灯','有无让路标志','有无减速带']).add_yaxis("0", df0.loc[False].tolist(), stack="stack1",color='#992572').add_yaxis("1",df0.loc[True].tolist(), stack="stack1",color='#4A203B').set_series_opts(label_opts=opts.LabelOpts(is_show=False)).set_global_opts(title_opts=opts.TitleOpts(title="Bar-堆叠数据（全部）"))) h.render_notebook()

事故发生时大部分为白天，非路口，无降水，无信号灯，无让路标志，无减速带

#事故发生时能见度 #按百度百科的能见度表分级 data1["Visibility_bin"] = "差" data1.loc[(data1["Visibility(mi)"]>2.5)&(data1["Visibility(mi)"]<=6.5), "Visibility_bin"] = "中等" data1.loc[(data1["Visibility(mi)"]>6.5)&(data1["Visibility(mi)"]<=12), "Visibility_bin"] = "良好" data1.loc[(data1["Visibility(mi)"]>12), "Visibility_bin"] = "很好" d = (Pie().add("hotel",[list(z) for z in zip(['良好', '中等', '差', '很好'],data1["Visibility_bin"].value_counts().tolist())]).set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%")).set_global_opts(title_opts=opts.TitleOpts(title="事故发生时能见度"))) d.render_notebook()

大部分事故发生时的可见度都时良好的即可见度在6.5英里和12英里之间(1英里约等于1.6公里)

# 这里使用folium库画美国地图并在data1中随机取3000个事故地点在地图上打点 incidents = folium.map.FeatureGroup() datasample=data1.sample(3000) # Loop through the 200 crimes and add each to the incidents feature group for lat, lng, in zip(datasample.Start_Lat,datasample.Start_Lng):incidents.add_child(folium.CircleMarker([lat, lng],radius=3, # define how big you want the circle markers to becolor='yellow',fill=True,fill_color='red',fill_opacity=0.4))# Add incidents to map US_map = folium.Map(location=[38, -100], zoom_start=4) US_map.add_child(incidents)

可见事故大部分是在沿海发达地区发生

4.利用xgboost对严重程度建模预测

4.1建模前预处理

dataX = data1.copy() #提取月份和小时 dataX['month'] = dataX['Start_Time'].apply(lambda x:x.month) dataX['hour'] = dataX['Start_Time'].apply(lambda x:x.hour) 删除对建模无用的特征 deletelist2=['Source','Side','Start_Time', 'End_Time','Description','Street','City','County', 'State', 'Zipcode', 'Country', 'Timezone','Wind_Direction'] dataX = dataX.drop(deletelist2, axis=1) #把false换成0 true换成1 list3=['Amenity','Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway','Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal','Turning_Loop'] m = lambda x : 1 if x else 0 for i in list3:dataX[i] = dataX[i].apply(m) #严重程度为1的数量太少极度不均衡，删掉 dataX = dataX.drop(index=(dataX.loc[(dataX['Severity']==1)].index)) dataX.index = range(len(dataX)) dataX.head() SeverityStart_LatStart_LngDistance(mi)Temperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_Speed(mph)Precipitation(in)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_Sunsetmonthhour01234

3	37.925392	-122.320595	0.01	44.1	79.0	29.91	10.0	5.8	Partly Cloudy	0	1	1	0
3	37.878185	-122.307175	0.01	51.1	83.0	29.97	10.0	11.5	Overcast	1	0	1	0
2	38.014820	-121.640579	0.00	44.1	79.0	29.91	10.0	5.8	Partly Cloudy	0	0	1	0
3	37.912056	-122.323982	0.01	44.1	82.0	29.95	9.0	3.5	Mostly Cloudy	0	0	1	1
3	37.925392	-122.320595	0.01	44.1	82.0	29.95	9.0	3.5	Mostly Cloudy	0	1	1	1

y = dataX['Severity']#标签列 Xw=dataX['Weather_Condition']#需要独热编码的列 X = dataX.drop(['Severity','Weather_Condition'],axis=1) #独热编码 enc = OneHotEncoder(categories='auto',handle_unknown='ignore').fit(Xw.values.reshape(-1,1)) result = enc.transform(Xw.values.reshape(-1,1)).toarray() Xw1=pd.DataFrame(result) Xw1.shape #(71669,78)

独热编码后Weather_Condition这一列变成了(71669,78)
这样会发生过拟合，所以用PCA降维

pca=PCA(n_components=5) pca.fit(Xw1) col = pca.transform(Xw1) Xw1 = pd.DataFrame(col) Xw1.head() 0123401234

-0.365071	-0.105292	0.643741	-0.661940	-0.141977
-0.544094	0.715639	-0.290545	-0.018498	-0.060453
-0.365071	-0.105292	0.643741	-0.661940	-0.141977
-0.482130	-0.685853	-0.468909	-0.024822	-0.071162
-0.482130	-0.685853	-0.468909	-0.024822	-0.071162

#对其他特征进行标准化 columns=X.columns.tolist() robustS=RobustScaler() X = pd.DataFrame(robustS.fit_transform(X),columns=columns) X.head() Start_LatStart_LngDistance(mi)Temperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_Speed(mph)Precipitation(in)AmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_Sunsetmonthhour01234

0.287847	-0.995895	0.5	-0.849372	0.371429	-0.454545	0.0	-0.189655	0.0	1.0	-1.0	-1.166667	-1.500
0.281328	-0.995506	0.5	-0.556485	0.485714	-0.181818	0.0	0.793103	1.0	0.0	-1.0	-1.166667	-1.500
0.300197	-0.976155	0.0	-0.849372	0.371429	-0.454545	0.0	-0.189655	0.0	0.0	-1.0	-1.166667	-1.500
0.286006	-0.995993	0.5	-0.849372	0.457143	-0.272727	-1.0	-0.586207	0.0	0.0	-1.0	-1.166667	-1.375
0.287847	-0.995895	0.5	-0.849372	0.457143	-0.272727	-1.0	-0.586207	0.0	1.0	-1.0	-1.166667	-1.375

拼接X 和xw1

X1 = pd.concat([X,Xw1],axis = 1) X1.shape #(716669, 30) #xgboost分类标签只接受0到类别数，即0-2，234转换为012 def f(x):if x==2:return 0elif x==3:return 1else:return 2 y1 = y.apply(f) y1.value_counts()

0 461657
1 230899
2 24113
Name: Severity, dtype: int64

xgboost建模

param1 = {'booster': 'gbtree','objective': 'multi:softmax', # 多分类的问题'num_class': 3, # 类别数，与 multisoftmax 并用'gamma': 0.1, # 用于控制是否后剪枝的参数,越大越保守，一般0.1、0.2这样子。'max_depth': 12, # 构建树的深度，越大越容易过拟合'lambda': 2, # 控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。'subsample': 0.7, # 随机采样训练样本'colsample_bytree': 0.7, # 生成树时进行的列采样'min_child_weight': 3,'silent': 1, # 设置成1则没有运行信息输出，最好是设置为0.'eta': 0.007, # 如同学习率'seed': 1000,'nthread': 4, # cpu 线程数 } X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, random_state=0) xg_train = xgb.DMatrix(X_train, label=y_train) xg_test = xgb.DMatrix( X_test, label=y_test) bst1 = xgb.train(param1, xg_train) pred1 = bst1.predict( xg_test ) print(accuracy_score(y_test, pred1))

0 .7609220422230595 （准确率）
查看特征重要性

from xgboost import plot_importance plot_importance(bst3) plt.show()

调参部分(更新中)

总结

以上是生活随笔为你收集整理的美国交通事故分析(2017)(项目练习_5)的全部内容，希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错，欢迎将生活随笔推荐给好友。

上一篇： java class加载_Java 类加
下一篇：【项目分享】使用 PointNet 进行

生活随笔

生活随笔

美国交通事故分析(2017)(项目练习_5)

目录

1.项目摘要说明

2.数据处理(仅为分析处理，建模的处理放在后面)

3.数据可视化应用

4.利用xgboost对严重程度建模预测

4.1建模前预处理

总结