欢迎访问 生活随笔!

生活随笔

当前位置: 首页 >

美国交通事故分析(2017)(项目练习_5)

发布时间:2024/3/7 45 豆豆
生活随笔 收集整理的这篇文章主要介绍了 美国交通事故分析(2017)(项目练习_5) 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

目录

        • 1.项目摘要说明
        • 2.数据处理(仅为分析处理,建模的处理放在后面)
        • 3.数据可视化应用
        • 4.利用xgboost对严重程度建模预测
          • 4.1建模前预处理

1.项目摘要说明

项目目的:对于数据分析的练习
数据来源:kaggle
源码.数据集以及字段说明 百度云链接:
地址:https://pan.baidu.com/s/1UD5HD69bNEsX2EkjaQ1IPg
提取码:8gd8
本项目分析目标:

  • 对数据进行基础分析 发生事故最多的州,什么时候容易发生事故,事故发生时天气状况及可视化应用:讲述2017美国发生事故的总体情况等等
  • 利用xgboost对事故严重程度进行预测,查看事故严重程度和什么因素比较有关

2.数据处理(仅为分析处理,建模的处理放在后面)

原数据集(US_Accidents_Dec19.csv)是一个数据量49列共300W数据量包含2016到2019的交通事故,但考虑到电脑硬件及时间问题,仅选取2017年间的事故进行分析(详情源文件可见)

#截取2017年的 import pandas as pd data = pd.read_csv('./US_Accidents_Dec19.csv') datacopy = data.copy() datacopy['Start_Time'] = pd.to_datetime(datacopy['Start_Time']) datacopy['year'] = datacopy['Start_Time'].apply(lambda x:x.year) data1 = datacopy[datacopy['year']==2017] data1.to_csv('./USaccident2017.csv')

对USaccident2017.csv开始分析
导入需要使用的包

import numpy as np from sklearn.preprocessing import OneHotEncoder from sklearn.decomposition import PCA import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline import folium import pandas as pd import webbrowser from pyecharts import options as opts from pyecharts.charts import Page, Pie, Bar, Line, Scatter from sklearn.preprocessing import RobustScaler from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.model_selection import train_test_split import xgboost as xgb data = pd.read_csv('./USaccident2017.csv') data.shape #(717483, 51) data.head() Unnamed: 0IDSourceTMCSeverityStart_TimeEnd_TimeStart_LatStart_LngEnd_LatEnd_LngDistance(mi)DescriptionNumberStreetSideCityCountyStateZipcodeCountryTimezoneAirport_CodeWeather_TimestampTemperature(F)Wind_Chill(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Precipitation(in)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilightyear01234
9206A-9207MapQuest201.032017-01-01 00:17:362017-01-01 00:47:1237.925392-122.320595NaNNaN0.01Accident on I-80 Westbound at Exit 15 Cutting ...NaNI-80 EREl CerritoContra CostaCA94530USUS/PacificKCCR2017-01-01 00:53:0044.140.879.029.9110.0WSW5.8NaNPartly CloudyFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseNightNightNightNight2017
9207A-9208MapQuest201.032017-01-01 00:26:082017-01-01 01:16:0637.878185-122.307175NaNNaN0.01Accident on I-580 Southbound at Exit 12 I-80 I...NaNI-580 WRBerkeleyAlamedaCA94710USUS/PacificKOAK2017-01-01 00:53:0051.1NaN83.029.9710.0West11.5NaNOvercastFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight2017
9208A-9209MapQuest201.022017-01-01 00:53:412017-01-01 01:22:3538.014820-121.640579NaNNaN0.00Accident on Taylor Rd Southbound at Bethel Isl...2998.0Taylor LnROakleyContra CostaCA94561USUS/PacificKCCR2017-01-01 00:53:0044.140.879.029.9110.0WSW5.8NaNPartly CloudyFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight2017
9209A-9210MapQuest241.032017-01-01 01:18:512017-01-01 01:48:0137.912056-122.323982NaNNaN0.01Lane blocked and queueing traffic due to accid...NaNBayview AveRRichmondContra CostaCA94804USUS/PacificKCCR2017-01-01 01:11:0044.142.582.029.959.0SW3.5NaNMostly CloudyFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight2017
9210A-9211MapQuest222.032017-01-01 01:20:122017-01-01 01:49:4737.925392-122.320595NaNNaN0.01Queueing traffic due to accident on I-80 Westb...NaNI-80 EREl CerritoContra CostaCA94530USUS/PacificKCCR2017-01-01 01:11:0044.142.582.029.959.0SW3.5NaNMostly CloudyFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseNightNightNightNight2017

字段说明

https://www.jianshu.com/p/9e597dc8ae71

#查看空值情况 data.isnull().sum()[data.isnull().sum()!=0]

#处理空值 #无影响或者不分析的列 删除 deletelist= ['Unnamed: 0', 'ID','TMC', 'End_Lat', 'End_Lng', 'Airport_Code','Weather_Timestamp','Wind_Chill(F)','Civil_Twilight', 'Nautical_Twilight','Astronomical_Twilight', 'year','Number'] data1 = data.drop(deletelist, axis=1) #删除有空值的行 data1 = data1.dropna(axis = 0,subset=['City','Zipcode','Timezone','Sunrise_Sunset']) #温度湿度气压能见度用均值填补 data1['Temperature(F)'] = data1['Temperature(F)'].fillna(data1['Temperature(F)'].mean()) data1['Humidity(%)'] = data1['Humidity(%)'].fillna(data1['Humidity(%)'].mean()) data1['Pressure(in)'] = data1['Pressure(in)'].fillna(data1['Pressure(in)'].mean()) data1['Visibility(mi)'] = data1['Visibility(mi)'].fillna(data1['Visibility(mi)'].mean()) #风速使用近邻填补 data1['Wind_Speed(mph)'] = data1['Wind_Speed(mph)'].interpolate(method='nearest', order=4) #天气状况风向用众数填补 data1['Weather_Condition'] = data1['Weather_Condition'].fillna(data1['Weather_Condition'].mode()) data1['Wind_Direction'] = data1['Wind_Direction'].fillna(data1['Wind_Direction'].mode()) #降水量没有就用0填补 data1['Precipitation(in)'] = data1['Precipitation(in)'].fillna(0) #风向,把同样含义单词的合并起来 occupation = {"CALM":"Calm", "N":"North", "S":"South", "W":"West", "E":"East", "VAR":"Variable"} f = lambda x : occupation.get(x,x) #在occupation中找对应的值 data1['Wind_Direction']= data1['Wind_Direction'].map(f) #最后矫正索引因为删除了部分行 data1.index = range(len(data1))

3.数据可视化应用

#发生事故最多的州 a=(Bar(init_opts=opts.InitOpts(width="2000px",height="400px")) .add_xaxis(data1['State'].value_counts().index.tolist()) .add_yaxis('各州事故发生数量',data1['State'].value_counts().tolist(),color='#499C9F') .set_series_opts(label_opts=opts.LabelOpts(is_show= False)) ) a.render_notebook()


(点击可放大图片)
前5多事故发生的州分别为:CA(加利福尼亚州) TX(得克萨斯州) FL(佛罗里达州) NY(纽约州)NC(北卡罗来纳州)
都是比较发达人口较多的地区

#事故发生时间 x1=pd.DatetimeIndex(data1["Start_Time"]).hour.value_counts().sort_index().index.tolist() x=[str(i) for i in x1] #pyehchart需要字符类型 from pyecharts.charts import Line b = (Line(init_opts=opts.InitOpts(width="1000px",height="400px")).add_xaxis(x).add_yaxis('各时间事故发生数',pd.DatetimeIndex(data1["Start_Time"]).hour.value_counts().sort_index().tolist() ,color='#F7BA0B',is_smooth=True).set_series_opts(label_opts=opts.LabelOpts(is_show= False), markarea_opts=opts.MarkAreaOpts(data=[opts.MarkAreaItem(name="早高峰", x=("6", "9")),opts.MarkAreaItem(name="晚高峰", x=("15", "18"))])).set_global_opts(xaxis_opts=opts.AxisOpts(name='时间/时',name_location = "center",name_gap= 40))) b.render_notebook()


早高峰晚高峰有明显的凸出来,是交通最繁杂车流量最多的时候

#事故发生月份 x1=pd.DatetimeIndex(data1["Start_Time"]).month.value_counts().sort_index().index.tolist() x=[str(i) for i in x1] #pyehchart需要字符类型 from pyecharts.charts import Line q = (Line(init_opts=opts.InitOpts(width="1000px",height="400px")).add_xaxis(x).add_yaxis('月份事故发生数',pd.DatetimeIndex(data1["Start_Time"]).month.value_counts().sort_index().tolist() ,color='#AED54C',is_smooth=True,areastyle_opts=opts.AreaStyleOpts(opacity=0.5)).set_global_opts(xaxis_opts=opts.AxisOpts(name='month',name_location = "center",name_gap= 40))) q.render_notebook()


下半年事故发生数明显多于上半年,也许是因为下半年节假日较多且工作量较多

#各天气下发生事故数量 weather10 = data1['Weather_Condition'].value_counts().head(10) c=(Bar() .add_xaxis(weather10.index.tolist()) .add_yaxis('各天气下发生事故数量',weather10.tolist(),color='#48A43F') .set_series_opts(label_opts=opts.LabelOpts(is_show= False)) .set_global_opts( xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15))#解决标签名太长 )) c.render_notebook()


可见,大部分事故发生时是clear天气晴朗的,但是后3.4个为阴天或多云,天气原因还是会部分影响事故发生

#天气晴朗下发生事故类型 Clear_wearther = data1[:][data1['Weather_Condition']=='Clear'] occupation = {1:"轻微事故", 2:"一般事故", 3:"较大事故", 4:"重大事故"} f = lambda x : occupation.get(x,x) #在occupation中找对应的值 Clear_wearther['Severity']= Clear_wearther['Severity'].map(f) Clear_wearther['Severity'].value_counts().index d = (Pie().add("hotel",[list(z) for z in zip(['一般事故', '较大事故', '重大事故', '轻微事故'],Clear_wearther['Severity'].value_counts().tolist())]).set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%")) ) d.render_notebook()

#查看发生事故时各环境条件 #是否是白天 1就是白天 m = lambda x : 1 if x=='Day' else 0 data1['Sunrise_Sunset'] = data1['Sunrise_Sunset'].apply(m) #是否有降雨 m = lambda x : 1 if x>0 else 0 data1['PrecipitationORnot'] = data1['Precipitation(in)'].apply(m) df0=pd.concat([data1['Crossing'].value_counts(),data1['PrecipitationORnot'].value_counts(),data1['Sunrise_Sunset'].value_counts().sort_index(),data1['Traffic_Signal'].value_counts(),data1['Give_Way'].value_counts(),data1['Bump'].value_counts()],axis=1)h = (Bar().add_xaxis(['是否路口','有无降水','是否白天','有无信号灯','有无让路标志','有无减速带']).add_yaxis("0", df0.loc[False].tolist(), stack="stack1",color='#992572').add_yaxis("1",df0.loc[True].tolist(), stack="stack1",color='#4A203B').set_series_opts(label_opts=opts.LabelOpts(is_show=False)).set_global_opts(title_opts=opts.TitleOpts(title="Bar-堆叠数据(全部)"))) h.render_notebook()


事故发生时大部分为白天,非路口,无降水,无信号灯,无让路标志,无减速带

#事故发生时能见度 #按百度百科的能见度表分级 data1["Visibility_bin"] = "差" data1.loc[(data1["Visibility(mi)"]>2.5)&(data1["Visibility(mi)"]<=6.5), "Visibility_bin"] = "中等" data1.loc[(data1["Visibility(mi)"]>6.5)&(data1["Visibility(mi)"]<=12), "Visibility_bin"] = "良好" data1.loc[(data1["Visibility(mi)"]>12), "Visibility_bin"] = "很好" d = (Pie().add("hotel",[list(z) for z in zip(['良好', '中等', '差', '很好'],data1["Visibility_bin"].value_counts().tolist())]).set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%")).set_global_opts(title_opts=opts.TitleOpts(title="事故发生时能见度"))) d.render_notebook()


大部分事故发生时的可见度都时良好的即可见度在6.5英里和12英里之间(1英里约等于1.6公里)

# 这里使用folium库画美国地图并在data1中随机取3000个事故地点在地图上打点 incidents = folium.map.FeatureGroup() datasample=data1.sample(3000) # Loop through the 200 crimes and add each to the incidents feature group for lat, lng, in zip(datasample.Start_Lat,datasample.Start_Lng):incidents.add_child(folium.CircleMarker([lat, lng],radius=3, # define how big you want the circle markers to becolor='yellow',fill=True,fill_color='red',fill_opacity=0.4))# Add incidents to map US_map = folium.Map(location=[38, -100], zoom_start=4) US_map.add_child(incidents)


可见事故大部分是在沿海发达地区发生

4.利用xgboost对严重程度建模预测

4.1建模前预处理
dataX = data1.copy() #提取月份和小时 dataX['month'] = dataX['Start_Time'].apply(lambda x:x.month) dataX['hour'] = dataX['Start_Time'].apply(lambda x:x.hour) 删除对建模无用的特征 deletelist2=['Source','Side','Start_Time', 'End_Time','Description','Street','City','County', 'State', 'Zipcode', 'Country', 'Timezone','Wind_Direction'] dataX = dataX.drop(deletelist2, axis=1) #把false换成0 true换成1 list3=['Amenity','Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway','Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal','Turning_Loop'] m = lambda x : 1 if x else 0 for i in list3:dataX[i] = dataX[i].apply(m) #严重程度为1的数量太少极度不均衡,删掉 dataX = dataX.drop(index=(dataX.loc[(dataX['Severity']==1)].index)) dataX.index = range(len(dataX)) dataX.head() SeverityStart_LatStart_LngDistance(mi)Temperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_Speed(mph)Precipitation(in)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_Sunsetmonthhour01234
337.925392-122.3205950.0144.179.029.9110.05.80.0Partly Cloudy0000000000010010
337.878185-122.3071750.0151.183.029.9710.011.50.0Overcast0010000000000010
238.014820-121.6405790.0044.179.029.9110.05.80.0Partly Cloudy0000000000000010
337.912056-122.3239820.0144.182.029.959.03.50.0Mostly Cloudy0000000000000011
337.925392-122.3205950.0144.182.029.959.03.50.0Mostly Cloudy0000000000010011
y = dataX['Severity']#标签列 Xw=dataX['Weather_Condition']#需要独热编码的列 X = dataX.drop(['Severity','Weather_Condition'],axis=1) #独热编码 enc = OneHotEncoder(categories='auto',handle_unknown='ignore').fit(Xw.values.reshape(-1,1)) result = enc.transform(Xw.values.reshape(-1,1)).toarray() Xw1=pd.DataFrame(result) Xw1.shape #(71669,78)

独热编码后Weather_Condition这一列变成了(71669,78)
这样会发生过拟合,所以用PCA降维

pca=PCA(n_components=5) pca.fit(Xw1) col = pca.transform(Xw1) Xw1 = pd.DataFrame(col) Xw1.head() 0123401234
-0.365071-0.1052920.643741-0.661940-0.141977
-0.5440940.715639-0.290545-0.018498-0.060453
-0.365071-0.1052920.643741-0.661940-0.141977
-0.482130-0.685853-0.468909-0.024822-0.071162
-0.482130-0.685853-0.468909-0.024822-0.071162
#对其他特征进行标准化 columns=X.columns.tolist() robustS=RobustScaler() X = pd.DataFrame(robustS.fit_transform(X),columns=columns) X.head() Start_LatStart_LngDistance(mi)Temperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_Speed(mph)Precipitation(in)AmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_Sunsetmonthhour01234
0.287847-0.9958950.5-0.8493720.371429-0.4545450.0-0.1896550.00.00.00.00.00.00.00.00.00.00.00.01.00.0-1.0-1.166667-1.500
0.281328-0.9955060.5-0.5564850.485714-0.1818180.00.7931030.00.00.01.00.00.00.00.00.00.00.00.00.00.0-1.0-1.166667-1.500
0.300197-0.9761550.0-0.8493720.371429-0.4545450.0-0.1896550.00.00.00.00.00.00.00.00.00.00.00.00.00.0-1.0-1.166667-1.500
0.286006-0.9959930.5-0.8493720.457143-0.272727-1.0-0.5862070.00.00.00.00.00.00.00.00.00.00.00.00.00.0-1.0-1.166667-1.375
0.287847-0.9958950.5-0.8493720.457143-0.272727-1.0-0.5862070.00.00.00.00.00.00.00.00.00.00.00.01.00.0-1.0-1.166667-1.375

拼接X 和xw1

X1 = pd.concat([X,Xw1],axis = 1) X1.shape #(716669, 30) #xgboost分类标签只接受0到类别数,即0-2,234转换为012 def f(x):if x==2:return 0elif x==3:return 1else:return 2 y1 = y.apply(f) y1.value_counts()

0 461657
1 230899
2 24113
Name: Severity, dtype: int64

xgboost建模

param1 = {'booster': 'gbtree','objective': 'multi:softmax', # 多分类的问题'num_class': 3, # 类别数,与 multisoftmax 并用'gamma': 0.1, # 用于控制是否后剪枝的参数,越大越保守,一般0.1、0.2这样子。'max_depth': 12, # 构建树的深度,越大越容易过拟合'lambda': 2, # 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。'subsample': 0.7, # 随机采样训练样本'colsample_bytree': 0.7, # 生成树时进行的列采样'min_child_weight': 3,'silent': 1, # 设置成1则没有运行信息输出,最好是设置为0.'eta': 0.007, # 如同学习率'seed': 1000,'nthread': 4, # cpu 线程数 } X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, random_state=0) xg_train = xgb.DMatrix(X_train, label=y_train) xg_test = xgb.DMatrix( X_test, label=y_test) bst1 = xgb.train(param1, xg_train) pred1 = bst1.predict( xg_test ) print(accuracy_score(y_test, pred1))

0 .7609220422230595 (准确率)
查看特征重要性

from xgboost import plot_importance plot_importance(bst3) plt.show()

调参部分(更新中)

总结

以上是生活随笔为你收集整理的美国交通事故分析(2017)(项目练习_5)的全部内容,希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错,欢迎将生活随笔推荐给好友。