美国交通事故分析(2017)(项目练习_5)
目录
- 1.项目摘要说明
- 2.数据处理(仅为分析处理,建模的处理放在后面)
- 3.数据可视化应用
- 4.利用xgboost对严重程度建模预测
- 4.1建模前预处理
1.项目摘要说明
项目目的:对于数据分析的练习
数据来源:kaggle
源码.数据集以及字段说明 百度云链接:
地址:https://pan.baidu.com/s/1UD5HD69bNEsX2EkjaQ1IPg
提取码:8gd8
本项目分析目标:
- 对数据进行基础分析 发生事故最多的州,什么时候容易发生事故,事故发生时天气状况及可视化应用:讲述2017美国发生事故的总体情况等等
- 利用xgboost对事故严重程度进行预测,查看事故严重程度和什么因素比较有关
2.数据处理(仅为分析处理,建模的处理放在后面)
原数据集(US_Accidents_Dec19.csv)是一个数据量49列共300W数据量包含2016到2019的交通事故,但考虑到电脑硬件及时间问题,仅选取2017年间的事故进行分析(详情源文件可见)
#截取2017年的 import pandas as pd data = pd.read_csv('./US_Accidents_Dec19.csv') datacopy = data.copy() datacopy['Start_Time'] = pd.to_datetime(datacopy['Start_Time']) datacopy['year'] = datacopy['Start_Time'].apply(lambda x:x.year) data1 = datacopy[datacopy['year']==2017] data1.to_csv('./USaccident2017.csv')对USaccident2017.csv开始分析
导入需要使用的包
| 9206 | A-9207 | MapQuest | 201.0 | 3 | 2017-01-01 00:17:36 | 2017-01-01 00:47:12 | 37.925392 | -122.320595 | NaN | NaN | 0.01 | Accident on I-80 Westbound at Exit 15 Cutting ... | NaN | I-80 E | R | El Cerrito | Contra Costa | CA | 94530 | US | US/Pacific | KCCR | 2017-01-01 00:53:00 | 44.1 | 40.8 | 79.0 | 29.91 | 10.0 | WSW | 5.8 | NaN | Partly Cloudy | False | False | False | False | False | False | False | False | False | False | False | True | False | Night | Night | Night | Night | 2017 |
| 9207 | A-9208 | MapQuest | 201.0 | 3 | 2017-01-01 00:26:08 | 2017-01-01 01:16:06 | 37.878185 | -122.307175 | NaN | NaN | 0.01 | Accident on I-580 Southbound at Exit 12 I-80 I... | NaN | I-580 W | R | Berkeley | Alameda | CA | 94710 | US | US/Pacific | KOAK | 2017-01-01 00:53:00 | 51.1 | NaN | 83.0 | 29.97 | 10.0 | West | 11.5 | NaN | Overcast | False | False | True | False | False | False | False | False | False | False | False | False | False | Night | Night | Night | Night | 2017 |
| 9208 | A-9209 | MapQuest | 201.0 | 2 | 2017-01-01 00:53:41 | 2017-01-01 01:22:35 | 38.014820 | -121.640579 | NaN | NaN | 0.00 | Accident on Taylor Rd Southbound at Bethel Isl... | 2998.0 | Taylor Ln | R | Oakley | Contra Costa | CA | 94561 | US | US/Pacific | KCCR | 2017-01-01 00:53:00 | 44.1 | 40.8 | 79.0 | 29.91 | 10.0 | WSW | 5.8 | NaN | Partly Cloudy | False | False | False | False | False | False | False | False | False | False | False | False | False | Night | Night | Night | Night | 2017 |
| 9209 | A-9210 | MapQuest | 241.0 | 3 | 2017-01-01 01:18:51 | 2017-01-01 01:48:01 | 37.912056 | -122.323982 | NaN | NaN | 0.01 | Lane blocked and queueing traffic due to accid... | NaN | Bayview Ave | R | Richmond | Contra Costa | CA | 94804 | US | US/Pacific | KCCR | 2017-01-01 01:11:00 | 44.1 | 42.5 | 82.0 | 29.95 | 9.0 | SW | 3.5 | NaN | Mostly Cloudy | False | False | False | False | False | False | False | False | False | False | False | False | False | Night | Night | Night | Night | 2017 |
| 9210 | A-9211 | MapQuest | 222.0 | 3 | 2017-01-01 01:20:12 | 2017-01-01 01:49:47 | 37.925392 | -122.320595 | NaN | NaN | 0.01 | Queueing traffic due to accident on I-80 Westb... | NaN | I-80 E | R | El Cerrito | Contra Costa | CA | 94530 | US | US/Pacific | KCCR | 2017-01-01 01:11:00 | 44.1 | 42.5 | 82.0 | 29.95 | 9.0 | SW | 3.5 | NaN | Mostly Cloudy | False | False | False | False | False | False | False | False | False | False | False | True | False | Night | Night | Night | Night | 2017 |
字段说明
https://www.jianshu.com/p/9e597dc8ae71
#查看空值情况 data.isnull().sum()[data.isnull().sum()!=0] #处理空值 #无影响或者不分析的列 删除 deletelist= ['Unnamed: 0', 'ID','TMC', 'End_Lat', 'End_Lng', 'Airport_Code','Weather_Timestamp','Wind_Chill(F)','Civil_Twilight', 'Nautical_Twilight','Astronomical_Twilight', 'year','Number'] data1 = data.drop(deletelist, axis=1) #删除有空值的行 data1 = data1.dropna(axis = 0,subset=['City','Zipcode','Timezone','Sunrise_Sunset']) #温度湿度气压能见度用均值填补 data1['Temperature(F)'] = data1['Temperature(F)'].fillna(data1['Temperature(F)'].mean()) data1['Humidity(%)'] = data1['Humidity(%)'].fillna(data1['Humidity(%)'].mean()) data1['Pressure(in)'] = data1['Pressure(in)'].fillna(data1['Pressure(in)'].mean()) data1['Visibility(mi)'] = data1['Visibility(mi)'].fillna(data1['Visibility(mi)'].mean()) #风速使用近邻填补 data1['Wind_Speed(mph)'] = data1['Wind_Speed(mph)'].interpolate(method='nearest', order=4) #天气状况风向用众数填补 data1['Weather_Condition'] = data1['Weather_Condition'].fillna(data1['Weather_Condition'].mode()) data1['Wind_Direction'] = data1['Wind_Direction'].fillna(data1['Wind_Direction'].mode()) #降水量没有就用0填补 data1['Precipitation(in)'] = data1['Precipitation(in)'].fillna(0) #风向,把同样含义单词的合并起来 occupation = {"CALM":"Calm", "N":"North", "S":"South", "W":"West", "E":"East", "VAR":"Variable"} f = lambda x : occupation.get(x,x) #在occupation中找对应的值 data1['Wind_Direction']= data1['Wind_Direction'].map(f) #最后矫正索引因为删除了部分行 data1.index = range(len(data1))3.数据可视化应用
#发生事故最多的州 a=(Bar(init_opts=opts.InitOpts(width="2000px",height="400px")) .add_xaxis(data1['State'].value_counts().index.tolist()) .add_yaxis('各州事故发生数量',data1['State'].value_counts().tolist(),color='#499C9F') .set_series_opts(label_opts=opts.LabelOpts(is_show= False)) ) a.render_notebook()
(点击可放大图片)
前5多事故发生的州分别为:CA(加利福尼亚州) TX(得克萨斯州) FL(佛罗里达州) NY(纽约州)NC(北卡罗来纳州)
都是比较发达人口较多的地区
早高峰晚高峰有明显的凸出来,是交通最繁杂车流量最多的时候
下半年事故发生数明显多于上半年,也许是因为下半年节假日较多且工作量较多
可见,大部分事故发生时是clear天气晴朗的,但是后3.4个为阴天或多云,天气原因还是会部分影响事故发生
事故发生时大部分为白天,非路口,无降水,无信号灯,无让路标志,无减速带
大部分事故发生时的可见度都时良好的即可见度在6.5英里和12英里之间(1英里约等于1.6公里)
可见事故大部分是在沿海发达地区发生
4.利用xgboost对严重程度建模预测
4.1建模前预处理
dataX = data1.copy() #提取月份和小时 dataX['month'] = dataX['Start_Time'].apply(lambda x:x.month) dataX['hour'] = dataX['Start_Time'].apply(lambda x:x.hour) 删除对建模无用的特征 deletelist2=['Source','Side','Start_Time', 'End_Time','Description','Street','City','County', 'State', 'Zipcode', 'Country', 'Timezone','Wind_Direction'] dataX = dataX.drop(deletelist2, axis=1) #把false换成0 true换成1 list3=['Amenity','Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway','Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal','Turning_Loop'] m = lambda x : 1 if x else 0 for i in list3:dataX[i] = dataX[i].apply(m) #严重程度为1的数量太少极度不均衡,删掉 dataX = dataX.drop(index=(dataX.loc[(dataX['Severity']==1)].index)) dataX.index = range(len(dataX)) dataX.head()| 3 | 37.925392 | -122.320595 | 0.01 | 44.1 | 79.0 | 29.91 | 10.0 | 5.8 | 0.0 | Partly Cloudy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 37.878185 | -122.307175 | 0.01 | 51.1 | 83.0 | 29.97 | 10.0 | 11.5 | 0.0 | Overcast | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 38.014820 | -121.640579 | 0.00 | 44.1 | 79.0 | 29.91 | 10.0 | 5.8 | 0.0 | Partly Cloudy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 37.912056 | -122.323982 | 0.01 | 44.1 | 82.0 | 29.95 | 9.0 | 3.5 | 0.0 | Mostly Cloudy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 3 | 37.925392 | -122.320595 | 0.01 | 44.1 | 82.0 | 29.95 | 9.0 | 3.5 | 0.0 | Mostly Cloudy | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
独热编码后Weather_Condition这一列变成了(71669,78)
这样会发生过拟合,所以用PCA降维
| -0.365071 | -0.105292 | 0.643741 | -0.661940 | -0.141977 |
| -0.544094 | 0.715639 | -0.290545 | -0.018498 | -0.060453 |
| -0.365071 | -0.105292 | 0.643741 | -0.661940 | -0.141977 |
| -0.482130 | -0.685853 | -0.468909 | -0.024822 | -0.071162 |
| -0.482130 | -0.685853 | -0.468909 | -0.024822 | -0.071162 |
| 0.287847 | -0.995895 | 0.5 | -0.849372 | 0.371429 | -0.454545 | 0.0 | -0.189655 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -1.0 | -1.166667 | -1.500 |
| 0.281328 | -0.995506 | 0.5 | -0.556485 | 0.485714 | -0.181818 | 0.0 | 0.793103 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | -1.166667 | -1.500 |
| 0.300197 | -0.976155 | 0.0 | -0.849372 | 0.371429 | -0.454545 | 0.0 | -0.189655 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | -1.166667 | -1.500 |
| 0.286006 | -0.995993 | 0.5 | -0.849372 | 0.457143 | -0.272727 | -1.0 | -0.586207 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | -1.166667 | -1.375 |
| 0.287847 | -0.995895 | 0.5 | -0.849372 | 0.457143 | -0.272727 | -1.0 | -0.586207 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -1.0 | -1.166667 | -1.375 |
拼接X 和xw1
X1 = pd.concat([X,Xw1],axis = 1) X1.shape #(716669, 30) #xgboost分类标签只接受0到类别数,即0-2,234转换为012 def f(x):if x==2:return 0elif x==3:return 1else:return 2 y1 = y.apply(f) y1.value_counts()0 461657
1 230899
2 24113
Name: Severity, dtype: int64
xgboost建模
param1 = {'booster': 'gbtree','objective': 'multi:softmax', # 多分类的问题'num_class': 3, # 类别数,与 multisoftmax 并用'gamma': 0.1, # 用于控制是否后剪枝的参数,越大越保守,一般0.1、0.2这样子。'max_depth': 12, # 构建树的深度,越大越容易过拟合'lambda': 2, # 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。'subsample': 0.7, # 随机采样训练样本'colsample_bytree': 0.7, # 生成树时进行的列采样'min_child_weight': 3,'silent': 1, # 设置成1则没有运行信息输出,最好是设置为0.'eta': 0.007, # 如同学习率'seed': 1000,'nthread': 4, # cpu 线程数 } X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, random_state=0) xg_train = xgb.DMatrix(X_train, label=y_train) xg_test = xgb.DMatrix( X_test, label=y_test) bst1 = xgb.train(param1, xg_train) pred1 = bst1.predict( xg_test ) print(accuracy_score(y_test, pred1))0 .7609220422230595 (准确率)
查看特征重要性
调参部分(更新中)
总结
以上是生活随笔为你收集整理的美国交通事故分析(2017)(项目练习_5)的全部内容,希望文章能够帮你解决所遇到的问题。
- 上一篇: java class加载_Java 类加
- 下一篇: 【项目分享】使用 PointNet 进行