欢迎访问 生活随笔!

生活随笔

当前位置: 首页 > 编程资源 > 编程问答 >内容正文

编程问答

优达棒球赛数据分析项目

发布时间:2023/12/15 编程问答 45 豆豆
生活随笔 收集整理的这篇文章主要介绍了 优达棒球赛数据分析项目 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

棒球运动员的身高、体重的特点

作者获得了一份从1820到1995年出生的棒球运动员的身体数据。这里我对各地运动员的身高、体重情况以及他们随着时间的变化,以及它们和运动员寿命的关系情况感兴趣。接下来,我将对这些进行分析

提出问题:

1.运动员的出生区域分布 2.运动员的身高、体重随出生年份的变化 3.运动员的寿命与身高、体重的关系这里,运动员的身高、体重是因变量,年份、城市是自变量 #导入数据库# -*- coding: utf-8 -*- import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from __future__ import division %matplotlib inline

导入数据

def read_csv(filename):file=filenamedata=pd.read_csv(file)return(data) player_df=read_csv('Master.csv') #stars_df=read_csv('AllstarFull.csv')

让我们先来看一下导入的数据的结构

player_df.head() playerIDbirthYearbirthMonthbirthDaybirthCountrybirthStatebirthCitydeathYeardeathMonthdeathDay...nameLastnameGivenweightheightbatsthrowsdebutfinalGameretroIDbbrefID01234
aardsda011981.012.027.0USACODenverNaNNaNNaN...AardsmaDavid Allan220.075.0RR2004/4/62015/8/23aardd001aardsda01
aaronha011934.02.05.0USAALMobileNaNNaNNaN...AaronHenry Louis180.072.0RR1954/4/131976/10/3aaroh101aaronha01
aaronto011939.08.05.0USAALMobile1984.08.016.0...AaronTommie Lee190.075.0RR1962/4/101971/9/26aarot101aaronto01
aasedo011954.09.08.0USACAOrangeNaNNaNNaN...AaseDonald William190.075.0RR1977/7/261990/10/3aased001aasedo01
abadan011972.08.025.0USAFLPalm BeachNaNNaNNaN...AbadFausto Andres184.073.0LL2001/9/102006/4/13abada001abadan01

5 rows × 24 columns

下面是数据中表头的含义:

1.playerID A unique code asssigned to each player. The playerID linksthe data in this file with records in the other files. 2.birthYear Year player was born 3.birthMonth Month player was born 4.birthDay Day player was born 5.birthCountry Country where player was born 6.birthState State where player was born 7.birthCity City where player was born 8.deathYear Year player died 9.deathMonth Month player died 10.deathDay Day player died 11.deathCountry Country where player died 12.deathState State where player died 13.deathCity City where player died 14.nameFirst Player's first name 15.nameLast Player's last name 16.nameGiven Player's given name (typically first and middle) 17.weight Player's weight in pounds 18.height Player's height in inches 19.bats Player's batting hand (left, right, or both) 20.throws Player's throwing hand (left or right) 21.debut Date that player made first major league appearance

数据项目有很多,但我们只需要选手ID,出生年份、出生国家、城市等数据,这里将提取这些数据

data1_df=player_df[['playerID','birthYear','deathYear','birthCountry','birthState','birthCity','weight','height']]

让我们看一下新数据的结构

data1_df.head() playerIDbirthYeardeathYearbirthCountrybirthStatebirthCityweightheight01234
aardsda011981.0NaNUSACODenver220.075.0
aaronha011934.0NaNUSAALMobile180.072.0
aaronto011939.01984.0USAALMobile190.075.0
aasedo011954.0NaNUSACAOrange190.075.0
abadan011972.0NaNUSAFLPalm Beach184.073.0
data1_df.head() playerIDbirthYeardeathYearbirthCountrybirthStatebirthCityweightheight01234
aardsda011981.0NaNUSACODenver220.075.0
aaronha011934.0NaNUSAALMobile180.072.0
aaronto011939.01984.0USAALMobile190.075.0
aasedo011954.0NaNUSACAOrange190.075.0
abadan011972.0NaNUSAFLPalm Beach184.073.0

接下来让我们查看一下数据的摘要信息

data1_df.describe() birthYeardeathYearweightheightcountmeanstdmin25%50%75%max
18703.0000009336.00000017975.00000018041.000000
1930.6641181963.850364185.98086272.255640
41.22907931.50636921.2269882.598983
1820.0000001872.00000065.00000043.000000
1894.0000001942.000000170.00000071.000000
1936.0000001966.000000185.00000072.000000
1968.0000001989.000000200.00000074.000000
1995.0000002016.000000320.00000083.000000

从摘要信息中可以看到,棒球运动员的平均身高为72.255英寸,分布在43英寸到83英寸之间;体重的波动范围为65-320磅,平均体重为185.98磅

让我们看一下是否存在数据缺失情况

data1_df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 18846 entries, 0 to 18845 Data columns (total 8 columns): playerID 18846 non-null object birthYear 18703 non-null float64 deathYear 9336 non-null float64 birthCountry 18773 non-null object birthState 18220 non-null object birthCity 18647 non-null object weight 17975 non-null float64 height 18041 non-null float64 dtypes: float64(4), object(4) memory usage: 1.2+ MB可以看到,数据中体重、身高、出生年份、死亡年份数据信息不全。 其中,身高、体重数据将用前值补全,出生年份缺失的则需要将其剔除 #定义补全函数 def enfull_ave(letter):data1_df[letter].fillna(method='ffill') #补全体重 enfull_ave('weight') #补全身高 enfull_ave('height') #剔除缺失数据 data1_df=data1_df.dropna(how='all')

现在,让我们对棒球运动员的国家分布和城市分布进行分析

#下面定义几个常用函数 # 按照name对运动员进行分组后,计算每组的人数 def player_count(data,name):return data.groupby(name)['playerID'].count()def player_count_rate(data,name):b=player_count(data,name)a=data['playerID'].count()return b/a# 输出饼图 def print_pie(group_data,title):group_data.plot.pie(title=title,figsize=(12, 12),autopct='%3.1f%%',startangle =90,legend=True) # 输出柱状图 def print_bar(data,title):bar=data.plot.bar(title=title,width=10)for p in bar.patches:bar.annotate('%3.1f%%' % (p.get_height()*100), (p.get_x() * 1.005, p.get_height() * 1.005)) #输出折线图 def print_plot(data,name1,title):x=data.indexy=data[name1]plt.figure(figsize=(12,6)) #创建绘图对象 plt.plot(x,y,'ro',color="red",linewidth=1) #在当前绘图对象绘图(X轴,Y轴,蓝色虚线,线宽度)plt.xlabel("year")plt.ylabel(name1)plt.title(title) #图标题 plt.show() #显示图 plt.savefig("line.jpg") #保存图

接下来,让我们查看棒球运动员在各个国家的分布比例

player_count_rate(data1_df,'birthCountry').sort_values(ascending=False) birthCountry USA 0.875730 D.R. 0.034119 Venezuela 0.018094 P.R. 0.013425 CAN 0.012947 Cuba 0.010506 Mexico 0.006261 Japan 0.003290 Panama 0.002918 Ireland 0.002653 United Kingdom 0.002600 Germany 0.002441 Australia 0.001486 South Korea 0.000902 Colombia 0.000902 Nicaragua 0.000743 Curacao 0.000743 V.I. 0.000637 Netherlands 0.000637 Taiwan 0.000584 Russia 0.000424 France 0.000424 Italy 0.000371 Bahamas 0.000318 Aruba 0.000265 Poland 0.000265 Austria 0.000212 Sweden 0.000212 Spain 0.000212 Czech Republic 0.000212 Jamaica 0.000212 Brazil 0.000159 Norway 0.000159 Saudi Arabia 0.000106 At Sea 0.000053 American Samoa 0.000053 Belgium 0.000053 Belize 0.000053 China 0.000053 Viet Nam 0.000053 Denmark 0.000053 Finland 0.000053 Greece 0.000053 Guam 0.000053 Honduras 0.000053 Indonesia 0.000053 Lithuania 0.000053 Philippines 0.000053 Singapore 0.000053 Slovakia 0.000053 Switzerland 0.000053 Afghanistan 0.000053 Name: playerID, dtype: float64

可以看到,棒球运动员来自50多个国家和地区。绝大多数棒球运动员的出生国家在美国,占比87.6%;比较高的有D.R.、Venezuela、P.R.、CAN、Cuba ,都达到了1%以上。接下来,让我们看一下美国运动员的州分布

#提取美国运动员数据 data_usa=data1_df[data1_df['birthCountry']=='USA'] #画饼图 print_pie(player_count_rate(data_usa,'birthState'),'The player rate about States')

从这里可以看到,出生在CA的棒球运动员最多,占比为13%,其次为PA,为8.5%。排名前五的州为CA,PA,NY,IL,OH,有超过44%的美国棒球运动员在这些地方出生

让我们看一下各地棒球运动员的身高、体重情况吧

data2=data1_df[['birthCountry','birthState','height','weight']] #按平均身高排序 data3=data2.groupby('birthCountry').mean().sort_values(by='height',ascending=False) print '有%d个国家超过了平均水平'%(data3['height'][data3['height']>=data1_df['height'].mean()].count()) data3 有26个国家超过了平均水平 heightweightbirthCountryIndonesiaBelgiumJamaicaAfghanistanBrazilSingaporeHondurasGuamAustraliaNetherlandsSouth KoreaCuracaoSpainSwitzerlandLithuaniaNorwayChinaPhilippinesArubaPanamaD.R.TaiwanSwedenNicaraguaGermanyUSAVenezuelaJapanMexicoSaudi ArabiaGreeceAmerican SamoaBahamasSlovakiaCANP.R.FranceAustriaCubaColombiaPolandV.I.ItalyCzech RepublicAt SeaViet NamUnited KingdomBelizeRussiaIrelandFinlandDenmark
78.000000220.000000
77.000000205.000000
75.250000201.250000
75.000000215.000000
74.333333205.000000
74.000000205.000000
74.000000185.000000
74.000000210.000000
73.500000200.500000
73.454545183.333333
73.411765198.294118
73.357143207.857143
73.250000189.666667
73.000000170.000000
73.000000185.000000
73.000000180.000000
73.000000165.000000
73.000000188.000000
73.000000200.000000
72.890909186.018182
72.819596192.916019
72.727273194.454545
72.666667185.000000
72.571429189.785714
72.375000182.871795
72.257213185.427646
72.225806197.222874
72.209677192.354839
72.127119189.118644
72.000000200.000000
72.000000185.000000
72.000000210.000000
72.000000180.833333
72.000000196.000000
71.979167185.212500
71.881423185.818182
71.833333184.666667
71.750000190.250000
71.682051185.451282
71.647059199.125000
71.600000179.800000
71.333333186.250000
71.142857180.428571
71.000000184.000000
71.000000170.000000
71.000000200.000000
70.377778174.500000
70.000000180.000000
69.857143167.428571
69.552632170.131579
69.000000165.000000
67.000000158.000000

可以看到,平均身高最高的国家是印度尼西亚,为78英寸,接下来为比利时,为77英寸。各国的平均身高都不低于67英寸,超过平均水平的国家有26个。接下来,让我们看一下体重情况

c=data2.groupby('birthCountry').mean().sort_values(by='weight',ascending=False) #对超过平均水平的国家计数 print '有%d个国家超过了平均水平'%(data3['weight'][data3['weight']>=data1_df['weight'].mean()].count()) c 有27个国家超过了平均水平 heightweightbirthCountryIndonesiaAfghanistanAmerican SamoaGuamCuracaoSingaporeBelgiumBrazilJamaicaAustraliaSaudi ArabiaViet NamArubaColombiaSouth KoreaVenezuelaSlovakiaTaiwanD.R.JapanAustriaNicaraguaSpainMexicoPhilippinesV.I.PanamaP.R.CubaUSACANLithuaniaGreeceHondurasSwedenFranceCzech RepublicNetherlandsGermanyBahamasItalyNorwayBelizePolandUnited KingdomIrelandAt SeaSwitzerlandRussiaFinlandChinaDenmark
78.000000220.000000
75.000000215.000000
72.000000210.000000
74.000000210.000000
73.357143207.857143
74.000000205.000000
77.000000205.000000
74.333333205.000000
75.250000201.250000
73.500000200.500000
72.000000200.000000
71.000000200.000000
73.000000200.000000
71.647059199.125000
73.411765198.294118
72.225806197.222874
72.000000196.000000
72.727273194.454545
72.819596192.916019
72.209677192.354839
71.750000190.250000
72.571429189.785714
73.250000189.666667
72.127119189.118644
73.000000188.000000
71.333333186.250000
72.890909186.018182
71.881423185.818182
71.682051185.451282
72.257213185.427646
71.979167185.212500
73.000000185.000000
72.000000185.000000
74.000000185.000000
72.666667185.000000
71.833333184.666667
71.000000184.000000
73.454545183.333333
72.375000182.871795
72.000000180.833333
71.142857180.428571
73.000000180.000000
70.000000180.000000
71.600000179.800000
70.377778174.500000
69.552632170.131579
71.000000170.000000
73.000000170.000000
69.857143167.428571
69.000000165.000000
73.000000165.000000
67.000000158.000000

这里我们可以看到,运动员的平均体重最高的国家仍然是印度尼西亚,为220磅,接下来是阿富汗,为215磅,有27个国家的运动员超过了平均水平

接下来,让我们看一下全明星运动员的情况吧

接下来,让我们看一下平均身高、平均体重岁随年份的变化

#提取数据 b=data1_df.groupby('birthYear').mean()d=b.dropna() #打印体重-时间折线图 print_plot(d,'weight','The weight change about birthyears')

<matplotlib.figure.Figure at 0xe404400> #打印身高-时间折线图 print_plot(d,'height','The height change about birthYear')

<matplotlib.figure.Figure at 0xe1509e8>

从这里可以看到,运动员的身高和体重随着出生年份呈现正相关关系。那么,他们之间有多大的相关性呢?接下来让我们查看一下

#提取数据 e=pd.DataFrame(d,columns=['birthyear','weight','height']) e['birthyear']=e.index #计算相关系数 e.corrwith(e['birthyear']) birthyear 1.000000 weight 0.929546 height 0.947681 dtype: float64从这里可以看到,运动员的出生年份与运动员的平均身高的的相关系数为0.947,与平均体重的相关系数为0.934。可以看到运动员的平均身高、体重与年份有很大的相关性。但是由于缺乏进一步数据,造成这种现象的原因不得而知

接下来,我们看一下运动员的寿命与身高、体重情况

#剔除在世运动员的数据,并提取数据 data_age=data1_df.dropna(how='all') data_age=data_age[['playerID','birthYear','deathYear','weight','height']] #计算运动员寿命 data_age=pd.DataFrame(data_age,columns=['playerID','birthYear','deathYear','Age','weight','height']) data_age['Age']=data_age['deathYear']-data_age['birthYear']

去掉可能存在的缺失值

#剔除存在缺失的数据 data_age=data_age.dropna() #计算平均值 f=data_age.groupby('Age').mean() f birthYeardeathYearweightheightAge20.021.022.023.024.025.026.027.028.029.030.031.032.033.034.035.036.037.038.039.040.041.042.043.044.045.046.047.048.049.0...75.076.077.078.079.080.081.082.083.084.085.086.087.088.089.090.091.092.093.094.095.096.097.098.099.0100.0101.0102.0103.0107.0
1907.5000001927.500000176.50000070.500000
1867.0000001888.000000181.50000072.500000
1925.8000001947.800000179.00000071.400000
1915.0000001938.000000169.60000072.000000
1916.2000001940.200000177.40000071.300000
1898.3076921923.307692176.15384672.461538
1903.4000001929.400000177.53333371.733333
1887.7692311914.769231172.88461570.884615
1894.5000001922.500000178.50000071.500000
1907.4324321936.432432176.29729771.486486
1888.7096771918.709677172.77419471.064516
1881.6666671912.666667169.25925970.777778
1889.3939391921.393939173.33333370.727273
1894.2580651927.258065167.29032370.516129
1898.9000001932.900000177.04000071.820000
1899.1351351934.135135183.40540571.756757
1891.0512821927.051282176.71794970.128205
1886.5384621923.538462171.46153870.333333
1892.0833331930.083333178.25000071.354167
1897.5897441936.589744179.43589771.641026
1892.3111111932.311111178.55555671.133333
1893.5000001934.500000177.70454570.727273
1893.2250001935.225000179.22500071.275000
1891.2040821934.204082175.67346970.816327
1885.3442621929.344262173.01639370.377049
1898.1212121943.121212178.84848571.136364
1893.9387761939.938776179.04081671.061224
1893.4415581940.441558175.01298770.805195
1894.0000001942.000000174.16455770.949367
1894.2131151943.213115175.59016470.868852
............
1900.2850241975.285024174.78260971.164251
1897.8949771973.894977175.80821971.118721
1897.6071431974.607143173.99107171.004464
1897.6066351975.606635176.32701471.033175
1898.9909911977.990991175.64414471.157658
1899.3512401979.351240177.00000071.190083
1899.8796301980.879630176.35185270.925926
1900.7544641982.754464176.07589371.281250
1901.4541281984.454128175.66513871.243119
1898.2578951982.257895175.41578970.915789
1900.0052631985.005263172.21578970.968421
1903.9139781989.913978175.81182871.209677
1897.7986111984.798611175.40277871.090278
1904.5405411992.540541177.42567671.533784
1900.2992131989.299213174.86614271.228346
1901.4867261991.486726173.49557570.858407
1899.0681821990.068182173.75000070.681818
1901.6736841993.673684175.83157971.157895
1901.5131581994.513158173.82894771.000000
1898.0888891992.088889173.53333371.311111
1899.4615381994.461538172.57692370.826923
1902.2222221998.222222176.50000071.111111
1893.6470591990.647059171.82352970.352941
1900.8823531998.882353174.70588270.705882
1897.2222221996.222222163.44444469.666667
1899.7000001999.700000168.60000070.100000
1900.4000002001.400000167.00000070.400000
1900.0000002002.000000165.00000071.000000
1911.0000002014.000000158.00000065.000000
1891.0000001998.000000162.00000069.000000

85 rows × 4 columns

#提取年龄 age_df=pd.DataFrame(f,columns=['age','weight','height']) age_df['age']=f.index #绘制折线图 print_plot(age_df,'weight','weight-age') print_plot(age_df,'height','height-age')

<matplotlib.figure.Figure at 0xe81df98>

<matplotlib.figure.Figure at 0xdfd5c50> #计算相关系数 age_df.corr() ageweightheightageweightheight
1.000000-0.430298-0.371683
-0.4302981.0000000.724237
-0.3716830.7242371.000000

可以看到,运动员寿命与身高、体重存在弱相关关系,且与运动员身高、体重呈负相关关系。其相关性远不如出生年份。但这里也说明运动员的身高、体重在某种程度上有可能影响运动员寿命

总结

以上是生活随笔为你收集整理的优达棒球赛数据分析项目的全部内容,希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错,欢迎将生活随笔推荐给好友。