聚类尝试-kmeans-step1数据预处理
在已爬取的上海二手房数据基础上,调用高德接口,获取房屋经纬度及人民广场经纬度。在原有数据上添加一列“房子到人民广场的距离”,然后利用k-means进行聚类分析。本文只记录数据预处理部分。
REF:
Python:爬了下链家的小区数据,为数据分析弄素材~_Cyber的博客-CSDN博客_小区数据前言:最近在学数据分析,包括主成分分析、因子分析、聚类。。。等等,没有数据可不行啊,所以就把链家的小区数据给爬了一下,为后续的分析实战弄素材~一、数据爬取准备链家的网站还是很好爬的,地址很有标准,可以看到下图中有5391个小区,但是很多都是无效小区,而且单页只有30个小区且最多30页,所以我们需要从源码上了解具体结构。从第二页起,链接就变成pg2之类的,依次类推,就可以爬取30页以后的数...https://blog.csdn.net/weixin_42029733/article/details/93064205?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_title~default-1.queryctr&spm=1001.2101.3001.4242.2&utm_relevant_index=4
1.1 数据导入
import pickle import pandas as pd import re import numpy as np import requests import json import math from tqdm import tqdm, trange# 读取 shanghai_ershou = pickle.load(open( './shanghai_ershou_v2.pkl', 'rb')) # hangzhou_new = pickle.load(open( './hangzhou_new.pkl', 'rb'))l = [] for key in shanghai_ershou.keys():l.append(len(shanghai_ershou[key].keys()))print('爬取房屋的总数:',sum(l))# 转化为dataframe shanghai_ershou_df = pd.DataFrame(columns = pd.DataFrame(shanghai_ershou['locationbeicaipg1']).index) for i in shanghai_ershou.keys():temp = pd.DataFrame(shanghai_ershou[i]).Tshanghai_ershou_df = pd.concat([shanghai_ershou_df, temp], ignore_index = True)1.2 已有数据进行数据预处理
# 发现dataframe中每个元素都是一个列表(有可能为空) def extract_0(x):"""提取列表中第一个元素,若列表为空,则返回none"""try:return x[0]except:return None df_shanghai = shanghai_ershou_df.copy() # dataframe中每一列数据都从列表中提取出来 for col in df_shanghai.columns:df_shanghai[col] = df_shanghai[col].apply(extract_0)# 对总价进行处理 df_shanghai.total_price = df_shanghai.total_price.astype('float')# 对单价进行处理 df_shanghai.unit_price = df_shanghai.unit_price.str.extract(r'([\d,]+)').replace(',','', regex = True).astype('float')[0]# 对面积进行处理 df_shanghai['area'] = df_shanghai.loc[:,'info'].str.extract('([\d.]+)平米')1.3 批量调用接口并解析出经纬度
res_dict = {} for i in trange(df_shanghai.shape[0]): # for i in tqdm(range(5)):url = 'https://restapi.amap.com/v3/geocode/geo?key=c00a9fc63a97c64fe63bf1ff051a285e&address=上海市{}&city=上海市'location = df_shanghai.iloc[i, 5]+df_shanghai.iloc[i,0]try:res = requests.get(url.format(location.rstrip())).textexcept:res = Nonefinally:res_dict[i] = resdf_shanghai['api'] = pd.Series(res_dict)def parse_location(res):if res != None:geocodes = json.loads(res).get('geocodes')[0]location = geocodes.get('location')return locationelse:return None# 经纬度 df_shanghai['location'] = df_shanghai.api.map(parse_location)# 经度,纬度 df_shanghai['longitude'] = df_shanghai.location.str.extract('([\d.]+),') df_shanghai['latitude'] = df_shanghai.location.str.extract(',([\d.]+)')df_shanghai['longitude'] = df_shanghai['longitude'].astype('float') df_shanghai['latitude'] = df_shanghai['latitude'].astype('float')1.4 计算距离
# 获取人民广场的经纬度 url = 'https://restapi.amap.com/v3/geocode/geo?key=c00a9fc63a97c64fe63bf1ff051a285e&address=上海市{}&city=上海市' location = '人民广场' res = requests.get(url.format(location.rstrip())).text rg_location = json.loads(res).get('geocodes')[0].get('location')matchObj = re.search('([\d.]+),', rg_location) rg_longitude = float(matchObj.group(1)) # 经度 matchObj = re.search(',([\d.]+)', rg_location) rg_latitude = float(matchObj.group(1)) # 纬度 # 定义用于计算距离的函数def angle2radian(x):"角度转换为弧度"return x*math.pi/180def rec2sphere(lng1, lat1):"球坐标系->直角坐标系"R = 6371x1 = R*math.cos(lng1)*math.cos(lat1)y1 = R*math.cos(lng1)*math.sin(lat1)z1 = R*math.sin(lat1)return x1, y1, z1def get_chord_length(x1, y1, z1, x2, y2, z2):"获取直角坐标系中的直线距离"dx = x1 - x2dy = y1 - y2dz = z1 - z2lenth = np.sqrt(dx**2+dy**2+dz**2)return lenthdef get_distance(lng1, lat1, lng2 , lat2):"输入经纬度,得到两地距离(km)"R = 6371# 角度转化为弧度lng1 = angle2radian(lng1)lat1 = angle2radian(lat1)lng2 = angle2radian(lng2)lat2 = angle2radian(lat2)# 球坐标->直角坐标x1, y1, z1 = rec2sphere(lng1, lat1)x2, y2, z2 = rec2sphere(lng2, lat2)# 三维空间中的距离(大圆中的弦长)lenth = get_chord_length(x1, y1, z1, x2, y2, z2)# 大圆中的弧长alpha = math.asin(lenth/2/R)*2r = alpha*Rreturn r # dataframe的每一行计算距离 distance_dict = {} for i in trange(df_shanghai.shape[0]):lng1 = df_shanghai.iloc[i].longitudelat1 = df_shanghai.iloc[i].latitudedistance = get_distance(lng1, lat1, rg_longitude, rg_latitude)distance_dict[i] = distancedf_shanghai['distance_rg'] = pd.Series(distance_dict)1.4 数据存储
df_shanghai.drop(columns = ['api', 'location'], inplace=True) df_shanghai.to_csv('sh_ershou_clean_v2.csv')step2:
聚类尝试-kmeans-step2聚类模型训练及结果可视化_nikita_zj的博客-CSDN博客聚类尝试-kmeans-step2聚类模型训练及结果可视化https://blog.csdn.net/nikita_zj/article/details/122343615
总结
以上是生活随笔为你收集整理的聚类尝试-kmeans-step1数据预处理的全部内容,希望文章能够帮你解决所遇到的问题。
- 上一篇: 勾股定理的毕达哥拉斯证明
- 下一篇: win7搭建nas存储服务器_FreeN