当前位置：首页 > 编程语言 > python >内容正文

python

Python3爬虫项目集：豆瓣电影排行榜top250

发布时间：2024/8/1 python 40 豆豆

生活随笔收集整理的这篇文章主要介绍了 Python3爬虫项目集：豆瓣电影排行榜top250 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

文章目录

- 前言
- 爬虫概要
- 解析
- 代码示例
- 数据存储

Github地址：https://github.com/pasca520/Python3SpiderSet

前言

关于整理日常练习的一些爬虫小练习，可用作学习使用。

爬取项目以学习为主，尽可能使用更多的模块进行练习，而不是最优解。

爬虫概要

示例python 库

爬取模块	request
解析模块	BeautifulSoup
存储类型	list（方便存入数据库）

解析

BeautifulSoup参数我整理的一篇文章：https://blog.csdn.net/qinglianchen0851/article/details/102860741

代码示例

# -*- coding: utf-8 -*-import requests from requests.exceptions import ReadTimeout, ConnectionError, RequestException from bs4 import BeautifulSoup# 爬虫主体 def get_page(url):headers = {'Connection': 'keep-alive','Cache-Control': 'max-age=0','User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3','Referer': 'https://maoyan.com/board',}try:response = requests.get(url=url, headers=headers).textreturn responseexcept ReadTimeout: # 访问超时的错误print('Timeout')except ConnectionError: # 网络中断连接错误print('Connect error')except RequestException: # 父类错误print('Error')# 解析网页 def parse_page(html):soup = BeautifulSoup(html, 'lxml')grid = soup.find(name="ol", attrs={"class": "grid_view"})movie_list = grid.find_all("li")for movie in movie_list:rank = movie.find(name="em").getText()name = movie.find(name="span", attrs={"class": "title"}).getText()rating_num = movie.find(name="span", attrs={"class": "rating_num"}).getText()# bd = movie.find(name="p").getText().strip().replace(' ', '\n').replace('...\n ', '...\n').replace(' / ', '\n').split('\n') # 头皮发麻字符串分解系列，因为练习没用 re，果然原生字符串处理麻烦的一匹，strip去除空格，replace替换，旨在将不同信息分类存储到不同的参数，如导演、主演、上映时间、上映时间和电影类型bd = movie.find(name="p").getText().strip().replace(' ', '\n').replace('...\n ', '...\n').replace(' / ', '\n').split('\n') # 头皮发麻字符串分解系列，因为练习没用 re，果然原生字符串处理麻烦的一匹，strip去除空格，replace替换，旨在将不同信息分类存储到不同的参数，如导演、主演、上映时间、上映时间和电影类型# 豆瓣有些主演没有。。。贼蛋疼，为了简便只能写个烂代码再增加一次了if len(bd) == 4:bd.insert(1, '没爬到')inq = movie.find(name="span", attrs={"class": "inq"})# 处理 inq 为空的情况if not inq:inq = "暂无"else:inq = inq.getText()# 这里直接存储到字典，方便存到数据库douBanDict['rank'] = rankdouBanDict['name'] = namedouBanDict['director'] = bd[0]douBanDict['actor'] = bd[1]douBanDict['release_time'] = bd[2].strip() # 某些列表有空格，直接strip()去除空格douBanDict['country'] = bd[3]douBanDict['movie_types'] = bd[4]douBanDict['rating_num'] = rating_numdouBanDict['inq'] = inqdouBanList.append(str(douBanDict)) # 字典先转为字符串再累加到列表中，否则无法字典值会一直变return douBanListif __name__ == '__main__':douBanList = []douBanDict = {}for start in range(0, 250, 25):url = 'https://movie.douban.com/top250?start={}&filter='.format(start)html = get_page(url)douBanList = parse_page(html)print(douBanList)

数据存储

直接是列表格式，同时包含各个电影信息的字典。

done！

总结

以上是生活随笔为你收集整理的Python3爬虫项目集：豆瓣电影排行榜top250的全部内容，希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错，欢迎将生活随笔推荐给好友。

上一篇：蓝桥杯2015年第六届C/C++省赛A组
下一篇： python中if brthon环境安装