基于某网站的信息爬取与保存_指定跳转页
生活随笔
收集整理的这篇文章主要介绍了
基于某网站的信息爬取与保存_指定跳转页
小编觉得挺不错的,现在分享给大家,帮大家做个参考.
需求:对某网站实现输入指定的跳转页完成爬取并能保存到文本文件中
解决方法:通过Python的BeautifulSoup、selenium的Kyes完成该需求。
代码见下:
import json import urllib.request import urllib.error from urllib.parse import quote from bs4 import BeautifulSoup from builtins import strfrom selenium import webdriver from selenium.webdriver.common.by import By from bs4 import BeautifulSoup from urllib.request import urlopen from selenium.common.exceptions import NoSuchElementException import re import time import datetime import sys sys.setrecursionlimit(1000000) import os import random from selenium.webdriver.common.keys import Keysdef getQuestionsTotalLinks(driver):bs = BeautifulSoup(driver.page_source, 'lxml')AllInfo=bs.findAll('tr', {'class': 'bgcol'})for info in AllInfo: #[0:2]if info.find('a', {'class':'xjxd_nr'}) is None:print("No usefull Info")#returnelse:paras=info.find('a', {'class': 'xjxd_nr'}).get('onclick').replace('detail(','').replace("'",'')[0:-2]listparas=paras.split(',')innerlink='http://www.shenl.com.cn/public/mhwz/todetail?id='+listparas[0]+'&isSearchPassWord='+listparas[1]+'&tag='+listparas[2]innerDetail=info.get_text().replace('\t','').replace('\n','|').split('|')while '' in innerDetail:innerDetail.remove('')f.write('\t'.join(innerDetail) + "\t" + innerlink + "\n")try:print(type(driver.find_element(By.LINK_TEXT, "下一页")))driver.find_element_by_xpath("//a[contains(text(),'下一页')]").click()except NoSuchElementException:time.sleep(1)print("No more pages found")returntime.sleep(random.randint(5, 20))getQuestionsTotalLinks(driver)if __name__ == '__main__':for n in range(0,1,1):import time''' Fully crawl by page number 全量指定页面爬取'''IsoTimeFormat = '%Y_%m_%d'f = open('G:\\temp\\total\\HefeiQuestion_Incr_' + str(time.strftime(IsoTimeFormat)) + '.txt', 'w',encoding='utf-8')driver = webdriver.Chrome("C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")driver.get("http://www.shenl.com.cn/public/mhwz/xjxdList")inputElement = driver.find_element_by_xpath("//input[@name='curPage']")inputElement.send_keys(Keys.BACKSPACE)inputElement.send_keys("2846")ele = driver.find_element_by_xpath("//input[@value='go']")ele.send_keys(Keys.ENTER)getQuestionsTotalLinks(driver)driver.close()f.close()总结
以上是生活随笔为你收集整理的基于某网站的信息爬取与保存_指定跳转页的全部内容,希望文章能够帮你解决所遇到的问题。
- 上一篇: 饭桌开场白大全(饭桌上的开场白客套话)
- 下一篇: 基于某网站的信息爬取与保存_指定查询内容