基于bs4库的HTML内容查找方法
一、信息提取实例
提取HTML中所有的URL链接
思路:1)搜索到所有的<a>标签
2)解析<a>标签格式,提取href后的链接内容
>>> import requests
>>> r= requests.get("https://python123.io/ws/demo.html")
>>> demo=r.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p><b>The demo python introduces several python courses.</b></p>\r\n<p>Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" >Advanced Python</a>.</p>\r\n</body></html>'
>>> from bs4 import BeautifulSoup
soup=BeautifulSoup(demo,'html.parser')
>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p>
<b>
The demo python introduces several python courses.
</b>
</p>
<p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" >
Advanced Python
</a>
.
</p>
</body>
</html>
>>> for link in soup.find_all('a'):
... print(link.get("href"))
...
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
二、基于bs4库的HTML内容查找方法
<>.find_all(name,attrs,recursive,string,**kwargs)可以在soup的变量中去查找里面的信息
返回一个列表类型,存储查找的结果
1、name:对标签名称的检索字符串
>>> soup.find_all('a')
[<a href="http://www.icourse163.org/course/BIT-268001" >Advanced Python</a>]
>>> for tag in soup.find_all(True): #如果给出的标签名称是True,将显示当前soup的所有标签信息
... print(tag.name)
...
html
head
title
body
p
b
p
a
a
>>> import re
>>> for tag in soup.find_all(re.compile('b')): #正则表达式库所反馈的结果是指以b开头的所有的信息作为查找的要素
... print(tag.name)
...
body
b
2、attrs:对标签属性值的检索字符串,可标注属性检索
>>> soup.find_all('p','course')
[<p>Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" >Advanced Python</a>.</p>]
>>> soup.find_all(id='link1')
[<a href="http://www.icourse163.org/course/BIT-268001" >Advanced Python</a>]
3、recursive:是否对子孙全部检索,默认True
>>> soup.find_all('a')
[<a href="http://www.icourse163.org/course/BIT-268001" >Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]
说明从soup根节点开始,他的儿子节点层面上是没有a标签的,a标签应该在子孙的后续节点
4、string:<>...</>中字符串区域的检索字符串
>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p><b>The demo python introduces several python courses.</b></p>
<p>Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" ))
['This is a python demo page', 'The demo python introduces several python courses.']
>>>
<tag>(..) 等价于 <tag>.find_all(..)
soup(..)等价于soup.find_all(..)
七个扩展方法
<>.find()
<>.find_parents()
<>.find_parent()
<>.find_next_siblings()
<>.find_next_sibling()
<>.find_previous_siblings()
<>.find_previous_sibling()
总结
以上是如意编程网为你收集整理的基于bs4库的HTML内容查找方法的全部内容,希望文章能够帮你解决所遇到的问题。
- 上一篇: HTML语言语法大全
- 下一篇: 博客园页面定制CSS代码-博客侧边栏公告