欢迎访问 如意编程网!

如意编程网

当前位置: 首页 > 前端技术 > HTML >内容正文

HTML

基于bs4库的HTML内容查找方法

发布时间:2022/11/16 HTML 14 老码农
如意编程网 收集整理的这篇文章主要介绍了 基于bs4库的HTML内容查找方法 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

一、信息提取实例

提取HTML中所有的URL链接

思路:1)搜索到所有的<a>标签

   2)解析<a>标签格式,提取href后的链接内容

>>> import requests
>>> r= requests.get("https://python123.io/ws/demo.html")
>>> demo=r.text
>>> demo
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p><b>The demo python introduces several python courses.</b></p>\r\n<p>Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" >Advanced Python</a>.</p>\r\n</body></html>'
>>> from bs4 import BeautifulSoup

soup=BeautifulSoup(demo,'html.parser')

>>> print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p>
<b>
The demo python introduces several python courses.
</b>
</p>
<p>
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" >
Advanced Python
</a>
.
</p>
</body>
</html>

 

>>> for link in soup.find_all('a'):
... print(link.get("href"))
...
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

二、基于bs4库的HTML内容查找方法

<>.find_all(name,attrs,recursive,string,**kwargs)可以在soup的变量中去查找里面的信息

返回一个列表类型,存储查找的结果

1、name:对标签名称的检索字符串

>>> soup.find_all('a')
[<a href="http://www.icourse163.org/course/BIT-268001" >Advanced Python</a>]
>>> for tag in soup.find_all(True):  #如果给出的标签名称是True,将显示当前soup的所有标签信息
... print(tag.name)
...
html
head
title
body
p
b
p
a
a
>>> import re

>>> for tag in soup.find_all(re.compile('b')):  #正则表达式库所反馈的结果是指以b开头的所有的信息作为查找的要素
... print(tag.name)
...
body
b

2、attrs:对标签属性值的检索字符串,可标注属性检索

>>> soup.find_all('p','course')
[<p>Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" >Advanced Python</a>.</p>]

>>> soup.find_all(id='link1')
[<a href="http://www.icourse163.org/course/BIT-268001" >Advanced Python</a>]

3、recursive:是否对子孙全部检索,默认True

>>> soup.find_all('a')
[<a href="http://www.icourse163.org/course/BIT-268001" >Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]

说明从soup根节点开始,他的儿子节点层面上是没有a标签的,a标签应该在子孙的后续节点

4、string:<>...</>中字符串区域的检索字符串

>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p><b>The demo python introduces several python courses.</b></p>
<p>Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT-268001" ))
['This is a python demo page', 'The demo python introduces several python courses.']
>>>

<tag>(..) 等价于 <tag>.find_all(..)

soup(..)等价于soup.find_all(..)

七个扩展方法

<>.find()

<>.find_parents()

<>.find_parent()

<>.find_next_siblings()

<>.find_next_sibling()

<>.find_previous_siblings()

<>.find_previous_sibling()

 

总结

以上是如意编程网为你收集整理的基于bs4库的HTML内容查找方法的全部内容,希望文章能够帮你解决所遇到的问题。

如果觉得如意编程网网站内容还不错,欢迎将如意编程网推荐给好友。