在Python中,解析HTML文档有多种方法,以下是一些常用的方法:
(图片来源网络,侵删)1、使用BeautifulSoup库
BeautifulSoup是一个用于解析HTML和XML文档的Python库,它通常用于网络爬虫,可以帮助我们轻松地从网页中提取所需的信息,要使用BeautifulSoup,首先需要安装它:
pip install beautifulsoup4
接下来,我们可以使用以下代码来解析HTML文档:
from bs4 import BeautifulSoup html_doc = """ <html> <head> <title>网页标题</title> </head> <body> <p class="title"><b>文章标题</b></p> <p class="content">这是一个简单的HTML文档示例。</p> <a href="http://example.com/link1" class="link">链接1</a> <a href="http://example.com/link2" class="link">链接2</a> </body> </html> """ 创建一个BeautifulSoup对象,并将HTML文档作为参数传递 soup = BeautifulSoup(html_doc, 'html.parser') 获取网页标题 title = soup.title.string print("网页标题:", title) 获取文章标题 article_title = soup.find('p', class_='title').b.string print("文章标题:", article_title) 获取所有链接 links = soup.find_all('a', class_='link') for link in links: print("链接:", link['href'], "文本:", link.string)
2、使用lxml库
lxml是一个高性能的Python库,用于处理XML和HTML文档,它基于C语言编写,因此速度非常快,要使用lxml,首先需要安装它:
pip install lxml
接下来,我们可以使用以下代码来解析HTML文档:
from lxml import etree html_doc = """ <html> <head> <title>网页标题</title> </head> <body> <p class="title"><b>文章标题</b></p> <p class="content">这是一个简单的HTML文档示例。</p> <a href="http://example.com/link1" class="link">链接1</a> <a href="http://example.com/link2" class="link">链接2</a> </body> </html> """ 创建一个ElementTree对象,并将HTML文档作为参数传递 root = etree.fromstring(html_doc, parser=etree.HTMLParser()) 获取网页标题 title = root.find('title').text print("网页标题:", title) 获取文章标题 article_title = root.find('.//p[@class="title"]/b').text print("文章标题:", article_title) 获取所有链接 links = root.xpath('//a[@class="link"]') for link in links: print("链接:", link.get('href'), "文本:", link.text)
3、使用正则表达式(不推荐)
虽然可以使用正则表达式来解析HTML文档,但这并不是一种推荐的方法,因为HTML结构复杂,正则表达式很难处理所有的情况,如果你确实需要使用正则表达式,可以使用Python的re模块,以下是一个简单示例:
import re import requests from bs4 import BeautifulSoup as bs4_BeautifulSoup from lxml import etree as lxml_etree, html as lxml_html, fromstring as lxml_fromstring, tostring as lxml_tostring, parse as lxml_parse, etree as lxml_etree_element, Element as lxml_Element, SubElement as lxml_SubElement, tostring as lxml_tostring_element, fromstring as lxml_fromstring_element, Comment as lxml_Comment, ProcessingInstruction as lxml_ProcessingInstruction, Doctype as lxml_Doctype, ElementTree as lxml_ElementTree, register_namespace as lxml_register_namespace, QName as lxml_QName, system_encoding as lxml_system_encoding, geterrortext as lxml_geterrortext, __version__ as lxml__version__, __file__ as lxml__file__, __author__ as lxml__author__, __email__ as lxml__email__, __license__ as lxml__license__, __url__ as lxml__url__, __all__ as lxml__all__, __name__ as lxml__name__, __doc__ as lxml__doc__, __package__ as lxml__package__, __loader__ as lxml__loader__, __builtins__ as lxml__builtins__, __cached__ as lxml__cached__, __spec__ as lxml__spec__, __importlib__ as lxml__importlib__, __import__() as lxml__import__(), findall as lxml_findall, finditer as lxml_finditer, sub as lxml_sub, subn as lxml_subn, search as lxml_search, match as lxml_match, split as lxml_split, translate as lxml_translate, escape as lxml_escape, quote as lxml_quote, unescape as lxml_unescape, maketrans as lxml_maketrans, getattr as lxml_getattr, setattr as lxml_setattr, hasattr as lxml_hasattr, delattr as landroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundlandroidrequestedresource not foundeadvertisingid=673e570d8cec393fb8f9a0ee7d80986e&utm_campaign=%E7%9F%A5%E4%B9%8E%E4%BA%86%E4%BB%80%E4%B9%88%EF%BC%9F&utm_medium=%E7%94%B5%E5%AD%90&utm_term=%E6%90%9C%E7%B4%A2%E5%BC%95%E6%8D%AE&utm_source=baidu&req_num=1&tj=utf8&referer=https://www.google.com/?gws_rd=ssl&ld=www.google.com&q=python+how+to+parse+html&ved=2ahUKEwitlu7uZvvjAhVJr10KHfTCCMEQvhd6BAgFEAE#v=onepage&q=python%20how%20to%20parse%20html&fir=1&sa=X&ved=2ahUKEwitlu7uZvvjAhVJr10KHfTCCMEQvhd6BAgFEAE Google翻译cetedResourceId=673e570d8cec393fb8f9a0ee7d80986e&utm_campaign=%E7%9F%A5%E4%B9%8E%E4%BA%86%E4%BB%80%E4%B9%88%EF
最新评论
本站CDN与莫名CDN同款、亚太CDN、速度还不错,值得推荐。
感谢推荐我们公司产品、有什么活动会第一时间公布!
我在用这类站群服务器、还可以. 用很多年了。