使用requests 库获取网站html信息
import requests
response = requests.get( "https://jingyan.baidu.com/article/17bd8e52c76b2bc5ab2bb8a2.html#:~:text=1.%E6%89%93%E5%BC%80%E6%B5%8F%E8%A7%88%E5%99%A8F12%202.%E6%89%BE%E5%88%B0headers%E9%87%8C%E9%9D%A2%E7%9A%84cookie,3.%E5%A6%82%E6%9E%9C%E8%A6%81%E6%89%BE%E5%88%B0%E5%AF%B9%E5%BA%94%E7%9A%84%E7%82%B9%E5%87%BBcookie%204.%E8%BF%9E%E7%BB%AD%E4%B8%89%E6%AC%A1%E7%82%B9%E5%87%BB%E5%8F%B3%E9%94%AE%E5%A4%8D%E5%88%B6" )
print( response)
print( response.status_code)
if response.status_code >= 200 and response.status_code < 400 :
.. .
elif response.status_code >= 400 and response.status_code < 500 :
print( "request failed for the client has error客户端错误" )
elif response.status_code >= 500 :
print( "request failed for the server has error服务端错误" )
if response.ok:
print( response.text)
.. .
else:
print( "request failed" )
import requests
response = requests.get( "https://movie.douban.com/top250" )
print( response)
print( response.text)
import requests
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}
response = requests.get( "https://movie.douban.com/top250" , headers = headers)
print( response)
print( response.text)
HTML简单结构入门
HTML 定义网页的结构和信息,文件名为xxx.html,用浏览器打开
CSS 定义网页样式
JavaScript 定义用户和网页的交互逻辑
<! DOCTYPE HTML >
< html>
< body>
< h1> title</ h1>
< p> some texts</ p>
</ body>
</ html>
标题
< h1> </ h1>
< h2> </ h2>
< h3> </ h3>
< h4> </ h4>
< h5> </ h5>
< h6> </ h6>
文本段落
< p> </ p>
强制换行 < br>
加粗 < b> </ b>
斜体 < i> </ i>
下划线 < u> </ u>
图片 < img src = " ..." width = " " height = " " >
链接 < a href = " https://..." target = " _self" > text</ a> (target表示打开的方式,当前页面跳转,新页面跳转等)
容器 块级元素-div-独占一行,span为内嵌元素
< div>
...
</ div>
< span>
...
</ span>
列表 有序列表ol,无序列表ul
< ol>
< li> chinese</ li>
< li> math</ li>
</ ol>
< ul>
< li> chinese</ li>
< li> math</ li>
</ ul>
表格 td数据,
< table border = “1”> 表格属性之一,显示边框
< table>
< table border = “1”>
< thead>
< tr>
< td> tableheader1</ td>
< td> tableheader2</ td>
</ tr>
</ thead>
< tbody>
< tr>
< td> 111</ td>
< td> 2222</ td>
</ tr>
< tr>
< td> 333</ td>
< td> 444</ td>
</ tr>
</ tbody>
</ table>
class属性 -- 帮助分类
< p class = " content" > 给岁月以文明</ p>
< p class = " content" > 而不是给文明以岁月</ p>
< p class = " review" > 五星好评!</ p>
爬取网页中的书的价格和名称
from bs4 import BeautifulSoup
import requests
content = requests.get( "http://books.toscrape.com/" ) .text( )
soup = BeautifulSoup( content, "html.parser" )
all_prices = soup.findAll( "p" , attrs = { "class" : "price_color" } )
for price in all_prices:
print( prices.string[ 2 :] )
all_titles = soup.findAll( "h3" )
for title in all_titles:
all_links = title.findAll( "a" )
for link in all_links:
print( link.string)