pip install bs4
复制代码
需求分析
爬取虎扑步行街主干道前50页发帖。首先,通过
requests
获取每页的返回报文
response
,通过
beautifulsoup
解析报文主体
response.text
。
import requests
from bs4 import BeautifulSoup as bs
import time
url = "https://bbs.hupu.com/bxj-postdate"
useragent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"
header = {
'user-agent': useragent,
'cookie': 'your cookie'
for page in range(50):
page_url = url + '-' + str(page+1)
print(f'------------------ 第{page+1}页内容 {page_url}-------------------')
response = requests.get(page_url, headers=header)
bs_info = bs(response.text, 'html.parser')
ul = bs_info.find('ul', attrs={'class', 'for-list'})
for li in ul.findAll('li'):
title_div = li.find('div', attrs={'class', 'titlelink box'})
a_tag = title_div.find('a', attrs={'class', 'truetit'})
author_div = li.find('div', attrs={'class', 'author box'})
author_link = author_div.find('a', attrs={'class', 'aulink'})
pub_date = author_div.findAll('a')[1].text
print('https://bbs.hupu.com/'+a_tag.get('href'), a_tag.text.strip(), author_link.text, pub_date)
time.sleep(1)
复制代码
结果展示
------------------ 第1页内容 https://bbs.hupu.com/bxj-postdate-1-------------------
https://bbs.hupu.com//36407012.html 一线城市基本工资很低 呼噗呼噗我来啦 2020-07-07
https://bbs.hupu.com//36407009.html 抽象人是这样参加高考的 虎扑JR0132279583 2020-07-07
耗时0.19072270393371582
------------------ 第50页内容 https://bbs.hupu.com/bxj-postdate-50-------------------
https://bbs.hupu.com//36390518.html 平板看斗鱼直播很卡怎么办 听风看雨卧亭中 2020-07-06
https://bbs.hupu.com//36390517.html 男生夏天洗澡不到十分钟不是很正常吗? 胡飞飞1013 2020-07-06
耗时0.25310277938842773
复制代码
参考链接
requests官方文档
BeautifulSoup官方文档
下一篇:
Python爬虫入门(二):使用requests和xpath爬取论坛发帖列表