前言

由于专业选修大数据的结课小组作业里,有一个爬虫的分工,我自告奋勇选择这部分工作,希望以此作为动力让自己学习一下早就听说但一直没有去做的爬虫工作,这里就把学习到的部分以及作业过程里的一些经验和代码分享一下,希望能够帮助到需要爬取B站知乎微博数据的人们。

过程中参考了一些文章,以下3个链接是源代码的来源:
知乎爬虫
B站爬虫
微博爬虫

所有的完整源码放在我的Github仓库里
爬虫完整代码Github链接

基础知识

如果有学习过计算机网络的话,应该会对浏览器的原理有一些基本了解。(下面这些是我凭自己的认识说的,不一定保真)浏览器在访问网址时,会发送数据包,服务器接收请求后就会把网站的HTML代码发过来,最后我们看到的效果是这段HTML代码渲染后的结果,当然我们需要的信息肯定也包含正在这段HTML代码里。

而爬虫就是模拟浏览器,快速向需要的网页发送请求,接受HTML代码后直接从里面提取需要的信息,之后继续向下一个网页发送请求。所以实际上爬虫能爬取到的信息,从人工角度也都是手动能看到的信息。但是由于爬虫因为速度远快于真实的人类,所以会对网站造成负担,所以大部分的网站都是不支持爬虫这种行为,当然我们在爬虫的过程也会通过伪装,让网站看不出来我们是爬虫。而且爬虫的时候还要注意遵守法律,不要做违法的事情。

Cookie介绍

然后是大部分网站的爬虫都需要我们提供一个Cookie才能进行爬取,Cookie在我的理解里就是网站登录账号后的凭证,有了Cookie后对网页发送请求,在网页的视角上看就是一个登录账号的用户发送的请求。不然现在很多网站都有登录才能查看完整内容的限制,没有Cookie就爬取不了,而且某种程度上需要Cookie也起到了反爬虫的效果。

Cookie查看

那么要如何查看自己的Cookie,首先在浏览器(我以EDGE浏览器为例)打开目标网站后按下F12启动开发者模式,然后选择网络一栏,并且注意禁用缓存

此时浏览器就可以进入类似Wireshark一样的抓包模式,再次按下F5刷新网页,就能够看到浏览器捕获了很多的数据包,在筛选器一栏选中Fetch/XHR后,随便点几个包,往下拉就能够看到在请求标头里会有Cookie字段。

上图中马赛克掉的一大段就是知乎的Cookie(Cookie是有隐私性的,拥有别人的Cookie相当于登录别人的账号,所以Cookie需要保护好,不能在网上暴露)
如果再下拉,还能看到UA字段

这个就代表着你的电脑的信息,所以在爬虫的时候,一般也会附上这个信息,来伪装的更像一个浏览器,为了装的更像一点,可以多准备几个UA,每次爬虫的时候随机选取一个,能够伪装的更好。

Bilibili

功能

需要自己准备希望爬取的链接代码,可以爬取对应链接视频的标题、标签、作者、播放量、点赞投币收藏等数据,不需要Cookie。

介绍

爬取B站视频我采用的是下面仓库地址的源代码
B站爬虫
它需要你先准备好一个文件,里面是需要爬取的B站链接,之后运行代码就可以爬取这些链接的很多信息,并且爬取失败的也会记录下来哪些失败了。

那么怎么准备待爬取链接,我的做法是去Github上找一下别人备份的B站热门榜历史数据,里面有记录每一天B站热门视频的链接,然后把那些链接提取到一个文件里,之后就可以进行爬取。

代码

点击查看代码

**

highlighter- python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import re
import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
import random
import time

def write_error_log(message):
with open("video_errorlist.txt", "a") as file:
file.write(message + "\n")

def is_url(video_id_or_url):
return video_id_or_url.startswith("http") or video_id_or_url.startswith("https")

def get_video_url(video_id_or_url):
if is_url(video_id_or_url):
return video_id_or_url
else:
return f"https://www.bilibili.com/video/{video_id_or_url}"

# 反爬虫
agent = [
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Mobile Safari/537.36'
]
ua = random.choice(agent)


headers = {
'user-agent': ua,#UserAgent().chrome,
'Referer':'https://www.bilibili.com/',
'x-requested-with': 'XMLHttpRequest'
}

input_file = "idlist.txt" #待爬取链接存在这个文件里
output_file = "output.xlsx"

new_wb = Workbook()
new_ws = new_wb.active
new_ws.append(
["标题", "链接", "up主", "up主id", "精确播放数", "历史累计弹幕数", "点赞数", "投硬币枚数", "收藏人数", "转发人数",
"发布时间", "视频时长(秒)", "视频简介", "作者简介", "标签", "视频aid"])

with open(input_file, "r") as file:
id_list = file.readlines()

i = 0
for video_id_or_url in id_list:
i += 1
url = get_video_url(video_id_or_url.strip())
try:
# time.sleep(1)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

# 视频 aid、视频时长和作者 id
initial_state_script = soup.find("script", text=re.compile("window.__INITIAL_STATE__"))
initial_state_text = initial_state_script.string

author_id_pattern = re.compile(r'"mid":(\d+)')
video_aid_pattern = re.compile(r'"aid":(\d+)')
video_duration_pattern = re.compile(r'"duration":(\d+)')

author_id = author_id_pattern.search(initial_state_text).group(1)
video_aid = video_aid_pattern.search(initial_state_text).group(1)
video_duration_raw = int(video_duration_pattern.search(initial_state_text).group(1))
video_duration = video_duration_raw - 2

# 提取标题
title_raw = soup.find("title").text
title = re.sub(r"_哔哩哔哩_bilibili", "", title_raw).strip()

# 提取标签
keywords_content = soup.find("meta", itemprop="keywords")["content"]
content_without_title = keywords_content.replace(title + ',', '')
keywords_list = content_without_title.split(',')
tags = ",".join(keywords_list[:-4])

meta_description = soup.find("meta", itemprop="description")["content"]
numbers = re.findall(
r'[\s\S]*?视频播放量 (\d+)、弹幕量 (\d+)、点赞数 (\d+)、投硬币枚数 (\d+)、收藏人数 (\d+)、转发人数 (\d+)',
meta_description)

# 提取作者
author_search = re.search(r"视频作者\s*([^,]+)", meta_description)
if author_search:
author = author_search.group(1).strip()
else:
author = "未找到作者"

# 提取作者简介
author_desc_pattern = re.compile(r'作者简介 (.+?),')
author_desc_match = author_desc_pattern.search(meta_description)
if author_desc_match:
author_desc = author_desc_match.group(1)
else:
author_desc = "未找到作者简介"

# 提取视频简介
meta_parts = re.split(r',\s*', meta_description)
if meta_parts:
video_desc = meta_parts[0].strip()
else:
video_desc = "未找到视频简介"

if numbers:
views, danmaku, likes, coins, favorites, shares = [int(n) for n in numbers[0]]
publish_date = soup.find("meta", itemprop="uploadDate")["content"]
new_ws.append([title, url, author, author_id, views, danmaku, likes, coins, favorites, shares, publish_date, video_duration, video_desc, author_desc, tags, video_aid])
print(f"第{i}行视频{url}已完成爬取")
else:
print(f"第{i}行视频 {url}未找到相关数据,可能为分集视频")

except Exception as e:
write_error_log(f"第{i}行视频发生错误:{e}")
print(f"第{i}行发生错误,已记录到错误日志:出错数据为{video_id_or_url}")

new_wb.save(output_file)

知乎

功能

爬取一个问题下的回答内容和回答的创建时间

介绍

知乎的爬取是依据以下链接文章里的教程
知乎爬虫
大致思路是发现知乎的回答里会有一个包,指向下一个回答的id,就类似链表一样,从一个初始的链接开始,不断地把下一个回答的id存起来,就把问题下的回答的id都记录下来,之后再根据这些id爬取对应的回答内容。
具体的,分为两个文件,第一个crawler.py就是记录answer_id,第二个crawler2.py是根据answer_id爬取内容,下面会介绍一下具体怎么修改代码来爬取需要的内容。

需要修改的内容

首先是crawler.py里的template

需要在待爬取问题,类似找Cookie一样,找到下面这个包 feeds?..这个包,里面的请求URL替换掉crawler.py里的template。

记得网址里面的offset=后面要加一个{offset},就是模仿之前写在那里的格式

然后是代码里的Cookie替换为自己知乎的Cookie,贴在单引号里面

最后把代码里headers里的refer也换一下(这个好像不换也行,但我是换了),refer字段就在Cookie下面一点

**

highlighter- 1c

1
2
3
4
5
6
headers = {
'cookie' : cookie,
'user-agent': ua,#UserAgent().chrome,
'Referer':'https://www.zhihu.com/question/654859896/answer/3487198664',
'x-requested-with': 'XMLHttpRequest'
}

之后就可以爬取,对于保存信息方面,实测知乎的反爬有点强,爬个两百多面就提示说对方主机强制断开连接,所以可以在代码设置一下每爬取若干页就保存一次。

**

highlighter- reasonml

1
2
3
4
# 每隔100条保存一次信息
if (page % 100 == 0):
df = pd.DataFrame({'answer_id': answer_ids}) # 重新创建df
df.to_csv('answer_id.csv', index=True)

然后就可以使用crawler2.py来爬取这些answerid对应的回答,需要在crawler2.py里修改下面红线标出的部分,Cookie和refer和前面类似,下面那个url那里记得把问题的id换成对应的你要爬取的那个问题的id,就是网页地址里面能明显看出来。

之后运行就可以爬取了,同样记得每隔一段保存一次,代码里是直接保存在不同的文件里,我通过clean.py进行整合和去重。具体的应该看看clean.py就知道怎么改了,只要修改一下文件路径基本就ok。

完整代码

crawler.py

点击查看代码

**

highlighter- llvm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import requests
import pandas as pd
import time
import random

template = 'https://www.zhihu.com/api/v4/questions/656533798/feeds?include=data%5B*%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Creaction_instruction%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%3Bdata%5B*%5D.author.follower_count%2Cvip_info%2Ckvip_info%2Cbadge%5B*%5D.topics%3Bdata%5B*%5D.settings.table_of_content.enabled&offset={offset}&limit=3&order=default&ws_qiangzhisafe=1&platform=desktop'



df = pd.DataFrame()
# df有三列,answer_id和content以及创建日期
df['answer_id'] = []
df['content'] = []
df['created_time'] = []

answer_ids = []

agent = [
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Mobile Safari/537.36'
]
ua = random.choice(agent)

cookies = [
''
]
cookie = random.choice(cookies)
headers = {
'cookie' : cookie,
'user-agent': ua,#UserAgent().chrome,
'Referer':'https://www.zhihu.com/question/654859896/answer/3487198664',
'x-requested-with': 'XMLHttpRequest'
}


# 第一条使用模版,后面的都是next来获取
url0 = template.format(offset=0)
resp0 = requests.get(url0, headers=headers)
for data in resp0.json()['data']:
answer_id = data['target']['id']
answer_ids.append(answer_id)
next = resp0.json()['paging']['next']

for page in range(1,5001):# 这里自己估算一下,每页是5条数据
#对第page页进行访问
headers1 = {
'cookie' : random.choice(cookies),
'user-agent': random.choice(agent),#UserAgent().chrome,
'Referer':'https://www.zhihu.com/question/654859896/answer/3487198664',
'x-requested-with': 'XMLHttpRequest'
}
resp = requests.get(next, headers=headers1)
print('正在爬取第' + str(page) + '页')

for data in resp.json()['data']:
answer_id = data['target']['id']
# 添加answer_id到df中
answer_ids.append(answer_id)
next = resp.json()['paging']['next']
time.sleep(random.randint(1,4))
# 每隔100条保存一次信息
if (page % 100 == 0):
df = pd.DataFrame({'answer_id': answer_ids}) # 重新创建df
df.to_csv('answer_id.csv', index=True)

crawler2.py

点击查看代码

**

highlighter- python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
from bs4 import BeautifulSoup
import pandas as pd
import random
import requests
import pandas as pd
import time
import random

contents = []
created_times = []
agent = [
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Mobile Safari/537.36'
]
ua = random.choice(agent)



headers = {
'cookie' : '',
'user-agent': ua,#UserAgent().chrome,
'Referer':'https://www.zhihu.com/question/654859896/answer/3487198664',
'x-requested-with': 'XMLHttpRequest'
}

# 读取answer_id.csv文件
df = pd.read_csv('answer_id.csv')

# 提取answer_id列
answer_ids = df['answer_id'].tolist()
cookies = [
''
]
batch = 0
for answer_id in answer_ids:
print('正在爬取answer_id为{answer_id}的数据'.format(answer_id=answer_id))
url = 'https://www.zhihu.com/question/654859896/answer/{answer_id}'.format(answer_id=answer_id)
try:
headers1 = {
'cookie' : random.choice(cookies),
'user-agent': random.choice(agent),#UserAgent().chrome,
'Referer':'https://www.zhihu.com/question/654859896/answer/3487198664',
'x-requested-with': 'XMLHttpRequest'
}
resp = requests.get(url, headers=headers1)
soup = BeautifulSoup(resp.text, 'html.parser')
# 查找content
content = soup.find('div', class_='RichContent-inner').text
contents.append(content)
time_element = soup.find('span', {'data-tooltip': True})
if time_element:
created_time = time_element['data-tooltip'].replace('发布于 ', '')
created_times.append(created_time)
else:
created_times.append('-')
except Exception as e:
print(f'爬取answer_id为{answer_id}的数据时出现异常:{e}')
break

time.sleep(random.randint(1,2))

# 每爬取500个回答就保存一次数据,保存在不同的文件中
if len(contents) % 500 == 0:
new_data = {'answer_id': answer_ids[:len(contents)], 'content': contents, 'created_time' : created_times[:len(contents)]}
new_df = pd.DataFrame(new_data)
new_df.to_csv(f'text_{batch}.csv', index=True)
batch += 1

微博

功能

对于一个话题,爬取话题链接,楼主ID,话题内容,楼主昵称,楼主性别,是否认证,认证类型,是否认证金v,发博数量,关注人数,粉丝数,微博等级,发布日期,发布时间,转发量,评论量,点赞量。这些信息。

介绍

微博的爬虫代码是来源于以下链接,链接里面有爬取评论的内容,但是我作业里不需要,我就把这个功能删了,需要爬取评论的可以去下面链接看看。
微博爬虫

具体操作

微博爬虫比较麻烦,好像是有限制一次性最多爬取50面,所以只能手动通过筛选时间,爬50面后看最后一条消息的时间,把筛选时间截止到那里,再接着爬50面,每次都得手动调,很麻烦。

需要修改的内容如下:

里面的Cookie和Referer就不用说了,老生常谈,主要说一下最底下那个baseurl,就是在微博里面高级搜索,筛选时间后,得到的网址。


上面的网址就是baseurl,替换后直接运行就可以开爬了,需要注意的是,代码里有一段是写csv的表头的语句,因为前面说了得多次爬取,所以只有第一次爬取的时候需要写表头,后面就得把写表头的语句注释掉。

完整代码

点击查看代码

**

highlighter- python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# -*- coding: utf-8 -*-

'''
微博爬虫,爬取一个主题下的博文内容,博主信息,评论等各种信息
'''

import requests, random, re
import time
import os
import csv
import sys
import json
import importlib
# from fake_useragent import UserAgent
from lxml import etree
import datetime
import pandas as pd
from selenium import webdriver
import urllib.request

# 记录起始时间
importlib.reload(sys)
startTime = time.time()
'''
设置文件储存的路径
'''
path1='./weibo_content_gaokao.csv'#存取的是微博博文的信息(不包含评论)
csvfile1 = open(path1, 'a', newline='', encoding='utf-8-sig') #'a'是追加模式, 'w'是重写
writer_1=csv.writer(csvfile1)

# csv头部
writer_1.writerow(('话题链接', '楼主ID', '话题内容','楼主昵称', '楼主性别','是否认证','认证类型',
'是否认证金v','发博数量','关注人数','粉丝数','微博等级', '发布日期',
'发布时间', '转发量', '评论量', '点赞量')) #微博博文的信息(不包含评论)

# --------------------------------------------头部信息-----------------------------------------------------
ip_list = [
{'http': 'http://118.193.47.193:8118'}, # 湖南长沙
{'http': 'http://58.20.234.243:9091'}, # 湖南湘潭
{'http': 'http://58.20.235.180:9091'}, # 湖南湘潭
{"http": "http://112.115.57.20:3128"},
{'http': 'http://121.41.171.223:3128'},
{"http": "http://124.88.67.54:80"},
{"http": "http://61.135.217.7:80"},
{"http": "http://42.231.165.132:8118"},
{"http": "http://10.10.1.10:3128"},
{"https": "http://10.10.1.10:1080"}
]
ip = random.choice(ip_list)

# 反爬虫
agent = [
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Mobile Safari/537.36'
]
ua = random.choice(agent)

cookie_list = [
# 手机号
{'cookie':''}
]

cookie = random.choice(cookie_list)['cookie']

headers = {
'cookie': cookie,
'user-agent': ua,#UserAgent().chrome,
'Referer':'https://m.weibo.cn/search?containerid=100103type%3D1%26q%3D%E6%98%A5%E8%8A%82',
'x-requested-with': 'XMLHttpRequest'
}

# # -----------------------------------爬取该主题首页的每个主题的ID------------------------------------------
'''
找出发布者id,并存入列表,用于找每个具体博客的网址
'''
comments_ID = []
baseurl = 'https://s.weibo.com/weibo?q=%23%E9%AB%98%E8%80%83%23&xsort=hot&suball=1&timescope=custom%3A2015-01-01%3A2016-05-12-20&Refer=g'
def get_title_id():
for page in range(1, 51): # 每个页面大约有9个话题
headers = {
'cookie': cookie,
'user-agent': ua,
'Referer': f'{baseurl}&page={page}',
'x-requested-with': 'XMLHttpRequest'
}
time.sleep(1)
api_url = f"{baseurl}&page={page}"

rep1 = requests.get(url=api_url, headers=headers)
try:
rep=rep1.text # 获取ID值并写入列表comment_ID中
comment_ID=re.findall('(?<=mid=")\d{16}', rep)
comments_ID.extend(comment_ID)
print(page,"页id获取成功!",comment_ID)
except:
print(page,"页id获取有误!")


# -----------------------------------爬取该主题下每个博客的详情页面 ------------------------------------------

'''
该主题下每个博客主的详情(包括话题内容、楼主id、楼主昵称、楼主性别、发布时间、日期、
发布时间、转发量、评论量、点赞量)
(利用正则表达式抓取)
'''
is_continue='y'
start_date = pd.to_datetime('2015/01/01')
end_date = pd.to_datetime('2024/12/31')
def spider_title(comment_ID):

article_url = 'https://m.weibo.cn/detail/' + comment_ID
print("article_url = ", article_url)
time.sleep(1)

try:
html_text = requests.get(url=article_url, headers=headers).text
# 发布日期
created_title_time = re.findall('.*?"created_at": "(.*?)".*?', html_text)[0].split(' ')
# print(created_title_time)
# 日期
if 'Jan' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '01', created_title_time[2])
elif 'Feb' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '02', created_title_time[2])
elif 'Mar' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '03', created_title_time[2])
elif 'Apr' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '04', created_title_time[2])
elif 'May' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '05', created_title_time[2])
elif 'Jun' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '06', created_title_time[2])
elif 'July' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '07', created_title_time[2])
elif 'Aug' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '08', created_title_time[2])
elif 'Sep' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '09', created_title_time[2])
elif 'Oct' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '10', created_title_time[2])
elif 'Nov' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '11', created_title_time[2])
elif 'Dec' in created_title_time:
title_created_YMD = "{}/{}/{}".format(created_title_time[-1], '12', created_title_time[2])
# print("title_created_YMD = ", title_created_YMD)

print('发布日期:',title_created_YMD)
time2 = pd.to_datetime(title_created_YMD)

if start_date<= time2 <= end_date:
# 话题内容
find_title = re.findall('.*?"text": "(.*?)",.*?', html_text)[0]
title_text = re.sub('<(S*?)[^>]*>.*?|<.*? />', '', find_title) # 正则匹配掉html标签

# 楼主ID
title_user_id = re.findall('.*?"id": (.*?),.*?', html_text)[1]

# 楼主昵称
title_user_NicName = re.findall('.*?"screen_name": "(.*?)",.*?', html_text)[0]

# 楼主性别
title_user_gender = re.findall('.*?"gender": "(.*?)",.*?', html_text)[0]

verified=re.findall('.*?"verified": (.*?),.*?', html_text)[0]#楼主是否认证
if verified=='true':
verified_type_ext = re.findall('.*?"verified_type_ext": (.*?),.*?', html_text)[0] # 楼主是否金v
else:
verified_type_ext=0
# print(verified_type_ext)
content_num=re.findall('.*?"statuses_count": (.*?),.*?', html_text)[0] #楼主发博数量
verified_type=re.findall('.*?"verified_type": (.*?),.*?', html_text)[0]#楼主认证类型
urank=re.findall('.*?"urank": (.*?),.*?', html_text)[0]#楼主微博等级
guanzhu=re.findall('.*?"follow_count": (.*?),.*?', html_text)[0]#楼主关注数
fensi=eval(re.findall('.*?"followers_count": (.*?),.*?', html_text)[0])#楼主粉丝数

# 发布时间
add_title_time = created_title_time[3]
print("add_title_time = ", add_title_time)
#当该条微博是是转发微博时,会有一个原微博的转发评论点赞量,以及本条微博的转发评论点赞量,此时需要的是第2个元素
if len(re.findall('.*?"reposts_count": (.*?),.*?', html_text))>1:
# 转发量
reposts_count = re.findall('.*?"reposts_count": (.*?),.*?', html_text)[1]
# 评论量
comments_count = re.findall('.*?"comments_count": (.*?),.*?', html_text)[1]
print("comments_count = ", comments_count)
# 点赞量
attitudes_count = re.findall('.*?"attitudes_count": (.*?),.*?', html_text)[1]
# 每个ajax一次加载20条数据
comment_count = int(int(comments_count) / 20)
else:
# 转发量
reposts_count = re.findall('.*?"reposts_count": (.*?),.*?', html_text)[0]
# print("reposts_count = ", reposts_count)

# 评论量
comments_count = re.findall('.*?"comments_count": (.*?),.*?', html_text)[0]
print("comments_count = ", comments_count)

# 点赞量
attitudes_count = re.findall('.*?"attitudes_count": (.*?),.*?', html_text)[0]
# print("attitudes_count = ", attitudes_count)

# 每个ajax一次加载20条数据
comment_count = int(int(comments_count) / 20)

# position1是记录
position11 = (article_url, title_user_id, title_text, title_user_NicName, title_user_gender, verified, verified_type,
verified_type_ext, content_num, guanzhu, fensi, urank, title_created_YMD, add_title_time,reposts_count, comments_count, attitudes_count)

# 写入数据
writer_1.writerow(position11)
print('写入博文信息数据成功!')
return comment_count, title_user_id, title_created_YMD,title_text

global is_continue
else:
is_continue = input('日期超出范围,是否继续爬取博文信息?(y/n, 默认: y) ——> ')#输入是否继续爬取
if is_continue == 'y' or is_continue == 'yes' or not is_continue:
pass
else:
print('日期超出范围,停止爬取博文信息!')
# 计算使用时间
endTime = time.time()
useTime = (endTime - startTime) / 60
print("该次所获的信息一共使用%s分钟" % useTime)
sys.exit(0)
return is_continue
except:
print('博文网页解析错误,或微博不存在或暂无查看权限!')
pass



# -------------------------------------------------主函数---------------------------------------------------
def main():
count_title = len(comments_ID)
for count, comment_ID in enumerate(comments_ID):
print("正在爬取第%s条微博,一共找到个%s条微博需要爬取" % (count + 1, count_title))

try:
maxPage,title_user_id,title_created_YMD,title_text = spider_title(comment_ID)
except:
if is_continue == 'y' or is_continue == 'yes' or not is_continue:
print("--------------------------分隔符---------------------------")
pass
else:
sys.exit(0)
print("--------------------------分隔符---------------------------")
csvfile1.close()

if __name__ == '__main__':
# 获取话题ID
get_title_id()
# 主函数操作
main()
# 计算使用时间
endTime = time.time()
useTime = (endTime - startTime) / 60
print("该次所获的信息一共使用%s分钟" % useTime)
# print('错误页面:',error_page_list)