Python爬取人民网新闻评论并制作词云

comments

目标网址

紧紧抓住大有可为的历史机遇期

简单源代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import requests
import re
import wordcloud
import jieba

content = []
for i in range(1,5):
url = 'http://bbs1.people.com.cn/post/129/0/0/166006014_'+str(i)+'.html#replyList'
headers = {
'Host': 'bbs1.people.com.cn',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.7 Safari/537.36'
}
response = requests.get(url,headers = headers)
response.encoding = 'utf-8'
text = response.text
txts = re.findall(r'<a class="treeReply" target="_blank".*?>\r\n\t\t\t\t\t\t(.*?)\t\t\t\t\t</a>',text,re.S)
for j in range(len(txts)):
content.append(txts[j])

txt = ','.join(content)
w = wordcloud.WordCloud( width=1500,font_path="msyh.ttc",height=1050) #宽度,字体和高度
w.generate(" ".join(jieba.lcut(txt)))
w.to_file("pywcloud.png")

代码及简单解析

  • 导入要使用的开源库.

    1
    2
    3
    4
    import requests
    import re
    import wordcloud
    import jieba
  • 新建列表content用来存储所有评论.

    1
    content = []

观察网站了解到评论总共4页,只需要简单的改动网址即可:

for i in range(1,5):
    url = 'http://bbs1.people.com.cn/post/129/0/0/166006014_'+str(i)+'.html#replyList'
    headers = {
    'Host': 'bbs1.people.com.cn',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.7 Safari/537.36'
    }
    response = requests.get(url,headers = headers) 
    response.encoding = 'utf-8' 
    text = response.text 
    txts = re.findall(r'<a class="treeReply" target="_blank".*?>\r\n\t\t\t\t\t\t(.*?)\t\t\t\t\t</a>',text,re.S)
    for j in range(len(txts)):
        content.append(txts[j])

通过在content列表中添加“逗号”将评论连接起来得到txt,方便进行词云的制作.

txt = ','.join(content)

设置字体以及图片的宽度和高度,通过jieba库的lcut()函数将txt做分词处理,并输出图片 pywcloud.png.

w = wordcloud.WordCloud( width=1500,font_path="msyh.ttc",height=1050) #宽度,字体和高度
w.generate(" ".join(jieba.lcut(txt)))
w.to_file("pywcloud.png")

词云图

词云图

finish!

-------------本文结束感谢您的阅读-------------