目标网址
简单源代码
1 | import requests |
代码及简单解析
导入要使用的开源库.
1
2
3
4import requests
import re
import wordcloud
import jieba新建列表content用来存储所有评论.
1
content = []
观察网站了解到评论总共4页,只需要简单的改动网址即可:
for i in range(1,5):
url = 'http://bbs1.people.com.cn/post/129/0/0/166006014_'+str(i)+'.html#replyList'
headers = {
'Host': 'bbs1.people.com.cn',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.7 Safari/537.36'
}
response = requests.get(url,headers = headers)
response.encoding = 'utf-8'
text = response.text
txts = re.findall(r'<a class="treeReply" target="_blank".*?>\r\n\t\t\t\t\t\t(.*?)\t\t\t\t\t</a>',text,re.S)
for j in range(len(txts)):
content.append(txts[j])
通过在content列表中添加“逗号”将评论连接起来得到txt
,方便进行词云的制作.
txt = ','.join(content)
设置字体以及图片的宽度和高度,通过jieba
库的lcut()
函数将txt
做分词处理,并输出图片 pywcloud.png.
w = wordcloud.WordCloud( width=1500,font_path="msyh.ttc",height=1050) #宽度,字体和高度
w.generate(" ".join(jieba.lcut(txt)))
w.to_file("pywcloud.png")
词云图
finish!