pythonbeautifulsoup标签下的二级标签 python标签云

编程语言2024-05-09 22:45:02

python：3.7
功能模块：wordcloud 1.6.0、matplotlib 3.1.2

安装wordcloud

现在应该是能直接通过

pip install wordcloud

的方式安装了，如果不能安装可以从这里下载

pythonbeautifulsoup标签下的二级标签 python标签云,pythonbeautifulsoup标签下的二级标签 python标签云_词云,第1张

选择的时候cp37表示python的版本是3.7，win_amd64表示64位，win32表示32位。然后通过

pip install wordcloud下载到本地后的全路径

比如：

pip install C:\Users\Administrator\Desktop\wordcloud-1.6.0-cp37-cp37m-win_amd64.whl

安装wordcloud。

一个最简单的实例

# 导入 wordcloud 模块和 matplotlib 模块
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 读入一个txt文件
text = open('articles.txt', 'r').read()

# 生成WordCloud对象并调用其中的generate()方法生成词云，其中传的参数为词的来源
wordcloud = WordCloud().generate(text)

# 显示词云图片
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

# to_file()用来保存词云的方法
wordcloud.to_file('test.jpg')

效果如下

pythonbeautifulsoup标签下的二级标签 python标签云,pythonbeautifulsoup标签下的二级标签 python标签云_ci_02,第2张

自定义词列表、词频

将词表定义为字典的形式

text_dict = {
    'you': 2993,
    'and': 6625,
    'in': 2767,
    'was': 2525,
    'the': 7845,
}

将这个text_dict传入相应的构造词云的generate_from_frequencies()方法即可

WordCloud().generate_from_frequencies(text_dict)

完整代码如下

# 导入 wordcloud 模块和 matplotlib 模块
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 自定义词频字典
text_dict = {
    'you': 2993,
    'and': 6625,
    'in': 2767,
    'was': 2525,
    'the': 7845,
}

# 传入test_dict生成词表
wordcloud = WordCloud().generate_from_frequencies(text_dict)

# 显示词云图片
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

# 保存图片
wordcloud.to_file('test.jpg')

中文词云

1. 字体

中文的话需要自行指定字体，wordcloud这个模块中并没有封装中文字体。可以用系统的字体，字体的位置在 C:\Windows\Fonts 下

pythonbeautifulsoup标签下的二级标签 python标签云,pythonbeautifulsoup标签下的二级标签 python标签云_词云_03,第3张

在字体上右击可以查看字体的名称

pythonbeautifulsoup标签下的二级标签 python标签云,pythonbeautifulsoup标签下的二级标签 python标签云_ci_04,第4张

在生成wordcloud时通过 font_path = 'C:/Windows/Fonts/STKAITI.TTF' 属性指定字体位置，也可以自行下载字体样式，同样通过该属性指定其位置即可。

2. 分词

英文原生就是分好了词了的，就是空格。中文的句子中没有空格，需要通过其他的方式来分出词，jieba是一个用来做中文分词的模块。

text = open('4test.txt').read()
list_text = list(jieba.cut(text))

这样就从一篇文章中分出来了所有的词形成了一个词列表

一个完成的程序

import matplotlib.pyplot as plt
from wordcloud import WordCloud
import jieba
from collections import Counter

# 读入词来源文件
text = open('4test.txt',encoding='utf-8',errors='ignore').read()

# 使用 jieba 分词
text_jieba = list(jieba.cut(text))

# 使用 counter 做词频统计，选取出现频率前 100 的词汇
c = Counter(text_jieba)
common_c = c.most_common(100)

# 配置词云参数
wc = WordCloud(
    # 设置字体
    font_path = 'C:/Windows/Fonts/STKAITI.TTF'   
)
# 生成词云
wc.generate_from_frequencies(dict(common_c))
# 生成图片并显示
plt.figure()
plt.imshow(wc)
plt.axis('off')
plt.show()
# 保存图片
wc.to_file('test.jpg')

效果如下

pythonbeautifulsoup标签下的二级标签 python标签云,pythonbeautifulsoup标签下的二级标签 python标签云_ci_05,第5张

可以看到它居然把标点符号也分出来了

查看全文

https://www.xamrdz.com/lan/54h1951289.html