欢迎访问 生活随笔!

生活随笔

当前位置: 首页 >

词云_jieba分词

发布时间:2024/9/27 34 豆豆
生活随笔 收集整理的这篇文章主要介绍了 词云_jieba分词 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

                                                  词云_jieba分词

本篇是对词云的代码展示,详细的见如下描述:

# -*- coding: utf-8 -*- from wordcloud import WordCloud import matplotlib.pyplot as plt import jieba import re combine_dict={} stopwords=[]#过滤停用词 def stopwordslist(stopWord):#stopwords = [line.strip() for line in open(stopWord, encoding='UTF-8').readlines()]return stopwords#同义词字典,以\t分割 def synonymwordslist(synonymWord):#for line in open(synonymWord, "r", encoding='UTF-8'):seperate_word = line.strip().split("\t")num = len(seperate_word)for i in range(1, num):combine_dict[seperate_word[i]] = seperate_word[0]# refer https://blog.csdn.net/jlulxg/article/details/84650683 # https://www.cnblogs.com/crawer-1/p/8341762.html # http://lzw.me/pages/unicode/ def cleanChinese():s = r"\n\r\t@#$%^&*这样一本书大卖,hello,,12。!《。有点意外,据说已经印了四五十万,排行榜仅次于《希拉里自传》。大概是大众抛弃了一位表演过火的“文化大师”后,。\n\s\r\t"#t = re.findall('[\u3002\uff1b\uff0c\uff1a\u201c\u201d\uff08\uff09\u3001\uff1f\u300a\u300b\u4e00-\u9fa5]', s)t = re.findall('[\u4e00-\u9fa5]', s) #仅保留汉字部分print(''.join(t))## 读取文本文件+停用词 def wordClould(inputText,splitText,outPic):fRead = open(inputText,'r',encoding='UTF-8')fWrite= open(splitText,'w',encoding='UTF-8')def replace_all_blank(value):"""去除value中的所有非字母内容,包括标点符号、空格、换行、下划线等"""result = re.sub('[a-zA-Z0-9’!"#$%&\'()()。;,:“”()、?《》*+,-./:;<=>?@,。?★、…【】《》?“”‘’![\\]^_`{|}~\s]+', "", value)result = re.sub('[\001\002\003\004\005\006\007\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a]+','', result)return resultdef seg_depart(sentence):sentence_depart = jieba.cut(sentence)#stopwords = stopwordslist('../input/stopWords.txt')outstr = ''for word in sentence_depart:if word not in stopwords:if word in combine_dict: #同义词替换word = combine_dict[word]outstr += replace_all_blank(word)outstr += " "return outstr#汇总成完整的文本cut_text=''for line in fRead:cut_text = cut_text + seg_depart(line)fWrite.write(cut_text)fRead.close()fWrite.close()wordcloud = WordCloud(#设置字体,不然会出现口字乱码,文字的路径是电脑的字体一般路径,可以换成别的font_path="C:/Windows/Fonts/彩虹粗仿宋.TTF",background_color="white",width=2000,height=1760,max_words=2000).generate(cut_text)plt.imshow(wordcloud, interpolation="bilinear")plt.axis("off")##plt.show()wordcloud.to_file(outPic)if __name__ == '__main__':###cleanChinese()jieba.load_userdict('../input/nlp/userDic.txt')synonymwordslist(r'..\input\nlp\synonymWord.txt')stopwords = stopwordslist(r'../input/nlp/stopWords.txt')wordClould(r'D:\bidingDemo.txt',r'D:\splitSingle.txt',r'D:\bidingDemo.png')

需要文件以及结果截图见下:

总结

以上是生活随笔为你收集整理的词云_jieba分词的全部内容,希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错,欢迎将生活随笔推荐给好友。