欢迎访问 生活随笔!

生活随笔

当前位置: 首页 > 编程资源 > 编程问答 >内容正文

编程问答

犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模(

发布时间:2023/12/15 编程问答 37 豆豆
生活随笔 收集整理的这篇文章主要介绍了 犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模( 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

犀牛建模软件的英文语言包

In this article, we will learn to do Topic Model using tidytext and textmineR packages with Latent Dirichlet Allocation (LDA) Algorithm.

在本文中,我们将学习使用带有潜在Dirichlet分配 (LDA)算法的tidytext和textmineR包进行主题模型。

Natural Language Processing has a wide area of knowledge and implementation, one of them is Topic Model. Topic Model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. For example “dog”, “bone”, and “obedient” will appear more often in the document about dogs, “cute”, “evil”, and “home owner” will appear in the document about cats. The “topics” produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.

自然语言处理具有广泛的知识和实现领域,其中之一就是主题模型。 主题模型是一种统计模型,用于发现文档集合中出现的抽象“主题” 。 主题建模是一种常用的文本挖掘工具,用于发现文本主体中的隐藏语义结构。 例如,“狗”,“骨头”和“服从”将在有关狗的文档中更频繁地出现,“可爱”,“邪恶”和“房主”将在关于猫的文档中出现。 通过主题建模技术产生的“主题”是相似单词的簇。 主题模型在数学框架中捕获了这种直觉,该模型允许检查一组文档并基于每个单词的统计信息来发现主题可能是什么以及每个文档的主题平衡是什么。

背景 (Background)

What is Topic Modeling Topic Modeling is how the machine collect a group of words within a document to build ‘topic’ which contain group of words with similar dependencies. With Topic models (or topic modeling, or topic model, its just the same) methods we can organize, understand and summarize large collections of textual information. It helps in:

什么是主题建模主题建模是机器如何在文档中收集一组单词以构建“主题”,其中包含具有相似依赖性的一组单词。 使用主题模型(或主题模型,或主题模型,相同),我们可以组织,理解和总结大量文本信息。 它有助于:

  • Discovering hidden topical patterns that are present across the collection

    发现集合中存在的隐藏主题模式
  • Annotating documents according to these topics

    根据这些主题注释文档
  • Using these annotations to organize, search and summarize texts

    使用这些注释来组织,搜索和总结文本

In a business approach, topic modeling power for discovering hidden topics can help the organization to understand better about their customer feedback’s So that they can concentrate on those issues customer’s are facing. It also can summarize text for company’s meetings. A high-quality meeting document can enable users to recall the meeting content efficiently. Topic tracking and detection can also use to build a recommender system.

在业务方法中,用于发现隐藏主题的主题建模功能可以帮助组织更好地了解其客户反馈,从而使他们可以专注于客户面临的那些问题。 它还可以汇总公司会议的文本。 高质量的会议文档可以使用户有效地回忆会议内容。 主题跟踪和检测也可以用于构建推荐系统。

There are many techniques that are used to obtain topic models, namely: Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Correlated Topic Models (CTM), and TextRank. In this study, we will focus to implement LDA algorithm to build topic model with tidytext and textmineR package. Not only building model, we will also evaluate the goodness of fit of the model using some metrics like R-squared or log-likelihood. There are also some metrics like coherence and prevalence to measure the quality of topics.

有许多技术可用于获取主题模型,即:潜在Dirichlet分配(LDA),潜在语义分析(LSA),相关主题模型(CTM)和TextRank。 在本研究中,我们将重点实现LDA算法,以使用tidytext和textmineR包构建主题模型。 不仅建立模型,我们还将使用一些指标(例如R平方或对数似然)评估模型的拟合优度。 还有一些度量标准,例如coherence和prevalence来衡量主题的质量。

Load these libraries in your working machine:

将这些库加载到您的工作计算机中:

# data wrangling
library(dplyr)
library(tidyr)
library(lubridate)
# visualization
library(ggplot2)
# dealing with text
library(textclean)
library(tm)
library(SnowballC)
library(stringr)
# topic model
library(tidytext)
library(topicmodels)
library(textmineR)

主题模型 (Topic Model)

From the introduction above we know that there are several ways to do topic model. In this study, we will use the LDA algorithm. LDA is a mathematical model that is used to find a mixture of words to each topic, also determine the mixture of topics that describe each document. LDA answer these following principles of topic modeling:

通过上面的介绍,我们知道有几种方法可以进行主题建模。 在这项研究中,我们将使用LDA算法。 LDA是一种数学模型,用于查找每个主题的单词组合,也可以确定描述每个文档的主题的组合。 LDA回答以下主题建模的原则:

  • Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.” This also can be symbolized as Θ theta

    每个文档都是主题的混合体。 我们认为每个文档可能包含特定比例的多个主题的单词。 例如,在两个主题的模型中,我们可以说“文档1是主题A的90%和主题B的10%,而文档2是主题A的30%和主题B的70%。” 这也可以被符号化为Θ theta

  • Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the political topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally. This also can be symbolized as Φ phi

    每个主题都是单词的混合体。 例如,我们可以想象美国新闻有两个主题的模型,一个主题是“政治”,另一个主题是“娱乐”。 政治话题中最常见的词可能是“总统”,“国会”和“政府”,而娱乐话题可能是由“电影”,“电视”和“演员”等词组成的。 重要的是,可以在主题之间共享单词。 诸如“预算”之类的词可能会同时出现在两者中。 这也可以表示为phi

We will use two packages: tidytext including tidymodels package and textmineR. Tidytext package build topic model easily and they provide a method for extracting the per-topic-per-word probabilities, called β (“beta”), from the model. But they don’t provide metrics to calculate the goodness of model like textmineR do.

我们将使用两个软件包: tidytext包括tidymodels软件包和textmineR 。 Tidytext包可以轻松地建立主题模型,并且它们提供了一种从模型中提取每个主题/单词的概率(称为β(“ beta”))的方法。 但是它们没有像textmineR一样提供度量标准来计算模型的textmineR 。

潜在狄利克雷分配(LDA) (Latent Dirichlet Allocation (LDA))

LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. Plate Notation (picture below) is a concise way of visually representing the dependencies among the model parameters.

LDA是一种生成统计模型,它允许由未观察组解释一组观察结果,这些观察组解释了为什么某些数据相似的原因。 例如,如果观察是收集到文档中的单词,则假定每个文档都是少量主题的混合,并且每个单词的出现都可归因于文档的主题之一。 Plate Notation (下图)是一种直观地表示模型参数之间依赖性的简洁方法。

LDA Plate NotationLDA板符号
  • Area in M denotes the number of documents

    M中的区域表示文件数
  • N is the number of words in a given document

    N是给定文档中的单词数
  • α is the parameter of the Dirichlet prior on the per-document topic distributions. High α indicates that each documents is likely to contain a mixture of most of the topics (not just one or two). Low αα indicates each document will likely contain just a few of topics

    α是每个文档主题分布上的Dirichlet优先级的参数。 高α表示每个文档可能包含大多数主题的混合体(不仅仅是一个或两个)。 αα低表示每个文档可能只包含几个主题
  • β is the parameter of the Dirichlet prior to the per-topic word distribution. High β indicates that each topic will contain a mixture of most in the words. low β indicates the topic have a low mixture of words.

    β是按主题分布之前的Dirichlet的参数。 高β表示每个主题将包含大部分单词。 低β表示主题的单词混合度较低。
  • θm is the topic distribution for document m

    θm是文档m的主题分布

  • zmn is the topic for the n-th word in document m

    zmn是文档m中第n个单词的主题

  • wmn is the specific word

    wmn是特定词

LDA is a generative process. LDA assumes that new documents are created in the following way:1. Determine the number of words in document2. Choose a topic mixture for the document over a fixed set of topics (example: 20% topic A, 50$ topic B, 30% topic C)3. Generate the words in the document by:- pick a topic based on the document’s multinomial distribution (zm,n∼Multinomial(θm))- pick a word based on topic’s multinomial distribution (wm,n∼Multinomial(φzmn)) (where φzmn is the word distribution for topic z)4. Repeat the process for n number of iteration until the distribution of the words in the topics meet the criteria (number 2)

LDA是一个生成过程。 LDA假定以下列方式创建新文档:1。 确定document2中的单词数。 在固定的主题集上选择文档的主题组合(例如:20%主题A,50 $主题B,30%主题C)3。 通过以下方式生成文档中的单词:-根据文档的多项式分布(zm,n〜Multinomial(θm))选择一个主题-根据主题的多项式分布(wm,n〜Multinomial(φzmn))选择一个单词(其中φzmn是主题z )4的单词分布。 重复此过程n次迭代,直到主题中单词的分布符合条件(第2个)

数据导入和目标 (Data Import & Objectives)

The data is from this kaggle. It's about customers' feedback on Amazon musical instruments. Every row represents one feedback from one user. There are several columns but we only need reviewText which contain the text of the review, overall the product rating from 1-5 given by the user, and reviewTime which contain the time review was given.

数据来自该kaggle 。 这是关于客户对亚马逊乐器的反馈。 每行代表一个用户的一个反馈。 有几列,但是我们只需要reviewText包含评论的文本, overall上用户给出的产品评分为1-5,而reviewTime包含给出评论的时间。

# data import and preparation
data <- read.csv("Musical_instruments_reviews.csv")
data <- data %>%
mutate(overall = as.factor(overall),
reviewTime = str_replace_all(reviewTime, pattern = " ",replacement = "-"),
reviewTime = str_replace(reviewTime, pattern = ",",replacement = ""),
reviewTime = mdy(reviewTime)) %>%
select(reviewText, overall,reviewTime)
head(data)

So the objectives of this project is to discover what users are talking about for each rating. This will help the organization to understand better about their customer feedback So that they can concentrate on those issues customers are facing.

因此,该项目的目标是发现用户对每个评级都在谈论什么 。 这将帮助组织更好地了解他们的客户反馈,以便他们可以专注于客户面临的那些问题。

整洁的文字 (Tidytext)

文字清理过程 (Text cleaning process)

Before we put the text to LDA model, we need to clean the text. We gonna build textcleaner function using several functions from tm, textclean, and stringr package. We also need to convert the text to Document Term Matrix (DTM) format because LDA() function from tidytext package needs dtm format.

在将文本放入LDA模型之前,我们需要清理文本。 我们将使用tm , textclean和stringr包中的几个函数来构建textcleaner函数。 我们还需要将文本转换为文档术语矩阵(DTM)格式,因为tidytext包中的LDA()函数需要dtm格式。

# build textcleaner function
textcleaner <- function(x){
x <- as.character(x)

x <- x %>%
str_to_lower() %>% # convert all the string to low alphabet
replace_contraction() %>% # replace contraction to their multi-word forms
replace_internet_slang() %>% # replace internet slang to normal words
replace_emoji() %>% # replace emoji to words
replace_emoticon() %>% # replace emoticon to words
replace_hash(replacement = "") %>% # remove hashtag
replace_word_elongation() %>% # replace informal writing with known semantic replacements
replace_number(remove = T) %>% # remove number
replace_date(replacement = "") %>% # remove date
replace_time(replacement = "") %>% # remove time
str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
str_remove_all(pattern = "[^\\s]*[0-9][^\\s]*") %>% # remove mixed string n number
str_squish() %>% # reduces repeated whitespace inside a string.
str_trim() # removes whitespace from start and end of string

xdtm <- VCorpus(VectorSource(x)) %>%
tm_map(removeWords, stopwords("en"))

# convert corpus to document term matrix
return(DocumentTermMatrix(xdtm))

}

Because we want to know the topic from each rating, we should split/subset the data by its rating.

因为我们想从每个评分中了解主题,所以我们应该按评分对数据进行拆分/细分。

data_1 <- data %>% filter(overall == 1)
data_2 <- data %>% filter(overall == 2)
data_3 <- data %>% filter(overall == 3)
data_4 <- data %>% filter(overall == 4)
data_5 <- data %>% filter(overall == 5)
table(data$overall)>
##
## 1 2 3 4 5
## 14 21 77 245 735

From the table above we know that most of the feedback has the highest rating. Because the distributions are different, each rating will have different treatments especially in choosing highest terms frequency. I’ll make sure we will use at least 700–1000 words to be analyzed for each rating.

从上表可以知道,大多数反馈的评分最高。 因为分布不同,所以每个等级都会有不同的处理方式,尤其是在选择最高条款频率时。 我将确保每个等级至少要使用700-1000个单词进行分析。

主题建模等级5 (Topic Modeling rating 5)

# apply textcleaner function for review text
dtm_5 <- textcleaner(data_5$reviewText)
# find most frequent terms. i choose words that at least appear in 50 reviews
freqterm_5 <- findFreqTerms(dtm_5,50)
# we have 981 words. subset the dtm to only choose those selected words
dtm_5 <- dtm_5[,freqterm_5]
# only choose words that appear once in each rows
rownum_5 <- apply(dtm_5,1,sum)
dtm_5 <- dtm_5[rownum_5>0,]# apply to LDA function. set the k = 6, means we want to build 6 topic
lda_5 <- LDA(dtm_5,k = 6,control = list(seed = 1502))
# apply auto tidy using tidy and use beta as per-topic-per-word probabilities
topic_5 <- tidy(lda_5,matrix = "beta")# choose 15 words with highest beta from each topic
top_terms_5 <- topic_5 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)# plot the topic and words for easy interpretation
plot_topic_5 <- top_terms_5 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_5Rating 5 topic modeling using tidytext使用tidytext对5个主题建模进行评级

主题建模等级4 (Topic Modeling rating 4)

dtm_4 <- textcleaner(data_4$reviewText)
freqterm_4 <- findFreqTerms(dtm_4,20)
dtm_4 <- dtm_4[,freqterm_4]
rownum_4 <- apply(dtm_4,1,sum)
dtm_4 <- dtm_4[rownum_4>0,]lda_4 <- LDA(dtm_4,k = 6,control = list(seed = 1502))
topic_4 <- tidy(lda_4,matrix = "beta")top_terms_4 <- topic_4 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_4 <- top_terms_4 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_4Rating 4 topic modeling using tidytext使用tidytext对4个主题建模进行评级

主题建模等级3 (Topic Modeling rating 3)

dtm_3 <- textcleaner(data_3$reviewText)
freqterm_3 <- findFreqTerms(dtm_3,10)
dtm_3 <- dtm_3[,freqterm_3]
rownum_3 <- apply(dtm_3,1,sum)
dtm_3 <- dtm_3[rownum_3>0,]lda_3 <- LDA(dtm_3,k = 6,control = list(seed = 1502))
topic_3 <- tidy(lda_3,matrix = "beta")top_terms_3 <- topic_3 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_3 <- top_terms_3 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_3Rating 3 topic modeling using tidytext使用tidytext对3个主题建模进行评级

主题建模等级2 (Topic Modeling rating 2)

dtm_2 <- textcleaner(data_2$reviewText)
freqterm_2 <- findFreqTerms(dtm_2,5)
dtm_2 <- dtm_2[,freqterm_2]
rownum_2 <- apply(dtm_2,1,sum)
dtm_2 <- dtm_2[rownum_2>0,]lda_2 <- LDA(dtm_2,k = 6,control = list(seed = 1502))
topic_2 <- tidy(lda_2,matrix = "beta")top_terms_2 <- topic_2 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_2 <- top_terms_2 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_2Rating 2 topic modeling using tidytext使用tidytext对2个主题建模进行评级

主题建模等级1 (Topic Modeling rating 1)

dtm_1 <- textcleaner(data_1$reviewText)
freqterm_1 <- findFreqTerms(dtm_1,5)
dtm_1 <- dtm_1[,freqterm_1]
rownum_1 <- apply(dtm_1,1,sum)
dtm_1 <- dtm_1[rownum_1>0,]lda_1 <- LDA(dtm_1,k = 6,control = list(seed = 1502))
topic_1 <- tidy(lda_1,matrix = "beta")top_terms_1 <- topic_1 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_1 <- top_terms_1 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_1Rating 1 topic modeling using tidytext使用tidytext对1个主题建模进行评级

文本 (textmineR)

文字清理过程 (Text cleaning process)

Just like previous text cleaning method, we will build a text cleaner function to automate the cleaning process. The difference is we don’t need to convert the text to dtm format. textmineR package has its own dtm converter, CreateDtm(). Fitting LDA model with textmineR need dtm format made by CreateDtm() function. We also can set n-gram size, remove punctuation, stopwords, and any simple text cleaning process.

就像以前的文本清理方法一样,我们将构建文本清理器功能来自动执行清理过程。 区别在于我们不需要将文本转换为dtm格式。 textmineR软件包具有自己的dtm转换器CreateDtm() 。 用textmineR拟合LDA模型需要CreateDtm()函数制作的dtm格式。 我们还可以设置n-gram大小,删除标点符号,停用词以及任何简单的文本清理过程。

textcleaner_2 <- function(x){
x <- as.character(x)

x <- x %>%
str_to_lower() %>% # convert all the string to low alphabet
replace_contraction() %>% # replace contraction to their multi-word forms
replace_internet_slang() %>% # replace internet slang to normal words
replace_emoji() %>% # replace emoji to words
replace_emoticon() %>% # replace emoticon to words
replace_hash(replacement = "") %>% # remove hashtag
replace_word_elongation() %>% # replace informal writing with known semantic replacements
replace_number(remove = T) %>% # remove number
replace_date(replacement = "") %>% # remove date
replace_time(replacement = "") %>% # remove time
str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
str_remove_all(pattern = "[^\\s]*[0-9][^\\s]*") %>% # remove mixed string n number
str_squish() %>% # reduces repeated whitespace inside a string.
str_trim() # removes whitespace from start and end of string

return(as.data.frame(x))

主题建模等级5 (Topic Modeling rating 5)

# apply textcleaner function. note: we only clean the text without convert it to dtm
clean_5 <- textcleaner_2(data_5$reviewText)
clean_5 <- clean_5 %>% mutate(id = rownames(clean_5))# crete dtm
set.seed(1502)
dtm_r_5 <- CreateDtm(doc_vec = clean_5$x,
doc_names = clean_5$id,
ngram_window = c(1,2),
stopword_vec = stopwords("en"),
verbose = F)dtm_r_5 <- dtm_r_5[,colSums(dtm_r_5)>2]

create LDA model using `textmineR`. Here we gonna make 20 topics. the reason why we build so many topics is that `textmineR` has metrics to calculate the quality of topics. we will choose some topics with the best quality

使用`textmineR`创建LDA模型。 在这里,我们将提出20个主题。 我们建立这么多主题的原因是`textmineR`具有衡量主题质量的指标。 我们将选择质量最好的一些主题

set.seed(1502)
mod_lda_5 <- FitLdaModel(dtm = dtm_r_5,
k = 20, # number of topic
iterations = 500,
burnin = 180,
alpha = 0.1,beta = 0.05,
optimize_alpha = T,
calc_likelihood = T,
calc_coherence = T,
calc_r2 = T)

Once we have created a model, we need to evaluate it. For overall goodness of fit, textmineR has R-squared and log-likelihood. R-squared is interpretable as the proportion of variability in the data explained by the model, as with linear regression.

创建模型后,我们需要对其进行评估。 为了整体上的贴合度,textmineR具有R平方和对数似然性。 与线性回归一样,R平方可解释为模型解释的数据中的可变性比例。

mod_lda_5$r2
>
## [1] 0.2183867

The primary goodness of fit measures in topic modeling is likelihood methods. Likelihoods, generally the log-likelihood, are naturally obtained from probabilistic topic models. the log_likelihood is P(tokens|topics) at each iteration.

主题建模中拟合度量的主要优点是似然法。 可能性,通常是对数可能性,自然是从概率主题模型中获得的。 log_likelihood在每次迭代中为P(tokens | topics)。

plot(mod_lda_5$log_likelihood,type = "l")log likelhood for every iteration in rating 5等级5中每次迭代的记录似然度

get 15 top terms with the highest phi. phi representing a distribution of words over topics. Words with high phi have the most frequency in a topic.

获得phi最高的15个热门术语。 代表主题上单词分布的phi。 phi较高的单词在主题中的出现频率最高。

mod_lda_5$top_terms <- GetTopTerms(phi = mod_lda_5$phi,M = 15)
data.frame(mod_lda_5$top_terms)top terms in topic rating 5主题评分最高的词5

Let’s see the coherence value for each topic. Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. For each pair of words {a,b}, then probabilistic coherence calculates P(b|a)−P(b) where {a} is more probable than {b} in the topic. In simple words, coherence tell us how associated words are in a topic

让我们看看每个主题的连贯性值。 主题连贯性度量通过测量主题中高分单词之间的语义相似度来对单个主题评分。 这些度量有助于区分在语义上可以解释的主题和作为统计推断的工件的主题。 对于每对单词{a,b}, probabilistic coherence计算出P(b | a)-P(b),其中在主题中{a}比{b}更有可能。 简单来说, 连贯性告诉我们主题中的相关词如何

mod_lda_5$coherence
>
## t_1 t_2 t_3 t_4 t_5 t_6 t_7
## 0.12140404 0.08349523 0.05510456 0.11607445 0.16397834 0.05472121 0.09739406
## t_8 t_9 t_10 t_11 t_12 t_13 t_14
## 0.14221823 0.24856426 0.79310008 0.28175270 0.10231907 0.58667185 0.05449207
## t_15 t_16 t_17 t_18 t_19 t_20
## 0.09204392 0.10147505 0.07949897 0.04519463 0.13664781 0.21586105

We also want to look at prevalence value. Prevalence tells us the most frequent topics in the corpus. Prevalence is the probability of topics distribution in the whole documents.

我们还想看看患病率值。 患病率告诉我们语料库中最常见的话题。 患病率是主题在整个文档中分布的概率

mod_lda_5$prevalence <- colSums(mod_lda_5$theta)/sum(mod_lda_5$theta)*100
mod_lda_5$prevalence
>
## t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8
## 5.514614 5.296280 4.868778 7.484032 9.360072 2.748069 4.269445 4.195638
## t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16
## 5.380414 3.541380 5.807442 5.305865 3.243890 4.657203 5.488087 2.738993
## t_17 t_18 t_19 t_20
## 4.821128 4.035630 7.385820 3.857221

Now we have the top terms at each topic, the goodness of model by r2 and log_likelihood, also the quality of topics by calculating coherence and prevalence. let’s compile them in summary

现在,我们在每个主题上都拥有最高级的词汇,r2和log_likelihood的模型优度,以及通过计算一致性和普遍性得出的主题质量。 让我们总结一下

mod_lda_5$summary <- data.frame(topic = rownames(mod_lda_5$phi),
coherence = round(mod_lda_5$coherence,3),
prevalence = round(mod_lda_5$prevalence,3),
top_terms = apply(mod_lda_5$top_terms,2,function(x){paste(x,collapse = ", ")}))
modsum_5 <- mod_lda_5$summary %>%
`rownames<-`(NULL)

We know that the quality of the model can be described with coherence and prevalence value. let’s build a plot to identify which topic has the best quality

我们知道,模型的质量可以用相关性和流行度值来描述。 让我们来建立一个情节,以确定哪个主题的质量最高

modsum_5 %>% pivot_longer(cols = c(coherence,prevalence)) %>%
ggplot(aes(x = factor(topic,levels = unique(topic)), y = value, group = 1)) +
geom_point() + geom_line() +
facet_wrap(~name,scales = "free_y",nrow = 2) +
theme_minimal() +
labs(title = "Best topics by coherence and prevalence score",
subtitle = "Text review with 5 rating",
x = "Topics", y = "Value")coherence and prevalence score in rating 5等级5的连贯性和患病率得分

From the graph above we know that topic 10 has the highest quality, which means the words in that topic are associated with each other. But in the terms of probability of topics distribution in the whole documents (prevalence), topic 10 has a low score. Mean the review is unlikely using combination of words in topic 10 even tough the words inside that topic are supporting each other.

从上图可以看出, topic 10的质量最高,这意味着该主题中的单词相互关联。 但就整个文档中主题分布的可能性(普遍性)而言, topic 10得分较低。 这意味着使用topic 10的单词组合很难回顾,即使该主题中的单词相互支持也很难。

We can see if topics can be grouped together using Dendogram. A Dendrogram uses Hellinger distance (distance between 2 probability vectors) to decide if the topics are closely related. For instance, the Dendrogram below suggests that there are greater similarity between topic 10 and 13.

我们可以查看是否可以使用Dendogram将主题分组在一起。 树状图使用Hellinger距离 (两个概率向量之间的距离)来确定主题是否紧密相关。 例如,下面的树状图表明主题10和主题13之间存在更大的相似性。

mod_lda_5$linguistic <- CalcHellingerDist(mod_lda_5$phi)
mod_lda_5$hclust <- hclust(as.dist(mod_lda_5$linguistic),"ward.D")
mod_lda_5$hclust$labels <- paste(mod_lda_5$hclust$labels, mod_lda_5$labels[,1])
plot(mod_lda_5$hclust)cluster dendrogram rating 5聚类树状图评分5

Now we have complete to build topic model in rating 5 and its interpretation, let’s apply the same step for every rating and see the difference of what people are talk about.

现在,我们已经完成了在等级5及其解释中建立主题模型的工作,让我们对每个等级应用相同的步骤,并了解人们在谈论什么。

I won’t copy and paste the process for every rating because its just the same process and i think it will waste the space. But if you really want to look at it please visit my publications in my rpubs.

我不会为每个评级复制并粘贴该过程,因为它只是一个相同的过程,我认为这会浪费空间。 但是,如果您真的想查看它,请访问我在rpubs中的出版物。

结论 (Conclusion)

We’ve done topic model process from cleaning text to interpretation and analysis. Finally, let’s see what people are talking about for each rating. We will choose 5 different topics with the highest quality (coherence). Each topic will have 15 words with the highest value of phi (distribution of words over topics).

我们已经完成了从清洁文本到解释和分析的主题模型过程。 最后,让我们看看人们在谈论每个评级。 我们将选择质量最高(一致性)最高的5个不同主题。 每个主题将包含15个具有phi最高值的单词(单词在主题上的分布)。

等级5 (Rating 5)

modsum_5 %>%
arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 5)主题中的最高术语(按最高连贯性排序)(等级5)

Highest coherence score, topic 10 and topic 13 contains lots of ‘sticking’ and ‘tongue’ words. Maybe its just a phrase for a specific instrument. It has similar words that make their coherence score rising but low prevalence means they are rarely used in other reviews, that’s why i suggest its from ‘specific’ instrument. in topic 11 and other people are talking about how good the product is, for example, there are words like ‘good’, ‘accurate’, ‘clean’, ‘easy’, ‘recommend’, and ‘great’ that indicates positive sentiment.

最高连贯分数, topic 10和topic 13包含许多“黏”字和“舌”字。 也许只是特定工具的一句话。 它具有相似的词,使他们的连贯性得分上升,但流行率低意味着它们很少在其他评论中使用,这就是为什么我从“特定”工具中建议使用它。 在topic 11 ,其他人在谈论产品的好坏,例如,有“好”,“准确”,“干净”,“简单”,“推荐”和“伟大”之类的词表示积极的情绪。

等级4 (Rating 4)

modsum_4 %>%
arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 4)主题中的最高术语,按最高连贯性排序(4级)

Same as before, topic with the highest coherence score is filled with sticking and tongue stuff. In this rating, people are still praising the product but not as much as rating 5. Keep in mind, the dtm is built using bigrams, words with 2 words like solid_state or e_tongue are captured and calculated just like single word does. With that information, we know that all words showed here have their own phi value and actually represent the review.

与以前一样,具有最高连贯性得分的主题充满了黏糊糊的内容。 在此评级中,人们仍然对产品赞不绝口,但没有达到5评级。请记住,dtm是使用双字母组构建的,捕获并计算了2个单词(例如solid_state或e_tongue),就像单个单词一样。 有了这些信息,我们知道这里显示的所有单词都有自己的phi值,实际上代表了评论。

等级3 (Rating 3)

modsum_3 %>%
arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 3)主题中的最高术语,按最高连贯性排序(3级)

Looks like stick and tongue words are everywhere. `topic 15` has high coherence and prevalence value in rating 3, means lots of review in this rating are talking about them. On the other hand, in this rating, positive words are barely seen. most of the topics filled with guitar or string related words.

看起来到处都是棍子和舌头的话。 “主题15”在等级3中具有较高的连贯性和普遍性值,意味着该等级中有很多评论都在谈论它们。 另一方面,在此评级中,几乎看不到正面的话。 大多数主题充满了与吉他或弦乐相关的单词。

等级2 (Rating 2)

modsum_2 %>%
arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 2)主题中的最高术语,按最高连贯性排序(等级2)

等级1 (Rating 1)

modsum_1 %>%
arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 1)主题中的最高术语,按最高连贯性排序(等级1)

In the worst rating, people are highly complaint. words like ‘junk’, ‘cheap’ , ‘just’, ‘back’ are everywhere. there’s a lot of difference compared with rating 5.

在最差的评分中,人们高度抱怨。 像“垃圾”,“便宜”,“公正”,“后退”之类的词无处不在。 与等级5相比有很多差异。

Overall let's keep in mind this dataset is a combination of products, so its obvious if the topic filled with nonsense. But for every rating we’re able to build topics with different instruments. Most of them are talking about with particular instrument with its positive or negative review. In this project we managed to build topic model that separated by instrument, it shows LDA is able to build topic with its semantic words. It will be better if we do topic model with a specific product only and discover the problems to remove or goodness to keep. It surely help organization to understand better about their customer feedback’s So that they can concentrate on those issues customer’s are facing, especially for those who have lots of reviews to analyze.

总体而言,让我们记住该数据集是产品的组合,因此,如果主题中充斥着废话,这是显而易见的。 但是,对于每个评级,我们都可以使用不同的工具构建主题。 他们中的大多数人都在谈论带有正面或负面评论的特定工具。 在这个项目中,我们设法建立了以仪器分隔的主题模型,它表明LDA能够使用其语义词来建立主题。 如果我们仅对特定产品进行主题建模,然后发现要删除的问题或保留的优点,那将更好。 它肯定有助于组织更好地了解他们的客户反馈,从而使他们可以专注于客户面临的那些问题,尤其是对于那些需要分析大量评论的客户。

翻译自: https://medium.com/@joenathanchristian/topic-modeling-in-r-with-tidytext-and-textminer-package-latent-dirichlet-allocation-764f4483be73

犀牛建模软件的英文语言包

总结

以上是生活随笔为你收集整理的犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模(的全部内容,希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错,欢迎将生活随笔推荐给好友。