CN105975497A

CN105975497A - Automatic microblog topic recommendation method and device

Info

Publication number: CN105975497A
Application number: CN201610268830.1A
Authority: CN
Inventors: 徐华; 李佳; 邓俊辉; 孙晓民
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2016-04-27
Filing date: 2016-04-27
Publication date: 2016-09-28

Abstract

The invention discloses a microblog topic automatic recommendation method and device, wherein the method includes: processing the text content of the microblog based on a neural network model to obtain a feature vector; Classification is performed to obtain topic categories; according to topic categories, topics are automatically recommended for microblogs that do not contain topics. The method can help users and the microblog platform manage massive microblog content. The invention also provides a microblog topic automatic recommendation device.

Description

Microblog topic automatic recommendation method and device

技术领域technical field

本发明涉及计算机应用技术与社交网络领域，尤其涉及一种微博话题自动推荐方法及装置。The invention relates to the fields of computer application technology and social network, in particular to a microblog topic automatic recommendation method and device.

背景技术Background technique

文本表示在网页检索、信息筛选、情感分析等任务中是一个至关重要的步骤。在传统机器学习方法中，通常以特征表示的形式出现。文本学习中最常用的特征表示方法是词袋子模型。词袋子模型中，最常用的特征是词、二元词组、多元词组(n-gram)以及一些人工抽取的模板特征。在以特征的形式表示文本之后，传统模型往往使用词频、互信息、PLSA(Probability Latent Semantic Analysis,概率潜语义分析)、LDA(Latent DirichletAllocation，文档主题生成模型)等方法筛选出最有效的特征。然而，传统方法在表示文本时，会忽略上下文信息，同时也会丢失词序信息。Text representation is a crucial step in tasks such as web page retrieval, information filtering, sentiment analysis, etc. In traditional machine learning methods, it usually appears in the form of feature representation. The most commonly used feature representation method in text learning is the bag-of-words model. In the bag of words model, the most commonly used features are words, binary phrases, multivariate phrases (n-grams), and some manually extracted template features. After representing text in the form of features, traditional models often use word frequency, mutual information, PLSA (Probability Latent Semantic Analysis, probabilistic latent semantic analysis), LDA (Latent Dirichlet Allocation, document topic generation model) and other methods to screen out the most effective features. However, traditional methods ignore context information and lose word order information when representing text.

近年来，预训练词向量以及深度神经网络模型为自然语言处理带来了新的思路。在词向量的帮助下，有人提出一些组合语义的方法来表示文本的语义。循环神经网络可以在O(n)时间内构建文本的语义。该模型逐词处理整个文档，并把所有上文的语义保存到一个固定大小的隐藏层中。循环神经网络的优势在于它可以更好地捕捉上下文信息，对长距离的上下文信息进行建模。然而，循环神经网络是一个有偏的模型，如对于正向的循环神经网络而言，文本中靠后的词相对靠前的词占据了更主导的地位。由于这一语义偏置的特性，循环神经网络在构建整个文本的语义时，会更多地包含文本后面部分的信息。但是实际上并非所有文本的重点都放在最后，这可能会影响其生成的语义表示的精确度。In recent years, pre-trained word vectors and deep neural network models have brought new ideas to natural language processing. With the help of word vectors, some methods of combining semantics have been proposed to represent the semantics of text. Recurrent neural networks can construct the semantics of text in O(n) time. The model processes the entire document word by word and saves all the above semantics into a fixed-size hidden layer. The advantage of RNN is that it can better capture contextual information and model long-distance contextual information. However, the recurrent neural network is a biased model. For example, for the forward recurrent neural network, the later words in the text occupy a more dominant position than the earlier words. Due to the nature of this semantic bias, when the RNN constructs the semantics of the entire text, it will include more information about the latter part of the text. But not all texts are actually focused on the end, which may affect the precision of the semantic representation it generates.

为了解决语义偏置的问题，有人提出用卷积神经网络来构建文本语义。卷积神经网络利用最大池化技术能从文本中找出最有用的文本片段，其复杂度也是O(n)。因此卷积神经网络在构建文本语义时有更大的潜力。然而，现有卷积神经网络的模型总是使用比较简单的卷积核，如固定窗口。在使用这类模型时，如何确定窗口大小是一个关键问题。当窗口太小时，可能导致上下文信息保留不足，难以对词进行精确刻画；而当窗口太大时，会导致参数过多，增加模型优化难度。因此，需要考虑，如何构建模型，才能更好地捕获上下文信息，减少选择窗口大小带来的困难。In order to solve the problem of semantic bias, it was proposed to use convolutional neural network to construct text semantics. The convolutional neural network uses the maximum pooling technique to find the most useful text fragments from the text, and its complexity is also O(n). Therefore, convolutional neural networks have greater potential in constructing text semantics. However, existing convolutional neural network models always use relatively simple convolution kernels, such as fixed windows. How to determine the window size is a key issue when using such models. When the window is too small, it may lead to insufficient retention of context information, and it is difficult to accurately characterize words; and when the window is too large, it will lead to too many parameters and increase the difficulty of model optimization. Therefore, it is necessary to consider how to build a model in order to better capture contextual information and reduce the difficulty of choosing the window size.

针对短文本主题的研究一直备受关注，如何有效准备地对文本信息进行主题划分是需要解决的一个问题。Research on topics in short texts has been attracting much attention, and how to effectively and preparatively classify text information into topics is a problem that needs to be solved.

发明内容Contents of the invention

本发明的目的旨在至少在一定程度上解决上述的技术问题之一。The object of the present invention is to solve one of the above-mentioned technical problems at least to a certain extent.

为此，本发明的第一个目的在于提出一种微博话题自动推荐方法。该方法能够帮助用户和微博平台管理海量的微博内容。For this reason, the first object of the present invention is to propose a method for automatic recommendation of microblog topics. The method can help users and the microblog platform manage massive microblog content.

本发明的第二个目的在于提出了一种微博话题自动推荐装置。The second object of the present invention is to propose a microblog topic automatic recommendation device.

为达上述目的，本发明第一方面实施例的微博话题自动推荐方法，包括：基于神经网络模型对微博的文本内容进行处理得到特征向量；通过softmax分类器根据所述特征向量对所述微博的文本内容进行分类得到话题类别；根据所述话题类别自动对不含有话题的微博进行话题推荐。In order to achieve the above purpose, the microblog topic automatic recommendation method of the first embodiment of the present invention includes: processing the text content of the microblog based on a neural network model to obtain a feature vector; The text content of microblogs is classified to obtain topic categories; according to the topic categories, topics are automatically recommended for microblogs that do not contain topics.

本发明实施例的微博话题自动推荐方法，首先基于神经网络模型对微博的文本内容进行处理得到特征向量，接着通过softmax分类器根据特征向量对微博的文本内容进行分类得到话题类别，最后根据话题类别自动对不含有话题的微博进行话题推荐。该方法能够帮助用户和微博平台管理海量的微博内容。The microblog topic automatic recommendation method in the embodiment of the present invention first processes the text content of the microblog based on the neural network model to obtain a feature vector, then uses the softmax classifier to classify the text content of the microblog according to the feature vector to obtain the topic category, and finally Automatically recommend topics for microblogs that do not contain topics according to topic categories. The method can help users and the microblog platform manage massive microblog content.

在一些示例中，所述神经网络模型包括：卷积神经网络模型和循环神经网络模型。In some examples, the neural network model includes: a convolutional neural network model and a recurrent neural network model.

在一些示例中，所述基于神经网络模型对微博的文本内容进行处理得到特征向量具体包括：去除所述微博的文本内容中的杂乱信息，并根据停用词表去除无用的停用词得到新的文本内容；通过对所述新的文本内容的句子进行卷积操作得到所述句子中的每个基本单元的局部特征，并对所述局部特征进行最大化操作得到所述句子的特征向量；最后利用循环神经网络对所述句子的特征向量进行处理得到所述微博的文本内容的特征向量。In some examples, the processing the text content of the microblog based on the neural network model to obtain the feature vector specifically includes: removing messy information in the text content of the microblog, and removing useless stop words according to the stop word list Obtain new text content; obtain the local feature of each basic unit in the sentence by performing a convolution operation on the sentence of the new text content, and perform a maximization operation on the local feature to obtain the feature of the sentence vector; finally, use the cyclic neural network to process the feature vector of the sentence to obtain the feature vector of the text content of the microblog.

在一些示例中，所述杂乱信息包括：@信息、URL信息和图片信息。In some examples, the messy information includes: @ information, URL information and picture information.

为达上述目的，本发明第二方面实施例的微博话题自动推荐装置，包括：处理模块，用于基于神经网络模型对微博的文本内容进行处理得到特征向量；分类模块，通过softmax分类器根据所述特征向量对所述微博的文本内容进行分类得到话题类别；自动推荐模块，用于根据所述话题类别自动对不含有话题的微博进行话题推荐。In order to achieve the above-mentioned purpose, the microblog topic automatic recommendation device of the second aspect embodiment of the present invention includes: a processing module, which is used to process the text content of the microblog based on a neural network model to obtain a feature vector; a classification module, through a softmax classifier The text content of the microblog is classified according to the feature vector to obtain a topic category; an automatic recommendation module is configured to automatically recommend topics for microblogs that do not contain topics according to the topic category.

本发明实施例的微博话题自动推荐装置，首先处理模块基于神经网络模型对微博的文本内容进行处理得到特征向量，接着分类模块通过softmax分类器根据特征向量对微博的文本内容进行分类得到话题类别，最后自动推荐模块根据话题类别自动对不含有话题的微博进行话题推荐。该装置能够帮助用户和微博平台管理海量的微博内容。In the microblog topic automatic recommendation device of the embodiment of the present invention, first, the processing module processes the text content of the microblog based on the neural network model to obtain a feature vector, and then the classification module classifies the text content of the microblog according to the feature vector through a softmax classifier to obtain Topic categories, and finally the automatic recommendation module automatically recommends topics for microblogs that do not contain topics according to topic categories. The device can help users and microblog platforms manage massive microblog contents.

在一些示例中，所述处理模块具体用于：去除所述微博的文本内容中的杂乱信息，并根据停用词表去除无用的停用词得到新的文本内容；通过对所述新的文本内容的句子进行卷积操作得到所述句子中的每个基本单元的局部特征，并对所述局部特征进行最大化操作得到所述句子的特征向量；最后利用循环神经网络对所述句子的特征向量进行处理得到所述微博的文本内容的特征向量。In some examples, the processing module is specifically configured to: remove the messy information in the text content of the microblog, and remove useless stop words according to the stop word list to obtain new text content; The sentence of text content is carried out convolution operation to obtain the local feature of each basic unit in the described sentence, and is carried out maximization operation to described local feature to obtain the feature vector of described sentence; The feature vector is processed to obtain the feature vector of the text content of the microblog.

本发明附加的方面和优点将在下面的描述中部分给出，部分将从下面的描述中变得明显，或通过本发明的实践了解到。Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and comprehensible from the description of the embodiments in conjunction with the following drawings, wherein:

图1是根据本发明一个实施例的微博话题自动推荐方法的流程图；Fig. 1 is the flow chart of the microblog topic automatic recommendation method according to one embodiment of the present invention;

图2是根据本发明一个实施例的卷积神经网络模型的卷积层的示意图；2 is a schematic diagram of a convolutional layer of a convolutional neural network model according to an embodiment of the present invention;

图3是根据本发明一个实施例的循环神经网络模型对文本语义构建的流程图；Fig. 3 is the flow chart that the recurrent neural network model constructs text semantics according to one embodiment of the present invention;

图4是根据本发明一个实施例的微博的文本内容的特征向量的示意图；FIG. 4 is a schematic diagram of feature vectors of text content of microblogs according to an embodiment of the present invention;

图5根据本发明一个具体实施例的微博话题自动推荐方法的流程图；以及Fig. 5 is a flowchart of a microblog topic automatic recommendation method according to a specific embodiment of the present invention; and

图6根据本发明一个实施例的微博话题自动推荐装置的示意图。Fig. 6 is a schematic diagram of an apparatus for automatically recommending microblog topics according to an embodiment of the present invention.

具体实施方式detailed description

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，旨在用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary and are intended to explain the present invention and should not be construed as limiting the present invention.

图1是根据本发明一个实施例的微博话题自动推荐方法的流程图。Fig. 1 is a flow chart of a method for automatically recommending microblog topics according to an embodiment of the present invention.

如图1所示，该微博话题自动推荐方法可以包括：As shown in Figure 1, the microblog topic automatic recommendation method may include:

S101，基于神经网络模型对微博的文本内容进行处理得到特征向量。S101, process the text content of the microblog based on the neural network model to obtain a feature vector.

需要说明的是，在一些示例中，神经网络模型可以包括：卷积神经网络模型和循环神经网络模型。It should be noted that, in some examples, the neural network model may include: a convolutional neural network model and a recurrent neural network model.

可以理解的是，神经网络的高容错性和高度非线性描述能力使其得到广泛的研究和应用，而其中卷积神经网络更是各类神经网络模型中的佼佼者。卷积神经网络是一个多层次的神经网络，每一层由多个二维平面组成，而每个平面都由多个独立神经元组成。网络中包含着计算层和特征提取层。一般地，每个神经元的输入与前一层的局部域相连接，并提取该局部的特征，一旦该局部特征被提取，它与其他特征之间的位置关系就被确定；每个计算层由多个特征映射组成，每个特征映射为一个平面，每个平面上的神经元权值相同。由于每个映射面上的神经元共享权值，减少了网络的自由参数个数，进而降低了网络参数选择的时间复杂度。网络中神经元的输出连接值符合“最大值检出假设”，即在某一小区域内存在的神经元集合中，只有输出最大的神经元才强化输出值。根据假说，只有一个神经元会发生强化。卷积神经网络的元就是最大输出元，并且还控制了邻近元的强化结果。卷积神经网络除了输入和输出层，还有卷积层，抽样层和全连接层，在卷积层和抽样层中有若干个特征图，每一层有多个平面，在训练时不断修正神经元权值。同一平面的神经元权值相同，这样可以有相同程度的位移、旋转不变性。由于权值共享，所以从一个平面到下个平面的映射可以看作是卷积运算。隐含层和隐含层之间空间分辨率递减，每层含有的平面数递增，这样可用于检测更多的特征信息。卷积层中，前一层的特征图与一个可学习的核进行卷积，卷积的结果经过激活函数后的输出形成这一层的神经元，从而构成该层特征图。卷积层举例而言，如图2所示。卷积神经网络可以通过三个方法来实现位移、缩放和扭曲不变性，即局域感受野、权值共享和次抽样。局域感受野指的是每一层网络的神经元只与上一层的一个小邻域内的神经单元连接，通过局域感受野，每个神经元可以提取初级的特征；权值共享使得卷积神经网络具有更好的参数，需要相对少的训练数据和时间；次抽样可以减小特征的分辨率，实现对位移、缩放和其他形式扭曲的不变性。It is understandable that the high fault tolerance and highly nonlinear description ability of neural network make it widely researched and applied, and the convolutional neural network is the best among all kinds of neural network models. Convolutional neural network is a multi-layer neural network, each layer is composed of multiple two-dimensional planes, and each plane is composed of multiple independent neurons. The network contains a calculation layer and a feature extraction layer. Generally, the input of each neuron is connected to the local domain of the previous layer, and the local features are extracted. Once the local features are extracted, the positional relationship between it and other features is determined; each calculation layer It consists of multiple feature maps, each feature map is a plane, and the weights of neurons on each plane are the same. Since neurons on each mapping surface share weights, the number of free parameters of the network is reduced, thereby reducing the time complexity of network parameter selection. The output connection value of neurons in the network conforms to the "maximum detection hypothesis", that is, in the set of neurons existing in a small area, only the neuron with the largest output will strengthen the output value. According to the hypothesis, only one neuron will be strengthened. The unit of the convolutional neural network is the maximum output unit, and also controls the reinforcement results of the neighboring units. In addition to the input and output layers, the convolutional neural network also has a convolutional layer, a sampling layer, and a fully connected layer. There are several feature maps in the convolutional layer and the sampling layer, and each layer has multiple planes, which are constantly corrected during training. neuron weights. The weights of neurons in the same plane are the same, so they can have the same degree of displacement and rotation invariance. Since the weights are shared, the mapping from one plane to the next can be seen as a convolution operation. The spatial resolution decreases between the hidden layer and the hidden layer, and the number of planes contained in each layer increases, which can be used to detect more feature information. In the convolution layer, the feature map of the previous layer is convolved with a learnable kernel, and the output of the convolution result after passing through the activation function forms the neurons of this layer, thus forming the feature map of this layer. For example, the convolutional layer is shown in Figure 2. Convolutional neural networks can achieve displacement, scaling, and distortion invariance through three methods, namely local receptive fields, weight sharing, and subsampling. The local receptive field means that the neurons of each layer of the network are only connected to the neurons in a small neighborhood of the previous layer. Through the local receptive field, each neuron can extract primary features; weight sharing makes volume Productive neural networks have better parameters and require relatively less training data and time; subsampling reduces the resolution of features and achieves invariance to displacement, scaling, and other forms of distortion.

另外，循环神经网络可以在O(n)时间内构建文本的语义。该模型逐词处理整个文档，并把所有上文的语义保存到一个固定大小的隐含层中。循环神经网络的又是在于它可以更好地捕捉上下文信息，对长距离的上下文信息进行建模。然而，循环神经网络是一个有偏的模型，如对于正向的循环神经网络而言，文本中靠后的词相对靠前的词占据了更主导的地位。由于这一语义偏置的特性，循环神经网络在构建整个文本的语义时，会更多地包含文本后面部分的信息。在获取句子和文档的语义表示时，很容易想到直接沿用词的分布假说，对文档进行建模。然而，如果采用分布假说直接生成句子或者文档的向量表示，会遇到极大的数据稀疏问题。如果将句子看成一个整体，用词向量模型来训练句子的表示，由于绝大多数句子因为只出现过一次，训练的结果将毫无统计意义。另一方面，分布假说是针对词义的假说，这种通过上下文获取语义的方式对句子和文档是否有效，还有待讨论。因此需要寻求新的思路对句子和文档进行建模，循环神经网络就是很有的一种模型。循环神经网络由Elman等人在1990年首次提出。该模型的核心是通过循环方式逐个输入文本中的各个词，并维护一个隐藏层，保留所有的上文信息。循环神经网络是递归神经网络的一个特例，可以认为它对应的是一棵任何一个非叶结点的右子树均为叶结点的树。这种特殊结构使得循环神经网络具有两个特点：一、由于固定了网络结构，模型只需在O(n)时间内即可构建文本的语义。这使得循环神经网络可以更高效地对文本的语义进行建模。二、从网络结构上看，循环神经网络的层数非常深，句子中有几个词，网络就有几层。因此，使用传统方法训练循环神经网络时，会遇到梯度衰减或梯度爆炸的问题，这需要模型使用更特别的方法来实现优化过程。循环神经网络对文本语义的构建过程如图3所示：每个词与代表所有上文的隐藏层组合成新的隐藏层，从文本的第一个词循环计算到最后一个词。当模型输入所有的词之后，最后一个词对应的隐藏层代表了整个文本的语义。在优化方式上，循环神经网络与其它网络结构也略有差异。在普通的神经网络中，反向传播算法可以利用导数的链式法则直接推算得到。但是在循环神经网络中，由于其隐藏层到下一个隐藏层的权重矩阵H是复用的，直接对权重矩阵求导非常困难。循环神经网络最朴素的优化方式为沿时间反向传播技术。该方法首先将网络展开，对于每一个标注样本，模型通过普通网络的反向传播技术对隐藏层逐个更新，并反复更新其中的权重矩阵H。由于梯度衰减的问题，使用BPTT优化循环神经网络时，只传播固定的层数。为了解决梯度衰减问题，Hochreiter和Schmidhuber在1997年提出了LSTM模型。该模型引入了记忆单元，可以保存长距离信息，是循环神经网络的一种常用的优化方案。In addition, recurrent neural networks can construct the semantics of text in O(n) time. The model processes the entire document word by word and saves all the above semantics into a fixed-size hidden layer. The advantage of the cyclic neural network is that it can better capture contextual information and model long-distance contextual information. However, the recurrent neural network is a biased model. For example, for the forward recurrent neural network, the later words in the text occupy a more dominant position than the earlier words. Due to the nature of this semantic bias, when the RNN constructs the semantics of the entire text, it will include more information about the latter part of the text. When obtaining the semantic representation of sentences and documents, it is easy to think of directly following the distribution hypothesis of words to model the documents. However, if the distribution hypothesis is used to directly generate the vector representation of sentences or documents, it will encounter a huge data sparsity problem. If the sentence is regarded as a whole and the word vector model is used to train the representation of the sentence, since most sentences appear only once, the training results will be statistically meaningless. On the other hand, the distribution hypothesis is a hypothesis for word meaning. Whether this method of acquiring semantics through context is effective for sentences and documents remains to be discussed. Therefore, it is necessary to find new ideas to model sentences and documents, and the recurrent neural network is a very good model. Recurrent neural networks were first proposed by Elman et al. in 1990. The core of the model is to input each word in the text one by one in a cyclic manner, and maintain a hidden layer to retain all the above information. Recurrent neural network is a special case of recursive neural network, which can be considered as corresponding to a tree whose right subtree of any non-leaf node is a leaf node. This special structure makes the cyclic neural network have two characteristics: First, due to the fixed network structure, the model can construct the semantics of the text only in O(n) time. This allows recurrent neural networks to more efficiently model the semantics of text. 2. From the perspective of network structure, the number of layers of the recurrent neural network is very deep. There are several words in a sentence, and there are several layers in the network. Therefore, when using traditional methods to train recurrent neural networks, you will encounter the problem of gradient decay or gradient explosion, which requires the model to use more special methods to achieve the optimization process. The construction process of the text semantics by the cyclic neural network is shown in Figure 3: each word is combined with the hidden layer representing all the above to form a new hidden layer, which is calculated cyclically from the first word to the last word of the text. When the model inputs all the words, the hidden layer corresponding to the last word represents the semantics of the entire text. In terms of optimization methods, the cyclic neural network is also slightly different from other network structures. In ordinary neural networks, the backpropagation algorithm can be directly calculated using the chain rule of derivatives. However, in the recurrent neural network, since the weight matrix H from the hidden layer to the next hidden layer is multiplexed, it is very difficult to directly derive the weight matrix. The simplest optimization method of cyclic neural network is backpropagation along time. This method first expands the network, and for each labeled sample, the model updates the hidden layer one by one through the backpropagation technology of the ordinary network, and repeatedly updates the weight matrix H in it. Due to the problem of gradient decay, when using BPTT to optimize the recurrent neural network, only a fixed number of layers is propagated. In order to solve the gradient decay problem, Hochreiter and Schmidhuber proposed the LSTM model in 1997. The model introduces memory units, which can store long-distance information, and is a commonly used optimization scheme for recurrent neural networks.

具体地，在一些示例中，基于神经网络模型对微博的文本内容进行处理得到特征向量具体包括：去除微博的文本内容中的杂乱信息，并根据停用词表去除无用的停用词得到新的文本内容。通过对新的文本内容的句子进行卷积操作得到句子中的每个基本单元的局部特征，并对局部特征进行最大化操作得到句子的特征向量。最后利用循环神经网络对句子的特征向量进行处理得到微博的文本内容的特征向量。其中，微博的文本内容的特征向量如图4所示。Specifically, in some examples, processing the text content of the microblog based on the neural network model to obtain the feature vector specifically includes: removing messy information in the text content of the microblog, and removing useless stop words according to the stop word list to obtain new text content. The local feature of each basic unit in the sentence is obtained by performing a convolution operation on the sentence of the new text content, and the feature vector of the sentence is obtained by maximizing the local feature. Finally, the eigenvector of the sentence is processed by the recurrent neural network to obtain the eigenvector of the microblog text content. Among them, the feature vector of the text content of microblog is shown in Fig. 4 .

更具体而言，在一些示例中，杂乱信息包括：@信息、URL信息和图片信息。可以理解的是，去除微博文本中的杂乱信息比如@信息、URL信息、图片信息等，然后对中文微博内容进行分词处理，并且根据中文停用词表去除无用的停用词。More specifically, in some examples, the miscellaneous information includes: @ information, URL information and picture information. It is understandable that the cluttered information in the microblog text, such as @ information, URL information, picture information, etc., is removed, and then the Chinese microblog content is word-segmented, and useless stop words are removed according to the Chinese stop word list.

需要说明的是，微博文本数据有用的信息指的是微博内容。It should be noted that the useful information of microblog text data refers to microblog content.

其中，通过对新的文本内容的句子进行卷积操作得到句子中的每个基本单元的局部特征，并对局部特征进行最大化操作得到句子的特征向量。可以理解的是，对微博内容进行句子级别的向量表示学习。给定包含N个基本单位(r₁，r₂，...，r_N)的句子x，字级别句子的基本单位是单个的字，词级别句子的基本单位是分词之后的词。在计算句子级别的特征时，会遇到两个主要的问题:不同句子的长度不同，重要的信息会出现在句子的任意位置。利用卷积层对句子建立模型，计算句子级别的特征，可以解决上面提到的两个问题。通过卷积操作可以得到句子中每个基本单位id局部特征，然后对得到的局部特征进行最大化操作，从而得到一个固定长度的句子特征向量。在包含N个基本单位(r₁，r₂，...，r_N)的句子x中，卷积层对每个大小为k的连续窗口进行矩阵向量操作。卷积窗口的大小k不同，获取的局部信息也不同。通过实验前期阶段设定合适大小的k来学习。将所有卷积层生成的句子特征向量进行串接，得到一个新句子的特征向量。Among them, the local feature of each basic unit in the sentence is obtained by performing a convolution operation on the sentence of the new text content, and the feature vector of the sentence is obtained by maximizing the local feature. It is understandable that sentence-level vector representation learning is performed on Weibo content. Given a sentence x containing N basic units (r ₁ , r ₂ , ..., r _N ), the basic unit of a word-level sentence is a single character, and the basic unit of a word-level sentence is a word after word segmentation. When computing sentence-level features, two main problems are encountered: different sentences have different lengths, and important information can appear anywhere in the sentence. Using the convolutional layer to model sentences and calculate sentence-level features can solve the two problems mentioned above. Through the convolution operation, the local features of each basic unit id in the sentence can be obtained, and then the obtained local features are maximized to obtain a fixed-length sentence feature vector. In a sentence x containing N basic units (r ₁ , r ₂ , ..., r _N ), the convolutional layer performs matrix-vector operations on each consecutive window of size k. The size k of the convolution window is different, and the local information obtained is also different. Learning by setting an appropriate size of k in the early stage of the experiment. Concatenate the sentence feature vectors generated by all convolutional layers to obtain a new sentence feature vector.

S102，通过softmax分类器根据特征向量对微博的文本内容进行分类得到话题类别。S102. Classify the text content of the microblog according to the feature vector by using a softmax classifier to obtain topic categories.

需要说明的是，分类器可以是但不限于是softmax分类器。It should be noted that the classifier may be but not limited to a softmax classifier.

S103，根据话题类别自动对不含有话题的微博进行话题推荐。S103. Automatically perform topic recommendation for microblogs that do not contain topics according to topic categories.

举例而言，如图5所示：基于本发明中的基于循环神经网络的微博话题自动推荐方法，本发明开发了一套针对新浪微博话题自动推荐系统。该系统对用户发布的新微博内容进行推荐包括两个阶段：首先是系统的自动预处理阶段，对原始的微博内容进行数据清洗，然后利用卷积神经网络和循环神经网络得到微博级别的向量表示；其次是系统的话题推荐阶段，系统调用训练好的softmax分类模型将微博向量表示作为特征进行话题分类，将话题类别推荐给用户。该系统的推荐结果能帮助用户和微博平台有效管理海量微博数据。For example, as shown in Figure 5: based on the automatic recommendation method for microblog topics based on recurrent neural network in the present invention, the present invention has developed a set of automatic recommendation system for Sina microblog topics. The system recommends new microblog content published by users, including two stages: first, the automatic preprocessing stage of the system, which cleans the original microblog content, and then uses convolutional neural network and recurrent neural network to obtain microblog level The second is the topic recommendation stage of the system. The system calls the trained softmax classification model to use the vector representation of Weibo as a feature to classify topics, and recommend topic categories to users. The recommendation results of the system can help users and the microblog platform effectively manage massive microblog data.

为了本领域人员更加了解微博话题自动推荐方法，下面结合图6具体说明：面对一条新微博，首先使用卷积神经网络得到句子级别的向量表示，然后进一步利用循环神经网络学习出微博级别的向量表示，然后利用训练好的模型进行话题分类，将话题类别的自动推荐给用户。In order for people in the field to better understand the automatic recommendation method of microblog topics, the following is a specific description in conjunction with Figure 6: Facing a new microblog, first use the convolutional neural network to obtain the sentence-level vector representation, and then further use the cyclic neural network to learn the microblog Level vector representation, and then use the trained model to classify topics, and automatically recommend topic categories to users.

与上述实施例提供的微博话题自动推荐方法相对应，本发明的一种实施例还提供一种微博话题自动推荐装置，由于本发明实施例提供的微博话题自动推荐装置与上述实施例提供的微博话题自动推荐方法具有相同或相似的技术特征，因此在前述微博话题自动推荐方法的实施方式也适用于本实施例提供的微博话题自动推荐装置，在本实施例中不再详细描述。如图6所示，该微博话题自动推荐装置可包括：处理模块110、分类模块120、自动推荐模块130。Corresponding to the microblog topic automatic recommendation method provided by the above-mentioned embodiment, an embodiment of the present invention also provides a microblog topic automatic recommendation device, because the microblog topic automatic recommendation device provided by the embodiment of the present invention is the same The microblog topic automatic recommendation method provided has the same or similar technical features, so the implementation of the aforementioned microblog topic automatic recommendation method is also applicable to the microblog topic automatic recommendation device provided in this embodiment, and is no longer used in this embodiment A detailed description. As shown in FIG. 6 , the apparatus for automatically recommending microblog topics may include: a processing module 110 , a classification module 120 , and an automatic recommendation module 130 .

其中，处理模块110用于基于神经网络模型对微博的文本内容进行处理得到特征向量。Wherein, the processing module 110 is configured to process the text content of the microblog based on the neural network model to obtain a feature vector.

分类模块120通过softmax分类器根据特征向量对微博的文本内容进行分类得到话题类别。The classification module 120 uses a softmax classifier to classify the microblog text content according to the feature vector to obtain topic categories.

自动推荐模块130用于根据话题类别自动对不含有话题的微博进行话题推荐。The automatic recommendation module 130 is used for automatically recommending topics to microblogs that do not contain topics according to topic categories.

在一些示例中，神经网络模型包括：卷积神经网络模型和循环神经网络模型。In some examples, the neural network models include: convolutional neural network models and recurrent neural network models.

在一些示例中，处理模块110具体用于：去除微博的文本内容中的杂乱信息，并根据停用词表去除无用的停用词得到新的文本内容；通过对新的文本内容的句子进行卷积操作得到句子中的每个基本单元的局部特征，并对局部特征进行最大化操作得到句子的特征向量；最后利用循环神经网络对句子的特征向量进行处理得到微博的文本内容的特征向量。In some examples, the processing module 110 is specifically used to: remove the messy information in the text content of the microblog, and remove useless stop words according to the stop word table to obtain new text content; The convolution operation obtains the local features of each basic unit in the sentence, and maximizes the local features to obtain the feature vector of the sentence; finally, the eigenvector of the sentence is processed by the cyclic neural network to obtain the feature vector of the text content of Weibo .

在一些示例中，杂乱信息包括：@信息、URL信息和图片信息。In some examples, the messy information includes: @ information, URL information and picture information.

在本发明的描述中，需要理解的是，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In the description of the present invention, it should be understood that the terms "first" and "second" are used for description purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of the present invention, "plurality" means at least two, such as two, three, etc., unless otherwise specifically defined.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, descriptions referring to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the described specific features, structures, materials or characteristics may be combined in any suitable manner in any one or more embodiments or examples. In addition, those skilled in the art can combine and combine different embodiments or examples and features of different embodiments or examples described in this specification without conflicting with each other.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本发明的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本发明的实施例所属技术领域的技术人员所理解。Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments or portions of code comprising one or more executable instructions for implementing specific logical functions or steps of the process , and the scope of preferred embodiments of the invention includes alternative implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order depending on the functions involved, which shall It is understood by those skilled in the art to which the embodiments of the present invention pertain.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps carried by the methods of the above embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium. During execution, one or a combination of the steps of the method embodiments is included.

尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limiting the present invention, those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.

Claims

1. a microblog topic auto recommending method, it is characterised in that including:

Based on neural network model, the content of text of microblogging is carried out process and obtain characteristic vector；

According to described characteristic vector, the content of text of described microblogging is carried out classification by softmax grader and obtain topic classification；

Automatically the microblogging not containing topic is carried out topic recommendation according to described topic classification.

2. microblog topic auto recommending method as claimed in claim 1, it is characterised in that described neural network model includes: Convolutional neural networks model and Recognition with Recurrent Neural Network model.

3. microblog topic auto recommending method as claimed in claim 1, it is characterised in that described based on neural network model The content of text of microblogging carries out process obtain characteristic vector and specifically include:

Remove the gibberish in the content of text of described microblogging, and obtain new according to disabling the vocabulary useless stop words of removal Content of text；

The local of each elementary cell in described sentence is obtained by the sentence of described new content of text is carried out convolution operation Feature, and described local feature is carried out maximum operation obtain the characteristic vector of described sentence；

Finally utilize Recognition with Recurrent Neural Network that the characteristic vector of described sentence is processed the spy of the content of text obtaining described microblogging Levy vector.

4. microblog topic auto recommending method as claimed in claim 3, it is characterised in that described gibberish includes :@ Information, URL information and pictorial information.

5. the automatic recommendation apparatus of microblog topic, it is characterised in that including:

Processing module, obtains characteristic vector for the content of text of microblogging being carried out process based on neural network model；

Sort module, carries out classification according to described characteristic vector to the content of text of described microblogging by softmax grader and obtains Topic classification；

Automatically recommending module, for automatically carrying out topic recommendation to the microblogging not containing topic according to described topic classification.

6. the automatic recommendation apparatus of microblog topic as claimed in claim 5, it is characterised in that described neural network model includes: Convolutional neural networks model and Recognition with Recurrent Neural Network model.

7. the automatic recommendation apparatus of microblog topic as claimed in claim 5, it is characterised in that described processing module specifically for:

8. the automatic recommendation apparatus of microblog topic as claimed in claim 7, it is characterised in that described gibberish includes :@ Information, URL information and pictorial information.