Nothing Special   »   [go: up one dir, main page]

CN106708929B - Video program searching method and device - Google Patents

Video program searching method and device Download PDF

Info

Publication number
CN106708929B
CN106708929B CN201611019485.4A CN201611019485A CN106708929B CN 106708929 B CN106708929 B CN 106708929B CN 201611019485 A CN201611019485 A CN 201611019485A CN 106708929 B CN106708929 B CN 106708929B
Authority
CN
China
Prior art keywords
matrix
index
video
description
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611019485.4A
Other languages
Chinese (zh)
Other versions
CN106708929A (en
Inventor
李贤�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN201611019485.4A priority Critical patent/CN106708929B/en
Priority to PCT/CN2016/113642 priority patent/WO2018090468A1/en
Publication of CN106708929A publication Critical patent/CN106708929A/en
Application granted granted Critical
Publication of CN106708929B publication Critical patent/CN106708929B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video program searching method, which comprises the following steps: receiving a description entry for describing a video program and a video category to which the video program belongs, which are input by a user; selecting a potential semantic index model corresponding to the video category, and constructing a query vector of the description entry according to a construction mode of an index matrix of the semantic index model; calculating the cosine similarity of each column of vectors of the index matrix and the query vector according to the potential semantic index model; and sorting the cosine similarity obtained by calculation from large to small, and selecting the video program corresponding to the column vector of the cosine similarity with the sorting number belonging to the sorting interval to provide for the user. Correspondingly, the invention also discloses a video program searching device. By adopting the embodiment of the invention, the potential semantics of the document can be mined, and the accuracy and the searching efficiency of searching the video program are improved.

Description

Video program searching method and device
Technical Field
The present invention relates to the field of computers, and in particular, to a method and an apparatus for searching for a video program.
Background
When the comprehensive art program is recommended, the ContentBase method is an important strategy, mainly clustering recommendation is carried out through the similarity of comprehensive art content description, the method clusters texts with similar contents, the existing Rocchio algorithm mainly based on TF-IDF is derived from a Vector space model theory, the basic idea of a Vector space model is to use a Vector to represent one text, and the subsequent processing process can be converted into operation of the Vector in the space. The Rocchio algorithm training process is a process of establishing a category feature vector, generating a vector of a given unknown text, then calculating the similarity of the vector and each category feature vector, and finally classifying the text into the most similar category.
However, the adoption of the algorithm has the following defects: the Rocchio algorithm cannot mine the underlying semantics of the document. Second, it assumes that the training data is absolutely correct, since it does not have any mechanism to quantitatively measure whether the sample contains noise, and is thus not resistant to erroneous data.
Disclosure of Invention
The method and the device for searching the video program provided by the embodiment of the invention can dig out the potential semantics of the document and improve the accuracy and the searching efficiency of searching the video program.
The method for searching the video program provided by the embodiment of the invention comprises the following steps:
receiving a description entry for describing a video program and a video category to which the video program belongs, which are input by a user;
selecting a potential semantic index model corresponding to the video category, and constructing a query vector of the description entry according to a construction mode of an index matrix of the semantic index model; the latent semantic index model is obtained by performing singular value decomposition on an index matrix constructed by description documents of video programs describing the same video category;
calculating the cosine similarity of each column of vectors of the index matrix and the query vector according to the potential semantic index model;
and sorting the cosine similarity obtained by calculation from large to small, and selecting the video program corresponding to the column vector of the cosine similarity with the sorting number belonging to the sorting interval to provide for the user.
Further, the process of constructing the index matrix from the description documents describing the video programs includes: taking the word frequency of the ith keyword appearing in the description document of the jth video program as the numerical value of the ith element of the jth column of the index matrix;
the process of constructing the query vector describing the entry comprises: setting a keyword represented by an ith element of the query vector to be the same as a keyword represented by an ith row element of the index matrix, and taking a word frequency of the keyword corresponding to the ith element appearing in the description entry as a numerical value of the ith element of the query vector; wherein the query vector is a column vector.
Further, a process of constructing an index matrix from the description documents describing the video programs of the same video category specifically includes:
for all description documents which are stored in a database and describe video programs of the same video category, carrying out format adjustment on terms contained in all the description documents according to a standard term format; the database stores description documents of various video categories, one description document describes one video program, and the video programs described by different description documents are different from each other;
calling a word segmentation tool;
utilizing the word segmentation tool to segment the entries of all the description documents after format adjustment to obtain a first word set;
extracting keywords from the first set of words according to a TF-IDF algorithm;
constructing an index matrix according to the word frequency of each extracted keyword in each description document; the row sequence of the index matrix is arranged from high to low according to the total word frequency of the keywords appearing in all the description documents, and the column sequence of the index matrix is arranged from high to low according to the word frequency of the keywords appearing in each description document.
Further, the constructing the query vector describing the entry specifically includes:
according to the standard entry format, carrying out format adjustment on the description entries;
calling a word segmentation tool;
utilizing the word segmentation tool to segment the description entries after the format adjustment to obtain a second word set;
extracting keywords from the second set of words according to a TF-IDF algorithm;
and constructing a query vector of the description entries according to the word frequency of each extracted keyword appearing in the description entries.
Further, if the index matrix is H, the latent semantic index model obtained by performing singular value decomposition on the index matrix is: h ═ T ═ S ^ DT(ii) a Wherein T is an orthogonal matrix, and each column of the matrix T is a left singular vector of the index matrix H; s is a diagonal matrix, and diagonal elements of the matrix S are singular values of the index matrix H; d is an orthogonal matrix, and each column of the matrix D is a right singular vector of the index matrix H; the query vector is Q;
calculating the cosine similarity between each column of vectors of the index matrix and the query vector according to the potential semantic index model, specifically:
selecting TK、SKAnd DKMatrix, revising the latent semantic index model to HK=TK*SK*DK T(ii) a Wherein, TKIs a matrix formed by the first K columns of the matrix T, SKFor a diagonal matrix formed by the first K diagonal elements of the matrix S, DKIs a matrix formed by the first K columns of the matrix D; the numerical value of K is larger than the maximum sorting number contained in the sorting interval;
index matrix H for the revised potential semantic index modelKComputing a transposed matrix Q of the query vectorTAnd the matrix TKMultiplying the resulting row vector with said matrix DKAnd the matrix SKThe cosine similarity between two lines of vectors of the jth line vector of the multiplied matrix is taken as the index matrix HKAnd the cosine similarity of the jth column vector of (a) and the query vector Q.
Further, the search method further comprises:
when a description document describing a new video program is added to the database, a potential semantic index model corresponding to a video category to which the new video program belongs is updated.
Accordingly, an embodiment of the present invention provides a video program search apparatus, including:
the user information receiving module is used for receiving a description entry which is input by a user and used for describing a video program and a video category to which the video program belongs;
the query vector construction module is used for selecting a potential semantic index model corresponding to the video category and constructing the query vector describing the entries according to the construction mode of an index matrix of the semantic index model; the latent semantic index model is obtained by performing singular value decomposition on an index matrix constructed by description documents of video programs describing the same video category;
the similarity calculation module is used for calculating the cosine similarity between each column of vectors of the index matrix and the query vector according to the potential semantic index model;
and the video program selecting module is used for sorting the cosine similarity obtained by calculation from large to small, and selecting the video program corresponding to the column vector of the cosine similarity with the sorting number belonging to the sorting interval to provide for the user.
Further, the query vector construction module includes a unit configured to construct an index matrix according to the description document describing the video program, and is specifically configured to: taking the word frequency of the ith keyword appearing in the description document of the jth video program as the numerical value of the ith element of the jth column of the index matrix;
the unit for constructing a query vector describing the entry, which is included in the query vector construction module, is specifically configured to: setting a keyword represented by an ith element of the query vector to be the same as a keyword represented by an ith row element of the index matrix, and taking a word frequency of the keyword corresponding to the ith element appearing in the description entry as a numerical value of the ith element of the query vector; wherein the query vector is a column vector.
Further, the query vector construction module includes a unit configured to construct an index matrix according to description documents describing video programs of the same video category, specifically:
the first format adjusting unit is used for adjusting the formats of all the entries contained in all the description documents which are stored in the database and describe the video programs of the same video category according to the standard entry formats; the database stores description documents of various video categories, one description document describes one video program, and the video programs described by different description documents are different from each other;
the first tool calling unit is used for calling the word segmentation tool;
the first word segmentation unit is used for performing word segmentation on the entries of all the description documents after format adjustment by using the word segmentation tool to obtain a first word set;
a first keyword extraction unit for extracting keywords from the first word set according to a TF-IDF algorithm;
the index matrix construction unit is used for constructing an index matrix according to the word frequency of each extracted keyword in each description document; the row sequence of the index matrix is arranged from high to low according to the total word frequency of the keywords appearing in all the description documents, and the column sequence of the index matrix is arranged from high to low according to the word frequency of the keywords appearing in each description document.
Further, the query vector construction module further includes a unit configured to construct the query vector describing the entry, specifically:
the second format adjusting unit is used for carrying out format adjustment on the description entries according to the standard entry format;
the second tool calling unit is used for calling the word segmentation tool;
the second word segmentation unit is used for segmenting the description entries with the adjusted formats by using the word segmentation tool to obtain a second word set;
a second keyword extraction unit for extracting keywords from the second word set according to a TF-IDF algorithm;
and the query vector construction unit is used for constructing the query vector of the description entries according to the word frequency of each extracted keyword appearing in the description entries.
Further, if the index matrix is H, the latent semantic index model obtained by performing singular value decomposition on the index matrix is: h ═ T ═ S ^ DT(ii) a Wherein T is an orthogonal matrix, and each column of the matrix T is a left singular vector of the index matrix H; s is a diagonal matrix, and diagonal elements of the matrix S are singular values of the index matrix H; d is an orthogonal matrix, and each column of the matrix D is a right singular vector of the index matrix H; the query vector is Q;
the similarity calculation module specifically includes:
a model revision unit for selecting TK、SKAnd DKMatrix, revising the latent semantic index model to HK=TK*SK*DK T(ii) a Wherein, TKBeing formed by a matrix TMatrix formed by first K columns, SKFor a diagonal matrix formed by the first K diagonal elements of the matrix S, DKIs a matrix formed by the first K columns of the matrix D; the numerical value of K is larger than the maximum sorting number contained in the sorting interval;
a computing unit for computing an index matrix H for the revised latent semantic index modelKComputing a transposed matrix Q of the query vectorTAnd the matrix TKMultiplying the resulting row vector with said matrix DKAnd the matrix SKThe cosine similarity between two lines of vectors of the jth line vector of the multiplied matrix is taken as the index matrix HKAnd the cosine similarity of the jth column vector of (a) and the query vector Q.
Further, the search device further includes:
and the model updating module is used for updating the potential semantic index model corresponding to the video category to which the new video program belongs when the description document describing the new video program is added in the database.
The embodiment of the invention has the following beneficial effects:
according to the video program searching method and device provided by the embodiment of the invention, the degree of correlation between the description entries of the video to be searched and the description documents represented by each column vector of the index matrix of the potential semantic index model can be obtained by calculating the cosine similarity between the query vector of the video to be searched and each column vector of the index matrix of the potential semantic index model, the higher the numerical value is, the higher the degree of correlation is, and further the video program corresponding to the description documents with the high degree of correlation with the description entries is recommended to the user. In addition, the video category to which the video program belongs is input by the user, and the potential semantic index model corresponding to the video category is selected for calculation, so that the efficiency of searching for the video program can be further improved.
Drawings
Fig. 1 is a schematic flowchart of an embodiment of a video program searching method provided by the present invention;
fig. 2 is a schematic structural diagram of an embodiment of a video program search apparatus provided in the present invention;
fig. 3 is a schematic structural diagram of an embodiment of a query vector construction module of the video program search apparatus provided in the present invention;
fig. 4 is a schematic structural diagram of a similarity calculation module of a video program search apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of an embodiment of a video program searching method provided by the present invention; the searching method comprises steps S1-S4, and specifically comprises the following steps:
s1, receiving a description entry for describing a video program and a video category to which the video program belongs, which are input by a user;
s2, selecting a potential semantic index model corresponding to the video category, and constructing the query vector of the description entry according to the construction mode of an index matrix of the semantic index model; the latent semantic index model is obtained by performing singular value decomposition on an index matrix constructed by description documents of video programs describing the same video category; the value of the ith element in the jth column of the index matrix represents the word frequency of the ith keyword appearing in the description document of the jth video program; the query vector is a column vector, a keyword represented by an ith element of the query vector is the same as a keyword represented by an ith row element of the index matrix, and a numerical value of the ith element of the query vector represents a word frequency of the keyword corresponding to the ith element appearing in the description entry;
s3, calculating the cosine similarity of each column vector of the index matrix and the query vector according to the potential semantic index model;
and S4, sorting the cosine similarity obtained by calculation from big to small, and selecting the video program corresponding to the column vector of the cosine similarity with the sorting number belonging to the sorting interval to provide for the user.
It should be noted that by calculating the cosine similarity between the query vector of the video to be searched and each column of vectors of the index matrix of the potential semantic index model, the degree of correlation between the description terms of the video to be searched and the description documents represented by each column of vectors of the index matrix can be obtained, the higher the numerical value is, the higher the degree of correlation is, and further the video program corresponding to the description documents with the high degree of correlation with the description terms is recommended to the user, and because the potential semantic index model is constructed (trained) according to the description documents describing the video program, the potential semantics of the documents can be mined, and the accuracy of searching the video program is improved. In addition, the video category to which the video program belongs is input by the user, and the potential semantic index model corresponding to the video category is selected for calculation, so that the efficiency of searching for the video program can be further improved. In general, the above sort section is preferably arranged with the top 10 sort numbers.
Further, the process of constructing the index matrix according to the description document describing the video program of the same video category in step S2 includes:
for all description documents which are stored in a database and describe video programs of the same video category, carrying out format adjustment on terms contained in all the description documents according to a standard term format; the database stores description documents of various video categories, one description document describes one video program, and the video programs described by different description documents are different from each other; for the format adjustment of the entries, but not limited to, unifying the lower case in the entry into the upper case, deleting the redundant blank space in the entry, unifying punctuation marks in the entry, unifying the full angle format or the half angle format of the entry into one, and the like.
Calling a word segmentation tool; preferably, the word segmentation tool is a jieba word segmentation tool, but is not limited to this word segmentation tool.
Utilizing the word segmentation tool to segment the entries of all the description documents after format adjustment to obtain a first word set; the word segmentation tool has various word segmentation modes for describing entries, can continue to segment long words except for segmenting according to a normal word segmentation mode, improves recall rate, can segment more words than the normal segmentation particularly for short texts, and has an effect of improving the accuracy of subsequent output video programs.
Extracting keywords from the first set of words according to a TF-IDF algorithm;
constructing an index matrix according to the word frequency of each extracted keyword in each description document; the row sequence of the index matrix is arranged from high to low according to the total word frequency of the keywords appearing in all the description documents, and the column sequence of the index matrix is arranged from high to low according to the word frequency of the keywords appearing in each description document.
It should be noted that, the index matrix is constructed in advance according to the description documents stored in the database, and the construction process is to follow: the value of the ith element in the jth column of the index matrix represents the word frequency of the ith keyword appearing in the description document of the jth video program. All elements in the ith row of the index matrix represent the same keyword, and the keywords represented by the elements in different rows are different. For example, assuming that all elements in row 1 of the index matrix represent the keyword a and the elements in column 1 of the index matrix represent the descriptive document B, the numerical value of the elements in row 1 and column 1 of the index matrix represents the probability of the keyword a appearing in the descriptive document B.
Further, the constructing the query vector describing the entry in step S2 specifically includes:
according to the standard entry format, carrying out format adjustment on the description entries; for example, unifying the lower case of a term into upper case, deleting the extra space in a term, unifying punctuation marks in a term, unifying the full-angle format or half-angle format of a term into one, and the like.
Calling a word segmentation tool; preferably, the word segmentation tool is a jieba word segmentation tool, but is not limited to this word segmentation tool.
Utilizing the word segmentation tool to segment the description entries after the format adjustment to obtain a second word set; the word segmentation tool has various word segmentation modes for describing entries, can continue to segment long words except for segmenting according to a normal word segmentation mode, improves recall rate, can segment more words than the normal segmentation particularly for short texts, and has an effect of improving the accuracy of subsequent output video programs.
Extracting keywords from the second set of words according to a TF-IDF algorithm;
and constructing a query vector of the description entries according to the word frequency of each extracted keyword appearing in the description entries.
It should be noted that, when constructing the query vector describing the entry, it is to be ensured that the keyword represented by the i-th element of the query vector is the same as the keyword represented by the i-th row element of the index matrix of the latent semantic index model, so that the comparison of the cosine similarity of the query vector and each column of vectors of the index matrix has significance.
In addition, the process of constructing the vector still follows the following principle: the keywords represented by the ith element of the query vector are the same as the keywords represented by the ith row element of the index matrix, and the numerical value of the ith element of the query vector represents the word frequency of the keywords corresponding to the ith element in the description entry; for example, assuming that all elements in row 1 of the index matrix represent keyword a, the keyword represented by the elements in row 1 of the query vector is keyword a, and the numerical value of the elements in row 1 of the query vector represents the word frequency of keyword a appearing in the description entry.
Further, if the index matrix is H, the latent semantic index model obtained by performing singular value decomposition on the index matrix is: h ═ T ═ S ^ DT(ii) a Wherein T isAn orthogonal matrix, each column of matrix T being a left singular vector of the index matrix H; s is a diagonal matrix, and diagonal elements of the matrix S are singular values of the index matrix H; d is an orthogonal matrix, and each column of the matrix D is a right singular vector of the index matrix H; the query vector is Q;
the specific implementation process of step S3 is specifically:
selecting TK、SKAnd DKMatrix, revising the latent semantic index model to HK=TK*SK*DK T(ii) a Wherein, TKIs a matrix formed by the first K columns of the matrix T, SKFor a diagonal matrix formed by the first K diagonal elements of the matrix S, DKIs a matrix formed by the first K columns of the matrix D; the numerical value of K is larger than the maximum sorting number contained in the sorting interval;
index matrix H for the revised potential semantic index modelKComputing a transposed matrix Q of the query vectorTAnd the matrix TKMultiplying the resulting row vector with said matrix DKAnd the matrix SKThe cosine similarity between two lines of vectors of the jth line vector of the multiplied matrix is taken as the index matrix HKAnd the cosine similarity of the jth column vector of (a) and the query vector Q.
It should be noted that the K value here is a threshold value, and may be selected according to actual conditions, and the decomposition process adopts K rank of H, so that singular values after the first K maximum singular values of the index matrix H are all zero. The revision of the potential semantic index model can improve the retrieval efficiency.
Further, the search method further comprises:
when a description document describing a new video program is added to the database, a potential semantic index model corresponding to a video category to which the new video program belongs is updated.
It should be noted that, as video programs are continuously added, and description documents describing the newly added video programs are also continuously added to the database, the semantic index model needs to be updated.
According to the video program searching method provided by the embodiment of the invention, the degree of correlation between the description vocabulary entry of the video to be searched and the description document represented by each column vector of the index matrix of the potential semantic index model can be obtained by calculating the cosine similarity between the query vector of the video to be searched and each column vector of the index matrix of the potential semantic index model, the higher the numerical value is, the higher the degree of correlation is, and further the video program corresponding to the description document with the high degree of correlation with the description vocabulary entry is recommended to the user, and because the potential semantic index model is constructed (trained) according to the description document describing the video program, the potential semantics of the document can be mined, and the accuracy of searching the video program is improved. In addition, the video category to which the video program belongs is input by the user, and the potential semantic index model corresponding to the video category is selected for calculation, so that the efficiency of searching for the video program can be further improved.
Fig. 2 is a schematic structural diagram of an embodiment of a video program search apparatus according to the present invention. The search apparatus can execute all the processes of the video program search method provided by the above embodiment, and the search apparatus includes:
a user information receiving module 10, configured to receive a description entry describing a video program and a video category to which the video program belongs, where the description entry is input by a user;
a query vector construction module 20, configured to select a potential semantic index model corresponding to the video category, and construct a query vector describing the entry according to a construction manner of an index matrix of the semantic index model; the latent semantic index model is obtained by performing singular value decomposition on an index matrix constructed by description documents of video programs describing the same video category;
a similarity calculation module 30, configured to calculate a cosine similarity between each column of vectors of the index matrix and the query vector according to the latent semantic index model;
and the video program selecting module 40 is configured to sort the cosine similarity obtained through calculation from large to small, and select a video program corresponding to the column vector of the cosine similarity whose ranking number belongs to the sorting interval to provide to the user.
Further, the query vector construction module includes a unit configured to construct an index matrix according to the description document describing the video program, and is specifically configured to: taking the word frequency of the ith keyword appearing in the description document of the jth video program as the numerical value of the ith element of the jth column of the index matrix;
the unit for constructing a query vector describing the entry, which is included in the query vector construction module, is specifically configured to: setting a keyword represented by an ith element of the query vector to be the same as a keyword represented by an ith row element of the index matrix, and taking a word frequency of the keyword corresponding to the ith element appearing in the description entry as a numerical value of the ith element of the query vector; wherein the query vector is a column vector.
Further, referring to fig. 3, it is a schematic structural diagram of an embodiment of a query vector constructing module of a video program search apparatus provided in the present invention, where the query vector constructing module 20 includes a unit for constructing an index matrix according to description documents describing video programs of the same video category, specifically:
a first format adjusting unit 21, configured to perform format adjustment on entries included in all description documents describing video programs of the same video category, which are stored in a database, according to a standard entry format; the database stores description documents of various video categories, one description document describes one video program, and the video programs described by different description documents are different from each other;
a first tool calling unit 22 for calling a word segmentation tool;
the first word segmentation unit 23 is configured to perform word segmentation on the entries of all the description documents after format adjustment by using the word segmentation tool to obtain a first word set;
a first keyword extraction unit 34 for extracting keywords from the first word set according to a TF-IDF algorithm;
an index matrix constructing unit 25, configured to construct an index matrix according to the word frequency of each extracted keyword appearing in each description document; the row sequence of the index matrix is arranged from high to low according to the total word frequency of the keywords appearing in all the description documents, and the column sequence of the index matrix is arranged from high to low according to the word frequency of the keywords appearing in each description document.
Further, the query vector construction module 20 further includes a unit for constructing the query vector describing the entry, specifically:
a second format adjusting unit 26, configured to perform format adjustment on the description entries according to a standard entry format;
a second tool calling unit 27 for calling a word segmentation tool;
a second word segmentation unit 28, configured to perform word segmentation on the description entry with the adjusted format by using the word segmentation tool, so as to obtain a second word set;
a second keyword extraction unit 29 for extracting keywords from the second word set according to a TF-IDF algorithm;
a query vector construction unit 31, configured to construct a query vector of the description entries according to the word frequency of each extracted keyword appearing in the description entries.
Further, referring to fig. 4, which is a schematic structural diagram of an embodiment of a similarity calculation module of a video program search apparatus provided by the present invention, where the index matrix is H, the latent semantic index model obtained by performing singular value decomposition on the index matrix is: h ═ T ═ S ^ DT(ii) a Wherein T is an orthogonal matrix, and each column of the matrix T is a left singular vector of the index matrix H; s is a diagonal matrix, and diagonal elements of the matrix S are singular values of the index matrix H; d is an orthogonal matrix, and each column of the matrix D is a right singular vector of the index matrix H; the query vector is Q;
the similarity calculation module 30 specifically includes:
a model revision unit 32 for selecting TK、SKAnd DKMatrix, revising the latent semantic index model to HK=TK*SK*DK T(ii) a Wherein, TKIs a matrix formed by the first K columns of the matrix T, SKFor a diagonal matrix formed by the first K diagonal elements of the matrix S, DKIs a matrix formed by the first K columns of the matrix D; the numerical value of K is larger than the maximum sorting number contained in the sorting interval;
a computing unit 33 for computing an index matrix H for the revised potential semantic index modelKComputing a transposed matrix Q of the query vectorTAnd the matrix TKMultiplying the resulting row vector with said matrix DKAnd the matrix SKThe cosine similarity between two lines of vectors of the jth line vector of the multiplied matrix is taken as the index matrix HKAnd the cosine similarity of the jth column vector of (a) and the query vector Q.
Further, the search device further includes:
and the model updating module 50 is used for updating the potential semantic index model corresponding to the video category to which the new video program belongs when the description document describing the new video program is added to the database.
The video program searching device provided by the embodiment of the invention can obtain the degree of correlation between the description vocabulary entry of the video to be searched and the description document represented by each column vector of the index matrix of the potential semantic index model by calculating the cosine similarity between the query vector of the video to be searched and each column vector of the index matrix of the potential semantic index model, wherein the higher the numerical value is, the higher the degree of correlation is, and further recommend the video program corresponding to the description document with the high degree of correlation with the description vocabulary entry to the user. In addition, the video category to which the video program belongs is input by the user, and the potential semantic index model corresponding to the video category is selected for calculation, so that the efficiency of searching for the video program can be further improved.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A method for searching for a video program, comprising:
receiving a description entry for describing a video program and a video category to which the video program belongs, which are input by a user;
selecting a potential semantic index model corresponding to the video category, and constructing a query vector of the description entry according to a construction mode of an index matrix of the semantic index model; the latent semantic index model is obtained by performing singular value decomposition on an index matrix constructed by description documents of video programs describing the same video category;
calculating the cosine similarity of each column of vectors of the index matrix and the query vector according to the potential semantic index model;
sorting the cosine similarity obtained by calculation from big to small, and selecting the video program corresponding to the column vector of the cosine similarity with the sorting sequence number belonging to the sorting interval to provide for the user;
the process of constructing the index matrix by the description document describing the video program comprises the following steps: taking the word frequency of the ith keyword appearing in the description document of the jth video program as the numerical value of the ith element of the jth column of the index matrix;
the process of constructing the query vector describing the entry comprises: setting a keyword represented by an ith element of the query vector to be the same as a keyword represented by an ith row element of the index matrix, and taking a word frequency of the keyword corresponding to the ith element appearing in the description entry as a numerical value of the ith element of the query vector; wherein the query vector is a column vector.
2. The method for searching for video programs according to claim 1, wherein the process of constructing the index matrix from the description documents describing the video programs of the same video category comprises:
for all description documents which are stored in a database and describe video programs of the same video category, carrying out format adjustment on terms contained in all the description documents according to a standard term format; the database stores description documents of various video categories, one description document describes one video program, and the video programs described by different description documents are different from each other;
calling a word segmentation tool;
utilizing the word segmentation tool to segment the entries of all the description documents after format adjustment to obtain a first word set;
extracting keywords from the first set of words according to a TF-IDF algorithm;
constructing an index matrix according to the word frequency of each extracted keyword in each description document; the row sequence of the index matrix is arranged from high to low according to the total word frequency of the keywords appearing in all the description documents, and the column sequence of the index matrix is arranged from high to low according to the word frequency of the keywords appearing in each description document.
3. The method for searching for a video program according to claim 1, wherein the constructing of the query vector describing the entry specifically comprises:
according to the standard entry format, carrying out format adjustment on the description entries;
calling a word segmentation tool;
utilizing the word segmentation tool to segment the description entries after the format adjustment to obtain a second word set;
extracting keywords from the second set of words according to a TF-IDF algorithm;
and constructing a query vector of the description entries according to the word frequency of each extracted keyword appearing in the description entries.
4. The method for searching for video programs according to claim 2, wherein if the index matrix is H, the latent semantic index model obtained by singular value decomposition of the index matrix is: h ═ T ═ S ^ DT(ii) a Wherein T is an orthogonal matrix, and each column of the matrix T is a left singular vector of the index matrix H; s is a diagonal matrix, and diagonal elements of the matrix S are singular values of the index matrix H; d is an orthogonal matrix, and each column of the matrix D is a right singular vector of the index matrix H; the query vector is Q;
calculating the cosine similarity between each column of vectors of the index matrix and the query vector according to the potential semantic index model, specifically:
selecting TK、SKAnd DKMatrix, revising the latent semantic index model to HK=TK*SK*DK T(ii) a Wherein, TKIs a matrix formed by the first K columns of the matrix T, SKFor a diagonal matrix formed by the first K diagonal elements of the matrix S, DKIs a matrix formed by the first K columns of the matrix D; the numerical value of K is larger than the maximum sorting number contained in the sorting interval;
index matrix H for the revised potential semantic index modelKComputing a transposed matrix Q of the query vectorTAnd the matrix TKMultiplying the resulting row vector with said matrix DKAnd the matrix SKThe cosine similarity between two lines of vectors of the jth line vector of the multiplied matrix is taken as the index matrix HKAnd the cosine similarity of the jth column vector of (a) and the query vector Q.
5. The method for searching for a video program according to claim 1, wherein the method for searching for a video program further comprises:
when a description document describing a new video program is added to the database, a potential semantic index model corresponding to a video category to which the new video program belongs is updated.
6. An apparatus for searching a video program, comprising:
the user information receiving module is used for receiving a description entry which is input by a user and used for describing a video program and a video category to which the video program belongs;
the query vector construction module is used for selecting a potential semantic index model corresponding to the video category and constructing the query vector describing the entries according to the construction mode of an index matrix of the semantic index model; the latent semantic index model is obtained by performing singular value decomposition on an index matrix constructed by description documents of video programs describing the same video category;
the similarity calculation module is used for calculating the cosine similarity between each column of vectors of the index matrix and the query vector according to the potential semantic index model;
the video program selecting module is used for sorting the cosine similarity obtained by calculation from large to small and selecting the video program corresponding to the column vector of the cosine similarity with the sorting number belonging to the sorting interval to provide for the user;
the query vector construction module includes a unit configured to construct an index matrix according to a description document describing a video program, and is specifically configured to: taking the word frequency of the ith keyword appearing in the description document of the jth video program as the numerical value of the ith element of the jth column of the index matrix;
the unit for constructing a query vector describing the entry, which is included in the query vector construction module, is specifically configured to: setting a keyword represented by an ith element of the query vector to be the same as a keyword represented by an ith row element of the index matrix, and taking a word frequency of the keyword corresponding to the ith element appearing in the description entry as a numerical value of the ith element of the query vector; wherein the query vector is a column vector.
7. The apparatus for searching for video programs according to claim 6, wherein the query vector construction module comprises a unit configured to construct an index matrix according to the description documents describing the video programs of the same video category, specifically:
the first format adjusting unit is used for adjusting the formats of all the entries contained in all the description documents which are stored in the database and describe the video programs of the same video category according to the standard entry formats; the database stores description documents of various video categories, one description document describes one video program, and the video programs described by different description documents are different from each other;
the first tool calling unit is used for calling the word segmentation tool;
the first word segmentation unit is used for performing word segmentation on the entries of all the description documents after format adjustment by using the word segmentation tool to obtain a first word set;
a first keyword extraction unit for extracting keywords from the first word set according to a TF-IDF algorithm;
the index matrix construction unit is used for constructing an index matrix according to the word frequency of each extracted keyword in each description document; the row sequence of the index matrix is arranged from high to low according to the total word frequency of the keywords appearing in all the description documents, and the column sequence of the index matrix is arranged from high to low according to the word frequency of the keywords appearing in each description document.
8. The apparatus for searching for a video program according to claim 6, wherein the query vector construction module further comprises a unit for constructing the query vector describing the entry, specifically:
the second format adjusting unit is used for carrying out format adjustment on the description entries according to the standard entry format;
the second tool calling unit is used for calling the word segmentation tool;
the second word segmentation unit is used for segmenting the description entries with the adjusted formats by using the word segmentation tool to obtain a second word set;
a second keyword extraction unit for extracting keywords from the second word set according to a TF-IDF algorithm;
and the query vector construction unit is used for constructing the query vector of the description entries according to the word frequency of each extracted keyword appearing in the description entries.
9. The apparatus for searching for video programs according to claim 7, wherein if the index matrix is H, the latent semantic index model obtained by singular value decomposition of the index matrix is: h ═ T ═ S ^ DT(ii) a Wherein T is an orthogonal matrix, and each column of the matrix T is a left singular vector of the index matrix H; s is a diagonal matrix, and diagonal elements of the matrix S are singular values of the index matrix H; d is an orthogonal matrix, and each column of the matrix D is a right singular vector of the index matrix H; the query vector is Q;
the similarity calculation module specifically includes:
a model revision unit for selecting TK、SKAnd DKMatrix, revising the latent semantic index model to HK=TK*SK*DK T(ii) a Wherein, TKIs a matrix formed by the first K columns of the matrix T, SKFor a diagonal matrix formed by the first K diagonal elements of the matrix S, DKIs a matrix formed by the first K columns of the matrix D; the numerical value of K is larger than the maximum sorting number contained in the sorting interval;
a computing unit for computing an index matrix H for the revised latent semantic index modelKComputing a transposed matrix Q of the query vectorTAnd the matrix TKMultiplying the resulting row vector with said matrix DKAnd the matrix SKThe cosine similarity between two lines of vectors of the jth line vector of the multiplied matrix is taken as the index matrix HKAnd the cosine similarity of the jth column vector of (a) and the query vector Q.
10. The apparatus for searching for a video program according to claim 6, wherein said searching means further comprises:
and the model updating module is used for updating the potential semantic index model corresponding to the video category to which the new video program belongs when the description document describing the new video program is added in the database.
CN201611019485.4A 2016-11-18 2016-11-18 Video program searching method and device Active CN106708929B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201611019485.4A CN106708929B (en) 2016-11-18 2016-11-18 Video program searching method and device
PCT/CN2016/113642 WO2018090468A1 (en) 2016-11-18 2016-12-30 Method and device for searching for video program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611019485.4A CN106708929B (en) 2016-11-18 2016-11-18 Video program searching method and device

Publications (2)

Publication Number Publication Date
CN106708929A CN106708929A (en) 2017-05-24
CN106708929B true CN106708929B (en) 2020-06-26

Family

ID=58939942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611019485.4A Active CN106708929B (en) 2016-11-18 2016-11-18 Video program searching method and device

Country Status (2)

Country Link
CN (1) CN106708929B (en)
WO (1) WO2018090468A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416026B (en) * 2018-03-09 2023-04-18 腾讯科技(深圳)有限公司 Index generation method, content search method, device and equipment
CN110555127A (en) * 2018-03-30 2019-12-10 优酷网络技术(北京)有限公司 Multimedia content generation method and device
CN109918616B (en) * 2019-01-23 2020-01-31 中国人民解放军32801部队 visual media processing method based on semantic index precision enhancement
CN111177512A (en) * 2019-12-24 2020-05-19 绍兴市上虞区理工高等研究院 Scientific and technological achievement missing processing method and device based on big data
CN111651635B (en) * 2020-05-28 2023-04-28 拾音智能科技有限公司 Video retrieval method based on natural language description
CN111984851B (en) * 2020-09-03 2023-11-14 深圳平安智慧医健科技有限公司 Medical data searching method, device, electronic device and storage medium
CN113094703B (en) * 2021-03-11 2024-06-21 北京六方云信息技术有限公司 Output content filtering method and system for web intrusion detection
CN114564496B (en) * 2022-03-01 2023-09-19 北京有竹居网络技术有限公司 Content recommendation method and device
CN118364090B (en) * 2024-06-19 2024-08-27 西安羚控电子科技有限公司 Rapid generation method and device for designed scheme

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6189002B1 (en) * 1998-12-14 2001-02-13 Dolphin Search Process and system for retrieval of documents using context-relevant semantic profiles
JP2009213067A (en) * 2008-03-06 2009-09-17 Toshiba Corp Apparatus and method for program recommendation
CN103152618B (en) * 2011-12-07 2017-11-17 北京四达时代软件技术股份有限公司 Value added service of digital television content recommendation method and device
CN103559196B (en) * 2013-09-23 2017-02-22 浙江大学 Video retrieval method based on multi-core canonical correlation analysis
CN104657376B (en) * 2013-11-20 2018-09-18 航天信息股份有限公司 The searching method and device of video frequency program based on program relationship
CN104199933B (en) * 2014-09-04 2017-07-07 华中科技大学 The football video event detection and semanteme marking method of a kind of multimodal information fusion
CN105653690B (en) * 2015-12-30 2018-11-23 武汉大学 The video big data method for quickly retrieving and system of abnormal behaviour warning information constraint

Also Published As

Publication number Publication date
WO2018090468A1 (en) 2018-05-24
CN106708929A (en) 2017-05-24

Similar Documents

Publication Publication Date Title
CN106708929B (en) Video program searching method and device
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN110019732B (en) Intelligent question answering method and related device
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN111753060A (en) Information retrieval method, device, equipment and computer readable storage medium
Sarawagi et al. Open-domain quantity queries on web tables: annotation, response, and consensus models
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN111291188B (en) Intelligent information extraction method and system
CN111324771B (en) Video tag determination method and device, electronic equipment and storage medium
CN111753167B (en) Search processing method, device, computer equipment and medium
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN102663129A (en) Medical field deep question and answer method and medical retrieval system
CN102831184A (en) Method and system for predicating social emotions in accordance with word description on social event
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
CN106570196B (en) Video program searching method and device
AU2018226420B2 (en) Voice assisted intelligent searching in mobile documents
CN110879834A (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN112182145A (en) Text similarity determination method, device, equipment and storage medium
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
CN110866102A (en) Search processing method
CN113988057A (en) Title generation method, device, equipment and medium based on concept extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant