Nothing Special   »   [go: up one dir, main page]

CN111639230A - Similar video screening method, device, equipment and storage medium - Google Patents

Similar video screening method, device, equipment and storage medium Download PDF

Info

Publication number
CN111639230A
CN111639230A CN202010478656.XA CN202010478656A CN111639230A CN 111639230 A CN111639230 A CN 111639230A CN 202010478656 A CN202010478656 A CN 202010478656A CN 111639230 A CN111639230 A CN 111639230A
Authority
CN
China
Prior art keywords
video
similar
library
frame
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010478656.XA
Other languages
Chinese (zh)
Other versions
CN111639230B (en
Inventor
罗雄文
刘振强
卢江虎
项伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bigo Technology Pte Ltd
Original Assignee
Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Baiguoyuan Information Technology Co Ltd filed Critical Guangzhou Baiguoyuan Information Technology Co Ltd
Priority to CN202010478656.XA priority Critical patent/CN111639230B/en
Publication of CN111639230A publication Critical patent/CN111639230A/en
Application granted granted Critical
Publication of CN111639230B publication Critical patent/CN111639230B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Closed-Circuit Television Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for screening similar videos. Wherein, the method comprises the following steps: constructing a similar video library of each video to be selected in a specified similar scale according to the video level characteristics of each video to be selected in the video library to be selected; respectively searching out similar videos with the similarity exceeding a preset similar threshold from a similar video library of each video to be selected to obtain a corresponding video pair candidate library; and screening out corresponding similar video pairs from the video pair candidate library based on the frame-level characteristics of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight. According to the technical scheme provided by the embodiment of the invention, the comprehensiveness and accuracy of similar video screening are improved by adopting a double screening mode, the extraction operand of the frame-level features of both the video sides is controlled, the extraction operand of the frame-level features is reduced, and the screening efficiency of the similar video is improved.

Description

Similar video screening method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of video processing, in particular to a method, a device, equipment and a storage medium for screening similar videos.
Background
With the rapid development of the short video and live broadcast industry, the growth scale of the online video is continuously enlarged, the daily increment of the short video even reaches the level of ten million, at the moment, the newly added video with the repeated content brings a large amount of audit redundancy for the audit work, and brings huge risk to the copyright protection work of a video originator, so that the newly added video needs to be eliminated by screening out similar videos with the repeated content, and the efficiency of video audit and the copyright safety of the video originator are ensured.
At present, the similarity between different videos is usually calculated by extracting features of each video at a video level granularity or a picture level granularity and then using the features of different videos at the video level granularity or the picture level granularity. At the moment, for the characteristics under the video level granularity, only one characteristic extraction step needs to be executed on each video, the characteristics of individual video frames of the video under sparse sampling are fused, and the dimension is reduced to obtain the characteristics of the video under the video level granularity, so that the characteristic dimension under the video level granularity is lower, and the similarity among different videos cannot be accurately represented; for the features under the picture level granularity, the spatial details of the video under the condition of not using frames are considered, but the feature extraction step is required to be executed for a plurality of video frames in the video, so that the calculation amount of feature extraction is overlarge, the time sequence association of different video frames in a certain video is split, and the similarity calculation efficiency among different videos is greatly improved.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a storage medium for screening similar videos, which realize the similarity judgment of different videos to be selected in a video library to be selected under the combination of video-level characteristics and frame-level characteristics and improve the comprehensiveness and accuracy of similar video screening.
In a first aspect, an embodiment of the present invention provides a method for screening similar videos, where the method includes:
constructing a similar video library of each video to be selected in a specified similar scale according to the video level characteristics of each video to be selected in the video library to be selected;
respectively searching out similar videos with the similarity exceeding a preset similar threshold from a similar video library of each video to be selected to obtain a corresponding video pair candidate library;
and screening out corresponding similar video pairs from the video pair candidate library based on the frame-level characteristics of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight.
In a second aspect, an embodiment of the present invention provides a similar video screening apparatus, where the apparatus includes:
the similar library construction module is used for constructing a similar video library of each video to be selected in the video library to be selected under the appointed similar scale according to the video level characteristics of each video to be selected;
the candidate base generation module is used for respectively searching out similar videos of which the similarity with the video to be selected exceeds a preset similar threshold from the similar video base of each video to be selected to obtain corresponding video pair candidate bases;
and the similar video screening module is used for screening out corresponding similar video pairs from the video pair candidate library based on the frame-level characteristics of the video parties of each candidate video pair in the video pair candidate library under the corresponding similarity weight.
In a third aspect, an embodiment of the present invention provides an apparatus, where the apparatus includes:
one or more processors;
storage means for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors implement the method for screening similar videos according to any embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for filtering similar videos according to any embodiment of the present invention.
The screening method, the device, the equipment and the storage medium of the similar videos provided by the embodiment of the invention are characterized in that firstly, the video level characteristics of each video to be selected in a video library to be selected are established, a similar video library of each video to be selected under the appointed similar scale is established, so as to adopt the appointed similar scale to execute the first re-screening of the similar videos, then, the similar videos of which the similarity with the video to be selected exceeds the preset similar threshold are further searched out from the similar video library of each video to be selected respectively, the video to be selected and the searched similar videos form a corresponding video pair candidate library, so as to adopt the preset similar threshold to execute the second re-screening of the similar videos in the similar video library of each video to be selected, the number of candidate video pairs in the video pair candidate library is reduced through double screening, and further, based on the frame level characteristics of both videos of each candidate video pair in the video pair candidate library under the corresponding similarity weight, the corresponding similar video pairs are screened from the video pair candidate library, so that the similarity judgment of different videos to be selected in the video library to be selected under the combination of the video-level features and the frame-level features is realized, the comprehensiveness and the accuracy of similar video screening are improved by adopting a double screening mode, meanwhile, the extraction operand of the frame-level features of both the videos is controlled by the similarity weight of both the videos of each candidate video pair, the frame-level features do not need to be extracted from each video to be selected in the video library to be selected, the extraction operand of the frame-level features is reduced, and the screening efficiency of the similar videos is improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
fig. 1A is a flowchart of a method for screening similar videos according to an embodiment of the present invention;
fig. 1B is a schematic diagram of a screening process of similar videos according to an embodiment of the present invention;
fig. 2A is a flowchart of a method for screening similar videos according to a second embodiment of the present invention;
fig. 2B is a schematic diagram illustrating a video-level feature extraction process in the method according to the second embodiment of the present invention;
fig. 2C is a schematic structural diagram of a deep separation residual error network in the method according to the second embodiment of the present invention;
fig. 2D is a schematic diagram illustrating a principle that a deep separation residual error network performs convolution operation in the method according to the second embodiment of the present invention;
fig. 2E is a schematic structural diagram of a spatio-temporal separation residual error network in the method according to the second embodiment of the present invention;
FIG. 2F is a schematic diagram of the structure of each spatio-temporal convolution layer in the spatio-temporal separation residual error network in the method according to the second embodiment of the present invention;
fig. 3A is a flowchart of a method for screening similar videos according to a third embodiment of the present invention;
fig. 3B is a schematic diagram of a frame-level feature extraction process in the method according to the third embodiment of the present invention;
fig. 3C is a schematic structural diagram of a multi-scale attention residual error network in the method according to the third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a similar video screening apparatus according to a fourth embodiment of the present invention;
fig. 5 is a schematic structural diagram of an apparatus according to a fifth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures. In addition, the embodiments and features of the embodiments in the present invention may be combined with each other without conflict.
Example one
Fig. 1A is a flowchart of a method for screening similar videos according to an embodiment of the present invention, which is applicable to any situation where similar videos are screened from a large amount of video data. The screening method for similar videos provided in this embodiment may be executed by the screening apparatus for similar videos provided in the embodiment of the present invention, where the apparatus may be implemented in a software and/or hardware manner, and is integrated in a device that executes the method, where the device may be a background server that is specially responsible for uploading and storing video data.
Specifically, referring to fig. 1A, the method may include the steps of:
and S110, constructing a similar video library of each video to be selected in the designated similar scale according to the video level characteristics of each video to be selected in the video library to be selected.
In this embodiment, the candidate video library refers to a video set that currently contains a large amount of video data and needs to perform similarity analysis on different videos, for example, an online video set that is newly added at a video server every day, at this time, an auditor needs to audit the illegal video in the online video set and perform offline processing, and in order to reduce redundancy of the audit work, the similar video with repeated content in the online video set only needs to be audited once, so that the similar video needs to be screened out from the online video set. The video-level features of the video to be selected refer to features which can represent the picture content of the video to be selected and are extracted by taking the video as granularity, at the moment, the feature extraction process is only required to be executed once for the video-level features of each video to be selected, the feature dimension is low, and therefore the video-level features of the video to be selected can only represent local picture features of the video to be selected, and picture details of the video to be selected under different video frames cannot be comprehensively represented.
Specifically, a video library to be selected, which needs to be subjected to similar video screening at this time, needs to be obtained, and then, for each video to be selected in the video library to be selected, video level features of the video to be selected need to be extracted with the video as a granularity, in this embodiment, an existing extraction network of any one of the video level features may be adopted to extract the video level features of each video to be selected, and then the video level features of each video to be selected in the video library to be selected are adopted to analyze the similarity between every two videos to be selected in the video library to be selected, at this time, an appointed similar scale is preset, and the appointed similar scale may be obtained through empirical analysis of similar video screening history and can ensure that an upper limit of the number of similarity of real similar videos of each video to be selected in the video library to be selected can be covered; at this time, for each video to be selected in the video library to be selected, sorting other videos to be selected according to the similarity between the video to be selected and other videos to be selected in the video library to be selected, finding out other videos to be selected which are located within the specified similar scale and have the similarity with the video to be selected closer, and directly filtering out the other videos to be selected which are located within the specified similar scale as non-similar videos of the video to be selected, thereby executing a first re-screening process of the similar videos in the video library to be selected, subsequently forming the found other videos to be selected into a similar video library of the video to be selected at the specified similar scale, and at this time, the similar video library can roughly filter out videos which are not similar to the video to be selected; according to the same steps, a first re-screening process of the similar video of each video to be selected in the video library to be selected is executed, so that a similar video library of each video to be selected in the specified similar scale is generated, other videos to be selected with higher similarity with the video to be selected are further searched in the similar video library of each video to be selected in the following process, and the accuracy of similar video screening is improved.
For example, after the video level features of each video to be selected in the video library to be selected are extracted, a similar video library constructed based on the video level features of each video to be selected belongs to the principle of rapidly filtering non-similar videos, at this time, a corresponding hash index is established for the video level features of each video to be selected, and a feature distance set composed of every two videos to be selected in the video library to be selected is rapidly constructed by using the hash index and the set distance metric for representing the similarity between different videos to be selected, it should be noted that the distance metric in this embodiment may be a cosine distance; then, the existing K neighbor search algorithm is adopted to respectively find out other videos to be selected with topK having the highest similarity with each video to be selected in the video library to be selected, topK is the specified similar scale in the embodiment, and then the other videos to be selected found out for each video to be selected are combined to form a similar video library of each video to be selected at the specified similar scale, so as to realize the first re-filtering of the non-similar videos of each video to be selected at the low similarity, and subsequently further search for the similar videos with higher similarity from the similar video library of each video to be selected, thereby ensuring the accuracy of similar video screening.
And S120, respectively searching out similar videos of which the similarity with the video to be selected exceeds a preset similar threshold from the similar video library of each video to be selected, and obtaining corresponding video pair candidate libraries.
Optionally, after a similar video library of each video to be selected in a specified similar scale is constructed, since the video level features of the video to be selected cannot comprehensively represent the picture details of the video to be selected in different video frames, videos which are not similar to the video to be selected may exist in the similar video library of each video to be selected constructed through the video level features of each video to be selected, but all similar videos of each video to be selected can be covered in the video library to be selected, so that the comprehensiveness of similar video screening is improved; at this time, in order to further ensure the accuracy of similar video screening, in this embodiment, further similar video screening needs to be performed on the similar video libraries of each video to be selected, that is, a second screening process of similar videos is performed on the similar video libraries of each video to be selected, as shown in fig. 1B, a preset similar threshold is preset for the second screening, and the preset similar threshold is obtained by the existing video similarity evaluation service according to the historical experience of judging video similarity by using video-level features, so that whether the videos are similar or not can be accurately distinguished according to the video-level features of different videos to be selected; at this time, the similarity between each video to be selected and each other video to be selected contained in the similar video library of the video to be selected is firstly calculated, so as to find out other videos to be selected, of which the similarity between the video to be selected and the similar video library of the video to be selected exceeds a preset similarity threshold value, as similar videos of the video to be selected under second screening, and then each video to be selected and each similar video of the video to be selected under second screening respectively form a corresponding video pair, the corresponding video pair is added into a pre-established video pair candidate library, and then each candidate video pair in the video pair candidate library adopts frame-level features, and similarity judgment of each candidate video pair is further executed, so that the accuracy of similar video screening is ensured.
S130, based on the frame level characteristics of the video pair of each candidate video pair in the video pair candidate library under the corresponding similarity weight, screening out the corresponding similar video pair from the video pair candidate library.
Optionally, each candidate video pair in the video pair candidate library is a similar video pair determined by the video level characteristics of the video to be selected, and the video level characteristics cannot comprehensively represent the picture details of the video to be selected in different video frames, so that video pairs with dissimilar actual picture contents may also exist in the video pair candidate library; meanwhile, the video pair candidate library is obtained by double screening of similar videos in the video library to be selected, so that the number of candidate video pairs contained in the video pair candidate library is small, and at the moment, even if the frame-level features of both the videos of the candidate video pairs are extracted by taking the video frames as the granularity, the operation amount of feature extraction cannot be excessively increased; therefore, in order to further improve the similarity of each candidate video pair in the video pair candidate library, in this embodiment, for each candidate video pair in the video pair candidate library, the existing frame-level feature extraction network is further adopted to further extract the frame-level features of both videos of the candidate video pair, and then the frame-level features of both videos of the candidate video pair are further adopted to further judge the similarity between both videos of the candidate video pair, so that the comprehensiveness and accuracy of similar video screening are improved on the basis of ensuring the high efficiency of similar video screening.
Specifically, considering that the similarity of different candidate video pairs in a video pair candidate library is different, if one video in a certain candidate video pair exists in a plurality of other candidate video pairs, the video will have the similarity with a plurality of other videos in the video pair candidate library, at this time, if the similarities of the video and other videos are both high, only a small number of video frames need to be extracted from the video to extract corresponding frame-level features to judge the similarity between the video and other videos, but if the similarities of the video and other videos are both low, only a large number of video frames need to be extracted from the video to extract corresponding frame-level features, at this time, the frame-level features need to show detailed pictures of the video, so as to carefully judge the similarity between the video and other videos; therefore, in the embodiment, for both video parties of each candidate video pair in the video pair candidate library, a corresponding similarity weight is set, and the similarity weight is used for indicating the number of video frames required to be extracted when frame-level features are extracted, so that feature extraction in unnecessary video frames is reduced, and feature extraction efficiency in a similar video screening process is improved.
Further, for each candidate video pair in the candidate video pair library, it is necessary to first adopt the similarity weights corresponding to the two video parties of the candidate video pair, respectively extract a corresponding number of video frames from the two video parties, further respectively extract corresponding frame-level features from the video frames extracted by the two video parties, and adopt the frame-level features of the two video parties of each candidate video pair to continuously calculate the similarity between the candidate video pairs, so as to ensure the similarity accuracy between the candidate video pairs, and according to the similarity between each candidate video pair calculated by adopting the frame-level features, screen out a candidate video pair from the candidate video pair library whose similarity exceeds a specified similarity threshold, as a similar video pair finally screened from the candidate video library in this embodiment.
In addition, in this embodiment, a similar video pair library is pre-established and used for storing the similar video pairs screened from a large number of video libraries to be selected, and in order to ensure the comprehensiveness of similar video screening, after constructing the similar video library of each video to be selected in a specified similar scale, the method may further include: and submitting each video to be selected and a video pair consisting of similar videos, the similarity between which exceeds a preset submission threshold value, in a similar video library of the video to be selected to the similar video library of the video to be selected, to a pre-created similar video pair library.
Specifically, the similar video library of each video to be selected is obtained by calculating the video level characteristics of each video to be selected in the video library to be selected, at this time, a preset submission threshold is preset for the video level characteristics, the preset submission threshold is higher than a preset similarity threshold set in second screening after the similar video library of each video to be selected under the specified similar scale is constructed, if the similarity between a certain video to be selected and a certain similar video in the similar video library of the video to be selected exceeds the preset submission threshold, it is described that the selected video and the similar video in the similar video library of the video to be selected are very likely to be truly similar, so the implementation combines the similar video poles in the similar video library of the video to be selected and the corresponding video pairs, pre-submits the video pairs to the created similar video pair library for storage, and at this time, the video pair candidate library formed by the video pairs whose similarity exceeds the preset similarity threshold also stores the submission of this time Giving the similar video pair library similar pairs, and therefore after screening out corresponding similar video pairs from the video pair candidate library, updating the similar video pair library according to the screened similar video pairs; the method comprises the steps of selecting a video pair candidate library, storing similar video pairs selected from the video pair candidate library into a similar video pair library, judging whether the video pairs which are pre-submitted to the similar video pair library by adopting video level characteristics are really similar or not when the selection is carried out according to the selected similar video pairs, and deleting the video pairs which are stored in the similar video pair library and are not similar through frame level characteristics, so that the accuracy of the similar video pairs is ensured.
Meanwhile, after the similar video pairs in the video library to be selected are submitted to the similar video pair library, the video library to be selected can be subjected to duplicate elimination according to the similar video pairs in the similar video pair library, and the duplicate eliminated video library to be selected is pushed to the corresponding auditing server, so that the auditing server only needs to audit once for different videos with repeated contents, and the redundancy of video auditing is reduced; in addition, the embodiment can also push the updated similar video pair library to the copyright correction server side to perform copyright correction on the similar video, thereby ensuring the copyright security of the video originator.
The technical solution provided in this embodiment is that, first, video-level features of each video to be selected in a video library to be selected are constructed, a similar video library of each video to be selected in a specified similar scale is constructed, a first re-screening of similar videos is performed in the specified similar scale, then, similar videos whose similarity with the video to be selected exceeds a preset similar threshold are further searched from the similar video library of each video to be selected, a corresponding video pair candidate library is formed by the video to be selected and the searched similar videos, a second re-screening of similar videos is performed in the similar video library of each video to be selected in the preset similar threshold, the number of candidate video pairs in the video pair candidate library is reduced by the double-screening, and then, based on frame-level features of both videos of each candidate video pair in the video pair candidate library in the corresponding similarity weight, corresponding similar video pairs are screened from the video pair candidate library, the similarity judgment of different videos to be selected in the video library to be selected under the combination of the video-level features and the frame-level features is realized, the comprehensiveness and the accuracy of the similar video screening are improved by adopting a double screening mode, meanwhile, the extraction operand of the frame-level features of both video sides is controlled by the similarity weight of both video sides of each candidate video pair, the frame-level features do not need to be extracted from each video to be selected in the video library to be selected, the extraction operand of the frame-level features is reduced, and the screening efficiency of the similar videos is improved.
Example two
Fig. 2A is a flowchart of a similar video screening method according to a second embodiment of the present invention, and fig. 2B is a schematic diagram of a principle of a video-level feature extraction process according to the second embodiment of the present invention. The embodiment is optimized on the basis of the embodiment. Specifically, as shown in fig. 2A, the present embodiment mainly explains in detail the specific extraction process of the video-level feature of each video to be selected in the video library to be selected.
Optionally, as shown in fig. 2A, the present embodiment may include the following steps:
and S210, extracting corresponding sparse video frames from each video to be selected in the video library to be selected.
Optionally, when extracting the video level features of the video to be selected for each video to be selected in the video library to be selected, a sparse sampling density is set first, corresponding sparse video frames are extracted from each video to be selected respectively by using the sampling density, and then the frame features of the sparse video frames are extracted, and the frame features of the sparse video frames extracted under each video to be selected are fused to obtain the video level features of the video to be selected.
S220, according to the time sequence information of the sparse video frames in the video to be selected, the depth separation characteristics of each sparse video frame in the video to be selected are fused to obtain the two-dimensional space characteristics of the video to be selected.
Optionally, in this embodiment, a corresponding depth separation convolution operation may be adopted to extract a depth separation feature of each sparse video frame in the video to be selected, and the extraction of the depth separation feature can effectively avoid redundant feature mapping calculation in a common convolution operation, thereby greatly reducing the overhead of convolution operation; meanwhile, before and after the sparse video frames have a certain time sequence in the video to be selected, the depth separation characteristics of each sparse video frame in the video to be selected can be sequenced in sequence according to the time sequence information of the sparse video frames in the video to be selected, and then the depth separation characteristics of each sparse video frame in each characteristic dimension are fused by combining the time sequence information, so that the two-dimensional space characteristics of the video to be selected are obtained.
For example, as shown in fig. 2B, in this embodiment, according to the time sequence information of the sparse video frames in the video to be selected, the depth separation feature of each sparse video frame in the video to be selected is fused to obtain the two-dimensional spatial feature of the video to be selected, which may specifically include: and extracting the depth separation characteristics of each sparse video frame in the video to be selected by adopting a pre-constructed depth separation residual error network, and fusing the depth separation characteristics of each sparse video frame by adopting a preset pseudo time sequence convolution kernel under different characteristic dimensions to obtain the two-dimensional space characteristics of the video to be selected.
Specifically, each of the depth separation residual error networks in this embodiment is composed of a single-channel convolution kernel and a full-channel convolution kernel that are cascaded, and an overall structure of the depth separation residual error network is shown in fig. 2C, where the single-channel convolution kernel may be an M × M convolution kernel in a single channel, and the full-channel convolution kernel may be a 1 × 1 convolution kernel in multiple channels; after each sparse video frame in the video to be selected is sequentially input into the depth separation residual error network according to the time sequence information, continuously performing convolution operation on the sparse video frame through each spatial convolution layer (stage) cascaded in the depth-separated residual error network, as shown in fig. 2D, when performing convolution operation on each spatial convolution layer cascaded in the depth-separated residual error network, the conventional M x M convolution operation is split into the convolution operations of a single channel convolution kernel and a full channel convolution kernel performed in sequence, at each spatial convolution layer (stage), first a separate C single-channel M x M convolution kernel is used, respectively performing corresponding convolution operation on the feature map of the result of the convolution operation on the sparse video frame by the previous space convolution layer under each channel, wherein the channels in the embodiment can be regarded as corresponding feature dimensions, so that feature map groups with the same number of channels are obtained; secondly, respectively executing corresponding convolution operations on the feature map groups under different channels by adopting 1 x 1 convolution kernels of the N full channels to obtain feature output results under the N channels, and sequentially inputting the feature output results into a next space convolution layer to execute the convolution operations again; at this time, the number of channels finally output, that is, the feature dimension of the extracted feature, may be controlled by setting the number of full-channel convolution kernels in each space convolution layer, so that the parameter number of the deep separation convolution kernels in this embodiment is also much smaller than that of the conventional convolution kernels; in the depth separation residual error network in this embodiment, by using an inverse residual error mechanism (first performing channel expansion and then performing channel compression) and linear transformation output, the feature loss and offset after convolution mapping are reduced, and the bypass convolution in the conventional residual error network is also removed, so that the feature extraction accuracy is not excessively reduced when the parameter amount of the depth separation residual error network is reduced.
In this embodiment, after each sparse video frame in the video to be selected is sequentially input to a depth separation residual network, each spatial convolution layer (stage) cascaded in the depth separation residual network performs the above depth separation operation according to a single-channel convolution kernel and a full-channel convolution kernel in the spatial convolution layer, and after the convolution operation is continuously performed on the sparse video frame, a feature map group of each sparse video frame under multiple channels is finally output, at this time, feature summarization is performed on the feature map group under each channel through a channel-by-channel Global Average Pooling (GAP) operation to obtain the depth separation feature of the sparse video frame, and a preset pseudo-timing convolution kernel is adopted to perform weighted summation on the depth separation feature value of each sparse video frame under each feature dimension, so as to reduce the interference of channel noise on the feature value, and the preset pseudo-timing convolution kernel may be a convolution kernel of F1, and F is the number of the sparse video frames in the video to be selected, and then the depth separation characteristics of all the sparse video frames are fused to obtain the two-dimensional space characteristics of the video to be selected, and the accuracy of the two-dimensional space characteristics of the video to be selected is ensured through the depth separation characteristic extraction operation.
And S230, determining the three-dimensional space-time separation characteristic of the video to be selected according to the time sequence information of the sparse video frames in the video to be selected and the depth separation characteristic of each sparse video frame.
Optionally, after the depth separation feature of each sparse video frame in the video to be selected is extracted, the depth separation feature of each sparse video frame may have a corresponding feature map, at this time, the feature maps corresponding to the depth separation feature of each sparse video frame may be combined into a three-dimensional feature map group according to the time sequence information of the sparse video frame in the video to be selected, at this time, the three-dimensional feature map group is further added with a time feature on a spatial feature, at this time, the three-dimensional time-space separation feature of the three-dimensional feature map group in the video to be selected is extracted by adopting a corresponding time-space separation convolution operation, at this time, the separation of the space convolution and the time convolution avoids mutual interference of the time feature and the space feature.
For example, as shown in fig. 2B, in this embodiment, determining a three-dimensional space-time separation characteristic of the video to be selected according to the timing information of the sparse video frames in the video to be selected and the depth separation characteristic of each sparse video frame may specifically include: determining a middle convolution characteristic map of each sparse video frame in the video to be selected according to the depth separation characteristic of each sparse video frame in the video to be selected, and combining the middle convolution characteristic maps of the sparse video frames into a three-dimensional reference characteristic map group of the video to be selected according to the time sequence information of the sparse video frames in the video to be selected; and extracting the three-dimensional space-time separation characteristics of the three-dimensional reference characteristic graph group through a pre-constructed space-time separation residual error network.
The main residual branch of each space-time convolution layer in the space-time separation residual network is composed of a space convolution kernel and a time convolution kernel which are cascaded, the bypass residual branch of the space-time convolution layer is composed of a space convolution kernel and a time convolution kernel which are connected in parallel, and the overall structure of the space-time separation residual network is shown in fig. 2E.
Specifically, each spatial convolution kernel cascaded in the depth separation residual error network in this embodiment is used to continuously perform convolution operation on each sparse video frame in the video to be selected, and when the depth separation characteristic of each sparse video frame is extracted, after the last spatial convolution kernel in the depth separation residual error network performs corresponding convolution operation, the intermediate convolution characteristic map of the sparse video frame is output, and according to the time sequence information of the sparse video frame in the video to be selected, the intermediate convolution characteristic maps of each sparse video frame are combined into a three-dimensional reference characteristic map set of the video to be selected, the three-dimensional reference characteristic map set carries time characteristics and spatial characteristics, at this time, the three-dimensional reference characteristic map set is used as the input of the space-time separation residual error network in this embodiment, each space-time convolution layer of the space-time separation residual error network separates time characteristics and spatial characteristics, for example, a three-dimensional space-time convolution kernel of 3 × 3 is divided into a space convolution kernel of 1 × 3 and a time convolution kernel of 3 × 1, the space detail information of each intermediate convolution feature map in the three-dimensional reference feature map group is mapped, and then the relation mapping of each intermediate convolution feature map on time is learned, so that the redundant computation of feature extraction is reduced, the mutual interference between the time feature and the space feature is avoided, about one third of parameters in a space-time separation residual error network can be reduced, and the extraction speed of the three-dimensional space-time separation feature is further increased.
In addition, the space-time separation residual error network in the embodiment also designs a projection residual error structure shared by the cascade connection and the parallel connection of a space convolution kernel and a time convolution kernel at the inlet of each space-time convolution layer, learning the space-time characteristics of the three-dimensional reference characteristic map set through the cascaded space convolution kernel and time convolution kernel on the main residual branch of each space-time convolution layer, and the bypass residual error branch of each space-time convolution layer strengthens the local expression capacity of the three-dimensional reference feature graph group in time features and space features through a space convolution kernel and a time convolution kernel which are connected in parallel, at the moment, the space convolution kernel and the time convolution kernel on the main residual error branch and the bypass residual error branch can carry out down-sampling on the three-dimensional reference feature graph group, the features can be ensured to be continuously summarized, finally, the feature global summary is carried out on the features under each channel by adopting channel-by-channel GAP operation, and therefore the three-dimensional space-time separation features of the three-dimensional reference feature graph group are extracted.
S240, splicing the two-dimensional space characteristic and the three-dimensional space-time separation characteristic of the video to be selected to obtain the video-level characteristic of the video to be selected.
Optionally, after the two-dimensional space feature and the three-dimensional space-time separation feature of the video to be selected are obtained, as shown in fig. 2B, the two-dimensional space feature and the three-dimensional space-time separation feature of each video to be selected are spliced to obtain the video-level feature of the video to be selected, so that the two-dimensional space feature and the three-dimensional space-time separation feature are mixed in the video-level feature, and the accuracy of the video-level feature of the video to be selected is greatly improved.
It should be noted that, in consideration of the difference of the emphasis points of the video-level feature extraction and the frame-level feature extraction, different types of training labels and loss functions are adopted for the video-level feature extraction and the frame-level feature extraction; the deep separation residual network and the space-time separation residual network for video-level feature extraction can use cluster-level labels of videos to guide training, wherein the cluster-level labels of the videos refer to: performing multiple data enhancement on each video in the training data set, wherein all videos (including original videos) obtained by enhancing the same video form a cluster, and the videos in the cluster all use the same digital label; the video obtained by data enhancement is also used for training; at this time, considering that the training is guided by using the cluster-level labels, the number of categories is very large, so the embodiment uses the ArcFace loss combined with the cross entropy, and increases the bearing capacity to the large number of categories by properly adding penalty boundaries near the decision surface; when the depth separation residual error network and the space-time separation residual error network under the video-level feature extraction are used, the corresponding loss function calculation layer is removed, and an independent feature extraction network is generated, so that the extraction of the video-level features is more accurate.
And S250, constructing a similar video library of each video to be selected in the designated similar scale according to the video level characteristics of each video to be selected in the video library to be selected.
And S260, respectively searching out similar videos of which the similarity with the video to be selected exceeds a preset similar threshold from the similar video library of each video to be selected, and obtaining corresponding video pair candidate libraries.
And S270, screening out corresponding similar video pairs from the video pair candidate library based on the frame-level characteristics of the video pairs of each candidate video pair in the video pair candidate library under the corresponding similarity weight.
According to the technical scheme provided by the embodiment, the two-dimensional space features of the video to be selected are obtained by performing feature extraction and fusion under depth separation on the sparse video frames extracted from the video to be selected, the extraction efficiency of the two-dimensional space features is improved, meanwhile, the three-dimensional space-time separation features of the video to be selected are obtained by performing feature extraction under space-time separation on the depth separation features of the sparse video frames, the extraction efficiency of the three-dimensional space-time separation features is improved, the two-dimensional space features and the three-dimensional space-time separation features of the video to be selected are spliced subsequently, the video level features of the video to be selected are obtained, the two-dimensional space features and the three-dimensional space-time separation features are mixed in the video level features, and therefore, on the basis of guaranteeing the high efficiency of feature extraction, the accuracy of the video level features of the video.
EXAMPLE III
Fig. 3A is a flowchart of a similar video screening method according to a third embodiment of the present invention, and fig. 3B is a schematic diagram of a principle of a frame-level feature extraction process in the method according to the third embodiment of the present invention. The embodiment is optimized on the basis of the embodiment. Specifically, as shown in fig. 3A, the present embodiment mainly explains in detail the specific process of extracting the frame-level features of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weights.
Optionally, as shown in fig. 3A, the present embodiment may include the following steps:
and S310, constructing a similar video library of each video to be selected in the designated similar scale according to the video level characteristics of each video to be selected in the video library to be selected.
S320, respectively searching out similar videos of which the similarity with the video to be selected exceeds a preset similar threshold from the similar video library of each video to be selected, and obtaining corresponding video pair candidate libraries.
S330, aiming at any one video in each candidate video pair in the video pair candidate library, determining the similarity weight of the video according to the similarity between the video and other videos in the video pair candidate library.
Optionally, each candidate video pair in the video pair candidate library is a similar video pair determined by the video-level features of each video to be selected, and the similarity is inaccurate; moreover, the video pair candidate library is obtained by double screening of similar videos in the video library to be selected, and the number of the included candidate video pairs is small, so that the frame-level features of the two video parties of each candidate video pair are required to be adopted to further judge the similarity between the two video parties of the candidate video pair; at this time, considering that the similarity of different candidate video pairs in a video pair candidate library is different, if any one of the videos in a certain candidate video pair exists in a plurality of other candidate video pairs, the video will have the similarity with the other videos in the video pair candidate library, at this time, if the similarity of the video and the other videos is high, only a small number of video frames need to be extracted from the video to extract corresponding frame-level features to judge the similarity between the video and the other videos, but if the similarity of the video and the other videos is low, only a large number of video frames need to be extracted from the video to extract corresponding frame-level features, at this time, the frame-level features need to show detailed pictures of the video in detail, so as to carefully judge the similarity between the video and the other videos; therefore, before extracting the frame-level features of both video sides of each candidate video pair in the video pair candidate library, the similarity weights of both video sides of each candidate video pair need to be determined, in this embodiment, for any one video in each candidate video pair in the video pair candidate library, each candidate video pair including the video is found in the video pair candidate library, the similarity between the video and other videos in each found candidate video pair is determined, the corresponding average similarity is found from the similarities of all candidate video pairs including the video to serve as the similarity weight of the video, and then the corresponding video frame is extracted from the video according to the similarity weight of the video to extract the corresponding frame-level features.
And S340, determining the frame sampling rates of the two video parties according to the similarity weight of the two video parties of each candidate video pair in the video pair candidate library, and respectively extracting the target video frames of the two video parties under the corresponding frame sampling rates.
Optionally, after determining the similarity weights of the two video parties of each candidate video pair in the video pair candidate library, the embodiment presets a corresponding maximum frame sampling rate and a corresponding minimum frame sampling rate, and then selects the most suitable frame sampling rate of the two video parties between the maximum frame sampling rate and the minimum frame sampling rate according to the similarity weights of the two video parties of each candidate video pair, and further extracts corresponding target video frames in the two video parties by respectively adopting the frame sampling rates of the two video parties, wherein the target video frames are used for subsequently extracting corresponding frame-level features, at this time, because the frame sampling rates of the two video parties are different, the number of frames of the target video frames extracted by the two video parties is different, so that the suitable frame sampling rate is respectively selected for the actual similarity condition of the two video parties of each candidate video pair, without setting the same sampling rate for each video, therefore, unnecessary feature extraction calculation of the target video frame is reduced, and the accuracy of the frame level features is improved on the basis of ensuring the high efficiency of the frame level features as much as possible.
For example, the frame sampling rate may be calculated as:
Figure RE-GDA0002585880220000181
wherein,
Figure RE-GDA0002585880220000182
for each candidate video pair, the frame sampling rate under the corresponding similarity weight of both video sides, FminFor a predetermined minimum frame sampling rate, FmaxFor a predetermined maximum frame sampling rate,
Figure RE-GDA0002585880220000183
the similarity weights of both video parties in each candidate video pair.
And S350, extracting attention features of each target video frame of the two video parties under the corresponding frame sampling rate under different spatial scales through a pre-constructed multi-scale attention residual error network, and splicing the attention features of the target video frames under each spatial scale to obtain the multi-scale attention features of the target video frames.
Optionally, after extracting target video frames of both videos in each candidate video pair at the corresponding frame sampling rate, directly inputting the target video frames extracted by both videos into a pre-constructed multi-scale attention residual network, as shown in fig. 3B, extracting corresponding attention features from each target video frame in both videos by using an attention mechanism under different spatial scales set in the multi-scale attention residual network by the multi-scale attention residual network, and splicing the attention features of each target video frame under each spatial scale to obtain the multi-scale attention features of the target video frame.
Illustratively, as shown in fig. 3C, the multi-scale attention residual network in the present embodiment is configured with attention weights satisfying a specific spatial probability distribution at each spatial scale; at this time, extracting attention features of each target video frame of the two video parties under the corresponding frame sampling rate under different spatial scales through a pre-constructed multi-scale attention residual error network may specifically include: and extracting frame-level sub-features of each target video frame of the two video parties under the corresponding frame sampling rate under different spatial scales, and adjusting the frame-level sub-features of the target video frame under the corresponding spatial scales by adopting the attention weight under each spatial scale to obtain the attention features of the target video frame under different spatial scales.
Specifically, the multi-scale attention residual network in the embodiment takes common Res50 as an infrastructure, and has 50 layers and 16 bottleeck residual blocks. These bottleeck residual blocks are mainly of two types, in which a projected residual block with bypass convolution is used at the entrance of each convolution layer, so that the feature summation can be performed on the target video frame step by step, and a residual block containing only main branch convolution is used at other positions, so as to reduce the parameter quantity of feature extraction in the network. Meanwhile, in order to reduce unnecessary feature extraction redundancy, the multi-scale attention residual error network also greatly reduces the number of channels of each convolution layer in a mode of halving the convolution kernel, and meanwhile, the feature extraction operand of the whole multi-scale attention residual error network is also greatly reduced.
Further, after target video frames of both intra-video sides of each candidate video are input into the multi-scale attention residual network in this embodiment, as shown in fig. 3C, each convolution layer of the multi-scale attention residual network is taken as a corresponding spatial scale, at this time, each convolution layer of the multi-scale attention residual network outputs a feature map group under the spatial scale to represent meta-frame features under the corresponding spatial scale, and meanwhile, 1 × 1 convolution kernels are adopted to compress the number of channels of the feature map group under each spatial scale, so as to model deep connections of the feature map group under the current spatial scale, and reduce feature dimensions of frame-level sub-features of the target video frame under different spatial scales; and to avoid feature loss, linear transformation is also used to replace the conventional nonlinear activation Relu function; secondly, performing feature summarization on the feature graph group under each spatial scale in the spatial dimension by using channel-by-channel GAP operation, so as to obtain frame-level sub-features of the target video frame under different spatial scales; at this time, the attention weights under the current spatial scale are fitted in a feature-by-feature weighted summation mode by adopting a full connection mode for the frame-level sub-features of the target video frame under each spatial scale, and because the attention weights under different spatial scales are correlated with each other, the probability distributions of the attention weights under different spatial scales should be unified in the same space, so that the probability distributions of the attention weights under different spatial scales obtained by the frame-level sub-features under the spatial scales through fitting can be unified by adopting a softmax function, so that the attention weights under different spatial scales meeting specific spatial probability distributions can be obtained, and the inconsistency of feature distributions under different spatial scales can be avoided; and then adopting attention weight under each spatial scale meeting specific spatial probability distribution to respectively carry out element-by-element multiplication operation on the frame-level sub-features of the target video frame under the spatial scale so as to adjust the frame-level sub-features of the target video frame under the corresponding spatial scale to obtain the attention features of the target video frame under different spatial scales, and splicing the attention features of each target video frame under each spatial scale to obtain the multi-scale attention features of the target video frame, wherein the accuracy of the multi-scale attention features of the target video frame is improved through multi-scale spatial information and an attention mechanism.
And S360, fusing the multi-scale attention features of the target video frames of the two video parties under the corresponding frame sampling rate to obtain the frame-level features of the two video parties under the corresponding similarity weight.
Optionally, after obtaining the multi-scale attention features of each target video frame extracted by both the video parties in each candidate video pair, performing feature fusion on the multi-scale attention features of each target video frame in both the video parties in each feature dimension, thereby obtaining the frame-level features of both the video parties under the corresponding similarity weight.
It should be noted that, since a certain video may exist in a plurality of candidate video pairs, when extracting the frame-level features of both videos in each candidate video pair, if the frame-level feature of a certain video in the current candidate video pair is already extracted when the frame-level feature of the previous candidate video pair is performed, it is not necessary to perform the sequential feature extraction process on the current candidate video pair again.
And S370, inputting the frame-level characteristics of both the videos of each candidate video pair in the video pair candidate library under the corresponding similarity weight into a pre-constructed three-layer perceptron network to obtain the similarity score of the candidate video pair.
Optionally, after frame-level features of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight are extracted, the frame-level features of both video sides of the candidate video pair are directly and respectively input into a pre-constructed three-layer perceptron network, as shown in fig. 3B, the three-layer perceptron network analyzes the similarity between the frame-level features of both video sides of each candidate video pair, so as to obtain a similarity score of each candidate video pair, and then whether both video sides of the candidate video pair are similar is judged according to the similarity score of each candidate video pair.
It should be noted that, in this embodiment, the multi-scale attention residual network and the three-tier perceptron network for frame-level feature extraction are trained together in an end-to-end manner, and then two results of whether candidate video pairs are similar or not are finally output, at this time, the training is guided by using the digital labels of the video pair levels, each candidate video pair is labeled as similar or dissimilar, and since the number of classes after training is not large, only the common cross entropy is used as the loss function.
And S380, screening corresponding similar video pairs according to the similarity score of each candidate video pair in the video pair candidate library.
Optionally, candidate video pairs with similarity scores exceeding a preset similarity threshold are screened from the video pair candidate library to serve as similar video pairs in the embodiment, and are subsequently submitted to the similar video pair library for updating.
According to the technical scheme provided by the embodiment, the multi-scale attention features of the target video frame extracted by the multi-scale attention residual error network are fused to obtain the frame level features of the two video parties under the corresponding similarity weights, the extraction operand of the frame level features is reduced, the extraction accuracy of the frame level features is ensured, and the similarity between the two video parties of each candidate video pair is judged by adopting the three-layer perceptron network subsequently, so that more accurate similar video pairs are screened out from the video candidate library, and the screening efficiency of the similar videos is improved.
Example four
Fig. 4 is a schematic structural diagram of a similar video screening apparatus according to a fourth embodiment of the present invention, specifically, as shown in fig. 4, the apparatus may include:
the similar library construction module 410 is used for constructing a similar video library of each video to be selected in the video library to be selected in a specified similar scale according to the video level characteristics of each video to be selected;
the candidate library generating module 420 is configured to respectively search, from the similar video library of each video to be selected, a similar video whose similarity with the video to be selected exceeds a preset similar threshold, and obtain a corresponding video pair candidate library;
the similar video screening module 430 is configured to screen out a corresponding similar video pair from the video pair candidate library based on frame-level features of both video parties of each candidate video pair in the video pair candidate library under the corresponding similarity weight.
The technical solution provided in this embodiment is that, first, video-level features of each video to be selected in a video library to be selected are constructed, a similar video library of each video to be selected in a specified similar scale is constructed, a first re-screening of similar videos is performed in the specified similar scale, then, similar videos whose similarity with the video to be selected exceeds a preset similar threshold are further searched from the similar video library of each video to be selected, a corresponding video pair candidate library is formed by the video to be selected and the searched similar videos, a second re-screening of similar videos is performed in the similar video library of each video to be selected in the preset similar threshold, the number of candidate video pairs in the video pair candidate library is reduced by the double-screening, and then, based on frame-level features of both videos of each candidate video pair in the video pair candidate library in the corresponding similarity weight, corresponding similar video pairs are screened from the video pair candidate library, the similarity judgment of different videos to be selected in the video library to be selected under the combination of the video-level features and the frame-level features is realized, the comprehensiveness and the accuracy of the similar video screening are improved by adopting a double screening mode, meanwhile, the extraction operand of the frame-level features of both video sides is controlled by the similarity weight of both video sides of each candidate video pair, the frame-level features do not need to be extracted from each video to be selected in the video library to be selected, the extraction operand of the frame-level features is reduced, and the screening efficiency of the similar videos is improved.
The similar video screening device provided by the embodiment can be applied to the similar video screening method provided by any embodiment, and has corresponding functions and beneficial effects.
EXAMPLE five
Fig. 5 is a schematic structural diagram of an apparatus according to a fifth embodiment of the present invention, as shown in fig. 5, the apparatus includes a processor 50, a storage device 51, and a communication device 52; the number of processors 50 in the device may be one or more, and one processor 50 is taken as an example in fig. 5; the processor 50, the storage means 51 and the communication means 52 in the device may be connected by a bus or other means, which is exemplified in fig. 5.
The device provided by the embodiment can be used for executing the similar video screening method provided by any embodiment, and has corresponding functions and beneficial effects.
EXAMPLE six
The sixth embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, can implement the method for screening similar videos in any of the above embodiments.
The method specifically comprises the following steps:
constructing a similar video library of each video to be selected in a specified similar scale according to the video level characteristics of each video to be selected in the video library to be selected;
respectively searching out similar videos with the similarity exceeding a preset similar threshold from a similar video library of each video to be selected to obtain a corresponding video pair candidate library;
and screening out corresponding similar video pairs from the video pair candidate library based on the frame-level characteristics of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the similar video screening method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the similar video screening apparatus, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (13)

1. A method for screening similar videos, comprising:
constructing a similar video library of each video to be selected in a specified similar scale according to the video level characteristics of each video to be selected in the video library to be selected;
respectively searching out similar videos with the similarity exceeding a preset similar threshold from a similar video library of each video to be selected to obtain a corresponding video pair candidate library;
and screening out corresponding similar video pairs from the video pair candidate library based on the frame-level characteristics of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight.
2. The method of claim 1, further comprising, prior to screening the candidate library of video pairs for corresponding similar video pairs:
determining the frame sampling rates of the two video parties according to the similarity weight of the two video parties of each candidate video pair in the video pair candidate library, and respectively extracting target video frames of the two video parties under the corresponding frame sampling rates;
extracting attention characteristics of each target video frame of the two video parties under the corresponding frame sampling rate under different spatial scales through a pre-constructed multi-scale attention residual error network, and splicing the attention characteristics of the target video frame under each spatial scale to obtain the multi-scale attention characteristics of the target video frame;
and fusing the multi-scale attention features of the target video frames of the two video parties under the corresponding frame sampling rate to obtain the frame-level features of the two video parties under the corresponding similarity weight.
3. The method of claim 2, wherein the multi-scale attention residual network is configured with attention weights at each spatial scale that satisfy a particular spatial probability distribution;
extracting attention characteristics of each target video frame of the two video parties under the corresponding frame sampling rate under different spatial scales through a pre-constructed multi-scale attention residual error network, wherein the attention characteristics comprise the following steps:
and extracting frame-level sub-features of each target video frame of the two video parties under the corresponding frame sampling rate under different spatial scales, and adjusting the frame-level sub-features of the target video frame under the corresponding spatial scales by adopting the attention weight under each spatial scale to obtain the attention features of the target video frame under different spatial scales.
4. The method of claim 1, further comprising, prior to screening the candidate library of video pairs for corresponding similar video pairs:
and aiming at any one video in each candidate video pair in the video pair candidate library, determining the similarity weight of the video according to the similarity between the video and other videos in the video pair candidate library.
5. The method of claim 1, wherein screening out corresponding similar video pairs from the candidate video pair library based on frame-level features of both video parties of each candidate video pair in the candidate video pair library under corresponding similarity weights comprises:
inputting the frame-level characteristics of both video sides of each candidate video pair in the video pair candidate library under the corresponding similarity weight into a pre-constructed three-layer perceptron network to obtain the similarity score of the candidate video pair;
and screening corresponding similar video pairs according to the similarity score of each candidate video pair in the video pair candidate library.
6. The method of claim 1, further comprising, before constructing a library of similar videos for each video to be selected at a specified similar scale:
extracting corresponding sparse video frames from each video to be selected in the video library to be selected;
according to the time sequence information of the sparse video frames in the video to be selected, fusing the depth separation characteristics of each sparse video frame in the video to be selected to obtain the two-dimensional space characteristics of the video to be selected;
determining three-dimensional space-time separation characteristics of the video to be selected according to the time sequence information of the sparse video frames in the video to be selected and the depth separation characteristics of each sparse video frame;
and splicing the two-dimensional space characteristic and the three-dimensional space-time separation characteristic of the video to be selected to obtain the video-level characteristic of the video to be selected.
7. The method according to claim 6, wherein the step of fusing the depth separation features of each sparse video frame in the video to be selected according to the time sequence information of the sparse video frame in the video to be selected to obtain the two-dimensional spatial features of the video to be selected comprises:
extracting the depth separation characteristics of each sparse video frame in the video to be selected by adopting a pre-constructed depth separation residual error network, and fusing the depth separation characteristics of each sparse video frame by adopting a preset pseudo-timing sequence convolution kernel under different characteristic dimensions to obtain the two-dimensional space characteristics of the video to be selected, wherein each space convolution layer in the depth separation residual error network is composed of a single-channel convolution kernel and a full-channel convolution kernel which are cascaded.
8. The method of claim 6, wherein determining the three-dimensional spatio-temporal separation characteristic of the video to be selected according to the timing information of the sparse video frames in the video to be selected and the depth separation characteristic of each sparse video frame comprises:
determining a middle convolution characteristic map of each sparse video frame in the video to be selected according to the depth separation characteristic of each sparse video frame in the video to be selected, and combining the middle convolution characteristic maps of the sparse video frames into a three-dimensional reference characteristic map group of the video to be selected according to the time sequence information of the sparse video frames in the video to be selected;
and extracting three-dimensional space-time separation characteristics of the three-dimensional reference characteristic graph group through a pre-constructed space-time separation residual error network, wherein a main residual error branch of each space-time convolution layer in the space-time separation residual error network is composed of a space convolution kernel and a time convolution kernel which are cascaded, and a bypass residual error branch of each space-time convolution layer is composed of a space convolution kernel and a time convolution kernel which are connected in parallel.
9. The method according to any one of claims 1 to 8, further comprising, after constructing a library of similar videos for each video to be selected at a specified similar scale:
submitting each video to be selected and a video pair consisting of similar videos, the similarity between each video to be selected and the similar videos in a similar video library of the video to be selected exceeds a preset submission threshold value, to a pre-established similar video pair library;
correspondingly, after the corresponding similar video pairs are screened from the video pair candidate library, the method further comprises the following steps:
and updating the similar video pair library according to the screened similar video pairs.
10. The method of claim 9, further comprising, after updating the library of similar video pairs based on the filtered similar video pairs:
and according to the similar video pairs in the similar video pair library, eliminating duplication of the video library to be selected, and pushing the eliminated duplication of the video library to be selected to a corresponding auditing server.
11. A similar video screening apparatus, comprising:
the similar library construction module is used for constructing a similar video library of each video to be selected in the video library to be selected under the appointed similar scale according to the video level characteristics of each video to be selected;
the candidate base generation module is used for respectively searching out similar videos of which the similarity with the video to be selected exceeds a preset similar threshold from the similar video base of each video to be selected to obtain corresponding video pair candidate bases;
and the similar video screening module is used for screening out corresponding similar video pairs from the video pair candidate library based on the frame-level characteristics of the video parties of each candidate video pair in the video pair candidate library under the corresponding similarity weight.
12. An apparatus, characterized in that the apparatus comprises:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of screening for similar videos of any one of claims 1-10.
13. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method of filtering similar videos according to any one of claims 1 to 10.
CN202010478656.XA 2020-05-29 2020-05-29 Similar video screening method, device, equipment and storage medium Active CN111639230B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010478656.XA CN111639230B (en) 2020-05-29 2020-05-29 Similar video screening method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010478656.XA CN111639230B (en) 2020-05-29 2020-05-29 Similar video screening method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111639230A true CN111639230A (en) 2020-09-08
CN111639230B CN111639230B (en) 2023-05-30

Family

ID=72331316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010478656.XA Active CN111639230B (en) 2020-05-29 2020-05-29 Similar video screening method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111639230B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364850A (en) * 2021-01-13 2021-02-12 北京远鉴信息技术有限公司 Video quality inspection method and device, electronic equipment and storage medium
CN112559784A (en) * 2020-11-02 2021-03-26 浙江智慧视频安防创新中心有限公司 Image classification method and system based on incremental learning
CN115331154A (en) * 2022-10-12 2022-11-11 成都西交智汇大数据科技有限公司 Method, device and equipment for scoring experimental steps and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796088A (en) * 2019-10-30 2020-02-14 行吟信息科技(上海)有限公司 Video similarity determination method and device
CN110996123A (en) * 2019-12-18 2020-04-10 广州市百果园信息技术有限公司 Video processing method, device, equipment and medium
CN111046227A (en) * 2019-11-29 2020-04-21 腾讯科技(深圳)有限公司 Video duplicate checking method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796088A (en) * 2019-10-30 2020-02-14 行吟信息科技(上海)有限公司 Video similarity determination method and device
CN111046227A (en) * 2019-11-29 2020-04-21 腾讯科技(深圳)有限公司 Video duplicate checking method and device
CN110996123A (en) * 2019-12-18 2020-04-10 广州市百果园信息技术有限公司 Video processing method, device, equipment and medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112559784A (en) * 2020-11-02 2021-03-26 浙江智慧视频安防创新中心有限公司 Image classification method and system based on incremental learning
CN112559784B (en) * 2020-11-02 2023-07-04 浙江智慧视频安防创新中心有限公司 Image classification method and system based on incremental learning
CN112364850A (en) * 2021-01-13 2021-02-12 北京远鉴信息技术有限公司 Video quality inspection method and device, electronic equipment and storage medium
CN112364850B (en) * 2021-01-13 2021-04-06 北京远鉴信息技术有限公司 Video quality inspection method and device, electronic equipment and storage medium
CN115331154A (en) * 2022-10-12 2022-11-11 成都西交智汇大数据科技有限公司 Method, device and equipment for scoring experimental steps and readable storage medium

Also Published As

Publication number Publication date
CN111639230B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
CN108920720B (en) Large-scale image retrieval method based on depth hash and GPU acceleration
CN113516012A (en) Pedestrian re-identification method and system based on multi-level feature fusion
WO2021129145A1 (en) Image feature point filtering method and terminal
CN110909182A (en) Multimedia resource searching method and device, computer equipment and storage medium
CN111639230A (en) Similar video screening method, device, equipment and storage medium
CN111815432B (en) Financial service risk prediction method and device
CN104573130A (en) Entity resolution method based on group calculation and entity resolution device based on group calculation
CN114283350A (en) Visual model training and video processing method, device, equipment and storage medium
CN117312681B (en) Meta universe oriented user preference product recommendation method and system
CN110321492A (en) A kind of item recommendation method and system based on community information
CN109086830B (en) Typical correlation analysis near-duplicate video detection method based on sample punishment
CN115309996A (en) Information recommendation method and system based on multi-way recall
CN108470251B (en) Community division quality evaluation method and system based on average mutual information
CN111737371B (en) Data flow detection classification method and device capable of dynamically predicting
CN111444364B (en) Image detection method and device
CN117194778A (en) Prediction rule generation method, device, equipment and medium based on attribute map data
CN114332745B (en) Near-repetitive video big data cleaning method based on deep neural network
CN112613533B (en) Image segmentation quality evaluation network system and method based on ordering constraint
CN117688390A (en) Content matching method, apparatus, computer device, storage medium, and program product
CN114077681B (en) Image data processing method and device, computer equipment and storage medium
CN115577765A (en) Network model pruning method, electronic device and storage medium
CN114359649A (en) Image processing method, apparatus, device, storage medium, and program product
CN115982634A (en) Application program classification method and device, electronic equipment and computer program product
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
Jin Network Data Detection for Information Security Using CNN-LSTM Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231008

Address after: 31a, 15th floor, building 30, maple commercial city, bangrang Road, Brazil

Patentee after: Baiguoyuan Technology (Singapore) Co.,Ltd.

Address before: 5-13 / F, West Tower, building C, 274 Xingtai Road, Shiqiao street, Panyu District, Guangzhou, Guangdong 510000

Patentee before: GUANGZHOU BAIGUOYUAN INFORMATION TECHNOLOGY Co.,Ltd.