Nothing Special   »   [go: up one dir, main page]

CN110839173A - Music matching method, device, terminal and storage medium - Google Patents

Music matching method, device, terminal and storage medium Download PDF

Info

Publication number
CN110839173A
CN110839173A CN201911128158.6A CN201911128158A CN110839173A CN 110839173 A CN110839173 A CN 110839173A CN 201911128158 A CN201911128158 A CN 201911128158A CN 110839173 A CN110839173 A CN 110839173A
Authority
CN
China
Prior art keywords
audio
video
matched
features
music
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911128158.6A
Other languages
Chinese (zh)
Inventor
潘一汉
金明
董慧智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jilian Network Technology Co Ltd
Original Assignee
Shanghai Jilian Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jilian Network Technology Co Ltd filed Critical Shanghai Jilian Network Technology Co Ltd
Priority to CN201911128158.6A priority Critical patent/CN110839173A/en
Publication of CN110839173A publication Critical patent/CN110839173A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/47205End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for manipulating displayed content, e.g. interacting with MPEG-4 objects, editing locally
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a music matching method, a music matching device, a music matching terminal and a music matching storage medium. The method comprises the following steps: acquiring a target video, and respectively acquiring audio features to be matched of a plurality of pieces of music to be matched; extracting video visual features and original audio features of a target video, and generating target video features according to the video visual features and the original audio features; and screening out at least one matched audio characteristic from the plurality of audio characteristics to be matched according to the matching degree between the target video characteristic and the plurality of audio characteristics to be matched, and taking the music to be matched corresponding to the matched audio characteristic as the matched music. According to the technical scheme of the embodiment of the invention, the background music can be automatically matched according to the video content directly without the need of a user to listen to all the background music in advance, and the background music which is really suitable for the target video is screened out from a plurality of pieces of music to be matched in an objective and quantitative matching mode, so that the matching efficiency and the matching effect of the background music are improved.

Description

Music matching method, device, terminal and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computer application, in particular to a music matching method, a music matching device, a music matching terminal and a storage medium.
Background
At present, various short video applications become popular application categories in the mobile internet, and a user can shoot short videos anytime and anywhere and upload the short videos to the internet to share the short videos with other users.
In short videos, in addition to the actual performance of the video hero, background music is often an important factor attracting users to watch short videos. Therefore, if appropriate background music can be matched to the short video, more users can be attracted to watch the short video, thereby increasing the playing amount of the short video.
In view of this, short video production software often provides a large amount of background music so that a video producer can manually select a suitable background music from the large amount of background music. However, the implementation method of manually matching background music is inefficient, and it cannot be guaranteed that the taste of the video producer can be liked by the public, and the actual application effect is not good.
Disclosure of Invention
The embodiment of the invention provides a music matching method, a music matching device, a music matching terminal and a music matching storage medium, which are used for realizing the effect of automatically matching proper background music according to video content.
In a first aspect, an embodiment of the present invention provides a music matching method, which may include:
acquiring a target video, and respectively acquiring audio features to be matched of a plurality of pieces of music to be matched;
extracting video visual features and original audio features of a target video, and generating target video features according to the video visual features and the original audio features;
and screening out at least one matched audio characteristic from the plurality of audio characteristics to be matched according to the matching degree between the target video characteristic and the plurality of audio characteristics to be matched, and taking the music to be matched corresponding to the matched audio characteristic as the matched music.
Optionally, the extracting the video visual features of the target video may include:
inputting a target video to a trained video visual extraction model, and extracting video visual features of the target video, wherein the video visual extraction model comprises a video analysis module, a first convolution neural network module and a recurrent neural network module, and the video analysis module is used for extracting target video data in the target video and analyzing the target video data into a plurality of frames of target images.
Optionally, on the basis of the above method, the method may further include:
acquiring a historical video and a first historical classification result of historical video data in the historical video, and taking the historical video and the first historical classification result as a group of first training samples;
the method comprises the steps of training a first original neural network model based on a plurality of first training samples to obtain a video vision extraction model, wherein the first original neural network model comprises a video analysis module, a first convolution neural network module, a recurrent neural network module and a first classification module, and the first classification module is used for processing historical visual features output by the recurrent neural network module to obtain a first prediction classification result of the historical visual features.
Optionally, extracting the original audio feature of the target video may include:
inputting the target video into the trained audio feature extraction model, and extracting the original audio features of the target video, wherein the audio feature extraction model comprises an audio conversion module and a second convolutional neural network module, and the audio conversion module is used for extracting target audio data in the target video and converting the target audio data into a spectrogram.
Optionally, on the basis of the above method, the method may further include:
acquiring historical audio and a second historical classification result of the historical audio, and taking the historical audio and the second historical classification result as a group of second training samples;
and training a second original neural network model based on a plurality of second training samples to obtain an audio feature extraction model, wherein the second original neural network model comprises an audio conversion module, a second convolutional neural network module and a second classification module, and the second classification module is used for processing the historical audio features output by the second convolutional neural network module to obtain a second prediction classification result of the historical audio features.
Optionally, generating the target video feature according to the video visual feature and the original audio feature may include:
splicing the video visual features and the original audio features to obtain target splicing features;
and inputting the target splicing characteristics into the trained multilayer perceptron to obtain target video characteristics.
Optionally, on the basis of the above method, the method may further include:
and acquiring the historical splicing features and the audio features to be recommended corresponding to the historical splicing features, taking the historical splicing features and the audio features to be recommended as a group of third training samples, and training a third primitive neural network model based on a plurality of third training samples to obtain the multilayer perceptron.
In a second aspect, an embodiment of the present invention further provides a music matching apparatus, where the apparatus may include:
the acquisition module is used for acquiring a target video and respectively acquiring audio features to be matched of a plurality of pieces of music to be matched;
the generating module is used for extracting video visual characteristics and original audio characteristics of the target video and generating target video characteristics according to the video visual characteristics and the original audio characteristics;
and the matching module is used for screening out at least one matched audio characteristic from the plurality of audio characteristics to be matched according to the matching degree between the target video characteristic and the plurality of audio characteristics to be matched, and taking the music to be matched corresponding to the matched audio characteristic as the matched music.
In a third aspect, an embodiment of the present invention further provides a terminal, where the terminal may include:
one or more processors;
a memory for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the music matching method provided by any of the embodiments of the present invention.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the music matching method provided in any embodiment of the present invention.
According to the technical scheme of the embodiment of the invention, aiming at the acquirable target video, the target video characteristic capable of representing the semantic information of the target video on the whole is generated according to the video visual characteristic and the original audio characteristic of the extracted target video, the semantic information in the target video can be captured more comprehensively by comprehensively considering the video visual characteristic and the original audio characteristic, and the subsequent characteristic matching process can be simplified by pre-generating the target video characteristic; after the quantization processing result of the target video is obtained, at least one matched music can be screened out from a plurality of pieces of music to be matched by combining the directly obtained quantization processing result of the music to be matched, the quantization processing mode can be used for accurately and quickly matching background music in any application scene, and the application range is wide. According to the technical scheme, under the condition that a user does not need to listen to all background music materials in advance, the background music can be matched directly and automatically according to the video content, and the background music which is really suitable for the target video can be screened out from a plurality of pieces of music to be matched in an objective and quantitative matching mode, so that the matching efficiency and the matching effect of the background music are improved remarkably.
Drawings
Fig. 1 is a flowchart of a music matching method according to a first embodiment of the present invention;
FIG. 2a is a schematic structural diagram of a video visual extraction model according to a first embodiment of the present invention;
FIG. 2b is a schematic structural diagram of a first original neural network model according to a first embodiment of the present invention;
FIG. 3a is a schematic structural diagram of an audio feature extraction model according to a first embodiment of the present invention;
FIG. 3b is a schematic structural diagram of a second primitive neural network model according to a first embodiment of the present invention;
FIG. 4 is a flowchart of a music matching method according to a second embodiment of the present invention;
FIG. 5 is a diagram illustrating feature splicing of a music matching method according to a second embodiment of the present invention;
FIG. 6 is a diagram of a preferred embodiment of a music matching method according to a second embodiment of the present invention;
fig. 7 is a block diagram of a music matching apparatus according to a third embodiment of the present invention;
fig. 8 is a schematic structural diagram of a terminal in the fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Example one
Fig. 1 is a flowchart of a music matching method according to an embodiment of the present invention. The present embodiment is applicable to the case of automatically matching background music according to video content. The method can be executed by the music matching device provided by the embodiment of the invention, the device can be realized by software and/or hardware, and the device can be integrated on various user terminals or servers.
Referring to fig. 1, the method of the embodiment of the present invention specifically includes the following steps:
s110, obtaining a target video, and respectively obtaining audio features to be matched of a plurality of pieces of music to be matched.
The target video may be a shot video, such as a short video, a tv show, a movie, an animation, and the like, and in this case, the music matching method according to the embodiment of the present invention may match background music with the shot video; the target video may be a video obtained in real time from a real environment, such as a shopping mall video, a evening meeting video, a drama video, and the like, and at this time, the music matching method of the embodiment of the present invention may match background music played on site for the real environment; of course, the target video may also be a video related to the rest of the scenes and matching the background music, and is not specifically limited herein.
The plurality of pieces of music to be matched can be music stored in a background music library in advance, and the audio features to be matched of each piece of music to be matched can also be stored audio features extracted from the music to be matched in advance.
And S120, extracting the video visual characteristics and the original audio characteristics of the target video, and generating the target video characteristics according to the video visual characteristics and the original audio characteristics.
Each target video may include target video data and target audio data, and after video visual features are extracted from the target video data and original audio features are extracted from the target audio data, target video features may be generated according to the video visual features and the original audio features, and the target video features may be features used to represent semantic information of the target video as a whole.
It should be noted that, on one hand, the target video feature generation has the advantages that the target video feature and the plurality of audio features to be matched can be directly matched by combining the pre-extracted stored audio features to be matched, and further merging or other processing on the video visual feature, the original audio feature and the audio features to be matched is not required, so that the matching performance is greatly improved. On the other hand, compared with the video visual characteristics of the reference target video, the video visual characteristics and the original audio characteristics of the reference target video are integrated, so that the semantic information in the target video can be captured more comprehensively, and the accuracy of subsequent music matching can be improved.
Specifically, for extracting the video visual features of the target video, the target video data may be considered to be obtained by arranging target images of one frame by one frame according to a time sequence, and the extraction scheme of the video visual features may involve an image recognition technology such as image feature extraction, and may also involve change analysis of the target images on a time sequence. For example, the image may be recognized based on Histogram of Oriented Gradients (HOG), Scale-invariant feature transform (SIFT), convolutional neural network, and the like, and the variation of the target image in the time series may be resolved based on Hidden Markov Model (HMM), cyclic neural network such as long-short Term Memory network (LSTM), and the like.
Similarly, the original audio features of the target video are extracted, which is equivalent to the original audio features of the target audio data in the target video, and the target audio data can be considered to be composed of tones and time sequences, so that the original audio feature extraction scheme can convert the target audio data into two-dimensional vector data similar to a target image through spectrogram conversion, and on the basis, the original audio feature extraction operation is performed by using the scheme. In fact, the audio features to be matched of the music to be matched can also be extracted in advance based on a similar scheme, so that for each target video, the target video features are obtained only by processing the target video, the music to be matched does not need to be processed any more, and the efficiency is higher.
That is, after video frame taking, image feature extraction, and sequence feature processing are performed on target video data in a target video, the target video data may be processed into a segment of one-dimensional feature vector. Similarly, after information conversion and feature processing are performed on target audio data in the target video, the target audio data may also be processed into a segment of one-dimensional feature vector. On the basis, target video features can be generated according to the two sections of one-dimensional feature vectors, and the target video features are quantized processing results of the target video.
S130, according to the matching degree between the target video characteristic and the multiple audio characteristics to be matched, at least one matched audio characteristic is screened out from the multiple audio characteristics to be matched, and the music to be matched corresponding to the matched audio characteristic is used as matched music.
The target video features obtained in the above steps can represent semantic information of the target video as a whole, on this basis, at least one matched audio feature can be screened out from the multiple audio features to be matched according to the matching degree between the target video features and the multiple audio features to be matched, and the matched music corresponding to the matched audio feature is background music with higher matching degree with the target video. It should be noted that the matching process between the target video feature and the plurality of audio features to be matched may be understood as a similarity calculation process, for example, the similarity between the target video feature and each of the audio features to be matched may be calculated based on cosine similarity or in other manners.
On the basis, if the number of the matched music is one, the matched music can be directly recommended to the user as the background music of the target video; if the number of the matched music is at least two, the at least two matched music can be recommended to the user, so that the user can screen out the matched music which can be used as the background music of the target video from the at least two matched music.
According to the technical scheme of the embodiment of the invention, aiming at the acquirable target video, the target video characteristic capable of representing the semantic information of the target video on the whole is generated according to the video visual characteristic and the original audio characteristic of the extracted target video, the semantic information in the target video can be captured more comprehensively by comprehensively considering the video visual characteristic and the original audio characteristic, and the subsequent characteristic matching process can be simplified by pre-generating the target video characteristic; after the quantization processing result of the target video is obtained, at least one matched music can be screened out from a plurality of pieces of music to be matched by combining the directly obtained quantization processing result of the music to be matched, the quantization processing mode can be used for accurately and quickly matching background music in any application scene, and the application range is wide. According to the technical scheme, under the condition that a user does not need to listen to all background music materials in advance, the background music can be matched directly and automatically according to the video content, and the background music which is really suitable for the target video can be screened out from a plurality of pieces of music to be matched in an objective and quantitative matching mode, so that the matching efficiency and the matching effect of the background music are improved remarkably.
An optional technical solution is to extract video visual features of a target video, or to say, video visual features of target video data in the target video, and specifically may include: inputting a target video into a trained video visual extraction model, and extracting video visual features of the target video, wherein the video visual extraction model can comprise a video analysis module, a first convolution neural network module and a recurrent neural network module, and the video analysis module can be used for extracting target video data in the target video and analyzing the target video data into a multi-frame target image.
The target video data can be considered to be obtained by arranging target images of one frame by one frame according to a time sequence, and after the target video data is subjected to operations of video frame taking, image feature extraction, sequence feature processing and the like, the target video data can be processed into a section of one-dimensional feature vector. Therefore, for the trained video vision extraction model, the trained video vision extraction model may include a video parsing module for parsing target video data in a target video into multiple frames of target images, a first convolution neural network module for extracting image features in the target images, and a recurrent neural network module for performing time sequence analysis on the image features.
In order to better understand the specific working process of the video vision extraction model, for example, referring to fig. 2a, for a target video that has been input into the video vision extraction model, first, the target video is input into a video parsing module to obtain multiple frames of target images, such as a target image 1, a target image 2, and a target image 3, where the multiple frames of target images are presented in the form of an image sequence; secondly, sequentially inputting each frame of serialized target images to the same first Convolutional Neural Network (CNN) module to obtain a feature vector of the target image, for example, inputting a target image 1 to the first CNN module to obtain a feature vector 1, inputting a target image 2 to the first CNN module to obtain a feature vector 2, and inputting a target image 3 to the first CNN module to obtain a feature vector 3; thirdly, sequentially inputting each feature vector to a Recurrent Neural Network (RNN) module according to a time sequence relationship to obtain video visual features capable of representing semantic information of the target video as a whole, for example, inputting the feature vector 1 to the RNN module to obtain the feature vector 11, inputting the feature vector 2 and the feature vector 11 together to the RNN module to obtain the feature vector 21, and inputting the feature vector 3 and the feature vector 21 together to the RNN module to obtain the video visual features. That is to say, the first CNN module and the RNN module may be multiplexed, and the RNN module is a time-sequential module that can be used for semantic analysis, and can sequentially analyze the semantics of each frame of target image according to a time-sequential relationship, thereby obtaining the semantics of the target video composed of multiple frames of target images.
On this basis, optionally, the video visual extraction model may be obtained by training in advance through the following steps: acquiring a historical video and a first historical classification result of historical video data in the historical video, and taking the historical video and the first historical classification result as a group of first training samples; the method comprises the steps of training a first original neural network model based on a plurality of first training samples to obtain a video vision extraction model, wherein the first original neural network model comprises a video analysis module, a first convolution neural network module, a recurrent neural network module and a first classification module, and the first classification module is used for processing historical visual features output by the recurrent neural network module to obtain a first prediction classification result of the historical visual features.
As shown in fig. 2b, for a historical video that has been input to the first original neural network model, first, after the historical video passes through the video parsing module, the first CNN module, and the RNN module in sequence, a historical visual feature may be obtained; secondly, in order to verify the accuracy of the historical visual features extracted from the historical video, inputting the historical visual features into a first classification module to obtain a first prediction classification result of the historical visual features, wherein the first prediction classification result is a classification label of the historical visual features; again, the first prediction classification result is compared with a first historical classification result of historical video data in the historical video, thereby verifying the accuracy of the historical visual features.
It should be noted that 1) the first classification module may be any classifier, and since the first classification module is used to verify the accuracy of the historical visual features, the first classification module does not need to be retained in the video visual extraction model after the model training is finished. 2) The classification quantity and the selection of the classification content of the first classification module directly influence the historical visual features output by the recurrent neural network module, for example, if the first prediction classification result is a television show, a movie, an MV and a documentary, the relevance between the historical visual features and the video type is large; the first prediction classification result is distraction, heart injury and pain, and the relevance of the historical visual features and the video emotion is large. 3) The first prediction classification result is used as the classification label, so that the dimensionality of the first prediction classification result is low, namely the data volume is small, and the training speed is high; moreover, the first prediction classification result is a classification label set manually, so that the method is more in line with an artificial thinking mode, and the accuracy is ensured.
An optional technical solution is to extract an original audio feature of a target video, or to say, an original audio feature of target audio data in a target video, and specifically may include: inputting the target video into the trained audio feature extraction model, and extracting the original audio features of the target video, wherein the audio feature extraction model comprises an audio conversion module and a second convolutional neural network module, and the audio conversion module is used for extracting target audio data in the target video and converting the target audio data into a spectrogram.
The target audio data may be considered as data composed of tones and time sequences, and after information conversion, feature processing, and other operations are performed on the target audio data, the target audio data may be processed into a segment of one-dimensional feature vector. Therefore, for the trained audio feature extraction model, it may include an audio conversion module for converting the target audio data into a spectrogram, which is a two-dimensional vector data similar to the target image, and a second convolutional neural network module for extracting image features in the spectrogram.
In order to better understand the specific working process of the audio feature extraction model, for example, referring to fig. 3a, for a target video that has been input to the audio feature extraction model, first, the target video is input to an audio conversion module to obtain a spectrogram, for example, target audio data in the target video is converted into the spectrogram based on fourier transform; and secondly, inputting the spectrogram into a second convolutional neural network module to obtain the original audio features of the spectrogram.
On this basis, optionally, the audio feature extraction model may be obtained by training in advance through the following steps: acquiring historical audio and a second historical classification result of the historical audio, and taking the historical audio and the second historical classification result as a group of second training samples; and training a second original neural network model based on a plurality of second training samples to obtain an audio feature extraction model, wherein the second original neural network model comprises an audio conversion module, a second convolutional neural network module and a second classification module, and the second classification module is used for processing the historical audio features output by the second convolutional neural network module to obtain a second prediction classification result of the historical audio features.
As shown in fig. 3b, for the historical audio input to the second original neural network model, first, if the input data of the audio conversion module is audio data, the audio conversion module only needs to convert the input data into a spectrogram, otherwise, the audio conversion module needs to extract the audio data in the input data and then convert the audio data into the spectrogram, for example, extract the historical audio data in the historical video and then convert the historical audio data into the spectrogram; secondly, after the spectrogram obtains the historical audio features through the second CNN module, in order to verify the accuracy of the historical audio features extracted from the historical audio, the historical audio features are input to the second classification module to obtain a second predicted classification result of the historical audio features, and the second predicted classification result is a classification label of the historical audio features; again, the second predicted classification result is compared to a second historical classification result for historical audio, thereby verifying the accuracy of the historical visual features.
It should be noted that 1) the second classification module may be any classifier, and since the second classification module is used to verify the accuracy of the historical audio features, the second classification module does not need to be retained in the audio feature extraction model after the model training is finished. 2) The selection of the classification quantity and the classification content of the second classification module can directly influence the historical audio features output by the second convolutional neural network module, for example, if the second prediction classification result is English, cantonese and national language, the relevance between the historical audio features and the audio language type is large; the second prediction classification result is open, sad and sad, and the relevance of the historical audio features and the audio emotion is large. 3) The benefit of using the second prediction classification result as the classification label is that the dimensionality of the second prediction classification result is lower, namely the data volume is less, and the training speed is higher; and the second prediction classification result is a classification label set manually, so that the method is more in line with an artificial thinking mode and the accuracy is ensured.
Example two
Fig. 4 is a flowchart of a music matching method according to a second embodiment of the present invention. The present embodiment is optimized based on the above technical solutions. In this embodiment, optionally, the generating the target video feature according to the video visual feature and the original audio feature may specifically include: splicing the video visual features and the original audio features to obtain target splicing features; and inputting the target splicing characteristics into the trained multilayer perceptron to obtain target video characteristics. The same or corresponding terms as those in the above embodiments are not explained in detail herein.
Referring to fig. 4, the method of this embodiment may specifically include the following steps:
s210, obtaining a target video, and respectively obtaining audio features to be matched of a plurality of pieces of music to be matched.
S220, extracting video visual features and original audio features of the target video, splicing the video visual features and the original audio features to obtain target splicing features, and inputting the target splicing features into the trained multilayer perceptron to obtain the target video features.
In order to represent the semantic information of the target video as a whole, the video visual features and the original audio features may be spliced to obtain target splicing features, where the target splicing features simultaneously include picture information and voice information in the target video, as shown in fig. 5. As can be seen from the above, the target splicing feature is the result of the splicing process of the video visual feature and the original audio feature, and the vector length of the target splicing feature is necessarily greater than the vector length of the original audio feature.
Considering the following requirement, according to the matching degree between the target video features (i.e., the processed target splicing features) and the multiple audio features to be matched, at least one matched audio feature is screened from the multiple audio features to be matched, and the vector length of the audio features to be matched is consistent with the vector length of the original audio features, so that the target splicing features can be further processed, for example, the target splicing features can be input into a trained multilayer perceptron to obtain the target video features, and the vector length of the target video features is consistent with the vector length of the audio features to be matched, thereby facilitating the realization of similarity matching between the target video features and the audio features to be matched.
It should be noted that the multi-layer perceptron can perform the function of reducing the dimension of the target splicing feature after the splicing processing, so as to output the target video feature matched with the corresponding matched audio feature. Therefore, on this basis, optionally, the multi-layer perceptron can be obtained by training in advance through the following steps: and acquiring the historical splicing features and the audio features to be recommended corresponding to the historical splicing features, taking the historical splicing features and the audio features to be recommended as a group of third training samples, and training a third primitive neural network model based on a plurality of third training samples to obtain the multilayer perceptron.
The audio features to be recommended can be manually selected audio features of background music with the highest matching degree with the historical videos, the historical splicing features can be composed of historical visual features and historical audio features, and the historical visual features and the historical audio features are extracted from the historical videos. For example, taking a short video as an example, if unprocessed historical videos and background music in the short video can be obtained at the same time, for short videos with a relatively high popularity, historical visual features and historical audio features can be extracted from the unprocessed historical videos to obtain historical splicing features, and audio features to be recommended can be extracted from the unprocessed background music, so that the third primitive neural network model is trained as a forward third training sample. Similarly, for those short videos with low popularity, the corresponding feature vector can be extracted as an inverted third training sample to train the third primitive neural network model. Therefore, the matching degree of the target video features processed by the multilayer perceptron and the manually selected features to be recommended is high.
And S230, screening out at least one matched audio feature from the multiple audio features to be matched according to the matching degree between the target video feature and the multiple audio features to be matched, and taking the music to be matched corresponding to the matched audio feature as the matched music.
According to the technical scheme of the embodiment of the invention, the video visual features and the original audio features are spliced, and the spliced target splicing features are input into the trained multilayer perceptron, so that the multilayer perceptron can perform dimension reduction processing on the target splicing features and improve the similarity between the target splicing features and the matched audio features, and the matched audio features are the audio features of the matched music matched with the target video, so that the accuracy of the automatically obtained matched music is improved.
It should be noted that the above-mentioned "first", "second" and "third" are only used for distinguishing the respective noun concepts, and are not specific limitations to the respective noun concepts. For example, taking the original neural network model as an example, "first", "second", and "third" of "first original neural network model", "second original neural network model", and "third original neural network model" are used only to distinguish the original neural network models, and the content of each original neural network model is not specifically limited.
In order to better understand the concrete implementation process of the above steps, the concrete implementation process of the above music matching method can be as shown in fig. 6. For example, after the video visual extraction model, the audio feature extraction model and the multi-layer perceptron training are completed, all music to be matched in the background music library can be input into the audio feature extraction model, the audio features to be matched of each music to be matched are respectively obtained, and the audio features to be matched and the corresponding relationship between the audio features to be matched and the music to be matched are stored.
In practical application, aiming at each target video, the target video is respectively input into a video visual extraction model and an audio characteristic extraction model to obtain video visual characteristics and original audio characteristics; splicing the video visual features and the original audio features to obtain target splicing features, and inputting the target splicing features into a multi-layer perceptron to obtain target video features capable of representing semantic information of a target video on the whole; similarity calculation is carried out on the target video characteristics and the stored multiple audio characteristics to be matched to obtain the matching degree between the target video characteristics and the stored multiple audio characteristics to be matched, the audio characteristics to be matched with higher matching degree are taken as matched audio characteristics, the music to be matched corresponding to the matched audio characteristics is taken as matched music, and the matched music is background music recommended to the target video. Therefore, the method can automatically recommend the background music matched with the target video and liked by the public for the target video, thereby improving the popularity of the target video.
EXAMPLE III
Fig. 7 is a block diagram of a music matching apparatus according to a third embodiment of the present invention, which is configured to execute the music matching method according to any of the embodiments. The device and the music matching method of the embodiments belong to the same inventive concept, and details which are not described in detail in the embodiment of the music matching device can refer to the embodiment of the music matching method. Referring to fig. 7, the apparatus may specifically include: an acquisition module 310, a generation module 320, and a matching module 330.
The acquiring module 310 is configured to acquire a target video and acquire to-be-matched audio features of a plurality of to-be-matched music respectively;
the generating module 320 is configured to extract video visual features and original audio features of the target video, and generate target video features according to the video visual features and the original audio features;
the matching module 330 is configured to screen at least one matched audio feature from the multiple audio features to be matched according to the matching degree between the target video feature and the multiple audio features to be matched, and use the music to be matched corresponding to the matched audio feature as the matched music.
Optionally, the generating module 320 may specifically include:
the video visual characteristic extraction unit is used for inputting a target video to a trained video visual extraction model and extracting the video visual characteristics of the target video, wherein the video visual extraction model comprises a video analysis module, a first convolution neural network module and a recurrent neural network module, and the video analysis module is used for extracting target video data in the target video and analyzing the target video data into multi-frame target images.
Optionally, on the basis of the above apparatus, the apparatus may further include:
the first training sample acquisition module is used for acquiring a historical video and a first historical classification result of historical video data in the historical video, and taking the historical video and the first historical classification result as a group of first training samples;
the video vision extraction model training module is used for training a first original neural network model based on a plurality of first training samples to obtain a video vision extraction model, wherein the first original neural network model comprises a video analysis module, a first convolution neural network module, a recurrent neural network module and a first classification module, and the first classification module is used for processing historical visual features output by the recurrent neural network module to obtain a first prediction classification result of the historical visual features.
Optionally, the generating module 320 may specifically include:
and the original audio characteristic extraction unit is used for inputting the target video into the trained audio characteristic extraction model and extracting the original audio characteristics of the target video, wherein the audio characteristic extraction model comprises an audio conversion module and a second convolution neural network module, and the audio conversion module is used for extracting target audio data in the target video and converting the target audio data into a spectrogram.
Optionally, on the basis of the above apparatus, the apparatus may further include:
the second training sample acquisition module is used for acquiring a historical audio and a second historical classification result of the historical audio, and taking the historical audio and the second historical classification result as a group of second training samples;
and the audio characteristic extraction model training module is used for training a second original neural network model based on a plurality of second training samples to obtain an audio characteristic extraction model, wherein the second original neural network model comprises an audio conversion module, a second convolutional neural network module and a second classification module, and the second classification module is used for processing the historical audio characteristics output by the second convolutional neural network module to obtain a second prediction classification result of the historical audio characteristics.
Optionally, the generating module 320 may include:
the characteristic splicing unit is used for splicing the video visual characteristic and the original audio characteristic to obtain a target splicing characteristic;
and the feature generation unit is used for inputting the target splicing features into the trained multilayer perceptron to obtain the target video features.
Optionally, on the basis of the above apparatus, the apparatus may further include:
and the multilayer perceptron training module is used for acquiring the historical splicing characteristics and the audio characteristics to be recommended corresponding to the historical splicing characteristics, using the historical splicing characteristics and the audio characteristics to be recommended as a group of third training samples, and training a third primitive neural network model based on a plurality of third training samples to obtain the multilayer perceptron.
According to the music matching device provided by the third embodiment of the invention, the acquisition module and the generation module are matched with each other, the target video characteristics capable of representing the semantic information of the target video on the whole are generated according to the video visual characteristics and the original audio characteristics of the extracted target video aiming at the available target video, the semantic information in the target video can be captured more comprehensively by comprehensively considering the video visual characteristics and the original audio characteristics, and the subsequent characteristic matching process can be simplified by pre-generating the target video characteristics; after the quantization processing result of the target video is obtained, the matching module can screen at least one matched music from a plurality of pieces of music to be matched by combining the quantization processing result of the music to be matched, the quantization processing mode can accurately and quickly match background music for any application scene, and the application range is wide. By the aid of the device, under the condition that a user does not need to listen to all background music materials in advance, the background music can be matched directly and automatically according to video content, the background music which is really suitable for a target video can be screened out from a plurality of pieces of music to be matched in an objective and quantitative matching mode, and accordingly matching efficiency and matching effect of the background music are improved remarkably.
The music matching device provided by the embodiment of the invention can execute the music matching method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
It should be noted that, in the embodiment of the music matching apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
Example four
Fig. 8 is a schematic structural diagram of a terminal according to a fourth embodiment of the present invention, as shown in fig. 8, the terminal includes a memory 410, a processor 420, an input device 430, and an output device 440. The number of the processors 420 in the terminal may be one or more, and one processor 420 is taken as an example in fig. 8; the memory 410, processor 420, input device 430 and output device 440 in the terminal may be connected by a bus or other means, such as by bus 450 in fig. 8.
The memory 410, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the music matching method in the embodiment of the present invention (e.g., the acquisition module 310, the generation module 320, and the matching module 330 in the music matching apparatus). The processor 420 executes various functional applications of the terminal and data processing, i.e., implements the above-described music matching method, by executing software programs, instructions, and modules stored in the memory 410.
The memory 410 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 410 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 410 may further include memory located remotely from processor 420, which may be connected to devices through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the device. The output device 440 may include a display device such as a display screen.
EXAMPLE five
An embodiment of the present invention provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a music matching method, the method including:
acquiring a target video, and respectively acquiring audio features to be matched of a plurality of pieces of music to be matched;
extracting video visual features and original audio features of a target video, and generating target video features according to the video visual features and the original audio features;
and screening out at least one matched audio characteristic from the plurality of audio characteristics to be matched according to the matching degree between the target video characteristic and the plurality of audio characteristics to be matched, and taking the music to be matched corresponding to the matched audio characteristic as the matched music.
Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the music matching method provided by any embodiment of the present invention.
From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. With this understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A music matching method, comprising:
acquiring a target video, and respectively acquiring audio features to be matched of a plurality of pieces of music to be matched;
extracting video visual features and original audio features of the target video, and generating target video features according to the video visual features and the original audio features;
and screening out at least one matched audio feature from the plurality of audio features to be matched according to the matching degree between the target video feature and the plurality of audio features to be matched, and taking the music to be matched corresponding to the matched audio feature as the matched music.
2. The method according to claim 1, wherein the extracting video visual features of the target video comprises:
inputting the target video to a trained video visual extraction model, and extracting video visual features of the target video, wherein the video visual extraction model comprises a video analysis module, a first convolution neural network module and a recurrent neural network module, and the video analysis module is used for extracting target video data in the target video and analyzing the target video data into a multi-frame target image.
3. The method of claim 2, further comprising:
acquiring a historical video and a first historical classification result of historical video data in the historical video, and taking the historical video and the first historical classification result as a group of first training samples;
training a first original neural network model based on a plurality of first training samples to obtain the video vision extraction model, wherein the first original neural network model comprises the video parsing module, the first convolution neural network module, the recurrent neural network module and a first classification module, and the first classification module is used for processing the historical visual features output by the recurrent neural network module to obtain a first prediction classification result of the historical visual features.
4. The method of claim 1, wherein the extracting of the original audio features of the target video comprises:
inputting the target video into a trained audio feature extraction model, and extracting original audio features of the target video, wherein the audio feature extraction model comprises an audio conversion module and a second convolutional neural network module, and the audio conversion module is used for extracting target audio data in the target video and converting the target audio data into a spectrogram.
5. The method of claim 4, further comprising:
acquiring historical audio and a second historical classification result of the historical audio, and using the historical audio and the second historical classification result as a group of second training samples;
and training a second original neural network model based on a plurality of second training samples to obtain the audio feature extraction model, wherein the second original neural network model comprises the audio conversion module, the second convolutional neural network module and a second classification module, and the second classification module is used for processing the historical audio features output by the second convolutional neural network module to obtain a second prediction classification result of the historical audio features.
6. The method of claim 1, wherein generating target video features from the video visual features and the original audio features comprises:
splicing the video visual features and the original audio features to obtain target splicing features;
and inputting the target splicing characteristics into the trained multilayer perceptron to obtain target video characteristics.
7. The method of claim 6, further comprising:
acquiring a history splicing feature and an audio feature to be recommended corresponding to the history splicing feature, taking the history splicing feature and the audio feature to be recommended as a group of third training samples, and training a third primitive neural network model based on a plurality of third training samples to obtain the multilayer perceptron.
8. A music matching apparatus, comprising:
the acquisition module is used for acquiring a target video and respectively acquiring audio features to be matched of a plurality of pieces of music to be matched;
the generating module is used for extracting video visual characteristics and original audio characteristics of the target video and generating target video characteristics according to the video visual characteristics and the original audio characteristics;
and the matching module is used for screening out at least one matched audio feature from the plurality of audio features to be matched according to the matching degree between the target video feature and the plurality of audio features to be matched, and taking the music to be matched corresponding to the matched audio feature as the matched music.
9. A terminal, characterized in that the terminal comprises:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the music matching method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a music matching method according to any one of claims 1 to 7.
CN201911128158.6A 2019-11-18 2019-11-18 Music matching method, device, terminal and storage medium Pending CN110839173A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911128158.6A CN110839173A (en) 2019-11-18 2019-11-18 Music matching method, device, terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911128158.6A CN110839173A (en) 2019-11-18 2019-11-18 Music matching method, device, terminal and storage medium

Publications (1)

Publication Number Publication Date
CN110839173A true CN110839173A (en) 2020-02-25

Family

ID=69576612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911128158.6A Pending CN110839173A (en) 2019-11-18 2019-11-18 Music matching method, device, terminal and storage medium

Country Status (1)

Country Link
CN (1) CN110839173A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111246124A (en) * 2020-03-09 2020-06-05 三亚至途科技有限公司 Multimedia digital fusion method and device
CN111314771A (en) * 2020-03-13 2020-06-19 腾讯科技(深圳)有限公司 Video playing method and related equipment
CN111488485A (en) * 2020-04-16 2020-08-04 北京雷石天地电子技术有限公司 Music recommendation method based on convolutional neural network, storage medium and electronic device
CN111681679A (en) * 2020-06-09 2020-09-18 杭州星合尚世影视传媒有限公司 Video object sound effect searching and matching method, system and device and readable storage medium
CN111681680A (en) * 2020-06-09 2020-09-18 杭州星合尚世影视传媒有限公司 Method, system and device for acquiring audio by video recognition object and readable storage medium
CN111681678A (en) * 2020-06-09 2020-09-18 杭州星合尚世影视传媒有限公司 Method, system, device and storage medium for automatically generating sound effect and matching video
CN111918094A (en) * 2020-06-29 2020-11-10 北京百度网讯科技有限公司 Video processing method and device, electronic equipment and storage medium
CN112584062A (en) * 2020-12-10 2021-03-30 上海哔哩哔哩科技有限公司 Background audio construction method and device
CN113032616A (en) * 2021-03-19 2021-06-25 腾讯音乐娱乐科技(深圳)有限公司 Audio recommendation method and device, computer equipment and storage medium
CN113190709A (en) * 2021-03-31 2021-07-30 浙江大学 Background music recommendation method and device based on short video key frame
CN113486214A (en) * 2021-07-23 2021-10-08 广州酷狗计算机科技有限公司 Music matching method and device, computer equipment and storage medium
CN113496243A (en) * 2020-04-07 2021-10-12 北京达佳互联信息技术有限公司 Background music obtaining method and related product
CN113572981A (en) * 2021-01-19 2021-10-29 腾讯科技(深圳)有限公司 Video dubbing method and device, electronic equipment and storage medium
CN114513615A (en) * 2020-11-16 2022-05-17 阿里巴巴集团控股有限公司 Video processing method and device, processor and electronic equipment
CN115760020A (en) * 2023-01-05 2023-03-07 北京北测数字技术有限公司 Video fusion media content analysis management system based on Internet
WO2023046040A1 (en) * 2021-09-24 2023-03-30 海尔智家股份有限公司 Inventory management system in refrigeration appliance

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060122842A1 (en) * 2004-12-03 2006-06-08 Magix Ag System and method of automatically creating an emotional controlled soundtrack
CN109063163A (en) * 2018-08-14 2018-12-21 腾讯科技(深圳)有限公司 A kind of method, apparatus, terminal device and medium that music is recommended
CN109359636A (en) * 2018-12-14 2019-02-19 腾讯科技(深圳)有限公司 Video classification methods, device and server
CN109446990A (en) * 2018-10-30 2019-03-08 北京字节跳动网络技术有限公司 Method and apparatus for generating information
CN109587554A (en) * 2018-10-29 2019-04-05 百度在线网络技术(北京)有限公司 Processing method, device and the readable storage medium storing program for executing of video data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060122842A1 (en) * 2004-12-03 2006-06-08 Magix Ag System and method of automatically creating an emotional controlled soundtrack
CN109063163A (en) * 2018-08-14 2018-12-21 腾讯科技(深圳)有限公司 A kind of method, apparatus, terminal device and medium that music is recommended
CN109587554A (en) * 2018-10-29 2019-04-05 百度在线网络技术(北京)有限公司 Processing method, device and the readable storage medium storing program for executing of video data
CN109446990A (en) * 2018-10-30 2019-03-08 北京字节跳动网络技术有限公司 Method and apparatus for generating information
CN109359636A (en) * 2018-12-14 2019-02-19 腾讯科技(深圳)有限公司 Video classification methods, device and server

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111246124B (en) * 2020-03-09 2021-05-25 三亚至途科技有限公司 Multimedia digital fusion method and device
CN111246124A (en) * 2020-03-09 2020-06-05 三亚至途科技有限公司 Multimedia digital fusion method and device
CN111314771A (en) * 2020-03-13 2020-06-19 腾讯科技(深圳)有限公司 Video playing method and related equipment
CN111314771B (en) * 2020-03-13 2021-08-27 腾讯科技(深圳)有限公司 Video playing method and related equipment
CN113496243B (en) * 2020-04-07 2024-09-20 北京达佳互联信息技术有限公司 Background music acquisition method and related products
CN113496243A (en) * 2020-04-07 2021-10-12 北京达佳互联信息技术有限公司 Background music obtaining method and related product
CN111488485A (en) * 2020-04-16 2020-08-04 北京雷石天地电子技术有限公司 Music recommendation method based on convolutional neural network, storage medium and electronic device
CN111488485B (en) * 2020-04-16 2023-11-17 北京雷石天地电子技术有限公司 Music recommendation method based on convolutional neural network, storage medium and electronic device
CN111681679B (en) * 2020-06-09 2023-08-25 杭州星合尚世影视传媒有限公司 Video object sound effect searching and matching method, system, device and readable storage medium
CN111681680B (en) * 2020-06-09 2023-08-25 杭州星合尚世影视传媒有限公司 Method, system, device and readable storage medium for acquiring audio frequency by video recognition object
CN111681678B (en) * 2020-06-09 2023-08-22 杭州星合尚世影视传媒有限公司 Method, system, device and storage medium for automatically generating sound effects and matching videos
CN111681678A (en) * 2020-06-09 2020-09-18 杭州星合尚世影视传媒有限公司 Method, system, device and storage medium for automatically generating sound effect and matching video
CN111681680A (en) * 2020-06-09 2020-09-18 杭州星合尚世影视传媒有限公司 Method, system and device for acquiring audio by video recognition object and readable storage medium
CN111681679A (en) * 2020-06-09 2020-09-18 杭州星合尚世影视传媒有限公司 Video object sound effect searching and matching method, system and device and readable storage medium
CN111918094A (en) * 2020-06-29 2020-11-10 北京百度网讯科技有限公司 Video processing method and device, electronic equipment and storage medium
CN114513615A (en) * 2020-11-16 2022-05-17 阿里巴巴集团控股有限公司 Video processing method and device, processor and electronic equipment
EP4207746A4 (en) * 2020-12-10 2024-01-31 Shanghai Hode Information Technology Co., Ltd. Method and apparatus for constructing background audio
CN112584062B (en) * 2020-12-10 2023-08-08 上海幻电信息科技有限公司 Background audio construction method and device
CN112584062A (en) * 2020-12-10 2021-03-30 上海哔哩哔哩科技有限公司 Background audio construction method and device
CN113572981A (en) * 2021-01-19 2021-10-29 腾讯科技(深圳)有限公司 Video dubbing method and device, electronic equipment and storage medium
CN113032616B (en) * 2021-03-19 2024-02-20 腾讯音乐娱乐科技(深圳)有限公司 Audio recommendation method, device, computer equipment and storage medium
CN113032616A (en) * 2021-03-19 2021-06-25 腾讯音乐娱乐科技(深圳)有限公司 Audio recommendation method and device, computer equipment and storage medium
CN113190709A (en) * 2021-03-31 2021-07-30 浙江大学 Background music recommendation method and device based on short video key frame
CN113486214A (en) * 2021-07-23 2021-10-08 广州酷狗计算机科技有限公司 Music matching method and device, computer equipment and storage medium
WO2023046040A1 (en) * 2021-09-24 2023-03-30 海尔智家股份有限公司 Inventory management system in refrigeration appliance
CN115760020B (en) * 2023-01-05 2023-04-14 北京北测数字技术有限公司 Video integration media content analysis management system based on Internet
CN115760020A (en) * 2023-01-05 2023-03-07 北京北测数字技术有限公司 Video fusion media content analysis management system based on Internet

Similar Documents

Publication Publication Date Title
CN110839173A (en) Music matching method, device, terminal and storage medium
KR102416558B1 (en) Video data processing method, device and readable storage medium
US10824874B2 (en) Method and apparatus for processing video
CN108986186B (en) Method and system for converting text into video
CN108419094B (en) Video processing method, video retrieval method, device, medium and server
WO2019242222A1 (en) Method and device for use in generating information
CN109862397B (en) Video analysis method, device, equipment and storage medium
CN110751030A (en) Video classification method, device and system
WO2022188644A1 (en) Word weight generation method and apparatus, and device and medium
Le et al. NII-HITACHI-UIT at TRECVID 2016.
CN109582825B (en) Method and apparatus for generating information
CN114282047A (en) Small sample action recognition model training method and device, electronic equipment and storage medium
CN111737516A (en) Interactive music generation method and device, intelligent sound box and storage medium
CN116665083A (en) Video classification method and device, electronic equipment and storage medium
CN113742524A (en) Video quick retrieval method and system and video quick recommendation method
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
CN114510564A (en) Video knowledge graph generation method and device
JP2020173776A (en) Method and device for generating video
Tran et al. Predicting Media Memorability Using Deep Features with Attention and Recurrent Network.
Tran-Van et al. Predicting Media Memorability Using Deep Features and Recurrent Network.
CN113268635B (en) Video processing method, device, server and computer readable storage medium
CN116028669A (en) Video searching method, device and system based on short video and storage medium
CN117009577A (en) Video data processing method, device, equipment and readable storage medium
CN115272660A (en) Lip language identification method and system based on double-flow neural network
CN110489592B (en) Video classification method, apparatus, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200225