CN110839173A

CN110839173A - Music matching method, device, terminal and storage medium

Info

Publication number: CN110839173A
Application number: CN201911128158.6A
Authority: CN
Inventors: 潘一汉; 金明; 董慧智
Original assignee: Shanghai Jilian Network Technology Co Ltd
Current assignee: Shanghai Jilian Network Technology Co Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2020-02-25

Abstract

The embodiment of the invention discloses a music matching method, a music matching device, a music matching terminal and a music matching storage medium. The method comprises the following steps: acquiring a target video, and respectively acquiring audio features to be matched of a plurality of pieces of music to be matched; extracting video visual features and original audio features of a target video, and generating target video features according to the video visual features and the original audio features; and screening out at least one matched audio characteristic from the plurality of audio characteristics to be matched according to the matching degree between the target video characteristic and the plurality of audio characteristics to be matched, and taking the music to be matched corresponding to the matched audio characteristic as the matched music. According to the technical scheme of the embodiment of the invention, the background music can be automatically matched according to the video content directly without the need of a user to listen to all the background music in advance, and the background music which is really suitable for the target video is screened out from a plurality of pieces of music to be matched in an objective and quantitative matching mode, so that the matching efficiency and the matching effect of the background music are improved.

Description

Music matching method, device, terminal and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computer application, in particular to a music matching method, a music matching device, a music matching terminal and a storage medium.

Background

At present, various short video applications become popular application categories in the mobile internet, and a user can shoot short videos anytime and anywhere and upload the short videos to the internet to share the short videos with other users.

In short videos, in addition to the actual performance of the video hero, background music is often an important factor attracting users to watch short videos. Therefore, if appropriate background music can be matched to the short video, more users can be attracted to watch the short video, thereby increasing the playing amount of the short video.

In view of this, short video production software often provides a large amount of background music so that a video producer can manually select a suitable background music from the large amount of background music. However, the implementation method of manually matching background music is inefficient, and it cannot be guaranteed that the taste of the video producer can be liked by the public, and the actual application effect is not good.

Disclosure of Invention

The embodiment of the invention provides a music matching method, a music matching device, a music matching terminal and a music matching storage medium, which are used for realizing the effect of automatically matching proper background music according to video content.

In a first aspect, an embodiment of the present invention provides a music matching method, which may include:

acquiring a target video, and respectively acquiring audio features to be matched of a plurality of pieces of music to be matched;

extracting video visual features and original audio features of a target video, and generating target video features according to the video visual features and the original audio features;

and screening out at least one matched audio characteristic from the plurality of audio characteristics to be matched according to the matching degree between the target video characteristic and the plurality of audio characteristics to be matched, and taking the music to be matched corresponding to the matched audio characteristic as the matched music.

Optionally, the extracting the video visual features of the target video may include:

inputting a target video to a trained video visual extraction model, and extracting video visual features of the target video, wherein the video visual extraction model comprises a video analysis module, a first convolution neural network module and a recurrent neural network module, and the video analysis module is used for extracting target video data in the target video and analyzing the target video data into a plurality of frames of target images.

Optionally, on the basis of the above method, the method may further include:

acquiring a historical video and a first historical classification result of historical video data in the historical video, and taking the historical video and the first historical classification result as a group of first training samples;

the method comprises the steps of training a first original neural network model based on a plurality of first training samples to obtain a video vision extraction model, wherein the first original neural network model comprises a video analysis module, a first convolution neural network module, a recurrent neural network module and a first classification module, and the first classification module is used for processing historical visual features output by the recurrent neural network module to obtain a first prediction classification result of the historical visual features.

Optionally, extracting the original audio feature of the target video may include:

inputting the target video into the trained audio feature extraction model, and extracting the original audio features of the target video, wherein the audio feature extraction model comprises an audio conversion module and a second convolutional neural network module, and the audio conversion module is used for extracting target audio data in the target video and converting the target audio data into a spectrogram.

Optionally, on the basis of the above method, the method may further include:

acquiring historical audio and a second historical classification result of the historical audio, and taking the historical audio and the second historical classification result as a group of second training samples;

and training a second original neural network model based on a plurality of second training samples to obtain an audio feature extraction model, wherein the second original neural network model comprises an audio conversion module, a second convolutional neural network module and a second classification module, and the second classification module is used for processing the historical audio features output by the second convolutional neural network module to obtain a second prediction classification result of the historical audio features.

Optionally, generating the target video feature according to the video visual feature and the original audio feature may include:

splicing the video visual features and the original audio features to obtain target splicing features;

and inputting the target splicing characteristics into the trained multilayer perceptron to obtain target video characteristics.

Optionally, on the basis of the above method, the method may further include:

and acquiring the historical splicing features and the audio features to be recommended corresponding to the historical splicing features, taking the historical splicing features and the audio features to be recommended as a group of third training samples, and training a third primitive neural network model based on a plurality of third training samples to obtain the multilayer perceptron.

In a second aspect, an embodiment of the present invention further provides a music matching apparatus, where the apparatus may include:

the acquisition module is used for acquiring a target video and respectively acquiring audio features to be matched of a plurality of pieces of music to be matched;

the generating module is used for extracting video visual characteristics and original audio characteristics of the target video and generating target video characteristics according to the video visual characteristics and the original audio characteristics;

and the matching module is used for screening out at least one matched audio characteristic from the plurality of audio characteristics to be matched according to the matching degree between the target video characteristic and the plurality of audio characteristics to be matched, and taking the music to be matched corresponding to the matched audio characteristic as the matched music.

In a third aspect, an embodiment of the present invention further provides a terminal, where the terminal may include:

one or more processors;

a memory for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the music matching method provided by any of the embodiments of the present invention.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the music matching method provided in any embodiment of the present invention.

According to the technical scheme of the embodiment of the invention, aiming at the acquirable target video, the target video characteristic capable of representing the semantic information of the target video on the whole is generated according to the video visual characteristic and the original audio characteristic of the extracted target video, the semantic information in the target video can be captured more comprehensively by comprehensively considering the video visual characteristic and the original audio characteristic, and the subsequent characteristic matching process can be simplified by pre-generating the target video characteristic; after the quantization processing result of the target video is obtained, at least one matched music can be screened out from a plurality of pieces of music to be matched by combining the directly obtained quantization processing result of the music to be matched, the quantization processing mode can be used for accurately and quickly matching background music in any application scene, and the application range is wide. According to the technical scheme, under the condition that a user does not need to listen to all background music materials in advance, the background music can be matched directly and automatically according to the video content, and the background music which is really suitable for the target video can be screened out from a plurality of pieces of music to be matched in an objective and quantitative matching mode, so that the matching efficiency and the matching effect of the background music are improved remarkably.

Drawings

Fig. 1 is a flowchart of a music matching method according to a first embodiment of the present invention;

FIG. 2a is a schematic structural diagram of a video visual extraction model according to a first embodiment of the present invention;

FIG. 2b is a schematic structural diagram of a first original neural network model according to a first embodiment of the present invention;

FIG. 3a is a schematic structural diagram of an audio feature extraction model according to a first embodiment of the present invention;

FIG. 3b is a schematic structural diagram of a second primitive neural network model according to a first embodiment of the present invention;

FIG. 4 is a flowchart of a music matching method according to a second embodiment of the present invention;

FIG. 5 is a diagram illustrating feature splicing of a music matching method according to a second embodiment of the present invention;

FIG. 6 is a diagram of a preferred embodiment of a music matching method according to a second embodiment of the present invention;

fig. 7 is a block diagram of a music matching apparatus according to a third embodiment of the present invention;

fig. 8 is a schematic structural diagram of a terminal in the fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a music matching method according to an embodiment of the present invention. The present embodiment is applicable to the case of automatically matching background music according to video content. The method can be executed by the music matching device provided by the embodiment of the invention, the device can be realized by software and/or hardware, and the device can be integrated on various user terminals or servers.

Referring to fig. 1, the method of the embodiment of the present invention specifically includes the following steps:

s110, obtaining a target video, and respectively obtaining audio features to be matched of a plurality of pieces of music to be matched.

The target video may be a shot video, such as a short video, a tv show, a movie, an animation, and the like, and in this case, the music matching method according to the embodiment of the present invention may match background music with the shot video; the target video may be a video obtained in real time from a real environment, such as a shopping mall video, a evening meeting video, a drama video, and the like, and at this time, the music matching method of the embodiment of the present invention may match background music played on site for the real environment; of course, the target video may also be a video related to the rest of the scenes and matching the background music, and is not specifically limited herein.

The plurality of pieces of music to be matched can be music stored in a background music library in advance, and the audio features to be matched of each piece of music to be matched can also be stored audio features extracted from the music to be matched in advance.

And S120, extracting the video visual characteristics and the original audio characteristics of the target video, and generating the target video characteristics according to the video visual characteristics and the original audio characteristics.

Each target video may include target video data and target audio data, and after video visual features are extracted from the target video data and original audio features are extracted from the target audio data, target video features may be generated according to the video visual features and the original audio features, and the target video features may be features used to represent semantic information of the target video as a whole.

It should be noted that, on one hand, the target video feature generation has the advantages that the target video feature and the plurality of audio features to be matched can be directly matched by combining the pre-extracted stored audio features to be matched, and further merging or other processing on the video visual feature, the original audio feature and the audio features to be matched is not required, so that the matching performance is greatly improved. On the other hand, compared with the video visual characteristics of the reference target video, the video visual characteristics and the original audio characteristics of the reference target video are integrated, so that the semantic information in the target video can be captured more comprehensively, and the accuracy of subsequent music matching can be improved.

Specifically, for extracting the video visual features of the target video, the target video data may be considered to be obtained by arranging target images of one frame by one frame according to a time sequence, and the extraction scheme of the video visual features may involve an image recognition technology such as image feature extraction, and may also involve change analysis of the target images on a time sequence. For example, the image may be recognized based on Histogram of Oriented Gradients (HOG), Scale-invariant feature transform (SIFT), convolutional neural network, and the like, and the variation of the target image in the time series may be resolved based on Hidden Markov Model (HMM), cyclic neural network such as long-short Term Memory network (LSTM), and the like.

Similarly, the original audio features of the target video are extracted, which is equivalent to the original audio features of the target audio data in the target video, and the target audio data can be considered to be composed of tones and time sequences, so that the original audio feature extraction scheme can convert the target audio data into two-dimensional vector data similar to a target image through spectrogram conversion, and on the basis, the original audio feature extraction operation is performed by using the scheme. In fact, the audio features to be matched of the music to be matched can also be extracted in advance based on a similar scheme, so that for each target video, the target video features are obtained only by processing the target video, the music to be matched does not need to be processed any more, and the efficiency is higher.

That is, after video frame taking, image feature extraction, and sequence feature processing are performed on target video data in a target video, the target video data may be processed into a segment of one-dimensional feature vector. Similarly, after information conversion and feature processing are performed on target audio data in the target video, the target audio data may also be processed into a segment of one-dimensional feature vector. On the basis, target video features can be generated according to the two sections of one-dimensional feature vectors, and the target video features are quantized processing results of the target video.

S130, according to the matching degree between the target video characteristic and the multiple audio characteristics to be matched, at least one matched audio characteristic is screened out from the multiple audio characteristics to be matched, and the music to be matched corresponding to the matched audio characteristic is used as matched music.

The target video features obtained in the above steps can represent semantic information of the target video as a whole, on this basis, at least one matched audio feature can be screened out from the multiple audio features to be matched according to the matching degree between the target video features and the multiple audio features to be matched, and the matched music corresponding to the matched audio feature is background music with higher matching degree with the target video. It should be noted that the matching process between the target video feature and the plurality of audio features to be matched may be understood as a similarity calculation process, for example, the similarity between the target video feature and each of the audio features to be matched may be calculated based on cosine similarity or in other manners.

On the basis, if the number of the matched music is one, the matched music can be directly recommended to the user as the background music of the target video; if the number of the matched music is at least two, the at least two matched music can be recommended to the user, so that the user can screen out the matched music which can be used as the background music of the target video from the at least two matched music.

An optional technical solution is to extract video visual features of a target video, or to say, video visual features of target video data in the target video, and specifically may include: inputting a target video into a trained video visual extraction model, and extracting video visual features of the target video, wherein the video visual extraction model can comprise a video analysis module, a first convolution neural network module and a recurrent neural network module, and the video analysis module can be used for extracting target video data in the target video and analyzing the target video data into a multi-frame target image.

The target video data can be considered to be obtained by arranging target images of one frame by one frame according to a time sequence, and after the target video data is subjected to operations of video frame taking, image feature extraction, sequence feature processing and the like, the target video data can be processed into a section of one-dimensional feature vector. Therefore, for the trained video vision extraction model, the trained video vision extraction model may include a video parsing module for parsing target video data in a target video into multiple frames of target images, a first convolution neural network module for extracting image features in the target images, and a recurrent neural network module for performing time sequence analysis on the image features.

In order to better understand the specific working process of the video vision extraction model, for example, referring to fig. 2a, for a target video that has been input into the video vision extraction model, first, the target video is input into a video parsing module to obtain multiple frames of target images, such as a target image 1, a target image 2, and a target image 3, where the multiple frames of target images are presented in the form of an image sequence; secondly, sequentially inputting each frame of serialized target images to the same first Convolutional Neural Network (CNN) module to obtain a feature vector of the target image, for example, inputting a target image 1 to the first CNN module to obtain a feature vector 1, inputting a target image 2 to the first CNN module to obtain a feature vector 2, and inputting a target image 3 to the first CNN module to obtain a feature vector 3; thirdly, sequentially inputting each feature vector to a Recurrent Neural Network (RNN) module according to a time sequence relationship to obtain video visual features capable of representing semantic information of the target video as a whole, for example, inputting the feature vector 1 to the RNN module to obtain the feature vector 11, inputting the feature vector 2 and the feature vector 11 together to the RNN module to obtain the feature vector 21, and inputting the feature vector 3 and the feature vector 21 together to the RNN module to obtain the video visual features. That is to say, the first CNN module and the RNN module may be multiplexed, and the RNN module is a time-sequential module that can be used for semantic analysis, and can sequentially analyze the semantics of each frame of target image according to a time-sequential relationship, thereby obtaining the semantics of the target video composed of multiple frames of target images.

On this basis, optionally, the video visual extraction model may be obtained by training in advance through the following steps: acquiring a historical video and a first historical classification result of historical video data in the historical video, and taking the historical video and the first historical classification result as a group of first training samples; the method comprises the steps of training a first original neural network model based on a plurality of first training samples to obtain a video vision extraction model, wherein the first original neural network model comprises a video analysis module, a first convolution neural network module, a recurrent neural network module and a first classification module, and the first classification module is used for processing historical visual features output by the recurrent neural network module to obtain a first prediction classification result of the historical visual features.

As shown in fig. 2b, for a historical video that has been input to the first original neural network model, first, after the historical video passes through the video parsing module, the first CNN module, and the RNN module in sequence, a historical visual feature may be obtained; secondly, in order to verify the accuracy of the historical visual features extracted from the historical video, inputting the historical visual features into a first classification module to obtain a first prediction classification result of the historical visual features, wherein the first prediction classification result is a classification label of the historical visual features; again, the first prediction classification result is compared with a first historical classification result of historical video data in the historical video, thereby verifying the accuracy of the historical visual features.

It should be noted that 1) the first classification module may be any classifier, and since the first classification module is used to verify the accuracy of the historical visual features, the first classification module does not need to be retained in the video visual extraction model after the model training is finished. 2) The classification quantity and the selection of the classification content of the first classification module directly influence the historical visual features output by the recurrent neural network module, for example, if the first prediction classification result is a television show, a movie, an MV and a documentary, the relevance between the historical visual features and the video type is large; the first prediction classification result is distraction, heart injury and pain, and the relevance of the historical visual features and the video emotion is large. 3) The first prediction classification result is used as the classification label, so that the dimensionality of the first prediction classification result is low, namely the data volume is small, and the training speed is high; moreover, the first prediction classification result is a classification label set manually, so that the method is more in line with an artificial thinking mode, and the accuracy is ensured.

An optional technical solution is to extract an original audio feature of a target video, or to say, an original audio feature of target audio data in a target video, and specifically may include: inputting the target video into the trained audio feature extraction model, and extracting the original audio features of the target video, wherein the audio feature extraction model comprises an audio conversion module and a second convolutional neural network module, and the audio conversion module is used for extracting target audio data in the target video and converting the target audio data into a spectrogram.

The target audio data may be considered as data composed of tones and time sequences, and after information conversion, feature processing, and other operations are performed on the target audio data, the target audio data may be processed into a segment of one-dimensional feature vector. Therefore, for the trained audio feature extraction model, it may include an audio conversion module for converting the target audio data into a spectrogram, which is a two-dimensional vector data similar to the target image, and a second convolutional neural network module for extracting image features in the spectrogram.

In order to better understand the specific working process of the audio feature extraction model, for example, referring to fig. 3a, for a target video that has been input to the audio feature extraction model, first, the target video is input to an audio conversion module to obtain a spectrogram, for example, target audio data in the target video is converted into the spectrogram based on fourier transform; and secondly, inputting the spectrogram into a second convolutional neural network module to obtain the original audio features of the spectrogram.

On this basis, optionally, the audio feature extraction model may be obtained by training in advance through the following steps: acquiring historical audio and a second historical classification result of the historical audio, and taking the historical audio and the second historical classification result as a group of second training samples; and training a second original neural network model based on a plurality of second training samples to obtain an audio feature extraction model, wherein the second original neural network model comprises an audio conversion module, a second convolutional neural network module and a second classification module, and the second classification module is used for processing the historical audio features output by the second convolutional neural network module to obtain a second prediction classification result of the historical audio features.

As shown in fig. 3b, for the historical audio input to the second original neural network model, first, if the input data of the audio conversion module is audio data, the audio conversion module only needs to convert the input data into a spectrogram, otherwise, the audio conversion module needs to extract the audio data in the input data and then convert the audio data into the spectrogram, for example, extract the historical audio data in the historical video and then convert the historical audio data into the spectrogram; secondly, after the spectrogram obtains the historical audio features through the second CNN module, in order to verify the accuracy of the historical audio features extracted from the historical audio, the historical audio features are input to the second classification module to obtain a second predicted classification result of the historical audio features, and the second predicted classification result is a classification label of the historical audio features; again, the second predicted classification result is compared to a second historical classification result for historical audio, thereby verifying the accuracy of the historical visual features.

It should be noted that 1) the second classification module may be any classifier, and since the second classification module is used to verify the accuracy of the historical audio features, the second classification module does not need to be retained in the audio feature extraction model after the model training is finished. 2) The selection of the classification quantity and the classification content of the second classification module can directly influence the historical audio features output by the second convolutional neural network module, for example, if the second prediction classification result is English, cantonese and national language, the relevance between the historical audio features and the audio language type is large; the second prediction classification result is open, sad and sad, and the relevance of the historical audio features and the audio emotion is large. 3) The benefit of using the second prediction classification result as the classification label is that the dimensionality of the second prediction classification result is lower, namely the data volume is less, and the training speed is higher; and the second prediction classification result is a classification label set manually, so that the method is more in line with an artificial thinking mode and the accuracy is ensured.

Example two

Fig. 4 is a flowchart of a music matching method according to a second embodiment of the present invention. The present embodiment is optimized based on the above technical solutions. In this embodiment, optionally, the generating the target video feature according to the video visual feature and the original audio feature may specifically include: splicing the video visual features and the original audio features to obtain target splicing features; and inputting the target splicing characteristics into the trained multilayer perceptron to obtain target video characteristics. The same or corresponding terms as those in the above embodiments are not explained in detail herein.

Referring to fig. 4, the method of this embodiment may specifically include the following steps:

s210, obtaining a target video, and respectively obtaining audio features to be matched of a plurality of pieces of music to be matched.

S220, extracting video visual features and original audio features of the target video, splicing the video visual features and the original audio features to obtain target splicing features, and inputting the target splicing features into the trained multilayer perceptron to obtain the target video features.

In order to represent the semantic information of the target video as a whole, the video visual features and the original audio features may be spliced to obtain target splicing features, where the target splicing features simultaneously include picture information and voice information in the target video, as shown in fig. 5. As can be seen from the above, the target splicing feature is the result of the splicing process of the video visual feature and the original audio feature, and the vector length of the target splicing feature is necessarily greater than the vector length of the original audio feature.

Considering the following requirement, according to the matching degree between the target video features (i.e., the processed target splicing features) and the multiple audio features to be matched, at least one matched audio feature is screened from the multiple audio features to be matched, and the vector length of the audio features to be matched is consistent with the vector length of the original audio features, so that the target splicing features can be further processed, for example, the target splicing features can be input into a trained multilayer perceptron to obtain the target video features, and the vector length of the target video features is consistent with the vector length of the audio features to be matched, thereby facilitating the realization of similarity matching between the target video features and the audio features to be matched.

It should be noted that the multi-layer perceptron can perform the function of reducing the dimension of the target splicing feature after the splicing processing, so as to output the target video feature matched with the corresponding matched audio feature. Therefore, on this basis, optionally, the multi-layer perceptron can be obtained by training in advance through the following steps: and acquiring the historical splicing features and the audio features to be recommended corresponding to the historical splicing features, taking the historical splicing features and the audio features to be recommended as a group of third training samples, and training a third primitive neural network model based on a plurality of third training samples to obtain the multilayer perceptron.

The audio features to be recommended can be manually selected audio features of background music with the highest matching degree with the historical videos, the historical splicing features can be composed of historical visual features and historical audio features, and the historical visual features and the historical audio features are extracted from the historical videos. For example, taking a short video as an example, if unprocessed historical videos and background music in the short video can be obtained at the same time, for short videos with a relatively high popularity, historical visual features and historical audio features can be extracted from the unprocessed historical videos to obtain historical splicing features, and audio features to be recommended can be extracted from the unprocessed background music, so that the third primitive neural network model is trained as a forward third training sample. Similarly, for those short videos with low popularity, the corresponding feature vector can be extracted as an inverted third training sample to train the third primitive neural network model. Therefore, the matching degree of the target video features processed by the multilayer perceptron and the manually selected features to be recommended is high.

And S230, screening out at least one matched audio feature from the multiple audio features to be matched according to the matching degree between the target video feature and the multiple audio features to be matched, and taking the music to be matched corresponding to the matched audio feature as the matched music.

According to the technical scheme of the embodiment of the invention, the video visual features and the original audio features are spliced, and the spliced target splicing features are input into the trained multilayer perceptron, so that the multilayer perceptron can perform dimension reduction processing on the target splicing features and improve the similarity between the target splicing features and the matched audio features, and the matched audio features are the audio features of the matched music matched with the target video, so that the accuracy of the automatically obtained matched music is improved.

It should be noted that the above-mentioned "first", "second" and "third" are only used for distinguishing the respective noun concepts, and are not specific limitations to the respective noun concepts. For example, taking the original neural network model as an example, "first", "second", and "third" of "first original neural network model", "second original neural network model", and "third original neural network model" are used only to distinguish the original neural network models, and the content of each original neural network model is not specifically limited.

In order to better understand the concrete implementation process of the above steps, the concrete implementation process of the above music matching method can be as shown in fig. 6. For example, after the video visual extraction model, the audio feature extraction model and the multi-layer perceptron training are completed, all music to be matched in the background music library can be input into the audio feature extraction model, the audio features to be matched of each music to be matched are respectively obtained, and the audio features to be matched and the corresponding relationship between the audio features to be matched and the music to be matched are stored.

In practical application, aiming at each target video, the target video is respectively input into a video visual extraction model and an audio characteristic extraction model to obtain video visual characteristics and original audio characteristics; splicing the video visual features and the original audio features to obtain target splicing features, and inputting the target splicing features into a multi-layer perceptron to obtain target video features capable of representing semantic information of a target video on the whole; similarity calculation is carried out on the target video characteristics and the stored multiple audio characteristics to be matched to obtain the matching degree between the target video characteristics and the stored multiple audio characteristics to be matched, the audio characteristics to be matched with higher matching degree are taken as matched audio characteristics, the music to be matched corresponding to the matched audio characteristics is taken as matched music, and the matched music is background music recommended to the target video. Therefore, the method can automatically recommend the background music matched with the target video and liked by the public for the target video, thereby improving the popularity of the target video.

EXAMPLE III

Fig. 7 is a block diagram of a music matching apparatus according to a third embodiment of the present invention, which is configured to execute the music matching method according to any of the embodiments. The device and the music matching method of the embodiments belong to the same inventive concept, and details which are not described in detail in the embodiment of the music matching device can refer to the embodiment of the music matching method. Referring to fig. 7, the apparatus may specifically include: an acquisition module 310, a generation module 320, and a matching module 330.

The acquiring module 310 is configured to acquire a target video and acquire to-be-matched audio features of a plurality of to-be-matched music respectively;

the generating module 320 is configured to extract video visual features and original audio features of the target video, and generate target video features according to the video visual features and the original audio features;

the matching module 330 is configured to screen at least one matched audio feature from the multiple audio features to be matched according to the matching degree between the target video feature and the multiple audio features to be matched, and use the music to be matched corresponding to the matched audio feature as the matched music.

Optionally, the generating module 320 may specifically include:

the video visual characteristic extraction unit is used for inputting a target video to a trained video visual extraction model and extracting the video visual characteristics of the target video, wherein the video visual extraction model comprises a video analysis module, a first convolution neural network module and a recurrent neural network module, and the video analysis module is used for extracting target video data in the target video and analyzing the target video data into multi-frame target images.

Optionally, on the basis of the above apparatus, the apparatus may further include:

the first training sample acquisition module is used for acquiring a historical video and a first historical classification result of historical video data in the historical video, and taking the historical video and the first historical classification result as a group of first training samples;

the video vision extraction model training module is used for training a first original neural network model based on a plurality of first training samples to obtain a video vision extraction model, wherein the first original neural network model comprises a video analysis module, a first convolution neural network module, a recurrent neural network module and a first classification module, and the first classification module is used for processing historical visual features output by the recurrent neural network module to obtain a first prediction classification result of the historical visual features.

Optionally, the generating module 320 may specifically include:

and the original audio characteristic extraction unit is used for inputting the target video into the trained audio characteristic extraction model and extracting the original audio characteristics of the target video, wherein the audio characteristic extraction model comprises an audio conversion module and a second convolution neural network module, and the audio conversion module is used for extracting target audio data in the target video and converting the target audio data into a spectrogram.

the second training sample acquisition module is used for acquiring a historical audio and a second historical classification result of the historical audio, and taking the historical audio and the second historical classification result as a group of second training samples;

and the audio characteristic extraction model training module is used for training a second original neural network model based on a plurality of second training samples to obtain an audio characteristic extraction model, wherein the second original neural network model comprises an audio conversion module, a second convolutional neural network module and a second classification module, and the second classification module is used for processing the historical audio characteristics output by the second convolutional neural network module to obtain a second prediction classification result of the historical audio characteristics.

Optionally, the generating module 320 may include:

the characteristic splicing unit is used for splicing the video visual characteristic and the original audio characteristic to obtain a target splicing characteristic;

and the feature generation unit is used for inputting the target splicing features into the trained multilayer perceptron to obtain the target video features.

and the multilayer perceptron training module is used for acquiring the historical splicing characteristics and the audio characteristics to be recommended corresponding to the historical splicing characteristics, using the historical splicing characteristics and the audio characteristics to be recommended as a group of third training samples, and training a third primitive neural network model based on a plurality of third training samples to obtain the multilayer perceptron.

According to the music matching device provided by the third embodiment of the invention, the acquisition module and the generation module are matched with each other, the target video characteristics capable of representing the semantic information of the target video on the whole are generated according to the video visual characteristics and the original audio characteristics of the extracted target video aiming at the available target video, the semantic information in the target video can be captured more comprehensively by comprehensively considering the video visual characteristics and the original audio characteristics, and the subsequent characteristic matching process can be simplified by pre-generating the target video characteristics; after the quantization processing result of the target video is obtained, the matching module can screen at least one matched music from a plurality of pieces of music to be matched by combining the quantization processing result of the music to be matched, the quantization processing mode can accurately and quickly match background music for any application scene, and the application range is wide. By the aid of the device, under the condition that a user does not need to listen to all background music materials in advance, the background music can be matched directly and automatically according to video content, the background music which is really suitable for a target video can be screened out from a plurality of pieces of music to be matched in an objective and quantitative matching mode, and accordingly matching efficiency and matching effect of the background music are improved remarkably.

The music matching device provided by the embodiment of the invention can execute the music matching method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

It should be noted that, in the embodiment of the music matching apparatus, the included units and modules are merely divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Example four

Fig. 8 is a schematic structural diagram of a terminal according to a fourth embodiment of the present invention, as shown in fig. 8, the terminal includes a memory 410, a processor 420, an input device 430, and an output device 440. The number of the processors 420 in the terminal may be one or more, and one processor 420 is taken as an example in fig. 8; the memory 410, processor 420, input device 430 and output device 440 in the terminal may be connected by a bus or other means, such as by bus 450 in fig. 8.

The memory 410, which is a computer-readable storage medium, may be used to store software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the music matching method in the embodiment of the present invention (e.g., the acquisition module 310, the generation module 320, and the matching module 330 in the music matching apparatus). The processor 420 executes various functional applications of the terminal and data processing, i.e., implements the above-described music matching method, by executing software programs, instructions, and modules stored in the memory 410.

The memory 410 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 410 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 410 may further include memory located remotely from processor 420, which may be connected to devices through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the device. The output device 440 may include a display device such as a display screen.

EXAMPLE five

An embodiment of the present invention provides a storage medium containing computer-executable instructions, which when executed by a computer processor, are configured to perform a music matching method, the method including:

Of course, the storage medium provided by the embodiment of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the method operations described above, and may also perform related operations in the music matching method provided by any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. With this understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A music matching method, comprising:

extracting video visual features and original audio features of the target video, and generating target video features according to the video visual features and the original audio features;

and screening out at least one matched audio feature from the plurality of audio features to be matched according to the matching degree between the target video feature and the plurality of audio features to be matched, and taking the music to be matched corresponding to the matched audio feature as the matched music.

2. The method according to claim 1, wherein the extracting video visual features of the target video comprises:

inputting the target video to a trained video visual extraction model, and extracting video visual features of the target video, wherein the video visual extraction model comprises a video analysis module, a first convolution neural network module and a recurrent neural network module, and the video analysis module is used for extracting target video data in the target video and analyzing the target video data into a multi-frame target image.

3. The method of claim 2, further comprising:

training a first original neural network model based on a plurality of first training samples to obtain the video vision extraction model, wherein the first original neural network model comprises the video parsing module, the first convolution neural network module, the recurrent neural network module and a first classification module, and the first classification module is used for processing the historical visual features output by the recurrent neural network module to obtain a first prediction classification result of the historical visual features.

4. The method of claim 1, wherein the extracting of the original audio features of the target video comprises:

inputting the target video into a trained audio feature extraction model, and extracting original audio features of the target video, wherein the audio feature extraction model comprises an audio conversion module and a second convolutional neural network module, and the audio conversion module is used for extracting target audio data in the target video and converting the target audio data into a spectrogram.

5. The method of claim 4, further comprising:

acquiring historical audio and a second historical classification result of the historical audio, and using the historical audio and the second historical classification result as a group of second training samples;

and training a second original neural network model based on a plurality of second training samples to obtain the audio feature extraction model, wherein the second original neural network model comprises the audio conversion module, the second convolutional neural network module and a second classification module, and the second classification module is used for processing the historical audio features output by the second convolutional neural network module to obtain a second prediction classification result of the historical audio features.

6. The method of claim 1, wherein generating target video features from the video visual features and the original audio features comprises:

7. The method of claim 6, further comprising:

acquiring a history splicing feature and an audio feature to be recommended corresponding to the history splicing feature, taking the history splicing feature and the audio feature to be recommended as a group of third training samples, and training a third primitive neural network model based on a plurality of third training samples to obtain the multilayer perceptron.

8. A music matching apparatus, comprising:

and the matching module is used for screening out at least one matched audio feature from the plurality of audio features to be matched according to the matching degree between the target video feature and the plurality of audio features to be matched, and taking the music to be matched corresponding to the matched audio feature as the matched music.

9. A terminal, characterized in that the terminal comprises:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the music matching method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a music matching method according to any one of claims 1 to 7.