CN108831423B

CN108831423B - Method, device, terminal and storage medium for extracting main melody tracks from audio data

Info

Publication number: CN108831423B
Application number: CN201810537265.3A
Authority: CN
Inventors: 孔令城
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2023-06-06
Anticipated expiration: 2038-05-30
Also published as: CN108831423A

Abstract

The application discloses a method, a device, a terminal and a storage medium for extracting main melody tracks from audio data, belonging to the field of audio processing, wherein the method comprises the following steps: extracting a plurality of sound tracks in the target audio data, and determining time period information of a voice time period in each sound track to obtain a time period information set corresponding to each sound track; determining time period information of each sentence of lyrics in lyric information corresponding to target audio data, and obtaining a time period information set corresponding to the lyric information; determining the matching degree of a time period information set corresponding to each music track and a time period information set corresponding to lyric information; and determining the corresponding track with the highest matching degree as the main melody track of the target audio data. The method solves the problem that the current method for eliminating the tracks one by one is not suitable for the audio of the popular alternative of the editing style, and the non-main melody tracks in the audio are easy to be determined as the main melody of the audio, thereby achieving the effect of improving the universality and the accuracy of identifying the main melody tracks in the audio.

Description

Method, device, terminal and storage medium for extracting main melody tracks from audio data

Technical Field

The embodiment of the application relates to the field of audio processing, in particular to a method, a device, a terminal and a storage medium for extracting main melody tracks in audio data.

Background

The musical instrument digital interface (Musical Instrument Digital Interface, MIDI) is an interface used to generate musical audio. Each MIDI audio may comprise composite tracks, each track comprising music of a different instrument. In MIDI audio, one track is typically used to store a main melody and the other track is used to store an accompaniment.

The server may provide services for music analysis, music retrieval, music recognition, similar music recommendations, etc., based on the audio host melody. In the related art, a unique track obtained by excluding tracks in MIDI audio one by one is determined as a main melody of the MIDI audio.

For the MIDI audio of the other types of the editing style, if the method of removing the tracks one by one is adopted, the non-main melody tracks in the MIDI audio are easily determined as the main melody tracks of the MIDI audio. Therefore, how to effectively determine the main melody track of a song is a major issue to be addressed.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present application provide a method, an apparatus, a terminal, and a storage medium for extracting a main melody track from audio data. The technical proposal is as follows:

According to a first aspect of embodiments of the present application, there is provided a method of extracting a main melody track from audio data, the method including:

extracting a plurality of sound tracks in the target audio data, and determining time period information of a voice time period in each sound track to obtain a time period information set corresponding to each sound track;

determining the time period information of each sentence of lyrics in lyric information corresponding to the target audio data, and obtaining a time period information set corresponding to the lyric information;

determining the matching degree of the time period information set corresponding to each music track and the time period information set corresponding to the lyric information;

and determining the corresponding track with the highest matching degree as the main melody track of the audio data.

According to a second aspect of embodiments of the present application, there is provided an apparatus for extracting a main melody track from audio data, the apparatus including:

the first determining module is used for extracting a plurality of sound tracks in the target audio data, determining time period information of a voice time period in each sound track and obtaining a time period information set corresponding to each sound track;

the second determining module is used for determining the time period information of each sentence of lyrics in the lyric information corresponding to the target audio data to obtain a time period information set corresponding to the lyric information;

A third determining module, configured to determine a matching degree of the time period information set corresponding to each track and the time period information set corresponding to the lyric information;

and the fourth determining module is used for determining the corresponding track with the highest matching degree as the main melody audio track of the target audio data.

According to a third aspect of embodiments of the present application, there is provided a terminal, the terminal comprising a processor and a memory, the memory storing at least one instruction, the instruction being loaded and executed by the processor to implement the method of extracting a main melody track from audio data according to the first aspect.

According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement a method of extracting a main melody track from audio data according to the first aspect.

The beneficial effects that technical scheme that this application embodiment provided brought are:

matching the time period information sets corresponding to the plurality of tracks in the target audio data with the time period information corresponding to the lyric information of the target audio data, and determining the track with the highest matching degree as the main melody track of the target audio data; the method solves the problem that the prior method for eliminating the tracks one by one is not suitable for the audio of the popular alternative of the editing style, and the non-main melody tracks in the audio are easy to be determined as the main melody tracks of the audio, thereby achieving the effect of improving the universality and the accuracy of identifying the main melody tracks in the audio.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1A is a flow chart of a method for extracting a main melody track from audio data according to one embodiment of the present application;

FIG. 1B is a comparison of a set of time period information corresponding to each track with a set of time period information corresponding to lyric information provided in one embodiment of the present application;

FIG. 2 is a flow chart of a method for extracting a main melody track from audio data according to another embodiment of the present application;

FIG. 3 is a block diagram showing an apparatus for extracting a main melody track from audio data according to one embodiment of the present application;

fig. 4 shows a block diagram of a terminal 400 according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1A is a flowchart of a method for extracting a main melody track from audio data according to an embodiment of the present application, and the method for extracting a main melody track from audio data includes the following steps as shown in fig. 1A.

Step 101, extracting a plurality of audio tracks in the target audio data, and determining time period information of a voice time period in each audio track to obtain a time period information set corresponding to each audio track.

In this embodiment, the target audio data includes, but is not limited to, songs, music, and humming songs, and the target audio data may be obtained from a local or server.

In this embodiment, the format of the target audio data is MIDI format.

In the audio data in MIDI format, there is usually one track for storing a main melody, and a plurality of tracks for storing accompaniment melodies, and there are usually a vocal period and a silent period in each track, and since the silent section has no reference value in the process of subsequently determining the main melody track, it is necessary to determine the period information of the vocal period in each track after extracting a plurality of tracks in the target audio data, and obtain a corresponding set of period information for each track, thereby reducing unnecessary processing capacity of the terminal.

Since the audio data in MIDI format is usually an instruction file ending with a mid, the file includes at least the start time and the end time of all the voice time periods of each audio track, each audio track in the target audio data can be extracted from the instruction file corresponding to the target audio data, and the time period information set corresponding to each audio track is obtained, and each time period information set corresponding to each audio track includes the start time and the end time of all the voice time periods of the audio track in the audio data.

Optionally, the starting time and the ending time of all the voice time periods of each audio track in the audio data are respectively represented by two-dimensional arrays, and are recorded as:

where k represents the number of tracks,

starting time of all voice time periods with k-th track recorded in +.>

Ending time of all voice time period recorded with kth track, +.>

Representing the specific moment of the start/end of the ith personal audio period in the kth track, in milliseconds, km representing the number of personal audio periods in the kth track.

Step 102, determining the time period information of each sentence of lyrics in the lyrics information corresponding to the target audio data, and obtaining a time period information set corresponding to the lyrics information.

In this embodiment, lyric information corresponding to target audio data is used to describe deduction content of the target audio data. The content deduced by the target audio data is described by the lyric information corresponding to the target audio data, and the corresponding lyric information corresponding to the target audio data is deduced by the target audio data.

Taking lyric information ABC of the target audio data ABC as an example, the lyric information ABC is as follows:

[628,1980]a1a2a3a4a5a6，

[6301,9523]b1b2b3b4b5b6，

[12002,54301]c1c2c3c4c5c6，

……

in the lyric information abc, for example, "a1a2a3a4a5a6", "b1b2b3b4b5b6", "c1c2c3c4c5c6" are lyrics included in the lyric information abc, "[ ]" before each lyric is a time attribute description text of each lyric, and contents included in "[ ]" are used for describing a time attribute of each lyric, and a unit time thereof is usually ms. Wherein, the time attribute of lyrics includes: a start time of the lyrics and an end time of the lyrics. For example: the above [628,1980] is a time attribute description text of the lyrics "a1a2a3a4a5a6", wherein "628" represents a start time of the lyrics "a1a2a3a4a5a6", and "1980" represents an end time of the lyrics "a1a2a3a4a5a6", and from the time attribute description text of "a1", it is known that the playing time period 628ms to 1980ms of the lyrics "a1a2a3a4a5a6", that is, the lyrics "a1a2a3a4a5a6" starts playing from 628ms to ends playing 1980 ms.

Since the lyric information is usually a lyric file ending with qrc, the file at least comprises lyrics and a start time and an end time corresponding to each lyric, the time period information of each lyric can be extracted from the lyric file corresponding to the target audio data, and a time period information set corresponding to the lyric information can be obtained.

Optionally, the starting time and the ending time corresponding to each lyric are respectively represented by two-dimensional arrays, and are marked as follows:

qrc _st ＝[t ₁ ，t ₂ ，...t _n ]

qrc _et ＝[t ₁ ，t ₂ ，...t _n ]

wherein qrc _st A start time sequence qrc in which all lyrics included in the lyric information are recorded _et Recording the end time sequence of all lyrics included in the lyric information, t _i The specific time indicating the start time/end time of the i-th lyric, the unit millisecond, n indicates the number of lyrics.

Step 103, determining the matching degree of the time period information set corresponding to each music track and the time period information set corresponding to the lyric information.

Specifically, for each time slot information a in the time slot information set corresponding to the lyric information _i Searching for a time period information set corresponding to each track _i Time period information B meeting preset matching conditions _j Will be able to find the corresponding B _j A of (2) _i The ratio of the number of the time period information sets corresponding to the lyric information to the number of all the time period information sets corresponding to the lyric information is determined as the matching degree of the time period information sets corresponding to each of the music tracks and the time period information sets corresponding to the lyric information.

Wherein i is an integer between 1 and n, and j is an integer between 1 and m.

Optionally, the preset matching conditions at least include the following two cases:

in the first case, the preset matching condition is A _i Start time of (c) and B _j The time difference between the starting moments of (a) is within a preset first threshold value, and A _i End time of day and rest B _j The time difference between the end times of (c) is within a first threshold.

With a preset first threshold value of 500ms, the time period information set corresponding to the lyric information comprises time period information [628,1980 ]]A ₁ 、[6301,9523]A ₂ 、[12002,54301]A ₃ The set of time period information corresponding to each track includes time period information corresponding to the first track [600,2000 ]]B ₁ 、[6300,9600]B ₂ 、[12000,54400]B ₃ And time period information corresponding to the second track [501,1580 ]]C ₁ 、[6000,7000]C ₂ 、[10000,53000]C ₃ As an example. For time period information A ₁ The terminal searches corresponding starting time and A in a time period information set corresponding to the audio track ₁ Between the start times of (2)The time difference is within 500ms, and the corresponding end time is equal to A ₁ B with a time difference between the end moments of 500ms ₁ And C ₁ Finding out the corresponding starting time and A ₂ The time difference between the starting moments of (a) is within 500ms, and the corresponding ending moment is equal to A ₁ B with a time difference between the end moments of 500ms ₂ And C ₂ Finding out the corresponding starting time and A ₃ The time difference between the starting moments of (a) is within 500ms, and the corresponding ending moment is equal to A ₃ B with a time difference between the end moments of 500ms ₃ Because the number of the time period information which is searched in the time period information corresponding to the first audio track and meets the preset matching condition is 3, the number of the time period information which is searched in the time period information corresponding to the second audio track and meets the preset matching condition is 2, the matching degree of the time period information set corresponding to the first audio track and the time period information set corresponding to the lyric information is 1, and the matching degree of the time period information set corresponding to the second audio track and the time period information set corresponding to the lyric information is 2/3.

In the second case, the preset matching condition is A _i Start time of (c) and B _j Adding A to the time difference between the starting moments of (2) _i End time of (c) and B _j The sum of the time differences between the end moments of (c) is within a preset second threshold.

With the preset second threshold value of 500ms, the time period information set corresponding to the lyric information comprises time period information [628,1980 ]]A ₁ 、[6301,9523]A ₂ 、[12002,54301]A ₃ The set of time period information corresponding to each track includes time period information corresponding to the first track [600,2000 ]]B ₁ 、[6300,9600]B ₂ 、[12000,54400]B ₃ And time period information corresponding to the second track [501,1580 ]]C ₁ 、[6000,7000]C ₂ 、[10000,53000]C ₃ As an example. For time period information A ₁ The terminal searches corresponding starting time and A in a time period information set corresponding to the audio track ₁ Adding the corresponding end time to A ₁ B with a time difference between the end moments of 500ms ₁ CheckingFind out the corresponding starting time and A ₂ Adding the corresponding end time to A ₁ B with a time difference between the end moments of 500ms ₂ Finding out the corresponding starting time and A ₃ Adding the corresponding end time to A ₃ B with a time difference between the end moments of 500ms ₃ Because the number of the time period information which is searched in the time period information corresponding to the first audio track and meets the preset matching condition is 3, and the number of the time period information which is searched in the time period information corresponding to the second audio track and meets the preset matching condition is 0, the matching degree of the time period information set corresponding to the first audio track and the time period information set corresponding to the lyric information is 1, and the matching degree of the time period information set corresponding to the second audio track and the time period information set corresponding to the lyric information is 0.

Note that, the specific values and setting manners of the preset first threshold value and the preset second threshold value are not limited in this embodiment.

And 104, determining the corresponding track with the highest matching degree as a main melody track of the target audio data.

Continuing with the example in step 103:

in the first case, the terminal determines the corresponding first track (1>2/3) with the highest matching degree as the main melody track of the audio data.

In the second case, the terminal determines the corresponding first track (1>0) with the highest matching degree as the main melody track of the audio data.

Fig. 1B is a comparison chart of a time period information set corresponding to each track and a time period information set corresponding to lyrics information provided in an embodiment of the present application, as shown in fig. 1B, a horizontal axis represents a playing duration, a vertical axis 0 represents a time period information set corresponding to lyrics information, and vertical axes 1 to 12 respectively represent time period information sets corresponding to respective tracks, and since it can be intuitively seen from fig. 1B that the matching degree between the time period information set corresponding to the track represented by the vertical axis 1 and the time period information set corresponding to lyrics information is the highest, the track represented by the vertical axis 1 is determined as a main melody track of target audio data.

In summary, in the method for extracting main melody tracks in audio data provided in this embodiment, the time period information sets corresponding to the plurality of tracks in the target audio data are respectively matched with the time period information corresponding to the lyric information of the target audio data, and the track with the highest matching degree is determined as the main melody track of the target audio data, because in general, the matching degree between the time period information set corresponding to the main melody track and the time period information corresponding to the lyric information is the highest in all tracks of the target audio data; the method solves the problem that the prior method for eliminating the tracks one by one is not suitable for the audio of the popular alternative of the editing style, and the non-main melody tracks in the audio are easy to be determined as the main melody tracks of the audio, thereby achieving the effect of improving the universality and the accuracy of identifying the main melody tracks in the audio.

In a preset limited time period, if the time period information in the time period information set corresponding to the audio track accords with the preset matching condition, the description of the high probability accords with the preset matching condition in other non-preset time periods. Therefore, in order to reduce the processing pressure of the processor, the terminal only needs to perform subsequent calculation on the segment of the target audio data.

Fig. 2 is a flowchart of a method for extracting a main melody track from audio data according to another embodiment of the present application, and as shown in fig. 2, the method for extracting a main melody track from audio data includes the following steps.

Step 201, extracting a plurality of audio tracks in the target audio data, and determining time period information of a voice time period in each audio track within a preset limited time range to obtain a time period information set corresponding to each audio track.

If the earliest time in the time slot information set corresponding to the lyric information of the target audio data is 25000 and the latest time is 225000, the selectable limited time range in the target audio data is [25000,225000].

Taking a preset limited time range as [40000,100000] as an example, after extracting a plurality of tracks in target audio data, the terminal determines time period information of a voice time period in each track in the limited time range [40000,100000] to obtain a time period information set ([ 60000,200000], [630000,960000 ]) corresponding to a first track in the tracks and a time period set ([ 50100,158000], [600000,700000 ]) corresponding to a second track.

Note that, the present embodiment is not limited to the value range and the setting manner of the preset limited time range.

Step 202, determining time period information of each sentence of lyrics in a limited time range in lyric information corresponding to target audio data, and obtaining a time period information set corresponding to the lyric information.

Continuing with the illustration in step 201, when the preset limited time range is [40000,100000], the terminal determines the time period information of each sentence of lyrics in the limited time range [40000,100000] in the lyric information corresponding to the target audio data, and obtains the time period information set corresponding to the lyric information.

Step 203, determining the matching degree of the time period information set corresponding to each track and the time period information set corresponding to the lyric information.

In step 204, among the tracks with the matching degree reaching the preset matching degree threshold, the track with the highest matching degree is determined as the main melody track of the target audio data.

In the case that the target audio data does not contain the main melody, in order to avoid that the track with the highest matching degree is misjudged as the main melody track of the audio data by the terminal, a matching degree threshold value is preset, and the track with the corresponding matching degree reaching the preset matching degree threshold value is determined as the candidate track of the main melody track.

After the terminal obtains the matching degree of the time period information set corresponding to each music track and the time period information set corresponding to the lyric information, firstly eliminating the music tracks of which the corresponding matching degree does not reach the preset matching degree threshold value, and determining the main melody music tracks only in the music tracks of which the matching degree reaches the preset matching degree threshold value. If the number of the remaining tracks is 0 after the corresponding tracks with the matching degree not reaching the preset matching degree threshold are removed, the terminal determines that the target audio data does not contain the main melody tracks.

It should be noted that, since step 203 is similar to step 103 in the present embodiment, the description of step 203 is not repeated in the present embodiment.

In this embodiment, in a preset limited period of time, if the period of time information in the period of time information set corresponding to the audio track accords with the preset matching condition, the description of the high probability also accords with the preset matching condition in other non-preset periods of time. Therefore, in order to reduce the processing pressure of the processor, the terminal only needs to perform subsequent calculation on the segment of the target audio data.

In this embodiment, in order to avoid that the terminal misjudges the track with the highest matching degree as the main melody track of the audio data when the target audio data does not include the main melody, a matching degree threshold is preset, and the track with the matching degree reaching the preset matching degree threshold is determined as the candidate track of the main melody track.

The following is an embodiment of the device of the present application, and for details of the device embodiment that are not described in detail, reference may be made to the foregoing one-to-one method embodiment.

Referring to fig. 3, a block diagram of an apparatus for extracting a main melody track from audio data according to an embodiment of the present application is shown. The device comprises: the first determination module 301, the second determination module 302, the third determination module 303, and the fourth determination module 304.

A first determining module 301, configured to extract a plurality of audio tracks in the target audio data, determine time period information of a voice time period in each audio track, and obtain a time period information set corresponding to each audio track;

a second determining module 302, configured to determine, in lyric information corresponding to the target audio data, time period information of each lyric, and obtain a time period information set corresponding to the lyric information;

a third determining module 303, configured to determine a matching degree of a time period information set corresponding to each track and a time period information set corresponding to lyric information;

the fourth determining module 304 is configured to determine the corresponding track with the highest matching degree as the main melody track of the target audio data.

In summary, in the device for extracting main melody tracks in audio data provided in this embodiment, the time period information sets corresponding to the plurality of tracks in the target audio data are respectively matched with the time period information corresponding to the lyric information of the target audio data, and the track with the highest matching degree is determined as the main melody track of the target audio data, because in general, the matching degree between the time period information set corresponding to the main melody track and the time period information corresponding to the lyric information is the highest in all tracks of the target audio data; the method solves the problem that the prior method for eliminating the tracks one by one is not suitable for the audio of the popular alternative of the editing style, and the non-main melody tracks in the audio are easy to be determined as the main melody tracks of the audio, thereby achieving the effect of improving the universality and the accuracy of identifying the main melody tracks in the audio.

Based on the apparatus for extracting a main melody track in audio data provided in the foregoing embodiment, optionally, the first determining module is further configured to determine time period information of a voice time period in each track within a preset limited time range;

the second determining module is further configured to determine time period information of each sentence of lyrics within a limited time range, and obtain a time period information set corresponding to the lyric information.

Optionally, the third determining module includes:

a searching unit for searching each time segment information A in the time segment information set corresponding to the lyric information _i Comparing the starting time and the ending time of each piece of time information in the time information set corresponding to the audio track in turn, and searching for the A _i Time period information B meeting preset matching conditions _j Wherein i is an integer between 1 and n, and j is an integer between 1 and m;

a determining unit for finding the corresponding B _j A of (2) _i The ratio of the number of the time slot information sets corresponding to the lyric information to the number of all the time slot information sets corresponding to the lyric information is determined as the matching degree of the time slot information sets corresponding to the music tracks and the time slot information sets corresponding to the lyric information.

Optionally, the preset matching condition is a _i Start time of (c) and B _j The time difference between the starting moments of (a) is within a preset first threshold value, and A _i End time of day and rest B _j The time difference between the end moments of (a) is within a first threshold; or,

the preset matching condition is A _i Start time of (c) and B _j Adding A to the time difference between the starting moments of (2) _i End time of (c) and B _j The sum of the time differences between the end moments of (c) is within a preset second threshold.

Optionally, the fourth determining module is further configured to determine, as the main melody track of the audio data, a track with the highest matching degree among tracks with the matching degree reaching a preset matching degree threshold.

Optionally, the format of the target audio data is a MIDI format.

It should be noted that: the device for extracting the main melody track from the audio data provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the server is divided into different functional modules to perform all or part of the functions described above. In addition, the device for extracting the main melody tracks in the audio data and the method for extracting the main melody tracks in the audio data provided in the above embodiments belong to the same concept, and detailed implementation processes of the device are shown in the method embodiments, which are not repeated here.

The embodiment of the application also provides a computer readable storage medium, which can be a computer readable storage medium contained in a memory; or may be a computer-readable storage medium, alone, that is not incorporated into the intelligent terminal. The computer-readable storage medium stores at least one instruction for use by one or more processors to perform the method of extracting a main melody track from audio data described above.

Fig. 4 shows a block diagram of a terminal 400 according to an exemplary embodiment of the present application. The terminal 400 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts GroupAudio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. The terminal 400 may also be referred to by other names as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 400 includes: a processor 401 and a memory 402.

Processor 401 may include one or more processing cores such as a 4-core processor, an 8-core processor, etc. The processor 401 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable LogicArray ). The processor 401 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 401 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 401 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 402 is configured to store at least one instruction for execution by the processor 401 to implement the method of extracting a main melody track in audio data provided by the method embodiments in the present application.

In some embodiments, the terminal 400 may further optionally include: a peripheral interface 403 and at least one peripheral. The processor 401, memory 402, and peripheral interface 403 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 403 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 404, a touch display 405, a camera 406, audio circuitry 407, a positioning component 408, and a power supply 409.

Peripheral interface 403 may be used to connect at least one Input/Output (I/O) related peripheral to processor 401 and memory 402. In some embodiments, processor 401, memory 402, and peripheral interface 403 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 401, memory 402, and peripheral interface 403 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 404 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 404 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 404 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 404 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 404 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuitry 404 may also include NFC (Near Field Communication ) related circuitry, which is not limited in this application.

The display screen 405 is used to display a UI (useinterface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 405 is a touch display screen, the display screen 405 also has the ability to collect touch signals at or above the surface of the display screen 405. The touch signal may be input as a control signal to the processor 401 for processing. At this time, the display screen 405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 405 may be one, providing a front panel of the terminal 400; in other embodiments, the display 405 may be at least two, and disposed on different surfaces of the terminal 400 or in a folded design; in still other embodiments, the display 405 may be a flexible display disposed on a curved surface or a folded surface of the terminal 400. Even more, the display screen 405 may be arranged in an irregular pattern that is not rectangular, i.e. a shaped screen. The display 405 may be made of LCD (Liquid Crystal Display ), OLED (organic light-Emitting Diode) or other materials.

The camera assembly 406 is used to capture images or video. Optionally, camera assembly 406 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 406 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 407 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 401 for processing, or inputting the electric signals to the radio frequency circuit 404 for realizing voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 400. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 401 or the radio frequency circuit 404 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuit 407 may also include a headphone jack.

The location component 408 is used to locate the current geographic location of the terminal 400 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 408 may be a positioning component based on the united states GPS (Global Positioning System ), the beidou system of china, the grainer system of russia, or the galileo system of the european union.

The power supply 409 is used to power the various components in the terminal 400. The power supply 409 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When power supply 409 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 400 further includes one or more sensors 410. The one or more sensors 410 include, but are not limited to: acceleration sensor 411, gyroscope sensor 412, pressure sensor 413, fingerprint sensor 414, optical sensor 415, and proximity sensor 416.

The acceleration sensor 411 may detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 400. For example, the acceleration sensor 411 may be used to detect components of gravitational acceleration on three coordinate axes. The processor 401 may control the touch display screen 405 to display a user interface in a lateral view or a longitudinal view according to the gravitational acceleration signal acquired by the acceleration sensor 411. The acceleration sensor 411 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 412 may detect a body direction and a rotation angle of the terminal 400, and the gyro sensor 412 may collect a 3D motion of the user to the terminal 400 in cooperation with the acceleration sensor 411. The processor 401 may implement the following functions according to the data collected by the gyro sensor 412: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 413 may be disposed at a side frame of the terminal 400 and/or at a lower layer of the touch display 405. When the pressure sensor 413 is disposed at a side frame of the terminal 400, a grip signal of the terminal 400 by a user may be detected, and the processor 401 performs a left-right hand recognition or a shortcut operation according to the grip signal collected by the pressure sensor 413. When the pressure sensor 413 is disposed at the lower layer of the touch display screen 405, the processor 401 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 405. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 414 is used to collect a fingerprint of the user, and the processor 401 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 414, or the fingerprint sensor 414 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized by the processor 401 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 414 may be provided on the front, back or side of the terminal 400. When a physical key or vendor Logo is provided on the terminal 400, the fingerprint sensor 414 may be integrated with the physical key or vendor Logo.

The optical sensor 415 is used to collect the ambient light intensity. In one embodiment, the processor 401 may control the display brightness of the touch display screen 405 according to the ambient light intensity collected by the optical sensor 415. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 405 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 405 is turned down. In another embodiment, the processor 401 may also dynamically adjust the shooting parameters of the camera assembly 406 according to the ambient light intensity collected by the optical sensor 415.

A proximity sensor 416, also referred to as a distance sensor, is typically provided on the front panel of the terminal 400. The proximity sensor 416 is used to collect the distance between the user and the front of the terminal 400. In one embodiment, when the proximity sensor 416 detects a gradual decrease in the distance between the user and the front face of the terminal 400, the processor 401 controls the touch display 405 to switch from the bright screen state to the off screen state; when the proximity sensor 416 detects that the distance between the user and the front surface of the terminal 400 gradually increases, the processor 401 controls the touch display screen 405 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 4 is not limiting of the terminal 400 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

It should be understood that, as used herein, the singular forms "a," "an," "the," are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to the particular embodiments of the present application.

Claims

1. A method of extracting a main melody track from audio data, the method comprising:

Extracting a plurality of sound tracks in target audio data, and determining time period information of a voice time period in each sound track within a preset limited time range to obtain a time period information set corresponding to each sound track;

determining the time period information of each sentence of lyrics in the limited time range in lyric information corresponding to the target audio data, and obtaining a time period information set corresponding to the lyric information;

and determining the corresponding track with the highest matching degree as the main melody track of the target audio data.

2. The method of claim 1, wherein the determining a degree of matching of the set of time period information corresponding to each track and the set of time period information corresponding to the lyrics information comprises:

each time segment information A in the time segment information set corresponding to the lyric information _i Comparing the starting time and the ending time of each piece of time information in the time information set corresponding to the audio track in sequence, and searching for the A _i Time period information B meeting preset matching conditions _j Wherein i is an integer between 1 and n, and j is an integer between 1 and m;

will be able to find the corresponding B _j A of (2) _i The ratio of the number of the time slot information sets corresponding to the lyric information to the number of all the time slot information sets corresponding to the lyric information is determined as the matching degree of the time slot information sets corresponding to the music track and the time slot information sets corresponding to the lyric information.

3. The method according to claim 2, wherein the predetermined matching condition is a _i Start time of (c) and B _j The time difference between the starting moments of (a) is within a preset first threshold value, and A _i End time of day and rest B _j The time difference between the end moments of (a) is within the first threshold; or,

the preset matching condition is A _i Start of (2)Time of day and B _j Adding A to the time difference between the starting moments of (2) _i End time of (c) and B _j The sum of the time differences between the end moments of (c) is within a preset second threshold.

4. The method of claim 1, wherein determining the corresponding track with the highest matching degree as the main melody track of the target audio data comprises:

and determining the track with the highest matching degree as the main melody track of the target audio data from the tracks with the matching degree reaching the preset matching degree threshold.

5. The method of any of claims 1-4, wherein the format of the target audio data is a MIDI format.

6. An apparatus for extracting a main melody track from audio data, the apparatus comprising:

the first determining module is used for extracting a plurality of sound tracks in the target audio data, determining time period information of a voice time period in each sound track within a preset limited time range and obtaining a time period information set corresponding to each sound track;

the second determining module is used for determining the time period information of each sentence of lyrics in the limited time range in the lyric information corresponding to the target audio data, and obtaining a time period information set corresponding to the lyric information;

7. The apparatus of claim 6, wherein the third determination module comprises:

a searching unit for searching each time segment information A in the time segment information set corresponding to the lyric information _i Comparing the starting time and the ending time of each piece of time information in the time information set corresponding to the audio track in sequence, and searching for the A _i Time period information B meeting preset matching conditions _j Wherein i is an integer between 1 and n, and j is an integer between 1 and m;

a determining unit for finding the corresponding B _j A of (2) _i The ratio of the number of the time slot information sets corresponding to the lyric information to the number of all the time slot information sets corresponding to the lyric information is determined as the matching degree of the time slot information sets corresponding to the music track and the time slot information sets corresponding to the lyric information.

8. The apparatus of claim 7, wherein the predetermined matching condition is a _i Start time of (c) and B _j The time difference between the starting moments of (a) is within a preset first threshold value, and A _i End time of day and rest B _j The time difference between the end moments of (a) is within the first threshold; or,

9. The apparatus of claim 6, wherein the fourth determining module is further configured to determine, as the main melody track of the target audio data, a track with a highest matching degree among tracks with matching degrees that reach a preset matching degree threshold.

10. The apparatus according to any one of claims 6-9, wherein the format of the target audio data is MIDI format.

11. A terminal comprising a processor and a memory, wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the method of extracting a main melody track from audio data according to any one of claims 1 to 5.

12. A computer readable storage medium having stored therein at least one instruction loaded and executed by a processor to implement the method of extracting a main melody track from audio data according to any one of claims 1 to 5.