WO2020173488A1 - 音频起始点检测方法和装置 - Google Patents
音频起始点检测方法和装置 Download PDFInfo
- Publication number
- WO2020173488A1 WO2020173488A1 PCT/CN2020/077024 CN2020077024W WO2020173488A1 WO 2020173488 A1 WO2020173488 A1 WO 2020173488A1 CN 2020077024 W CN2020077024 W CN 2020077024W WO 2020173488 A1 WO2020173488 A1 WO 2020173488A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frequency band
- audio
- spectrum parameter
- voice
- parameter
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 69
- 238000001228 spectrum Methods 0.000 claims abstract description 197
- 230000005236 sound signal Effects 0.000 claims abstract description 61
- 238000000034 method Methods 0.000 abstract description 12
- 238000010586 diagram Methods 0.000 description 19
- 230000033764 rhythmic process Effects 0.000 description 11
- 230000006870 function Effects 0.000 description 9
- 238000004590 computer program Methods 0.000 description 7
- 239000008267 milk Substances 0.000 description 6
- 210000004080 milk Anatomy 0.000 description 5
- 235000013336 milk Nutrition 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000011897 real-time detection Methods 0.000 description 4
- 230000003595 spectral effect Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/81—Detection of presence or absence of voice signals for discriminating voice from music
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/36—Accompaniment arrangements
- G10H1/40—Rhythm
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/051—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or detection of onsets of musical sounds or notes, i.e. note attack timings
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/071—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for rhythm pattern analysis or rhythm style recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2250/00—Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
- G10H2250/315—Sound category-dependent sound synthesis processes [Gensound] for musical use; Sound category-specific synthesis-controlling parameters or control means therefor
- G10H2250/455—Gensound singing voices, i.e. generation of human voices for musical applications, vocal singing sounds or intelligible words at a desired pitch or with desired vocal effects, e.g. by phoneme synthesis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
Definitions
- Audio starting point detection is an information extraction algorithm applied to audio signals. The goal is to accurately detect the starting point positions of notes and syllables. Among them, note (note) specifically refers to music signal; syllable (phone) specifically refers to voice signal. Audio start point detection has many important uses and application prospects in the field of signal processing. Examples are as follows: automatic segmentation and automatic labeling of human voice and music audio, information extraction, segmentation compression, and interactive entertainment.
- Figure 1a and Figure lb show the starting point detection, where Figure 1a is the audio signal, and Figure 1b is the detected starting point position.
- the voice spectrum parameter curve corresponding to the audio signal is usually calculated, the local maximum point of the curve is determined according to the voice spectrum parameter curve, and the voice spectrum parameter corresponding to the changed point is compared with the set threshold. If it is greater than the threshold Value, the position corresponding to this point is determined as the starting point position.
- the above algorithm is mainly suitable for audio signals with clear boundaries and relatively single rhythm (such as fast-paced music with clear note boundaries and relatively single rhythm), and for some more complex audio signals with a weak sense of rhythm (such as multi-instrument mixed music, For slower-tempo music and human voices), the aforementioned detection algorithm cannot accurately detect the boundary, and frequent false detections and missed detections will occur. Summary of the invention
- an audio starting point detection method including:
- the voice spectrum parameter determines the second voice spectrum parameter of the current frequency band; and the one or more starting point positions of the notes and syllables in the audio are determined according to the second voice spectrum parameters corresponding to the respective frequency bands.
- the determining the second speech frequency spectrum parameter of the current frequency band according to the first speech frequency spectrum parameter of the current frequency band and the first speech frequency spectrum parameters of each frequency band before the current frequency band according to the time sequence includes: For each frequency band, the average value of the first speech frequency spectrum parameter is determined according to the first speech frequency spectrum parameter of the current frequency band and the first speech frequency spectrum parameter of each frequency band before the current frequency band according to the time sequence, and the average value is used as the current frequency band The second speech frequency spectrum parameter.
- the determining the second speech frequency spectrum parameter of the current frequency band according to the first speech frequency spectrum parameter of the current frequency band and the first speech frequency spectrum parameters of each frequency band before the current frequency band according to the time sequence includes: For each frequency band, determine the mean value of the first speech frequency spectrum parameter according to the first speech frequency spectrum parameter of the current frequency band and the first speech frequency spectrum parameter of each frequency band before the current frequency band according to the time sequence; according to the first speech frequency spectrum of the current frequency band The parameter and the average value determine the second speech frequency spectrum parameter of the current frequency band.
- the determining the second speech frequency spectrum parameter of the current frequency band according to the first speech frequency spectrum parameter of the current frequency band and the average value includes: Calculate the difference between the first speech frequency spectrum parameter of the current frequency band and the average value; and determine the second speech frequency spectrum parameter of the current frequency band according to the difference value. Further, the determining the second speech spectrum parameter of the current frequency band according to the difference value includes: determining according to the difference value corresponding to the current frequency band and the difference value corresponding to each frequency band before the current frequency band according to the time sequence The average value of the difference value, and the average value of the difference value is used as the second voice spectrum parameter of the current frequency band.
- the determining one or more starting point positions of the notes and syllables in the audio according to the second speech frequency spectrum parameters corresponding to the respective frequency bands includes: drawing according to the second speech frequency spectrum parameters corresponding to the respective frequency bands A voice spectrum parameter curve; determining a local highest point according to the voice spectrum parameter curve, and determining one or more starting point positions of notes and syllables in the audio according to a second voice spectrum parameter corresponding to the local highest point.
- the determining the first voice spectrum parameter corresponding to each frequency band according to the frequency domain signal corresponding to the audio signal includes: dividing the audio signal of the audio into a plurality of sub audio signals, and converting each sub audio signal into For frequency domain signals, each sub-audio signal corresponds to a frequency band; the first voice frequency spectrum parameter corresponding to each frequency band is determined.
- an embodiment of the present disclosure provides an audio starting point detection device, including: a first parameter determination module, configured to determine a first voice spectrum parameter corresponding to each frequency band according to a frequency domain signal corresponding to an audio signal of the audio; A second parameter determination module, configured to determine, for each frequency band, the second speech frequency spectrum parameter of the current frequency band according to the first speech frequency spectrum parameter of the current frequency band and the first speech frequency spectrum parameters of each frequency band before the current frequency band according to time sequence; The starting point determination module is configured to determine the second voice spectrum parameter corresponding to each frequency band One or more starting points of notes and syllables in audio.
- the second parameter determining module is specifically configured to: for each frequency band, determine the first speech frequency spectrum according to the first speech frequency spectrum parameter of the current frequency band and the first speech frequency spectrum parameter of each frequency band before the current frequency band according to the time sequence The average value of the parameter, using the average value as the second voice spectrum parameter of the current frequency band.
- the second parameter determining module includes: an average value determining unit, configured to determine, for each frequency band, according to the first speech frequency spectrum parameter of the current frequency band and the first speech frequency spectrum parameter of each frequency band before the current frequency band according to the time sequence The average value of the first voice spectrum parameter; and a second parameter determining unit, configured to determine the second voice spectrum parameter of the current frequency band according to the first voice spectrum parameter of the current frequency band and the average value. Further, the second parameter determining unit is specifically configured to: calculate the difference between the first speech frequency spectrum parameter of the current frequency band and the average value; and determine the second speech frequency spectrum parameter of the current frequency band according to the difference value.
- the second parameter determining unit is specifically configured to: determine the mean value of the difference according to the difference corresponding to the current frequency band and the difference corresponding to each frequency band before the current frequency band according to the time sequence, and to calculate the difference The average value of is used as the second speech spectrum parameter of the current frequency band.
- the starting point determination module is specifically configured to: draw a voice spectrum parameter curve according to the second voice spectrum parameter corresponding to each frequency band; determine a local maximum point according to the voice spectrum parameter curve, and correspond to the local maximum point
- the second speech frequency spectrum parameter determines the position of one or more starting points of notes and syllables in the audio.
- the first parameter determination module of the device is specifically configured to: divide the audio audio signal into a plurality of sub audio signals, and convert each sub audio signal into a frequency domain signal, and each sub audio signal corresponds to a frequency band; determine The first voice frequency spectrum parameter corresponding to each frequency band.
- an embodiment of the present disclosure provides an electronic device, including: at least one processor; And,
- a memory that is communicatively connected with the at least one processor; wherein the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor
- the device can execute any of the audio starting point detection methods described in the first aspect.
- embodiments of the present disclosure provide a non-transitory computer-readable storage medium, characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to cause a computer to execute the aforementioned first aspect Any of the audio starting point detection methods.
- the first voice spectrum parameter corresponding to each frequency band is determined according to the frequency domain signal corresponding to the audio signal of the audio, and for each frequency band, according to the first voice spectrum parameter of the current frequency band and located before the current frequency band according to the time sequence
- the first voice spectrum parameters of each frequency band determine the second voice spectrum parameters of the current frequency band, and the one or more starting point positions of the notes and syllables in the audio are determined according to the second voice frequency parameters corresponding to each frequency band.
- the second voice spectrum parameter refers to the first voice spectrum parameters corresponding to multiple frequency bands, so that the determined second voice spectrum parameters are more accurate, so that the starting points of the notes and syllables in the audio can be accurately detected, and errors are reduced.
- FIG. 1b is a schematic diagram of an audio start point detection result provided by the prior art
- Fig. 2a is a flowchart of an audio start point detection method provided in Embodiment 1 of the present disclosure
- Fig. 2b is an audio start point detection method provided in Embodiment 1 of the present disclosure
- Figure 2c is a voice frequency spectrum diagram of the audio signal in the audio starting point detection method provided in the first embodiment of the disclosure
- Fig. 3 is a flowchart of the audio start point detection method provided in the second embodiment of the disclosure
- Fig. 4a is a flowchart of the audio start point detection method provided in the third embodiment of the disclosure
- Fig. 4b is the voice spectrum provided in the third embodiment of the disclosure
- the parameter composition contains a graph of the glitch signal
- Fig. 4c is a graph of the voice spectrum parameter composition in the audio start point detection method provided in the third embodiment of the disclosure
- Fig. 4d is the audio start point detection method provided in the third embodiment of the disclosure Audio signal schematic diagram
- FIG. 4e is a schematic diagram of the audio signal detection result shown in FIG. 4d obtained by using the existing starting point detection method provided by the third embodiment of the disclosure
- FIG. 4e is a schematic diagram of the audio signal detection result shown in FIG. 4d obtained by using the existing starting point detection method provided by the third embodiment of the disclosure
- FIG. 4e is a schematic diagram of the audio signal detection result shown in FIG. 4d obtained by using the existing starting point detection
- FIG. 4f is the audio starting point detection method provided by the third embodiment of the disclosure
- Fig. 4d shows a schematic diagram of the audio signal detection result
- Fig. 5 is a schematic structural diagram of an audio starting point detection apparatus provided in Embodiment 4 of the present disclosure
- Fig. 6 is a schematic structural diagram of an electronic device provided according to Embodiment 5 of the present disclosure. detailed description
- FIG. 2a is a flowchart of the audio start point detection method provided in Embodiment 1 of the present disclosure.
- the audio start point detection method provided in this embodiment can be executed by an audio start point detection device, which It can be implemented as software or as a combination of software and hardware.
- the audio starting point detection device can be integrated in a certain device in the audio starting point detection system, such as an audio starting point detection server or an audio starting point detection terminal device.
- This embodiment can be applied to scenes for some more complex audio with a weak sense of rhythm (for example, music mixed with multiple instruments, music with a slower rhythm, and human voice).
- the method includes the following steps: Step S21: Determine the first language corresponding to each frequency band according to the frequency domain signal corresponding to the audio signal of the audio Audio spectrum parameters.
- the audio signal may be a piece of music or speech, and the corresponding frequency domain signal is obtained by converting the audio signal in the time domain into the frequency domain.
- the first voice spectrum parameter in order to distinguish the different voice spectrum parameters appearing in this document, they will be referred to as the first voice spectrum parameter and the second voice spectrum parameter according to the order of appearance.
- the first speech frequency spectrum parameter can be determined according to the frequency spectrum amplitude and phase.
- step S21 specifically includes: Step S21 1: divide the audio audio signal into a plurality of sub audio signals, and convert each sub audio signal into a frequency domain signal, and each sub audio signal corresponds to a frequency band.
- Step S212 Determine the first voice frequency spectrum parameter corresponding to each frequency band.
- the audio signal is a signal that changes aperiodically with time, in the short-term range (usually defined as between 10-40 ms), the audio signal can approximately exhibit stable (approximately periodic) characteristics, so
- the audio signal can be divided into short-term speech segments of equal length, namely sub-audio signals for analysis. For example, as shown in Figure 2b, for an audio signal with a sampling rate of 16000 Hz, 512 sample points can be selected as a sub audio signal, which corresponds to a speech length of 32 ms.
- the Fourier transform can be used to convert the audio signal in the time domain into the audio signal in the frequency domain.
- the frequency band information that changes with time is called the spectrogram.
- Figure 2c it can be clearly seen that the sub-audio signals are different.
- the energy change of the frequency band it can be seen that at the starting point, the frequency spectrum will have an obvious step change.
- the first voice spectrum parameter of each frequency band before the current frequency band determines the second voice spectrum parameter of the current frequency band.
- the preset number can be customized.
- Step S23 Determine one or more of the notes and syllables in the audio according to the second speech spectrum parameters corresponding to each frequency band Starting point position.
- step S23 includes:
- S231 Draw a voice spectrum parameter curve according to the second voice spectrum parameter corresponding to each frequency band.
- the first voice spectrum parameter corresponding to each frequency band is determined according to the frequency domain signal corresponding to the audio signal of the audio, and for each frequency band, according to the first voice spectrum parameter of the current frequency band and located before the current frequency band according to the time sequence The first voice spectrum parameter of each frequency band determines the
- the determined second voice spectrum parameters are more accurate, so that Accurately detect the starting point of the notes and syllables in the audio, reduce the occurrence of false detections and missed detections, and refer to the first voice spectrum parameters of each frequency band before the current frequency band according to the time sequence, which ensures the Real-time detection of starting point.
- FIG. 3 is a flowchart of the audio starting point detection method provided in Embodiment 2 of the present disclosure.
- This embodiment is based on the above-mentioned embodiment.
- the steps are based on the first voice frequency spectrum parameter of the current frequency band and are located in the sequence according to the time sequence.
- the first speech spectrum parameters of each frequency band before the current frequency band determine the second speech frequency spectrum parameters of the current frequency band for further optimization.
- This embodiment can be applied to some more complex audio with a weak sense of rhythm (such as multi-instrument mixed music, rhythm Slow music and vocals) scenes. As shown in Figure 3, it specifically includes:
- Step S31 Determine the first speech audio spectrum parameter corresponding to each frequency band according to the frequency domain signal corresponding to the audio audio signal.
- Step S32 For each frequency band, determine the average value of the first speech frequency spectrum parameter according to the first speech frequency spectrum parameter of the current frequency band and the first speech frequency spectrum parameter of each frequency band before the current frequency band according to the time sequence, and use the average value as the second speech value of the current frequency band Spectral parameters.
- Step S33 Determine one or more starting point positions of the notes and syllables in the audio according to the second speech spectrum parameters corresponding to each frequency band.
- the glitch phenomenon in the graph composed of the average value can be improved, and the starting point detection accuracy rate can be further improved , And refer to the first voice spectrum parameters of the frequency bands before the current frequency band according to the time sequence, which ensures the real-time detection of the starting point.
- FIG. 4a is a flowchart of a method for detecting an audio start point provided in Embodiment 3 of the disclosure.
- the step of determining the second voice spectrum parameter of the current frequency band according to the first voice spectrum parameter of the current frequency band and the first voice spectrum parameters of each frequency band before the current frequency band according to the time sequence is further optimized.
- the embodiment may be applicable to scenes for some more complex audio with a weak sense of rhythm (for example, music mixed with multiple instruments, music with a slower rhythm, and human voice).
- Step S41 Determine the first voice spectrum parameter corresponding to each frequency band according to the frequency domain signal corresponding to the audio audio signal.
- Step S42 For each frequency band, determine the average value of the first speech frequency spectrum parameter according to the first speech frequency spectrum parameter of the current frequency band and the first speech frequency spectrum parameter of each frequency band before the current frequency band according to the time sequence.
- Step S43 Determine the second speech frequency spectrum parameter of the current frequency band according to the first speech frequency spectrum parameter and the average value of the current frequency band.
- Step S43 includes: Step S431: Calculate the difference between the first voice spectrum parameter of the current frequency band and the average value.
- Step S432 Determine the second voice spectrum parameter of the current frequency band according to the difference. Further, step S432 includes:
- FIG. 4b it is a curve containing a glitch signal composed of speech spectrum parameters
- FIG. 4c it is a curve composed of speech spectrum parameters in this solution.
- FIG. 4d is a schematic diagram of an audio signal
- FIG. 4e is a schematic diagram of a starting point detection result obtained by detecting the audio signal of FIG.
- FIG. 4f is a schematic diagram of the audio signal of FIG. 4d using the method of this embodiment.
- Step S44 Determine one or more starting point positions of the notes and syllables in the audio according to the second speech frequency spectrum parameters corresponding to each frequency band.
- the embodiments of the present disclosure refer to the first voice spectrum parameters corresponding to multiple frequency bands when determining the second voice spectrum parameters, so that the determined second voice spectrum parameters are more accurate, so that the notes and syllables in the audio can be accurately detected.
- the The glitch phenomenon in the existing graphs further improves the accuracy of starting point detection, and refers to the first voice spectrum parameters of each frequency band before the current frequency band according to the time sequence, which ensures the real-time detection of the starting point.
- FIG. 5 is a schematic structural diagram of an audio starting point detection device provided in Embodiment 4 of the present disclosure.
- the audio starting point detection device can be implemented as software or a combination of software and hardware.
- the audio starting point detection device can be integrated Set in a certain device in the audio starting point detection system, such as an audio starting point detection server or an audio starting point detection terminal device.
- This embodiment can be applied to scenes for some more complex audio with a weak sense of rhythm (for example, music mixed with multiple instruments, music with a slower rhythm, and human voice). As shown in FIG.
- the device includes: a first parameter determining module 51, a second parameter determining module 52, and a starting point determining module 53; wherein, the first parameter determining module 51 is configured to determine the frequency domain corresponding to the audio signal.
- the signal determines the first voice spectrum parameter corresponding to each frequency band;
- the second parameter determining module 52 is configured to, for each frequency band, according to the first voice spectrum parameter of the current frequency band and the first voice frequency spectrum of each frequency band before the current frequency band according to the time sequence
- the parameter determines the second speech frequency spectrum parameter of the current frequency band;
- the starting point determination module 53 is configured to determine one or more starting point positions of the notes and syllables in the audio according to the second speech frequency spectrum parameters corresponding to the respective frequency bands.
- the second parameter determining module 52 is specifically configured to: for each frequency band, according to the first speech spectrum parameter of the current frequency band and the first speech of each frequency band before the current frequency band according to the time sequence
- the spectrum parameter determines the average value of the first voice spectrum parameter, and uses the average value as the second voice spectrum parameter of the current frequency band.
- the second parameter determining module 52 includes: an average value determining unit 521 and a second parameter determining unit 522; wherein, the average value determining unit 521 is used for each frequency band, according to the first speech frequency spectrum parameter of the current frequency band and the time sequence.
- the first voice spectrum parameter of each frequency band before the current frequency band determines the average value of the first voice spectrum parameter; the second parameter determining unit 522 is configured to determine the current frequency spectrum parameter according to the first voice frequency spectrum parameter of the current frequency band and the average value.
- the second voice spectrum parameter of the frequency band is specifically configured to: calculate the difference between the first speech spectrum parameter of the current frequency band and the average value; and determine the second speech spectrum parameter of the current frequency band according to the difference value .
- the second parameter determining unit 522 is specifically configured to: determine the average value of the difference according to the difference corresponding to the current frequency band and the difference corresponding to each frequency band before the current frequency band according to the time sequence, and to calculate the difference The mean value of the value is used as the second speech spectrum parameter of the current frequency band.
- the starting point determination module 53 is specifically configured to: draw a voice spectrum parameter curve according to the second voice spectrum parameter corresponding to each frequency band; determine a local highest point according to the voice spectrum parameter curve, and according to the local highest point The corresponding second speech frequency spectrum parameter determines one or more starting point positions of notes and syllables in the audio.
- the first parameter determining module 51 is specifically configured to: divide the audio audio signal into a plurality of sub audio signals, and convert each sub audio signal into a frequency domain signal, and each sub audio signal corresponds to a frequency band; determine The first voice frequency spectrum parameter corresponding to each frequency band.
- Example five Referring now to FIG. 6, it shows a schematic structural diagram of an electronic device suitable for implementing embodiments of the present disclosure.
- the electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (for example, Car navigation terminal) and other mobile terminals and fixed terminals such as digital TV, desktop computer, etc.
- PDAs personal digital assistants
- PADs tablets
- PMPs portable multimedia players
- vehicle-mounted terminals for example, Car navigation terminal
- other mobile terminals and fixed terminals such as digital TV, desktop computer, etc.
- FIG. 6 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
- the electronic device may include a processing device (such as a central processing unit, a graphics processor, etc.) 601, which may be loaded into a random access memory according to a program stored in a read-only memory (ROM) 602 or from a storage device 608 (RAM)
- the program in 603 executes various appropriate actions and processing.
- various programs and data required for the operation of the electronic device are also stored.
- the processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
- I / O interface 605 is also connected to the bus 604 o
- the device may be connected to the I / O interface 605 include: such as a touch screen, a touch pad, a keyboard, a mouse, an image sensor, a microphone, an accelerometer, a gyroscope
- An input device 606 such as a liquid crystal display (LCD), an output device 607 such as a speaker, a vibrator, etc.; a storage device 608 such as a magnetic tape, a hard disk, etc.; and a communication device 609.
- the communication device 609 may allow the electronic device to perform wireless or wired communication with other devices to exchange data.
- the process described above with reference to the flowchart can be implemented as a computer software program.
- the embodiments of the present disclosure include a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program includes program code for executing the method shown in the flowchart.
- the computer program may be downloaded and installed from the network through the communication device 309, or installed from the storage device 608, or installed from the ROM 602.
- the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
- the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above.
- Computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable memory Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
- a computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
- a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein.
- This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
- the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
- the computer-readable signal medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
- the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wire, optical cable, RF (Radio Frequency), etc., or any suitable combination of the foregoing.
- the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or it may exist alone without being assembled into the electronic device.
- the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: determines the first voice spectrum parameter corresponding to each frequency band according to the frequency domain signal corresponding to the audio signal ; Determine the second voice spectrum parameter of the current frequency band according to the first speech frequency spectrum parameter of the current frequency band and the first speech frequency spectrum parameter of each frequency band before the current frequency band according to the time sequence; According to the second speech corresponding to each frequency band The spectral parameters determine the starting point location.
- the computer program code used to perform the operations of the present disclosure may be written in one or more programming languages or a combination thereof.
- the above-mentioned programming languages include object-oriented programming languages such as Java, Smalltalk and C++ also include conventional procedural programming languages, such as "C" language or similar programming languages.
- the program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or server.
- the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to connect to the user's computer).
- LAN local area network
- WAN wide area network
- Connect an Internet service provider to connect to the user's computer.
- the flowcharts and block diagrams in the accompanying drawings illustrate the possible implementation architecture, functions, and operations of the system, method, and computer program product according to various embodiments of the present disclosure.
- each block in the flowchart or block diagram may represent a module, program segment, or part of code, and the module, program segment, or part of code includes one or more for realizing prescribed logical functions Executable instructions.
- each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified function or operation Or it can be implemented by a combination of dedicated hardware and computer instructions.
- the units described in the embodiments of the present disclosure may be implemented in a software manner, or may be implemented in a hardware manner. Wherein, the name of the unit does not constitute a limitation on the unit itself under certain circumstances.
- the drag point determination module can also be described as "a module for determining a drag point on a template image".
- the above description is only a preferred embodiment of the present disclosure and an explanation of the applied technical principles.
- Those skilled in the art should understand that the scope of disclosure involved in this disclosure is not limited to the technical solutions formed by the specific combination of the above technical features, and should also cover the above technical features without departing from the above disclosed concept.
- the above features and the technical features disclosed in this disclosure (but not limited to) having similar functions are replaced by each other.
- Technical solutions are provided.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Telephone Function (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
一种音频起始点检测方法、装置、电子设备和计算机可读存储介质。其中该音频起始点检测方法包括:根据与音频的音频信号对应的频域信号确定各频段对应的第一语音频谱参数(S21);针对各频段,根据当前频段的第一语音频谱参数和按照时序位于当前频段之前的各频段的第一语音频谱参数确定当前频段的第二语音频谱参数(S22);根据各频段对应的第二语音频谱参数确定音频中的音符和音节的一个或多个起始点位置(S23)。该方法由于在确定第二语音频谱参数时参照了多个频段的第一语音频谱参数,使得确定的第二语音频谱参数更为准确,从而可以准确的检测出音频中音符和音节的起始点,减少误检和漏检情况的发生,并且参照的是按照时序位于当前频段之前的各频段的第一语音频谱参数,保证了起始点检测的实时性。
Description
音频起始点检测方法和装置
相关申请的交叉引用
本申请要求于 2019年 02月 28 日提交的, 申请号为 201910151015.0、 发明 名称为 “音频起始点检测方法和装置” 的中国专利申请的优先权, 该申请的全 文通过引用结合在本申请中。 技术领域 本公开涉及图像处理技术领域, 尤其涉及一种音频起始点检测方法、装置、 电子设备及计算机可读存储介质。 背景技术
音频起始点检测是一种应用于音频信号的信息抽取算法, 目标是准确检测 音符和音节的起始点位置。 其中音符 (note) 特指音乐信号; 音节 (phone) 特 指语音人声信号。 音频起始点检测在信号处理领域有很多重要的用途和应用前 景, 举例如下: 对于人声和音乐音频的自动切分和自动标注、 信息抽取、 分段 压缩、 及互动娱乐的玩法。 图 la和图 lb表示了起始点检测, 其中图 la为音频 信号, 图 lb是检测出的起始点位置。 在现有技术中, 通常通过计算音频信号对应语音频谱参数曲线, 根据语音 频谱参数曲线确定曲线的局部最大点, 将改点对应的语音频谱参数与设置的门 限值进行比较, 如果大于门限值, 则确定该点对应的位置为起始点位置。 但是, 上述算法主要适用于边界清晰、 节奏相对单一的音频信号 (例如音 符边界清晰、 节奏相对单一的快节奏音乐) , 而对于一些较为复杂节奏感不强 的音频 (例如多乐器混合的音乐、 节奏比较慢的音乐以及人声) , 上述检测算 法就不能准确的检测出边界, 就会出现频繁的误检和漏检。
发明内容
第一方面, 本公开实施例提供一种音频起始点检测方法, 包括:
根据与音频的音频信号对应的频域信号确定各频段对应的第一语音频谱参 数; 针对各频段, 根据当前频段的第一语音频谱参数和按照时序位于所述当前 频段之前的各频段的第一语音频谱参数确定所述当前频段的第二语音频谱参数; 根据所述各频段对应的第二语音频谱参数确定所述音频中的音符和音节的 一个或多个起始点位置。 进一步的, 所述针对各频段, 根据当前频段的第一语音频谱参数和按照时 序位于所述当前频段之前的各频段的第一语音频谱参数确定所述当前频段的第 二语音频谱参数, 包括: 针对各频段, 根据当前频段的第一语音频谱参数和按照时序位于所述当前 频段之前的各频段的第一语音频谱参数确定第一语音频谱参数的均值, 将所述 均值作为所述当前频段的第二语音频谱参数。 进一步的, 所述针对各频段, 根据当前频段的第一语音频谱参数和按照时 序位于所述当前频段之前的各频段的第一语音频谱参数确定所述当前频段的第 二语音频谱参数, 包括: 针对各频段, 根据当前频段的第一语音频谱参数和按照时序位于所述当前 频段之前的各频段的第一语音频谱参数确定第一语音频谱参数的均值; 根据所述当前频段的第一语音频谱参数和所述均值确定所述当前频段的第 二语音频谱参数。 进一步的, 所述根据所述当前频段的第一语音频谱参数和所述均值确定所 述当前频段的第二语音频谱参数, 包括:
计算所述当前频段的第一语音频谱参数与所述均值的差值; 根据所述差值确定所述当前频段的第二语音频谱参数。 进一步的, 所述根据所述差值确定所述当前频段的第二语音频谱参数, 包 括: 根据所述当前频段对应的差值和按照时序位于所述当前频段之前的各频段 对应的差值确定差值的均值, 将所述差值的均值作为所述当前频段的第二语音 频谱参数。 进一步的, 所述根据所述各频段对应的第二语音频谱参数确定所述音频中 的音符和音节的一个或多个起始点位置, 包括: 根据所述各频段对应的第二语音频谱参数绘制语音频谱参数曲线; 根据所述语音频谱参数曲线确定局部最高点, 根据所述局部最高点对应的 第二语音频谱参数确定所述音频中的音符和音节的一个或多个起始点位置。 进一步的, 所述根据与音频的音频信号对应的频域信号确定各频段对应的 第一语音频谱参数包括: 将所述音频的音频信号切分为多个子音频信号, 将各子音频信号转换为频 域信号, 每个子音频信号对应一个频段; 确定各频段对应的第一语音频谱参数。 第二方面, 本公开实施例提供一种音频起始点检测装置, 包括: 第一参数确定模块, 用于根据与音频的音频信号对应的频域信号确定各频 段对应的第一语音频谱参数; 第二参数确定模块, 用于针对各频段, 根据当前频段的第一语音频谱参数 和按照时序位于所述当前频段之前的各频段的第一语音频谱参数确定所述当前 频段的第二语音频谱参数; 起始点确定模块, 用于根据所述各频段对应的第二语音频谱参数确定所述
音频中的音符和音节的一个或多个起始点位置。 进一步的, 所述第二参数确定模块具体用于: 针对各频段, 根据当前频段 的第一语音频谱参数和按照时序位于所述当前频段之前的各频段的第一语音频 谱参数确定第一语音频谱参数的均值, 将所述均值作为所述当前频段的第二语 音频谱参数。
进一步的, 所述第二参数确定模块包括: 均值确定单元, 用于针对各频段, 根据当前频段的第一语音频谱参数和按 照时序位于所述当前频段之前的各频段的第一语音频谱参数确定第一语音频谱 参数的均值; 第二参数确定单元, 用于根据所述当前频段的第一语音频谱参数和所述均 值确定所述当前频段的第二语音频谱参数。 进一步的, 所述第二参数确定单元具体用于: 计算所述当前频段的第一语 音频谱参数与所述均值的差值; 根据所述差值确定所述当前频段的第二语音频 谱参数。 进一步的, 所述第二参数确定单元具体用于: 根据所述当前频段对应的差 值和按照时序位于所述当前频段之前的各频段对应的差值确定差值的均值, 将 所述差值的均值作为所述当前频段的第二语音频谱参数。
进一步的, 所述起始点确定模块具体用于: 根据所述各频段对应的第二语 音频谱参数绘制语音频谱参数曲线; 根据所述语音频谱参数曲线确定局部最高 点, 根据所述局部最高点对应的第二语音频谱参数确定所述音频中的音符和音 节的一个或多个起始点位置。 进一步的, 所述装置第一参数确定模块具体用于: 将所述音频的音频信号 切分为多个子音频信号, 将各子音频信号转换为频域信号, 每个子音频信号对 应一个频段; 确定各频段对应的第一语音频谱参数。
第三方面, 本公开实施例提供一种电子设备, 包括: 至少一个处理器; 以
及,
与所述至少一个处理器通信连接的存储器; 其中, 所述存储器存储有能被 所述至少一个处理器执行的指令, 所述指令被所述至少一个处理器执行, 以使 所述至少一个处理器能够执行前述第一方面中的任一所述音频起始点检测方法。 第四方面, 本公开实施例提供一种非暂态计算机可读存储介质, 其特征在 于, 该非暂态计算机可读存储介质存储计算机指令, 该计算机指令用于使计算 机执行前述第一方面中的任一所述音频起始点检测方法。 本公开实施例通过根据与音频的音频信号对应的频域信号确定各频段对应 的第一语音频谱参数, 并针对各频段, 根据当前频段的第一语音频谱参数和按 照时序位于所述当前频段之前的各频段的第一语音频谱参数确定当前频段的第 二语音频谱参数, 根据各频段对应的第二语音频谱参数确定所述音频中的音符 和音节的一个或多个起始点位置, 由于在确定第二语音频谱参数时参照了多个 频段对应的第一语音频谱参数, 使得确定的第二语音频谱参数更为准确, 从而 可以准确的检测出音频中的音符和音节的起始点, 减少了误检和漏检情况的发 生, 并且参照的是按照时序位于所述当前频段之前的各频段的第一语音频谱参 数, 保证了起始点检测的实时性。 上述说明仅是本公开技术方案的概述, 为了能更清楚了解本公开的技术手 段, 而可依照说明书的内容予以实施, 并且为让本公开的上述和其他目的、 特 征和优点能够更明显易懂, 以下特举较佳实施例, 并配合附图, 详细说明如下。 附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案, 下面将对实施 例或现有技术描述中所需要使用的附图作一简单地介绍, 显而易见地, 下面描 述中的附图是本公开的一些实施例, 对于本领域普通技术人员来讲, 在不付出 创造性劳动的前提下, 还可以根据这些附图获得其他的附图。 图 la为现有技术提供的音频信号示意图;
图 lb为现有技术提供的音频起始点检测结果示意图; 图 2a为本公开实施例一提供的音频起始点检测方法的流程图; 图 2b为本公开实施例一提供的音频起始点检测方法中的音频信号示意图; 图 2c为本公开实施例一提供的音频起始点检测方法中的音频信号的语音频 谱图;
图 3为本公开实施例二提供的音频起始点检测方法的流程图; 图 4a为本公开实施例三提供的音频起始点检测方法的流程图; 图 4b为本公开实施例三提供的语音频谱参数组成含有毛刺信号的曲线图; 图 4c为本公开实施例三提供的音频起始点检测方法中语音频谱参数组成的 曲线图; 图 4d为本公开实施例三提供的音频起始点检测方法中的音频信号示意图; 图 4e 为本公开实施例三提供的采用现有起始点检测方法得到的图 4d所示 的音频信号检测结果示意图; 图 4f为本公开实施例三提供的音频起始点检测方法得到的图 4d所示的音频 信号检测结果示意图; 图 5为本公开实施例四提供的音频起始点检测装置的结构示意图; 图 6为根据本公开实施例五提供的电子设备的结构示意图。 具体实施方式
以下通过特定的具体实例说明本公开的实施方式, 本领域技术人员可由本 说明书所揭露的内容轻易地了解本公开的其他优点与功效。 显然, 所描述的实 施例仅仅是本公开一部分实施例, 而不是全部的实施例。 本公开还可以通过另 外不同的具体实施方式加以实施或应用, 本说明书中的各项细节也可以基于不 同观点与应用,在没有背离本公开的精神下进行各种修饰或改变。需说明的是,
在不冲突的情况下, 以下实施例及实施例中的特征可以相互组合。 基于本公开 中的实施例, 本领域普通技术人员在没有作出创造性劳动前提下所获得的所有 其他实施例, 都属于本公开保护的范围。 需要说明的是, 下文描述在所附权利要求书的范围内的实施例的各种方面。 应显而易见, 本文中所描述的方面可体现于广泛多种形式中, 且本文中所描述 的任何特定结构及 /或功能仅为说明性的。 基于本公开, 所属领域的技术人员应 了解, 本文中所描述的一个方面可与任何其它方面独立地实施, 且可以各种方 式组合这些方面中的两者或两者以上。 举例来说, 可使用本文中所阐述的任何 数目个方面来实施设备及 /或实践方法。 另外, 可使用除了本文中所阐述的方面 中的一或多者之外的其它结构及 /或功能性实施此设备及 /或实践此方法。 还需要说明的是, 以下实施例中所提供的图示仅以示意方式说明本公开的 基本构想, 图式中仅显示与本公开中有关的组件而非按照实际实施时的组件数 目、 形状及尺寸绘制, 其实际实施时各组件的型态、 数量及比例可为一种随意 的改变, 且其组件布局型态也可能更为复杂。 另外, 在以下描述中, 提供具体细节是为了便于透彻理解实例。 然而, 所 属领域的技术人员将理解, 可在没有这些特定细节的情况下实践所述方面。 实施例一 图 2a为本公开实施例一提供的音频起始点检测方法的流程图, 本实施例提 供的该音频起始点检测方法可以由一音频起始点检测装置来执行, 该音频起始 点检测装置可以实现为软件, 或者实现为软件和硬件的组合, 该音频起始点检 测装置可以集成设置在音频起始点检测系统中的某设备中, 比如音频起始点检 测服务器或者音频起始点检测终端设备中。 本实施例可适用于对于一些较为复 杂节奏感不强的音频 (例如多乐器混合的音乐、 节奏比较慢的音乐以及人声) 的场景。 如图 2a所示, 该方法包括如下步骤: 步骤 S21 :根据与音频的音频信号对应的频域信号确定各频段对应的第一语
音频谱参数。 其中, 音频信号可以为一段音乐或语音, 将时域的音频信号转换为频域即 得到对应的频域信号。 这里, 为了区分本文中出现的不同的语音频谱参数, 将根据出现的先后顺 序分别称为第一语音频谱参数和第二语音频谱参数。 其中, 第一语音频谱参数可以根据频谱幅度和相位确定。 在一个可选的实施例中, 步骤 S21具体包括: 步骤 S21 1 : 将音频的音频信号切分为多个子音频信号, 将各子音频信号转 换为频域信号, 每个子音频信号对应一个频段。 步驟 S212 : 确定各频段对应的第一语音频谱参数。 具体的, 音频信号是一串一維的离散时间序列, 可以表示为: X = {晉, 聶}, 其中, N为离散样本点的总数。 虽然音频信号是一种随时间非周 期变化的信号, 但是在短时范围内 (通常短时定义为 10-40ms 之间) , 音频信 号可以近似的展现出平稳 (近似周期性) 的特性, 因此这里可将音频信号切分 成等长的短时语音段即子音频信号进行分析。例如,如图 2b所示,针对 16000Hz 采样率的音频信号, 可选择 512个样本点为一个子音频信号, 其对应了 32ms的 语音长度。 这里, 可采用傅里叶变换将时域的音频信号转换为频域的音频信号, 随时 间变化的频段信息称为语谱图, 如图 2c所示, 可以清晰地看到子音频信号在不 同频段的能量变化, 可见在起始点位置, 频谱会有明显的阶跃变化。
_ 2nmk 其中, 对应的频域信号可以表示为: (彎 = 2n=。琴 n+m)* 丁, 其中, n表示第 n个子音频信号, L表示子音频信号的长度, k代表第 k个频段。 相应的, 当音频信号划分为多个子音频信号时, 第一语音频谱参数具体可 以为不同子音频信号的频谱幅度和相位的综合加权, 例如可采用公式 cpx(n) =
SL VClK^I - It-(^l)2 +(I哥(彎 I * sin(乳(彎))2计算得到,其中 (彎 |为 第 k个频段的幅度, 其中乳(彎为第 k个频段的二阶相位差分, 其中乳(彎 = 乳(彎-乳-:L(彎, 其中乳( 为第 k 个频段的一阶相位差分, 其中乳(彎 =
其中 (彎为第 k个频段的相位。 本实施例采用相位的二阶 差分可以更好地表示起始点信息。 步骤 S22 ·-针对各频段, 根据当前频段的第一语音频谱参数和按照时序位于 当前频段之前的各频段的第一语音频谱参数确定当前频段的第二语音频谱参数。 其中, 预设个数可自定义设置。 具体的, 在确定各频段的第二语音频谱参数中, 首先选定任一频段作为当 前频段, 然后根据当前频段的第一语音频谱参数和按照时序位于所述当前频段 之前的各频段的第一语音频谱参数确定当前频段的第二语音频谱参数, 再从剩 余频段中选取任一频段作为当前频段, 重复执行上述操作, 直到确定所有频段 的第二语音频谱参数。 步骤 S23 :根据各频段对应的第二语音频谱参数确定音频中的音符和音节的 一个或多个起始点位置。 在一个可选的实施例中, 步骤 S23包括:
S231 : 根据各频段对应的第二语音频谱参数绘制语音频谱参数曲线。
S232: 根据语音频谱参数曲线确定局部最高点, 根据局部最高点对应的第 二语音频谱参数确定音频中的音符和音节的一个或多个起始点位置。 本公开实施例通过根据与音频的音频信号对应的频域信号确定各频段对应 的第一语音频谱参数, 并针对各频段, 根据当前频段的第一语音频谱参数和按 照时序位于所述当前频段之前的各频段的第一语音频谱参数确定当前频段的第
节的一个或多个起始点位置, 由于在确定第二语音频谱参数时参照了多个频段 对应的第一语音频谱参数, 使得确定的第二语音频谱参数更为准确, 从而可以
准确的检测出音频中的音符和音节的起始点, 减少了误检和漏检情况的发生, 并且参照的是按照时序位于所述当前频段之前的各频段的第一语音频谱参数, 保证了起始点检测的实时性。
实施例二 图 3 为本公开实施例二提供的音频起始点检测方法的流程图, 本实施例在 上述实施例的基础上, 对步骤根据当前频段的第一语音频谱参数和按照时序位 于所述当前频段之前的各频段的第一语音频谱参数确定当前频段的第二语音频 谱参数进行进一步优化, 本实施例可适用于对于一些较为复杂节奏感不强的音 频 (例如多乐器混合的音乐、 节奏比较慢的音乐以及人声) 的场景。 如图 3 所 示, 具体包括:
步骤 S31 :根据与音频的音频信号对应的频域信号确定各频段对应的第一语 音频谱参数。 步骤 S32 : 针对各频段, 根据当前频段的第一语音频谱参数和按照时序位于 当前频段之前的各频段的第一语音频谱参数确定第一语音频谱参数的均值, 将 均值作为当前频段的第二语音频谱参数。 步骤 S33 :根据各频段对应的第二语音频谱参数确定音频中的音符和音节的 一个或多个起始点位置。 本公开实施例通过确定各频段的均值, 并根据各频段对应的均值确定音频 中的音符和音节的起始点位置, 可以改善由均值组成的曲线图中的毛刺现象, 进一步提高起始点检测准确率, 并且参照的是按照时序位于所述当前频段之前 的各频段的第一语音频谱参数, 保证了起始点检测的实时性。
实施例三 图 4a为本公开实施例三提供的音频起始点检测方法的流程图, 本实施例在
上述实施例的基础上, 对步骤根据当前频段的第一语音频谱参数和按照时序位 于所述当前频段之前的各频段的第一语音频谱参数确定当前频段的第二语音频 谱参数进行进一步优化, 本实施例可适用于对于一些较为复杂节奏感不强的音 频 (例如多乐器混合的音乐、 节奏比较慢的音乐以及人声) 的场景。 如图 4a所 示, 具体包括: 步骤 S41 :根据与音频的音频信号对应的频域信号确定各频段对应的第一语 音频谱参数。 步骤 S42: 针对各频段, 根据当前频段的第一语音频谱参数和按照时序位于 当前频段之前的各频段的第一语音频谱参数确定第一语音频谱参数的均值。 步骤 S43 :根据当前频段的第一语音频谱参数和均值确定当前频段的第二语 音频谱参数。 在一个可选的实施例中, 步骤 S43包括: 步骤 S431 : 计算当前频段的第一语音频谱参数与均值的差值。 步骤 S432: 根据差值确定当前频段的第二语音频谱参数。 进一步的, 步骤 S432包括:
可有如下两种实施方式: 根据当前频段对应的差值和按照时序位于所述当 前频段之前的各频段对应的差值确定差值的均值, 将差值的均值作为当前频段 的第二语音频谱参数, 后续在根据第二语音频谱参数确定起始点时, 可以减掉 曲线中的毛刺信号。 例如, 如图 4b所示, 为语音频谱参数组成的含有毛刺信号 的曲线, 如图 4c所示, 为本方案中语音频谱参数组成的曲线。 举例说明, 如图 4d, 为音频信号示意图, 图 4e 为采用现有技术的方法对图 4d的音频信号进行 检测得到的起始点检测结果示意图,图 4f为采用本实施的方法对图 4d的音频信 号进行检测得到的起始点检测结果示意图。 步骤 S44:根据各频段对应的第二语音频谱参数确定音频中的音符和音节的 一个或多个起始点位置。
本公开实施例由于在确定第二语音频谱参数时参照了多个频段对应的第一 语音频谱参数, 使得确定的第二语音频谱参数更为准确, 从而可以准确的检测 出音频中的音符和音节的起始点, 减少了误检和漏检情况的发生, 并且通过确 定各频段的差值的均值, 并根据各频段对应的差值的均值确定音频中的音符和 音节起始点位置, 可以改善由现有曲线图中的毛刺现象, 进一步提高起始点检 测准确率, 并且参照的是按照时序位于所述当前频段之前的各频段的第一语音 频谱参数, 保证了起始点检测的实时性。
实施例四 图 5 为本公开实施例四提供的音频起始点检测装置的结构示意图, 该音频 起始点检测装置可以实现为软件, 或者实现为软件和硬件的组合, 该音频起始 点检测装置可以集成设置在音频起始点检测系统中的某设备中, 比如音频起始 点检测服务器或者音频起始点检测终端设备中。 本实施例可适用于对于一些较 为复杂节奏感不强的音频 (例如多乐器混合的音乐、 节奏比较慢的音乐以及人 声) 的场景。 如图 5所示, 该装置包括: 第一参数确定模块 51、 第二参数确定 模块 52和起始点确定模块 53 ; 其中, 第一参数确定模块 51用于根据与音频的音频信号对应的频域信号确定各频 段对应的第一语音频谱参数; 第二参数确定模块 52用于针对各频段, 根据当前频段的第一语音频谱参数 和按照时序位于所述当前频段之前的各频段的第一语音频谱参数确定所述当前 频段的第二语音频谱参数; 起始点确定模块 53用于根据所述各频段对应的第二语音频谱参数确定所述 音频中的音符和音节的一个或多个起始点位置。 进一步的, 所述第二参数确定模块 52具体用于: 针对各频段, 根据当前频 段的第一语音频谱参数和按照时序位于所述当前频段之前的各频段的第一语音
频谱参数确定第一语音频谱参数的均值, 将所述均值作为所述当前频段的第二 语音频谱参数。 进一步的, 所述第二参数确定模块 52包括: 均值确定单元 521和第二参数 确定单元 522 ; 其中, 均值确定单元 521 用于针对各频段, 根据当前频段的第一语音频谱参数和 按照时序位于所述当前频段之前的各频段的第一语音频谱参数确定第一语音频 谱参数的均值; 第二参数确定单元 522 用于根据所述当前频段的第一语音频谱参数和所述 均值确定所述当前频段的第二语音频谱参数。 进一步的, 所述第二参数确定单元 522 具体用于: 计算所述当前频段的第 一语音频谱参数与所述均值的差值; 根据所述差值确定所述当前频段的第二语 音频谱参数。 进一步的, 所述第二参数确定单元 522 具体用于: 根据所述当前频段对应 的差值和按照时序位于所述当前频段之前的各频段对应的差值确定差值的均值, 将所述差值的均值作为所述当前频段的第二语音频谱参数。 进一步的, 所述起始点确定模块 53具体用于: 根据所述各频段对应的第二 语音频谱参数绘制语音频谱参数曲线; 根据所述语音频谱参数曲线确定局部最 高点, 根据所述局部最高点对应的第二语音频谱参数确定所述音频中的音符和 音节的一个或多个起始点位置。 进一步的, 所述第一参数确定模块 51具体用于: 将所述音频的音频信号切 分为多个子音频信号, 将各子音频信号转换为频域信号, 每个子音频信号对应 一个频段; 确定各频段对应的第一语音频谱参数。 有关音频起始点检测装置实施例的工作原理、 实现的技术效果等详细说明 可以参考前述音频起始点检测方法实施例中的相关说明, 在此不再赘述。 实施例五
下面参考图 6,其示出了适于用来实现本公开实施例的电子设备的结构示意 图。本公开实施例中的电子设备可以包括但不限于诸如移动电话、笔记本电脑、 数字广播接收器、 PDA (个人数字助理) 、 PAD (平板电脑) 、 PMP (便携式 多媒体播放器) 、 车载终端 (例如车载导航终端) 等等的移动终端以及诸如数 字 TV、 台式计算机等等的固定终端。 图 6示出的电子设备仅仅是一个示例, 不 应对本公开实施例的功能和使用范围带来任何限制。 如图 6 所示, 电子设备可以包括处理装置 (例如中央处理器、 图形处理器 等) 601, 其可以根据存储在只读存储器 (ROM) 602中的程序或者从存储装置 608加载到随机访问存储器(RAM) 603中的程序而执行各种适当的动作和处理。 在 RAM 603中,还存储有电子设备操作所需的各种程序和数据。处理装置 601、 ROM 602以及 RAM 603通过总线 604彼此相连。 输入 /输出 (I/O) 接口 605也 连接至总线 604 o 通常, 以下装置可以连接至 I/O接口 605:包括例如触摸屏、触摸板、键盘、 鼠标、 图像传感器、 麦克风、 加速度计、 陀螺仪等的输入装置 606 ; 包括例如液 晶显示器 (LCD) 、 扬声器、 振动器等的输出装置 607 ; 包括例如磁带、 硬盘等 的存储装置 608 ; 以及通信装置 609。 通信装置 609可以允许电子设备与其他设 备进行无线或有线通信以交换数据。虽然图 6示出了具有各种装置的电子设备, 但是应理解的是, 并不要求实施或具备所有示出的装置。 可以替代地实施或具 备更多或更少的装置。 特别地, 根据本公开的实施例, 上文参考流程图描述的过程可以被实现为 计算机软件程序。 例如, 本公开的实施例包括一种计算机程序产品, 其包括承 载在计算机可读介质上的计算机程序, 该计算机程序包含用于执行流程图所示 的方法的程序代码。 在这样的实施例中, 该计算机程序可以通过通信装置 309 从网络上被下载和安装,或者从存储装置 608被安装,或者从 ROM 602被安装。 在该计算机程序被处理装置 601 执行时, 执行本公开实施例的方法中限定的上 述功能。
需要说明的是, 本公开上述的计算机可读介质可以是计算机可读信号介质 或者计算机可读存储介质或者是上述两者的任意组合。 计算机可读存储介质例 如可以是—但不限于—电、 磁、 光、 电磁、 红外线、 或半导体的系统、 装 置或器件, 或者任意以上的组合。 计算机可读存储介质的更具体的例子可以包 括但不限于: 具有一个或多个导线的电连接、 便携式计算机磁盘、 硬盘、 随机 访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM 或闪存) 、 光纤、 便携式紧凑磁盘只读存储器 (CD-ROM) 、 光存储器件、 磁 存储器件、 或者上述的任意合适的组合。 在本公开中, 计算机可读存储介质可 以是任何包含或存储程序的有形介质, 该程序可以被指令执行系统、 装置或者 器件使用或者与其结合使用。 而在本公开中, 计算机可读信号介质可以包括在 基带中或者作为载波一部分传播的数据信号, 其中承载了计算机可读的程序代 码。 这种传播的数据信号可以采用多种形式, 包括但不限于电磁信号、 光信号 或上述的任意合适的组合。 计算机可读信号介质还可以是计算机可读存储介质 以外的任何计算机可读介质, 该计算机可读信号介质可以发送、 传播或者传输 用于由指令执行系统、 装置或者器件使用或者与其结合使用的程序。 计算机可 读介质上包含的程序代码可以用任何适当的介质传输, 包括但不限于: 电线、 光缆、 RF (射频) 等等, 或者上述的任意合适的组合。 上述计算机可读介质可以是上述电子设备中所包含的; 也可以是单独存在, 而未装配入该电子设备中。 上述计算机可读介质承载有一个或者多个程序, 当上述一个或者多个程序 被该电子设备执行时, 使得该电子设备: 根据音频信号对应的频域信号确定各 频段对应的第一语音频谱参数; 根据当前频段的第一语音频谱参数和按照时序 位于所述当前频段之前的各频段的第一语音频谱参数确定所述当前频段的第二 语音频谱参数; 根据所述各频段对应的第二语音频谱参数确定起始点位置。 可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的 计算机程序代码,上述程序设计语言包括面向对象的程序设计语言一诸如 Java、
Smalltalk、 C++, 还包括常规的过程式程序设计语言一诸如“C”语言或类似的程 序设计语言。 程序代码可以完全地在用户计算机上执行、 部分地在用户计算机 上执行、 作为一个独立的软件包执行、 部分在用户计算机上部分在远程计算机 上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中, 远程计算机可以通过任意种类的网络—包括局域网 (LAN)或广域网 (WAN)— 连接到用户计算机, 或者, 可以连接到外部计算机 (例如利用因特网服务提供 商来通过因特网连接) 。 附图中的流程图和框图, 图示了按照本公开各种实施例的系统、 方法和计 算机程序产品的可能实现的体系架构、 功能和操作。 在这点上, 流程图或框图 中的每个方框可以代表一个模块、 程序段、 或代码的一部分, 该模块、 程序段、 或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。 也应 当注意, 在有些作为替换的实现中, 方框中所标注的功能也可以以不同于附图 中所标注的顺序发生。 例如, 两个接连地表示的方框实际上可以基本并行地执 行, 们有时也可以按相反的顺序执行, 这依所涉及的功能而定。 也要注意的 是, 框图和 /或流程图中的每个方框、 以及框图和 /或流程图中的方框的组合, 可 以用执行规定的功能或操作的专用的基于硬件的系统来实现, 或者可以用专用 硬件与计算机指令的组合来实现。 描述于本公开实施例中所涉及到的单元可以通过软件的方式实现, 也可以 通过硬件的方式来实现。 其中, 单元的名称在某种情况下并不构成对该单元本 身的限定, 例如, 拖拽点确定模块还可以被描述为“用于在模板图像上确定拖拽 点的模块”。 以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。 本领域 技木人员应当理解, 本公开中所涉及的公开范围, 并不限于上述技术特征的特 定组合而成的技术方案, 同时也应涵盖在不脱离上述公开构思的情况下, 由上 述技术特征或其等同特征进行任意组合而形成的其它技术方案。 例如上述特征 与本公开中公开的 (但不限于) 具有类似功能的技术特征进行互相替换而形成
的技术方案。
Claims
1、 一种音频始点检测方法, 其特征在于, 包括:
根据与音频的音频信号对应的频域信号确定各频段对应的第一语音频谱参 数; 针对各频段, 根据当前频段的第一语音频谱参数和按照时序位于所述当前 频段之前的各频段的第一语音频谱参数确定所述当前频段的第二语音频谱参数; 根据所述各频段对应的第二语音频谱参数确定所述音频中的音符和音节的 一个或多个起始点位置。
2、 如权利要求 1所述的音频起始点检测方法, 其特征在于, 所述针对各频 段, 根据当前频段的第一语音频谱参数和按照时序位于所述当前频段之前的各 频段的第一语音频谱参数确定所述当前频段的第二语音频谱参数, 包括: 针对各频段, 根据当前频段的第一语音频谱参数和按照时序位于所述当前 频段之前的各频段的第一语音频谱参数确定第一语音频谱参数的均值, 将所述 均值作为所述当前频段的第二语音频谱参数。
3、 如权利要求 1所述的音频起始点检测方法, 其特征在于, 所述针对各频 段, 根据当前频段的第一语音频谱参数和按照时序位于所述当前频段之前的各 频段的第一语音频谱参数确定所述当前频段的第二语音频谱参数, 包括: 针对各频段, 根据当前频段的第一语音频谱参数和按照时序位于所述当前 频段之前的各频段的第一语音频谱参数确定第一语音频谱参数的均值; 根据所述当前频段的第一语音频谱参数和所述均值确定所述当前频段的第 二语音频谱参数。
4、 如权利要求 3所述的音频起始点检测方法, 其特征在于, 所述根据所述 当前频段的第一语音频谱参数和所述均值确定所述当前频段的第二语音频谱参 数, 包括:
计算所述当前频段的第一语音频谱参数与所述均值的差值; 根据所述差值确定所述当前频段的第二语音频谱参数。
5、 如权利要求 4所述的音频起始点检测方法, 其特征在于, 所述根据所述 差值确定所述当前频段的第二语音频谱参数, 包括: 根据所述当前频段对应的差值和按照时序位于所述当前频段之前的各频段 对应的差值确定差值的均值, 将所述差值的均值作为所述当前频段的第二语音 频谱参数。
6、 如权利要求 1-5任一项所述的音频起始点检测方法, 其特征在于, 所述 根据所述各频段对应的第二语音频谱参数确定所述音频中的音符和音节的一个 或多个起始点位置, 包括: 根据所述各频段对应的第二语音频谱参数绘制语音频谱参数曲线; 根据所述语音频谱参数曲线确定局部最高点, 根据所述局部最高点对应的 第二语音频谱参数确定所述音频中的音符和音节的一个或多个起始点位置。
7、 如权利要求 1-5任一项所述的音频起始点检测方法, 其特征在于, 所述 根据与音频的音频信号对应的频域信号确定各频段对应的第一语音频谱参数包 括: 将所述音频的音频信号切分为多个子音频信号, 将各子音频信号转换为频 域信号, 每个子音频信号对应一个频段; 确定各频段对应的第一语音频谱参数。
8、 一种音频起始点检测装置, 其特征在于, 包括: 第一参数确定模块, 用于根据与音频的音频信号对应的频域信号确定各频 段对应的第一语音频谱参数; 第二参数确定模块, 用于针对各频段, 根据当前频段的第一语音频谱参数 和按照时序位于所述当前频段之前的各频段的第一语音频谱参数确定所述当前
频段的第二语音频谱参数; 起始点确定模块, 用于根据所述各频段对应的第二语音频谱参数确定所述 音频中的音符和音节的一个或多个起始点位置。
9、 一种电子设备, 包括: 存储器, 用于存储非暂时性计算机可读指令; 以及 处理器, 用于运行所述计算机可读指令, 使得所述处理器执行时实现根据 权利要求 1-7中任意一项所述的音频起始点检测方法。
10、 一种计算机可读存储介质, 用于存储非暂时性计算机可读指令, 当所 述非暂时性计算机可读指令由计算机执行时, 使得所述计算机执行权利要求 1-7 中任意一项所述的音频起始点检测方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/434,628 US12119023B2 (en) | 2019-02-28 | 2020-02-27 | Audio onset detection method and apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910151015.0 | 2019-02-28 | ||
CN201910151015.0A CN110070884B (zh) | 2019-02-28 | 2019-02-28 | 音频起始点检测方法和装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020173488A1 true WO2020173488A1 (zh) | 2020-09-03 |
Family
ID=67366002
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/077024 WO2020173488A1 (zh) | 2019-02-28 | 2020-02-27 | 音频起始点检测方法和装置 |
Country Status (3)
Country | Link |
---|---|
US (1) | US12119023B2 (zh) |
CN (1) | CN110070884B (zh) |
WO (1) | WO2020173488A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112509601A (zh) * | 2020-11-18 | 2021-03-16 | 中电海康集团有限公司 | 一种音符起始点检测方法及系统 |
CN114678037A (zh) * | 2022-04-13 | 2022-06-28 | 北京远鉴信息技术有限公司 | 一种重叠语音的检测方法、装置、电子设备及存储介质 |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110070884B (zh) * | 2019-02-28 | 2022-03-15 | 北京字节跳动网络技术有限公司 | 音频起始点检测方法和装置 |
CN111145779B (zh) * | 2019-12-26 | 2021-08-24 | 腾讯科技(深圳)有限公司 | 一种音频文件的目标检测方法及相关设备 |
CN111091849B (zh) * | 2020-03-03 | 2020-12-22 | 龙马智芯(珠海横琴)科技有限公司 | 鼾声识别的方法及装置、存储介质止鼾设备和处理器 |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4833713A (en) * | 1985-09-06 | 1989-05-23 | Ricoh Company, Ltd. | Voice recognition system |
JPH02230296A (ja) * | 1989-03-03 | 1990-09-12 | Seiko Instr Inc | 音声信号における周期検出方法 |
CN1773605A (zh) * | 2004-11-12 | 2006-05-17 | 中国科学院声学研究所 | 一种应用于语音识别系统的语音端点检测方法 |
CN101996628A (zh) * | 2009-08-21 | 2011-03-30 | 索尼株式会社 | 提取语音信号的韵律特征的方法和装置 |
CN105280196A (zh) * | 2015-11-19 | 2016-01-27 | 科大讯飞股份有限公司 | 副歌检测方法及系统 |
CN108198547A (zh) * | 2018-01-18 | 2018-06-22 | 深圳市北科瑞声科技股份有限公司 | 语音端点检测方法、装置、计算机设备和存储介质 |
CN108320730A (zh) * | 2018-01-09 | 2018-07-24 | 广州市百果园信息技术有限公司 | 音乐分类方法及节拍点检测方法、存储设备及计算机设备 |
CN108962226A (zh) * | 2018-07-18 | 2018-12-07 | 百度在线网络技术(北京)有限公司 | 用于检测语音的端点的方法和装置 |
CN109256146A (zh) * | 2018-10-30 | 2019-01-22 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频检测方法、装置及存储介质 |
CN110070885A (zh) * | 2019-02-28 | 2019-07-30 | 北京字节跳动网络技术有限公司 | 音频起始点检测方法和装置 |
CN110070884A (zh) * | 2019-02-28 | 2019-07-30 | 北京字节跳动网络技术有限公司 | 音频起始点检测方法和装置 |
CN110085214A (zh) * | 2019-02-28 | 2019-08-02 | 北京字节跳动网络技术有限公司 | 音频起始点检测方法和装置 |
Family Cites Families (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8170875B2 (en) * | 2005-06-15 | 2012-05-01 | Qnx Software Systems Limited | Speech end-pointer |
JP4966048B2 (ja) * | 2007-02-20 | 2012-07-04 | 株式会社東芝 | 声質変換装置及び音声合成装置 |
WO2010146711A1 (ja) * | 2009-06-19 | 2010-12-23 | 富士通株式会社 | 音声信号処理装置及び音声信号処理方法 |
JP2011053565A (ja) * | 2009-09-03 | 2011-03-17 | Nippon Telegr & Teleph Corp <Ntt> | 信号分析装置、信号分析方法、プログラム、及び記録媒体 |
CN104681038B (zh) * | 2013-11-29 | 2018-03-09 | 清华大学 | 音频信号质量检测方法及装置 |
CN105304073B (zh) * | 2014-07-09 | 2019-03-12 | 中国科学院声学研究所 | 一种敲击弦乐器的音乐多音符估计方法及系统 |
CN104143324B (zh) * | 2014-07-14 | 2018-01-12 | 电子科技大学 | 一种乐音音符识别方法 |
JP2016038435A (ja) * | 2014-08-06 | 2016-03-22 | ソニー株式会社 | 符号化装置および方法、復号装置および方法、並びにプログラム |
KR102217292B1 (ko) * | 2015-02-26 | 2021-02-18 | 네이버 주식회사 | 적어도 하나의 의미론적 유닛의 집합을 음성을 이용하여 개선하기 위한 방법, 장치 및 컴퓨터 판독 가능한 기록 매체 |
EP3182413B1 (en) * | 2015-12-16 | 2018-08-29 | Ruhr-Universität Bochum | Adaptive line enhancer based method |
EP3392882A1 (en) * | 2017-04-20 | 2018-10-24 | Thomson Licensing | Method for processing an input audio signal and corresponding electronic device, non-transitory computer readable program product and computer readable storage medium |
CN107704447A (zh) * | 2017-08-23 | 2018-02-16 | 海信集团有限公司 | 一种中文分词方法、中文分词装置和终端 |
JP6891736B2 (ja) * | 2017-08-29 | 2021-06-18 | 富士通株式会社 | 音声処理プログラム、音声処理方法および音声処理装置 |
CN108256307B (zh) * | 2018-01-12 | 2021-04-02 | 重庆邮电大学 | 一种智能商务旅居房车的混合增强智能认知方法 |
CN108510987B (zh) * | 2018-03-26 | 2020-10-23 | 北京小米移动软件有限公司 | 语音处理方法及装置 |
-
2019
- 2019-02-28 CN CN201910151015.0A patent/CN110070884B/zh active Active
-
2020
- 2020-02-27 WO PCT/CN2020/077024 patent/WO2020173488A1/zh active Application Filing
- 2020-02-27 US US17/434,628 patent/US12119023B2/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4833713A (en) * | 1985-09-06 | 1989-05-23 | Ricoh Company, Ltd. | Voice recognition system |
JPH02230296A (ja) * | 1989-03-03 | 1990-09-12 | Seiko Instr Inc | 音声信号における周期検出方法 |
CN1773605A (zh) * | 2004-11-12 | 2006-05-17 | 中国科学院声学研究所 | 一种应用于语音识别系统的语音端点检测方法 |
CN101996628A (zh) * | 2009-08-21 | 2011-03-30 | 索尼株式会社 | 提取语音信号的韵律特征的方法和装置 |
CN105280196A (zh) * | 2015-11-19 | 2016-01-27 | 科大讯飞股份有限公司 | 副歌检测方法及系统 |
CN108320730A (zh) * | 2018-01-09 | 2018-07-24 | 广州市百果园信息技术有限公司 | 音乐分类方法及节拍点检测方法、存储设备及计算机设备 |
CN108198547A (zh) * | 2018-01-18 | 2018-06-22 | 深圳市北科瑞声科技股份有限公司 | 语音端点检测方法、装置、计算机设备和存储介质 |
CN108962226A (zh) * | 2018-07-18 | 2018-12-07 | 百度在线网络技术(北京)有限公司 | 用于检测语音的端点的方法和装置 |
CN109256146A (zh) * | 2018-10-30 | 2019-01-22 | 腾讯音乐娱乐科技(深圳)有限公司 | 音频检测方法、装置及存储介质 |
CN110070885A (zh) * | 2019-02-28 | 2019-07-30 | 北京字节跳动网络技术有限公司 | 音频起始点检测方法和装置 |
CN110070884A (zh) * | 2019-02-28 | 2019-07-30 | 北京字节跳动网络技术有限公司 | 音频起始点检测方法和装置 |
CN110085214A (zh) * | 2019-02-28 | 2019-08-02 | 北京字节跳动网络技术有限公司 | 音频起始点检测方法和装置 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112509601A (zh) * | 2020-11-18 | 2021-03-16 | 中电海康集团有限公司 | 一种音符起始点检测方法及系统 |
CN114678037A (zh) * | 2022-04-13 | 2022-06-28 | 北京远鉴信息技术有限公司 | 一种重叠语音的检测方法、装置、电子设备及存储介质 |
CN114678037B (zh) * | 2022-04-13 | 2022-10-25 | 北京远鉴信息技术有限公司 | 一种重叠语音的检测方法、装置、电子设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN110070884A (zh) | 2019-07-30 |
US12119023B2 (en) | 2024-10-15 |
US20220358956A1 (en) | 2022-11-10 |
CN110070884B (zh) | 2022-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020173488A1 (zh) | 音频起始点检测方法和装置 | |
WO2020119150A1 (zh) | 节奏点识别方法、装置、电子设备及存储介质 | |
CN107731223B (zh) | 语音活性检测方法、相关装置和设备 | |
WO2022105545A1 (zh) | 语音合成方法、装置、可读介质及电子设备 | |
CN111369971B (zh) | 语音合成方法、装置、存储介质和电子设备 | |
CN111583903A (zh) | 语音合成方法、声码器训练方法、装置、介质及电子设备 | |
CN111798821B (zh) | 声音转换方法、装置、可读存储介质及电子设备 | |
CN110070885B (zh) | 音频起始点检测方法和装置 | |
CN109979418B (zh) | 音频处理方法、装置、电子设备及存储介质 | |
CN112309414B (zh) | 基于音频编解码的主动降噪方法、耳机及电子设备 | |
WO2020173211A1 (zh) | 图像特效的触发方法、装置和硬件装置 | |
CN112562633B (zh) | 一种歌唱合成方法、装置、电子设备及存储介质 | |
CN111739544A (zh) | 语音处理方法、装置、电子设备及存储介质 | |
CN112309409A (zh) | 音频修正方法及相关装置 | |
CN110085214B (zh) | 音频起始点检测方法和装置 | |
WO2020224294A1 (zh) | 用于处理信息的系统、方法和装置 | |
CN111540344A (zh) | 声学网络模型训练方法、装置及电子设备 | |
CN113496706B (zh) | 音频处理方法、装置、电子设备及存储介质 | |
WO2023061496A1 (zh) | 一种音频信号对齐方法、装置、存储介质及电子设备 | |
WO2023051651A1 (zh) | 音乐生成方法、装置、设备、存储介质及程序 | |
CN111444384B (zh) | 一种音频关键点确定方法、装置、设备及存储介质 | |
CN109495786B (zh) | 视频处理参数信息的预配置方法、装置及电子设备 | |
KR20180103639A (ko) | 상대적 유사도에 기초한 음악 시퀀스들의 유사도 분석 | |
US20240282329A1 (en) | Method and apparatus for separating audio signal, device, storage medium, and program | |
CN112671966B (zh) | 耳返时延检测装置、方法、电子设备及计算机可读存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20762373 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 26.01.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20762373 Country of ref document: EP Kind code of ref document: A1 |