CN113794963B - Speech enhancement system based on low-cost wearable sensor - Google Patents
Speech enhancement system based on low-cost wearable sensor Download PDFInfo
- Publication number
- CN113794963B CN113794963B CN202111075171.7A CN202111075171A CN113794963B CN 113794963 B CN113794963 B CN 113794963B CN 202111075171 A CN202111075171 A CN 202111075171A CN 113794963 B CN113794963 B CN 113794963B
- Authority
- CN
- China
- Prior art keywords
- signal
- sampling rate
- sound
- low
- wearable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000005236 sound signal Effects 0.000 claims abstract description 36
- 238000013136 deep learning model Methods 0.000 claims abstract description 26
- 238000012545 processing Methods 0.000 claims abstract description 22
- 239000000919 ceramic Substances 0.000 claims abstract description 16
- 210000000613 ear canal Anatomy 0.000 claims abstract description 9
- 238000005070 sampling Methods 0.000 claims description 32
- 238000000034 method Methods 0.000 claims description 24
- 230000005540 biological transmission Effects 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 6
- 238000004891 communication Methods 0.000 claims description 5
- 238000004519 manufacturing process Methods 0.000 claims description 3
- 210000001260 vocal cord Anatomy 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims 2
- 238000010586 diagram Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 14
- 230000000694 effects Effects 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 9
- 230000003993 interaction Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 238000001914 filtration Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 101100236764 Caenorhabditis elegans mcu-1 gene Proteins 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 229910012258 LiPO Inorganic materials 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/10—Earpieces; Attachments therefor ; Earphones; Monophonic headphones
- H04R1/1016—Earpieces of the intra-aural type
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01H—MEASUREMENT OF MECHANICAL VIBRATIONS OR ULTRASONIC, SONIC OR INFRASONIC WAVES
- G01H11/00—Measuring mechanical vibrations or ultrasonic, sonic or infrasonic waves by detecting changes in electric or magnetic properties
- G01H11/06—Measuring mechanical vibrations or ultrasonic, sonic or infrasonic waves by detecting changes in electric or magnetic properties by electric means
- G01H11/08—Measuring mechanical vibrations or ultrasonic, sonic or infrasonic waves by detecting changes in electric or magnetic properties by electric means using piezoelectric devices
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0224—Processing in the time domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/10—Earpieces; Attachments therefor ; Earphones; Monophonic headphones
- H04R1/1083—Reduction of ambient noise
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R17/00—Piezoelectric transducers; Electrostrictive transducers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W4/00—Services specially adapted for wireless communication networks; Facilities therefor
- H04W4/80—Services using short range communication, e.g. near-field communication [NFC], radio-frequency identification [RFID] or low energy communication
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2201/00—Details of transducers, loudspeakers or microphones covered by H04R1/00 but not provided for in any of its subgroups
- H04R2201/10—Details of earpieces, attachments therefor, earphones or monophonic headphones covered by H04R1/10 but not provided for in any of its subgroups
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Telephone Function (AREA)
Abstract
The invention discloses a voice enhancement system based on a low-cost wearable sensor. The system comprises wearable equipment and intelligent equipment, wherein the wearable equipment comprises a micro control unit, an audio processing unit, a filter, an in-ear earphone and a piezoelectric ceramic piece, wherein the in-ear earphone and the piezoelectric ceramic piece are respectively used for collecting a sound signal and a neck vibration signal in an ear canal when a user produces sound under the control of the micro control unit and transmitting the sound signal and the neck vibration signal to the intelligent terminal; the intelligent device is used for aligning the received neck vibration signal and the sound signal in the auditory canal, extracting a corresponding time-frequency graph or time sequence, inputting the time-frequency graph or time sequence into the pre-trained deep learning model, and obtaining a target quality voice signal, wherein the resolution and the user hearing of the target quality voice signal are superior to those of the signal acquired by the wearable device. The invention can convert low-cost wearable sensor signals into high-quality signals under the condition of protecting the privacy of users, and is suitable for daily use.
Description
Technical Field
The invention relates to the technical field of wearable equipment, in particular to a voice enhancement system based on a low-cost wearable sensor.
Background
Human-computer interaction is one of the most important functions of the equipment, is limited by the morphological design of wearable equipment, and influences the user experience. In the interaction of the wearable device, the voice interaction is a natural and low-learning-cost mode. While smart headsets dominate most of the market share in wearable devices, similar neck worn devices (smart necklaces, neck-worn headsets) are also an emerging wearable device that may be accepted by users. For wearable devices, only one microphone needs to be equipped for them to complete the voice input interaction. However, due to the characteristics of the microphone, the microphone is sensitive to external environmental noise, which also results in that the recorded data contains more noise and has poor quality. Furthermore, due to the requirement for miniaturization of wearable devices, the processing power of the wearable devices is limited, and the sampling cannot be completed at a high sampling rate. Meanwhile, the requirement on transmission real-time performance is high in the daily recording scene of the wearable device. Therefore, low cost wearable devices, due to the limitations of the hardware itself, can typically only acquire the user's voice data at a lower sampling rate and with a lower sampling quality. The user can obviously perceive the difference of the audio data recorded by the microphone with high cost and low cost. Therefore, under a wearable scene, the voice quality of a receiving end of a user when the user uses low-cost equipment is improved, high-quality voice data is kept as far as possible, and the user experience can be improved under the local recording and conversation scene.
With the continuous development of deep learning technology, related applications of the deep learning technology appear in various fields. However, how to convert from a low-quality sensor signal to a high-quality sensor signal has not been specifically explored. The existing related research fields include
1) And audio super-resolution technology. And the sound signal with low sampling rate is up-sampled into the sound signal with high sampling rate in a deep learning mode so as to improve the audio quality.
2) Mapping techniques across sensors or modalities. For cross-sensor related techniques, such as restoring the accelerometer signal back to an audio signal, etc.; for cross-modality related technologies, for example, a technology of restoring speech by text, a technology of restoring text by video, and the like.
3) Techniques for speech enhancement. Such as speech noise reduction, multi-modal speech enhancement, etc.
4) Wearable interaction technology. Such as gesture recognition, voice input, etc.
However, the existing voice interaction scheme is difficult to deploy due to the complexity of the deep learning model, or is difficult to be practically applied to the voice interaction of the smart wearable device due to the inconvenience of interaction or due to the influence of external noise.
Disclosure of Invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a low-cost wearable sensor based speech enhancement system.
According to a first aspect of the invention, a low-cost wearable sensor based speech enhancement system is provided. The system comprises wearable equipment and intelligent equipment, wherein the wearable equipment comprises a micro control unit, an audio processing unit, a filter, an in-ear earphone and a piezoelectric ceramic piece, wherein the in-ear earphone and the piezoelectric ceramic piece are respectively used for collecting a sound signal and a neck vibration signal in an ear canal when a user produces sound under the control of the micro control unit and transmitting the sound signal and the neck vibration signal to the intelligent terminal; the intelligent device is used for aligning the received neck vibration signal and the sound signal in the auditory canal, extracting a corresponding time-frequency graph or time sequence, inputting the time-frequency graph or time sequence into the pre-trained deep learning model, and obtaining a target quality voice signal, wherein the resolution and the user hearing of the target quality voice signal are superior to those of the signal acquired by the wearable device.
According to a second aspect of the invention, a low-cost wearable sensor based speech enhancement method is provided. The method comprises the following steps:
collecting neck vibration signals and sound signals in an ear canal when a user produces sound;
after the neck vibration signals and the sound signals in the auditory meatus are aligned, corresponding time-frequency graphs or time sequences are extracted and input into a pre-trained deep learning model to obtain target quality voice signals, and the resolution and the user hearing of the target quality voice signals are superior to those of signals collected by the wearable equipment.
Compared with the prior art, the voice enhancement system based on the low-cost wearable sensor has the advantages that the voice enhancement system based on the low-cost wearable sensor can meet the function of a common earphone, can also collect the voice spread in the auditory canal and the signals of the piezoelectric ceramic piece, and can collect better original signals due to the characteristics of the collection position and the sensor. The invention can convert low-cost wearable sensor signals into high-quality signals under the condition of protecting the privacy of users, realizes the voice effect superior to that of high-cost sensors, and is suitable for daily use.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a block diagram of a low cost wearable sensor based speech enhancement system according to one embodiment of the present invention;
FIG. 2 is a schematic diagram of a user wearing a speech enhancement system according to one embodiment of the present invention;
FIG. 3 is a flow diagram of a low cost wearable sensor based speech enhancement method according to one embodiment of the present invention;
FIG. 4 is a schematic diagram of a low cost wearable sensor based speech enhancement process according to one embodiment of the present invention;
FIG. 5 is a flow diagram of detecting a voice event and running a deep learning model at a terminal according to one embodiment of the invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
The voice enhancement system based on the low-cost wearable sensor can be applied to the enhancement processing of the signals acquired by the low-cost sensor and the restoration of the signals into high-quality vibration signals, and the restored signals are superior to original signals. Referring to fig. 1 and 2, fig. 1 (a) is a front view of the system, and fig. 1 (b) is a rear view of the system, and the system as a whole is provided including a wearable device and a smart device, wherein the wearable device includes a micro control unit 1, an audio processing unit 2, a filter 3, a storage unit 4, an in-ear headphone 5, a piezoceramic sheet 6, and a power supply module 7. The wearable device may be a smart necklace, a neck-worn headset, or a neck strap, etc. The smart device may be a smart terminal or other wearable devices, and other types of electronic devices, such as a smart phone, a tablet electronic device, a desktop computer, or a vehicle-mounted device, and fig. 1 illustrates an example of a smart phone.
The MCU 1 may use a high performance chip as a central control unit for controlling coordination between other modules or units and communication between the wearable device and the smart phone, for example, the MCU 1 uses ESP32 as a stand-alone system running an application or a slave to the MCU and includes a communication module, such as providing Wi-Fi and bluetooth functions through SPI/SDIO or I2C/UART interfaces.
The audio processing unit 2 may employ an integrated chip, such as VS1053, which is an audio decoding module using SPI communication, and supports decoding playing and encoding saving of audio files.
The filter 3 has a filtering and amplifying function, and an LM358 including a dual operational amplifier circuit can be used.
The storage unit 4 employs, for example, an SD card for storing the acquired signal or audio file.
The microphone inside the in-ear earphone 5 is used for collecting sound signals in the ear canal when a user produces sound, and the environmental noise can be shielded to a certain extent by collecting the sound signals in the ear canal.
And the piezoelectric ceramic piece 6 is used for collecting a neck vibration signal when the user produces sound. The piezoelectric ceramic piece is very sensitive to vibration signals and can generate voltage changes with different amplitudes according to the vibration amplitude. Further, in order to make the acquired neck vibration signal more accurate or pure, a filtering and amplifying circuit can be used for moderately amplifying and filtering the effective signal. When a user wears the provided wearable device, the piezoelectric ceramic sheet is attached near the vocal cords of the user. The method is beneficial to obtaining accurate neck vibration signals and is convenient to carry.
The power module 7 is used to provide power to the wearable device, and may be of a common battery type, such as LiPO.
Hereinafter, with reference to fig. 3 and fig. 4, a speech enhancement process is described by taking an android phone as an example, which specifically includes the following steps:
and step S1, respectively collecting neck vibration signals and sound signals in the auditory canal when the user vocalizes through the piezoelectric ceramic piece and the microphone on the in-ear earphone.
For example, the wearable device housing is provided with a button key for starting a voice enhancement function, which can be manually turned on by a user, in addition to a power-on key and a microphone and a piezoelectric ceramic sheet for normal use. In one embodiment, the button keys that activate the voice enhancement function are located on the side of the housing and the piezoceramic wafer is located on the back of the system.
In step S1, the filter 3 with filtering and amplifying functions is used to effectively filter noise signals including the mains supply, increase and retain effective signals, the ESP32 collects neck vibration signals at a sampling rate of 10KHz through the high-speed ADC, and at the same time, temporarily stores the data into the SD card, and the SD card receives the data (Secure Digital Input and Output) through the SDIO bus. When the user wears the wearable equipment, the piezoelectric ceramic piece sensor is attached to the side face of the vocal cord position, so that the contact area is ensured as much as possible.
Preferably, an integrated audio processing chip VS1053 is adopted to collect the sound signal, compared with the method that an audio amplifier is used to directly collect the sound signal, a digital filter is arranged in the VS1053, and a series of complex software filtering algorithms are built in, so that the processing amount of the ear canal signal can be effectively reduced. The sampling rate of the sound signals is controlled to be 10KHz as well as VS1053 and ESP32 to carry out data transmission through SPI communication protocol, and the collected sound signals are converted into WAV audio format to be stored in SD card.
And step S2, storing and transmitting the collected data to the intelligent terminal for data processing.
For example, in step S2, the data transmission at the mobile phone end uses a bluetooth protocol that is commonly used, and in order to reduce the data transmission amount as much as possible, the data is compressed, for example, by using a huffman algorithm, the piezoelectric ceramic piece data and the ear channel data are compressed, and all the data is transmitted in a binary original data manner when being transmitted, and a check code and a frame header and a frame tail are added to the data, which can effectively transmit the data to the mobile phone for processing.
Step S3, noise filters the signal and uses voice activity detection to acquire voice events.
Referring to fig. 5, the step of detecting a sound event includes:
s31, recording a segment of data without voice after the user wears the equipment; and performing data framing on the original signal, and performing noise filtering processing.
Specifically, data when the current user has no voice is obtained as a reference source of the noise signal. The time-frequency spectrum of the noise is then calculated by a time-frequency domain transform, such as a short-time fourier transform (STFT). And obtaining a threshold value of the noise according to the mean value and the variance of the time frequency spectrum. And similarly, calculating a time-frequency spectrum of the original signal, and removing a part which is lower than a noise threshold value in the original signal as noise to obtain noise-filtered data. The above-described denoising process may be performed on both the acoustic signal in the ear canal and the neck vibration signal.
S32, for each frame of signal, it is detected whether there is voice activity.
Specifically, for the signal of each frame processed in S31, Voice Activity Detection (Voice Activity Detection) is used to detect whether there is a Voice Activity or a Voice event, and when there is a Voice event, a neck vibration signal and a Voice signal (or collectively referred to as Voice data) are sent to the smart terminal.
In step S4, the terminal processes the received sound data and feeds back the audio signal with improved quality to the user.
Still referring to fig. 5, step S4 includes:
s41, the terminal synchronizes the time of the received data of the two sensors and aligns the signals.
First, for the voice activity detected in step S32, the terminal receives the data of the two sensors, performs time synchronization, and aligns signals, specifically: 1) the data of the two sensors are first converted into a sequence of window energies, the window size being n. Then, the cross-correlation values of the two sensors are calculated to obtain the cross-correlation value of coarse granularity, and coarse granularity time synchronization is carried out. 2) And acquiring a section of original data (comprising two sensor signals) before and after the position of coarse grain alignment, calculating the cross correlation value of the original data, acquiring the cross correlation value of fine grain, and completing the time synchronization of fine grain on the basis of coarse grain time synchronization to acquire two sensor signals with better time synchronization.
And S42, running a deep learning model on the terminal, enhancing the signal, and feeding back the signal to the user.
Specifically, a deep learning model is first used to convert a low cost, low sampling rate sensor signal to a high cost, low sampling rate sensor signal. Then, another deep learning model is used for converting the sensor signals with high cost and low sampling rate into the sensor signals with high cost and high sampling rate, and then the sensor signals after speech enhancement are fed back to the user, so that the subjective experience of the user is enhanced. The deep learning model embedded in the terminal is obtained after being pre-trained by using a data set, and considering that the computing capability of the terminal is relatively limited, the training process can be carried out on a cloud or a server in an off-line mode.
By utilizing the two deep learning models, the complexity of the deep learning models can be reduced and the training efficiency can be improved. The first deep learning model realizes conversion from a signal corresponding to a low-cost sensor to a signal corresponding to a high-cost sensor (the resolution may be the same), that is, the response and the content quality of the sensor signal are improved, so that the user hearing sense is improved; and the second deep learning model realizes the reconstruction of a low sampling rate signal of the same sensor to a high sampling rate signal, namely a super-resolution technology, and mainly improves the signal sampling rate. In addition, by means of signal reconstruction in two steps, the effect of the first deep learning model can be monitored, an interpretable intermediate result can be obtained, and the method is favorable for knowing which model needs to be improved subsequently. It should be understood that a single model may be used to directly reconstruct the low-cost sensor-corresponding signal to a signal of the target resolution.
In one embodiment, the data set of the pre-trained deep learning model is constructed according to the following steps:
in step S51, the subject (user) utters a normal speech after using the wearable device.
In order to take various practical situations into consideration, various sounding scenes can be set, such as Chinese corpus reading and English corpus reading.
And step S52, collecting the sound signal in the auditory canal and the neck vibration signal.
Specifically, while low-cost sensor low-sampling-rate signal acquisition is carried out, a high-cost high-quality microphone is used for simultaneous recording, and high-cost high-quality signals are acquired. For the case that there is signal misalignment in the two signal recording processes, the method of step S41 may be used to perform time synchronization. And constructing a training data set by using the data after time synchronization.
The deep learning model comprises but is not limited to a convolutional neural network, a cyclic neural network and the like, and finally the purpose of optimizing the low-sampling-rate signal of the low-cost sensor is achieved. And dividing the operation domain into a time domain model and a time-frequency domain model.
For the time domain model, the operation is performed in the time domain, and a network combining a one-dimensional convolutional neural network and a cyclic neural network can be used), so that the reconstruction of the low-quality signal in the time domain into the high-quality signal is realized.
For the time-frequency domain model, the operation is performed on a time-frequency domain, in this case, a signal input is first converted into a time-frequency domain signal through a short-time fourier transform (STFT), the time-frequency domain signal is input into the model (for example, a two-dimensional convolutional neural network), an output result of the model is obtained, and then a time sequence signal is output through an inverse short-time fourier transform (iSTFT).
In conclusion, the invention collects the sound signals in the auditory canal and the neck vibration signals by using the low-cost sensor arranged on the wearable device, and transmits the sound signals and the neck vibration signals to the daily intelligent device for data processing and enhancement, thereby obtaining the enhanced voice signals. Compared with the signal acquired by directly utilizing the earphone or the piezoelectric ceramic piece, the enhanced signal has obvious improvement in the aspects of resolution (sampling rate), user hearing and the like, can achieve or be superior to the effect of directly acquiring the signal by adopting a high-cost high-resolution sensor, and the cost of the ceramic piece, the earphone and the like arranged on the wearable equipment can be lower, so that the hardware cost of the wearable equipment is low, the wearable equipment is suitable for edge equipment, and the effect of recovering the low-quality signal into the high-quality signal can be realized.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, Python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.
Claims (9)
1. A voice enhancement system based on a low-cost wearable sensor comprises wearable equipment and intelligent equipment, wherein the wearable equipment comprises a micro control unit, an audio processing unit, a filter, an in-ear earphone and a piezoelectric ceramic piece, wherein the in-ear earphone and the piezoelectric ceramic piece are controlled by the micro control unit to respectively collect a sound signal and a neck vibration signal in an ear canal when a user produces a sound and transmit the sound signal and the neck vibration signal to the intelligent equipment;
the intelligent equipment is used for aligning the received neck vibration signal and the sound signal in the auditory canal, extracting a corresponding time-frequency graph or time sequence, inputting the time-frequency graph or time sequence into a pre-trained deep learning model, and obtaining a target quality voice signal, wherein the resolution and the user hearing of the target quality voice signal are superior to those of the signal acquired by the wearable equipment;
the intelligent equipment obtains a target quality voice signal according to the following steps:
inputting the neck vibration signal and the sound signal in the auditory canal into a first deep learning model to obtain a first target quality signal, wherein the first deep learning model reflects the corresponding relation between the sensor signal with low cost and low sampling rate and the sensor signal with high cost and low sampling rate so as to convert the sensor signal with low cost and low sampling rate into the sensor signal with high cost and low sampling rate;
and inputting the first target quality signal into a second deep learning model to obtain a final target quality voice signal, wherein the second deep learning model reflects the corresponding relation between the sensor signal with high cost and low sampling rate and the sensor signal with high cost and high sampling rate so as to convert the sensor signal with high cost and low sampling rate into the sensor signal with high cost and high sampling rate.
2. The system of claim 1, wherein the wearable device housing is provided with an activation key for starting to detect a sound production event according to an activation operation of a user, and transmitting the collected in-ear-canal sound signal and the neck vibration signal to the smart device when the sound production event is detected.
3. The system of claim 1, wherein the position of the piezoceramic wafer is proximate to a user vocal cord position when the wearable device is in a worn state.
4. The system of claim 1, wherein the micro control unit is ESP32, the audio processing unit is audio processing chip VS1053, the filter is LM358, ESP32 performs data transmission through SPI communication protocol, and the collected sound signal is converted into WAV audio format and stored in SD card.
5. The system of claim 1, wherein the wearable device compresses the neck vibration signal and the in-ear sound signal using a huffman algorithm and transmits the compressed data to the smart device.
6. The system of claim 1, wherein the smart device aligns the received neck vibration signal and the in-ear sound signal according to the following steps:
converting the neck vibration signal and the in-ear-canal sound signal into an energy sequence of a predetermined window size;
calculating the cross correlation value of the neck vibration signal and the sound signal in the auditory canal to obtain the cross correlation value of coarse granularity and carry out coarse granularity time synchronization;
for the neck vibration signal and the sound signal in the auditory canal, a section of original data before and after the position of the coarse-grained alignment is obtained, the cross correlation value of the original data and the original data is calculated, the cross correlation value of the fine-grained alignment is obtained, and the time synchronization of the fine-grained alignment is carried out.
7. The system of claim 1, wherein the smart device is a smartphone, a tablet electronic device, a desktop computer, or an in-vehicle device, and the wearable device is a neck-worn device.
8. A low-cost wearable sensor-based speech enhancement method comprises the following steps:
collecting neck vibration signals and sound signals in an ear canal when a user produces sound;
after aligning the neck vibration signal and the sound signal in the auditory canal, extracting a corresponding time-frequency graph or time sequence, inputting the time-frequency graph or time sequence into a pre-trained deep learning model, and obtaining a target quality voice signal, wherein the resolution and the user hearing of the target quality voice signal are superior to those of the signal collected by wearable equipment;
wherein the target quality speech signal is obtained according to the following steps:
inputting the neck vibration signal and the sound signal in the auditory canal into a first deep learning model to obtain a first target quality signal, wherein the first deep learning model reflects the corresponding relation between the sensor signal with low cost and low sampling rate and the sensor signal with high cost and low sampling rate so as to convert the sensor signal with low cost and low sampling rate into the sensor signal with high cost and low sampling rate;
and inputting the first target quality signal into a second deep learning model to obtain a final target quality voice signal, wherein the second deep learning model reflects the corresponding relation between the sensor signal with high cost and low sampling rate and the sensor signal with high cost and high sampling rate so as to convert the sensor signal with high cost and low sampling rate into the sensor signal with high cost and high sampling rate.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as claimed in claim 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111075171.7A CN113794963B (en) | 2021-09-14 | 2021-09-14 | Speech enhancement system based on low-cost wearable sensor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111075171.7A CN113794963B (en) | 2021-09-14 | 2021-09-14 | Speech enhancement system based on low-cost wearable sensor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113794963A CN113794963A (en) | 2021-12-14 |
CN113794963B true CN113794963B (en) | 2022-08-05 |
Family
ID=78880183
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111075171.7A Active CN113794963B (en) | 2021-09-14 | 2021-09-14 | Speech enhancement system based on low-cost wearable sensor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113794963B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN204190755U (en) * | 2014-07-28 | 2015-03-04 | 胡健生 | A kind of Necklet-type larynx wheat intercom |
CN204498328U (en) * | 2015-03-20 | 2015-07-22 | 捷音特科技股份有限公司 | Piezoelectric ceramic double frequency bass strengthens earphone |
CN105476152A (en) * | 2015-11-23 | 2016-04-13 | 陈昊 | Cycling helmet with throat microphone function |
CN106601227A (en) * | 2016-11-18 | 2017-04-26 | 北京金锐德路科技有限公司 | Audio acquisition method and audio acquisition device |
CN109410976A (en) * | 2018-11-01 | 2019-03-01 | 北京工业大学 | Sound enhancement method based on binaural sound sources positioning and deep learning in binaural hearing aid |
CN109729448A (en) * | 2017-10-27 | 2019-05-07 | 北京金锐德路科技有限公司 | Neck wears the voice control optimization method and device of formula interactive voice earphone |
CN110044472A (en) * | 2019-03-22 | 2019-07-23 | 武汉源海博创科技有限公司 | Product abnormal sound abnormal sound intelligent checking system on a kind of line |
CN111883161A (en) * | 2020-07-08 | 2020-11-03 | 东方通信股份有限公司 | Method and device for audio acquisition and position identification |
CN112420063A (en) * | 2019-08-21 | 2021-02-26 | 华为技术有限公司 | Voice enhancement method and device |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10813559B2 (en) * | 2015-06-14 | 2020-10-27 | Facense Ltd. | Detecting respiratory tract infection based on changes in coughing sounds |
US9749766B2 (en) * | 2015-12-27 | 2017-08-29 | Philip Scott Lyren | Switching binaural sound |
US20180084341A1 (en) * | 2016-09-22 | 2018-03-22 | Intel Corporation | Audio signal emulation method and apparatus |
US10382092B2 (en) * | 2017-11-27 | 2019-08-13 | Verizon Patent And Licensing Inc. | Method and system for full duplex enhanced audio |
DK201970509A1 (en) * | 2019-05-06 | 2021-01-15 | Apple Inc | Spoken notifications |
US12075210B2 (en) * | 2019-10-04 | 2024-08-27 | Soundskrit Inc. | Sound source localization with co-located sensor elements |
US20210280322A1 (en) * | 2019-10-31 | 2021-09-09 | Facense Ltd. | Wearable-based certification of a premises as contagion-safe |
CN112235679B (en) * | 2020-10-29 | 2022-10-14 | 北京声加科技有限公司 | Signal equalization method and processor suitable for earphone and earphone |
-
2021
- 2021-09-14 CN CN202111075171.7A patent/CN113794963B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN204190755U (en) * | 2014-07-28 | 2015-03-04 | 胡健生 | A kind of Necklet-type larynx wheat intercom |
CN204498328U (en) * | 2015-03-20 | 2015-07-22 | 捷音特科技股份有限公司 | Piezoelectric ceramic double frequency bass strengthens earphone |
CN105476152A (en) * | 2015-11-23 | 2016-04-13 | 陈昊 | Cycling helmet with throat microphone function |
CN106601227A (en) * | 2016-11-18 | 2017-04-26 | 北京金锐德路科技有限公司 | Audio acquisition method and audio acquisition device |
CN109729448A (en) * | 2017-10-27 | 2019-05-07 | 北京金锐德路科技有限公司 | Neck wears the voice control optimization method and device of formula interactive voice earphone |
CN109410976A (en) * | 2018-11-01 | 2019-03-01 | 北京工业大学 | Sound enhancement method based on binaural sound sources positioning and deep learning in binaural hearing aid |
CN110044472A (en) * | 2019-03-22 | 2019-07-23 | 武汉源海博创科技有限公司 | Product abnormal sound abnormal sound intelligent checking system on a kind of line |
CN112420063A (en) * | 2019-08-21 | 2021-02-26 | 华为技术有限公司 | Voice enhancement method and device |
CN111883161A (en) * | 2020-07-08 | 2020-11-03 | 东方通信股份有限公司 | Method and device for audio acquisition and position identification |
Non-Patent Citations (1)
Title |
---|
洪进等.压电陶瓷的声带振动语音识别系统.《单片机与嵌入式系统应用》.2020,(第07期),正文第56-64页. * |
Also Published As
Publication number | Publication date |
---|---|
CN113794963A (en) | 2021-12-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA3075738C (en) | Low latency audio enhancement | |
US9536540B2 (en) | Speech signal separation and synthesis based on auditory scene analysis and speech modeling | |
CN106486130B (en) | Noise elimination and voice recognition method and device | |
CN109493877B (en) | Voice enhancement method and device of hearing aid device | |
CN111009257B (en) | Audio signal processing method, device, terminal and storage medium | |
US11605372B2 (en) | Time-based frequency tuning of analog-to-information feature extraction | |
CN110858483A (en) | Intelligent device, voice awakening method, voice awakening device and storage medium | |
CN105338459A (en) | MEMS (Micro-Electro-Mechanical System) microphone and signal processing method thereof | |
CN112992169A (en) | Voice signal acquisition method and device, electronic equipment and storage medium | |
WO2022121182A1 (en) | Voice activity detection method and apparatus, and device and computer-readable storage medium | |
CN110992967A (en) | Voice signal processing method and device, hearing aid and storage medium | |
WO2017000772A1 (en) | Front-end audio processing system | |
CN112383855A (en) | Bluetooth headset charging box, recording method and computer readable storage medium | |
WO2022199405A1 (en) | Voice control method and apparatus | |
CN111831116A (en) | Intelligent equipment interaction method based on PPG information | |
Schilk et al. | In-ear-voice: Towards milli-watt audio enhancement with bone-conduction microphones for in-ear sensing platforms | |
Sui et al. | TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms | |
CN113794963B (en) | Speech enhancement system based on low-cost wearable sensor | |
CN112735382A (en) | Audio data processing method and device, electronic equipment and readable storage medium | |
CN113039601B (en) | Voice control method, device, chip, earphone and system | |
Luo et al. | Audio-visual speech separation using i-vectors | |
KR102223653B1 (en) | Apparatus and method for processing voice signal and terminal | |
CN207518801U (en) | The remote music playing device of formula interactive voice earphone is worn for neck | |
CN207518804U (en) | The telecommunication devices of formula interactive voice earphone are worn for neck | |
CN112908334A (en) | Hearing aid method, device and equipment based on directional pickup |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |