WO2024107342A1 - Dynamic effects karaoke - Google Patents
Dynamic effects karaoke Download PDFInfo
- Publication number
- WO2024107342A1 WO2024107342A1 PCT/US2023/036670 US2023036670W WO2024107342A1 WO 2024107342 A1 WO2024107342 A1 WO 2024107342A1 US 2023036670 W US2023036670 W US 2023036670W WO 2024107342 A1 WO2024107342 A1 WO 2024107342A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- signal
- song
- audio
- vocal
- voice
- Prior art date
Links
- 230000000694 effects Effects 0.000 title description 41
- 230000001755 vocal effect Effects 0.000 claims abstract description 87
- 238000012545 processing Methods 0.000 claims description 66
- 238000000034 method Methods 0.000 claims description 33
- 238000012986 modification Methods 0.000 claims description 29
- 230000004048 modification Effects 0.000 claims description 29
- 238000013459 approach Methods 0.000 claims description 15
- 230000002238 attenuated effect Effects 0.000 claims description 7
- 230000009467 reduction Effects 0.000 claims description 4
- 230000006978 adaptation Effects 0.000 claims description 2
- 230000005284 excitation Effects 0.000 claims description 2
- 238000004458 analytical method Methods 0.000 abstract description 8
- 230000005236 sound signal Effects 0.000 abstract description 6
- 241001342895 Chorus Species 0.000 description 8
- HAORKNGNJCEJBX-UHFFFAOYSA-N cyprodinil Chemical compound N=1C(C)=CC(C2CC2)=NC=1NC1=CC=CC=C1 HAORKNGNJCEJBX-UHFFFAOYSA-N 0.000 description 8
- 238000013507 mapping Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000012512 characterization method Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000010295 mobile communication Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000002592 echocardiography Methods 0.000 description 2
- 230000002401 inhibitory effect Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 206010002953 Aphonia Diseases 0.000 description 1
- 241000269400 Sirenidae Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 239000011435 rock Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/36—Accompaniment arrangements
- G10H1/361—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/02—Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos
- G10H1/06—Circuits for establishing the harmonic content of tones, or other arrangements for changing the tone colour
- G10H1/14—Circuits for establishing the harmonic content of tones, or other arrangements for changing the tone colour during execution
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/46—Volume control
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/36—Accompaniment arrangements
- G10H1/361—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
- G10H1/366—Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems with means for modifying or correcting the external signal, e.g. pitch correction, reverberation, changing a singer's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
- G10H2210/056—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or identification of individual instrumental parts, e.g. melody, chords, bass; Identification or separation of instrumental parts by their characteristic voices or timbres
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/155—Musical effects
- G10H2210/265—Acoustic effect simulation, i.e. volume, spatial, resonance or reverberation effects added to a musical sound, usually by appropriate filtering or delays
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2240/00—Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
- G10H2240/121—Musical libraries, i.e. musical databases indexed by musical parameters, wavetables, indexing schemes using musical parameters, musical rule bases or knowledge bases, e.g. for automatic composing methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2240/00—Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
- G10H2240/325—Synchronizing two or more audio tracks or files according to musical features or musical timings
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
Definitions
- This invention relates to dynamically processing vocal accompaniment of recorded audio content, and in particular to dynamic effects in a Karaoke system.
- Karaoke systems usually come with an electro-acoustic system with microphone, amplifier, and loudspeaker to reinforce the singer’s voice.
- the Karaoke set plays back a special music track where the audio track of the lead singer is missing.
- the music itself incorporates a lot of audio effects that have been applied during music production. Therefore, also for the voice of the Karaoke singer some audio effects should be applied by the electro-acoustic system such that the music and the Karaoke contribution fit well together in terms of style.
- Karaoke sets therefore usually offer various audio effects that can be selected to enrich the singer’s voice. For example, equalizers to emphasize relevant frequencies, compressors to reduce loudness variations, reverberation effects with adjustable reverberation time, delays that add decaying echoes of the voice to the output signal, chorus effects to create the impression of multiple singers singing simultaneously, exciters that add brightness to the voice, pitch shifting or vocal harmonizer effects, and many more.
- In-vehicle Karaoke sometimes referred to as “Carpool Karaoke” (referencing a television show starring James Corden), is a type of Karaoke that may be performed by passengers or the driver of a vehicle. Commercial products are available that support features such as capturing vocals from multiple passengers in a vehicle and reducing feedback sound from the vehicle’s speakers.
- a vehicle environment poses a number of challenges for a Karaoke system, including the relatively “dead” acoustic environment, and the presence of road and other ambient noise, which can be significant in volume and varying, for example, by speed and road type, and can include outside noises such as sirens and construction jack-hammers.
- a user i.e., a singer’s
- one or more systems and methods described in this document dynamically changes features of the processing for a user’s (i.e., a singer’s) input according to characteristics of the audio being sung and/or the acoustic environment in which the user is singing.
- An advantage of system processing can include an improved user experience (e.g., it is more fun or engaging to use the system) and/or a higher quality audio output (e.g., the result of combining the presented audio and the captured audio has more desirable and/or pleasant characteristics).
- the word “Karaoke” should be interpreted broadly to include any situation in which a system is configured to present an acoustic signal to one or more users, and to capture audio produced by the one or more users during the presentation of the acoustic signal.
- the acoustic signal that is presented may be referred to below as a “song” without any connotation that the audio signal includes sung or spoken words, or that it includes music intended to be sung along to; and the captured audio (or their processed versions) may be referred to as the user’s “vocals” without any connotation that the captured audio necessarily includes lyrics or other spoken or sung words.
- the word “Karaoke” should be interpreted broadly to include any situation in which a system is configured to present an acoustic signal to one or more users, and to capture audio produced by the one or more users during the presentation of the acoustic signal.
- the acoustic signal that is presented may be referred to below as a “song” without any
- the Karaoke system is deployed for use in a vehicle, for example, for use by a driver and/or one or more passengers of the vehicle.
- Multiple microphones may be used, for instance, as distributed speaker-dedicated microphones or in close distance in an array configuration processing a beamformer to focus into specific directions where passengers are speaking.
- the system may also be able detect how many singers are participating in the Karaoke and in which seat in the car they are sitting. The system might then assign different audio effects, for instance, automatic gain control (AGC) to the individual contributors to ensure consistent levels for the individual singing contributions. For example, a singer on the backseat may be assigned with effects typical for background singers. Maybe also some pitch shifting, such as transposing by one octave, can be applied.
- AGC automatic gain control
- a song selected is analyzed with respect to basic properties such as speed, loudness dynamics, music style, genre, song structure, etc.
- a set of effects is chosen and configured.
- some of this information may be already available from the database, e.g., the pitch frequency, the tempo, and polyphony may be available from a MIDI file, information on the genre may be available from the database, and therefore no automatic extraction is needed.
- a predefined set of audio effects for a song can be prepared by manual tuning, for example, with a user manual tuning for a favorite song, thereby ensuring an optimal (i.e., most desirable) set of audio effects.
- the audio effects may be adjusted depending on the background noise. For instance, in high noise situations a higher playback gain, and less reverb may be applied.
- the background noise can be estimated using the same microphone that is used for Karaoke. Such audio effects may be particularly useful for in-vehicle applications where background noise may be substantial and may vary over time.
- the microphone(s) and loudspeaker(s) are not necessarily dedicated to Karaoke, for example, being integrated into an audio entertainment system, a hand-free telephony system, and/or a voice assistant system.
- the audio playback in both vehicles may be synchronized as well as possible, and remaining mismatch in synchronization (which may not be avoidable) is considered in the reverberation effect.
- the voice from the far-end car can be processed with different audio effects to put it as a chorus on the surround loudspeaker.
- the music in two vehicles A and B starts at the same time and is synchronized.
- the singers’ voices from car A then are fed to the effects section of the far end car B where, for instance, surround effects, reverb etc. is generated out of it. The same will be done for the singers’ voices in car B transmitted to car A.
- a method for dynamic audio modification of user input for presentation with playback of a source song includes processing a microphone signal to produce an audio voice signal representing a user input.
- Parameter values for one or more audio modification approaches are determined based on characteristics of the source song, and audio voice signal is processed with the audio modification approaches configured according to the determined parameter values to produce an enhanced vocal signal.
- the enhanced vocal signal and the source song are combined to produce an audio driving signal, and the audio driving signal is provided for acoustic presentation to the user.
- An acoustic signal is acquired at a microphone to produce the microphone signal,
- the acoustic signal includes at least the user’s voice and the acoustic presentation of the audio driving signal.
- Processing the microphone signal may then include removing the acoustic presentation of the audio driving signal based on an adaptation that makes use of at least one of a reference of the audio driving signal and source song.
- the acoustic signal may include ambient noise and the processing of the microphone signal comprises noise reduction.
- the audio modification approaches include one or more of reverberation, echo, excitation, and pitch modification processing.
- Characteristics of the source song includes one or more of a genre, tempo, musical key, and time signature.
- Determining the parameter values includes determining time varying parameter values that vary during a song. Such time varying may be advantageous because different parts of a song, for instant, a chorus and a stanza, may warrant different processing.
- the audio voice signal is processed to determine a voice level of a user input.
- the voice level may represent a presence or absence of voice, or may represent a volume or energy of voice.
- the voice signal is determined to have a presence of the user input during a first period and an absence of the user input during a second period.
- An audio driving signal is formed, including combining the audio voice signal and the vocal-removed song signal to produce an audio driving signal during the first period.
- the audio driving signal is formed, including combining the original vocals signal and the vocal-removed song signal to produce the audio driving signal during the second period.
- Such presentation of the original vocals may be advantageous when the user forgets the words, and begins to sing at a lower level.
- Forming the audio driving signal further includes, during the first interval, combining the original vocals signal at an attenuated level that is based on the determined voice level (e.g., based on history or a time filtering of the voice level).
- an attenuated presentation of the original vocals may be advantageous when the user may not be sure of the words, and begins to sing at a lower level.
- Determining the original vocals signal and the vocal-removed song signal includes receiving said vocals signal and said vocal-removed signal prior to the playback of the source song.
- Determining the original vocal signal and the vocal-removed song signal includes processing the original source song to demix vocal and vocal-removed components of the original source song.
- Voice is detected in the microphone signal, and during a period when voice is not detected in the microphone signal providing a signal corresponding to the original source song, including providing at least some of the original vocals signal, for acoustic presentation to the user.
- the microphone signal is acquired in a cabin of a first vehicle, and the audio driving signal is presented in the cabin of the first vehicle.
- a remote vocal signal is received from a second vehicle, and the enhanced vocal signal, the remote vocal signal, and the source song are combined to produce the audio driving signal.
- the enhanced vocal signal is provided for presentation in the second vehicle.
- Determining the parameter values for one or more audio modification approaches based on characteristics of the source song is performed at the first vehicle during presentation of the song to the user.
- Determining the parameter values for one or more audio modification approaches based on characteristics of the source song is performed prior to presentation of the song to the user.
- a non-transitory machine-readable medium has instructions stored on it.
- the instructions when executed by a processor, cause the processor to perform all the steps of any one of the methods set forth above.
- an audio processing system includes a processor configured to perform all the steps of any one of the methods set forth above.
- the audio processing system of may comprise an in-vehicle audio processing system.
- the audio processing system may be integrated into at least one of an audio entertainment system, a hand-free telephony system, or a voice assistant system.
- FIG. 1 is a schematic diagram of a vehicle-based Karaoke system.
- FIG. 2 is a schematic block diagram of an audio processing system.
- FIG. 3 is a schematic block diagram of a voice processing module.
- FIG. 4 is a schematic block diagram of an audio processing system with dynamic remixing.
- an audio processing system 100 provides a way for a user 110 to “sing along” with recorded audio presented in an acoustic environment generally to provide an enjoyable experience for the user or to other listeners of the combination of the user’s vocal input and the recorded audio.
- This process is often referred to as “Karaoke”, and while the term is used to describe this system no inference should be made regarding the characteristics of the system based on the usage of the word.
- a Karaoke system are described below in the context of the acoustic environment being a vehicle 120 (or multiple vehicles) and the users and/or listeners being drivers and/or passengers in a vehicle.
- the user 110 is a driver of a vehicle in which audio is played via one or more speakers 103 in the vehicle cabin, and audio is captured at one or more microphones 102 in the cabin. While a single microphone is illustrated, multiple microphones for instance forming a directional microphone array may be used. Furthermore, while the figure and discussion focusses on a single user, examples of the system may have multiple users in a vehicle contributing vocal input together. In some examples there are further modes of monitoring the users. One such example is via a camera 104, which may be useful in detecting and analyzing the user’s vocal output (e.g., based on lip position etc.) particularly in high noise or high music volume situations.
- vocal output e.g., based on lip position etc.
- the processing in the multiple vehicles may be coordinated, for example, to synchronize the input and/or output processing in the multiple vehicles.
- the songs that are played as part of this process may be stored local to the vehicle (e.g., in the entertainment system), or may be streamed from a server 160 in communication with the vehicle, or may be streamed (e.g., over a Bluetooth wireless link) from a user’s smartphone in the vehicle, for example, being stored on the smartphone, or retrieved (e.g., over a cellular data link) from a remote server by an application executing on the smartphone.
- Metadata for the song may be available, for example, including genre, pre-compute acoustic or musical parameters, which may be varying through the song, and lyric text.
- the audio processing described below is performed in an entertainment system 101, which generally includes a general purpose computer processor and/or a digital signal processing (DSP) processor under software control (e.g., with software instructions stored in the entertainment system in a non-transitory machine-readable medium).
- DSP digital signal processing
- the entertainment system 101 may receive input from vehicle systems 121, for example, providing vehicle state (e.g., speed of the vehicle, navigation state etc.), which may affect the operation of the system.
- vehicle state e.g., speed of the vehicle, navigation state etc.
- the audio processing is performed on the same platform as an in-car communication (ICC) system, for example, utilizing the same microphones and speakers that are used to enhance communication between users (e.g., driver and passengers) in a vehicle.
- ICC in-car communication
- audio processing includes audio input processing 210, followed by voice effect processing 220, and then audio output processing 240.
- the audio input processing receives audio input from the vehicle microphone(s) 102 capturing an acoustic signal in the vehicle cabin.
- the microphone 102 captures the user’s voice 257, noise 259 from the vehicle (e.g., road noise or interfering speech from non-singing individuals or voice assistants), and acoustic feedback 255 from the speaker(s) 103 to the microphone(s) 102.
- the audio input processing may perform one or both of acoustic “echo” cancellation and noise reduction. The echo cancellation removed as much as possible of signals emitted from the speaker(s).
- This cancellation may be based on the signal 245, which represents the acoustic signal emitted by the speaker 103, or may be based on the audio signal 205 of the song being played, and may use an adaptive cancellation approach, for example, adapting a cancellation filter (e.g., using a least mean squared (LMS) approach).
- LMS least mean squared
- Various noise reduction approaches may be based, for example, on spectral subtraction or adaptive Wiener filtering.
- the audio input processing may further include acoustic beamforming or microphone selection to improve the capture of the user’s voice and/or exclude interfering noise and/or speaker feedback.
- the audio input processing may further include automatic gain control and/or equalization, for instance, to match the level of the song 205 or possibly match a target level that is provided with the song representing the level of the excluded original vocals of the song.
- the output of audio input processing 210 is an audio vocal signal 215, which ideally represents only the user’s voice, and in practice may include residual noise and song signals that have been significantly attenuated.
- the vocal signal 215 is then processed by a voice effect processing module 220. As discussed further below, this module introduces audio effects to enhance the user’s vocal input.
- the module receives the song signal 205, thereby allowing the effects to be dynamically adjusted to match the song as a whole, and/or to vary during a song, for example, with different effects during a verse and during a chorus.
- the song signal 205 includes data as well as audio signals, for example, providing metadata that is used for adjusting the voice effects. In some examples, not illustrated in FIG.
- acoustic conditions such as the noise level or other noise characteristics (e.g., spectral shape) are provided to the voice effect processing module 220, for example, to modify the parameters of certain voice effects, for example, reducing the amount of reverberation in high- noise situations.
- the output of the module is an enhanced vocal signal 225.
- the voice effect processing is also controlled directly by the user, for example, by setting parameters using manual or voice command inputs.
- the enhanced vocal signal 225 is then combined with the song signal 205, illustrated in this example, as a signal summation 230, yielding a combined signal 235, which is passed to an audio output processing module 240.
- this output processing module may amplify the combined signal to a level controlled by the user and adjust typical parameters such as balance and front-back fade in multiple speaker situations.
- the output of the audio output processing module 245 is an audio driving signal 245, which drives the speaker(s), and which may be fed back to the audio input processing module 210 for the purpose of “echo” cancellation.
- the audio out processing is adaptive to the environment of the vehicle, for example, increasing overall level compression in high-noise environments.
- the song signal 205 is first processed by a song analysis module 310, and the analysis information is passed to a voice modification module 320, which processes the voice signal 215 to produce the enhanced voice signal 225.
- the song analysis module can provide an on/off indicator of whether a song is being played, or if a song being played is suitable for Karaoke because it is lacking vocals.
- the song analysis module 310 may make this determination, for instance, based on a spectral signal energy determination, which may distinguish non-song (e.g., voice only) audio from song audio.
- the song analysis module may include a broad categorization 313 of the song as a whole, for example, by genre classifying the song a “rock and roll,” “folk,” “a cappella,” etc.
- the song categorization may also track varying parameters, such as tempo (e.g., beats per minute). Many other parameters may be determined from the acoustic signal and/or provided from metadata provided for the song, such as the level (power, amplitude), musical key, time signature, chorus vs. verse indication, and data regarding the excluded lyrics, such as a representation of the words that are excluded from the song audio signal.
- the characterization of a song may also be done prior to playing the song.
- the characterization data may be stored, for example, as meta data with the song signal audio.
- the voice modification may implement one or more controllable modifications of the voice signal.
- Such modifications may include addition of reverberation 322 or echo 323.
- Such modifications may be parameterized, for example, including characterizations of the echo (e.g., the time delay of the echo) or the reverberation characteristics (e.g., time impulse response).
- an exciter 324 i.e., for adding harmonic distortions
- pitch modification may be parameterized by the target key of the song, for example, as automatically detected in the song audio or as provided in metadata for the song.
- song modification may include time modification to match the user’s singing rate, or selection of portions of a song based on recognition or matching of the sung lyrics to the original song lyrics.
- Such tracking may be beneficial to the vehicle environment in which a driver may not have the luxury to safely read song lyrics while driving!
- the system is able to adapt to whether or not the user is singing, thereby providing a Karaoke experience if they are singing, and providing the original audio with vocals if they are not, for example, because they have forgotten the words.
- the audio input processing 210 functions as in the example of FIG. 2.
- the vocal signal 215 (which may or may not in fact include the user’s vocals) is passed to a voice detector 450, which determines a voice level, which may represent a binary indicator of whether or not the user is singing, or in some example, the voice level may represent the acoustic volume (e.g., amplitude, energy) of the singing.
- voice is only detected during periods of time when original vocals are present, for example inhibiting user vocals to be introduced at other times.
- a dynamic remixer 440 passes, attenuates, or fully blocks original vocals 435 while passing the vocal-removed song audio 432.
- the output signal 445 of the remixer 440 corresponds to the song audio 205 of FIG. 3, in some examples including an attenuated version of the original vocals 435 based on a level of the user’s signing, and during periods that the user is not singing, then signal 445 corresponds to the original song including its original vocals.
- Attenuation of the original vocals is based on the voice level using a history (e.g, a time filtering) of the voice level (whether binary or representing an acoustic volume).
- a history e.g, a time filtering
- the original vocals provide an audio cue to the words, and if the user merely starts singing at a lower level because they are unsure of the words, the original vocals are included at an attenuated level.
- the voice effect processing 420 which is optional in the example illustrated in FIG. 4, can operate in the manner of the voice effect processing 220 of FIGS. 2 and 3, optionally inhibiting processing if the voice detector 450 indicates that the user is not singing. In the example illustrated in FIG.
- the song source 401 provides the song in audio form with the original vocals, and an automatic demixer 430 separates that audio into the vocals 435 and the remainder of the audio as signal 432 either before or during the playing of the song, such that the combination (e.g., sum) of signals 432 and 435 yields the original song audio.
- demixers may be used, for example, as describe in Mitsufuji et al, “Audio Demixing Challenge 2021” arXiv:2108.13559v3 (2022). Note that various forms of demixing may be used, for example, with multi-part vocals (e.g., a Capella) a particular voice part may be removed while retaining the other voice parts.
- the demixing is performed prior to the playing of the song, for example, to reduce the computation required during the playing of the song.
- the demixed song may be provided remotely to the vehicle and/or stored in a song library in the vehicle.
- the presence of vocals (e.g., signal energy) in the original vocals 435 is used to determine periods of time when the user is expected to sing, thereby allowing the voice detector 450 to only operate to detect the user’s voice during those periods and being the basis for a timeout after absence of the user’s voice when the original vocals should be reintroduced.
- a fourth signal may be combined with the first and second signals to produce the third signal.
- the particular order or approach of forming the third signal should not be understood to the limited to a particular sequence of operations, as long as the third signal includes at least some of the other signals that are combined to form it.
- users in multiple vehicles that are coupled by a mobile communication system participate in providing vocals for a signal song.
- a user in each car has a cellular smartphone that is coupled to the entertainment system in each vehicle, and inter- vehicle communication is via the smartphones, while in some examples, inter-vehicle mobile communication integrated into the vehicles.
- inter-vehicle mobile communication integrated into the vehicles.
- coordination of the entertainment systems in multiple cars to provide a multicar experience.
- the same song is stored at each vehicle and is played in both vehicles in a time-synchronized manner, and vocals are picked up in each vehicle.
- the local vocals are played out as described above, and are also transmitted to the remote vehicle, where they are combined with the local vocals there.
- the remote vocals are also received at the local system, and added to the signal that is output in the local vehicle. While exact synchronization is desirable, to the extent that a perfect synchronization cannot be achieved, the mismatch may be considered to be a reverberation effect such that the remote vocals are only heard as delayed reverberation without a direct (i.e., delay-free) component.
- the voice processing module may be modified to use the remote vocals in addition of such reverberation effects.
- the remote vocals may be provided as background singers, for example, with their contribution being attenuated or spatially placed in the acoustic environment differently than the local vocals.
- one vehicle may contribute a subset of vocal parts, with those parts being subsequently presented in another vehicle in which the users there present additional vocal parts. This approach may be accomplished by different vehicles cancelling different of the vocal parts for replacement by the users in those vehicles.
- mapping from the output of song analysis to parameters of vocal modification may be based on various techniques.
- One technique makes use of predetermined data that has been tuned to different situations. For example, setting of reverberation time and exciter gain may be parameterized by genre of the song as shown in the following table:
- the mapping from a characterization of a song to parameters of the voice modification is a learned mapping, for instance, implemented as a decision tree, neural network, and/or regression model.
- a learned mapping may be based, for example, on training data that includes songs and manual user settings for the songs according to their preference thereby having the automated system essentially mimic a human’s choice of settings.
- such a learned mapping may be adapted to the preferences of particular users, for example, based on monitoring the user’s modification of automatically set parameters.
- the vehicle environment is one example in which the system may be deployed. It is equally possible to use the dynamic adjustment of vocal effects in more conventional environments such as night clubs. Also, multi-location synchronized (or time shifted) implementations may include all or some of the locations being non-vehicle locations.
- Implementations of the techniques described above may make use of software, hardware, or a combination of hardware and software. Implementations using software may make use of processor instructions stored on non-transitory machine -readable media, with execution of the instructions by a processor or a processing system causing the methods to be performed.
- the instructions may include high-level language, intermediate representations (e.g., “bytecode”), or machine-level instructions, and the processor executing the instruction may be a physical processor, or may involve a virtual processor that accepts the instructions and causes a physical processor hosting the virtual processor to perform steps of the methods.
- a processing system may include components that execute in one or more vehicles (or other user equipment) and may also include remote server (e.g., “cloud”) based processors.
- a processing system may include special-purpose processors such as digital signal processors (DSPs), for instance, for execution of audio processing functions, and high-performance numerical processor, such as graphics processing units (GPUs) or tensor processors, which may be used for artificial neural network implemented functions such a vocal removal.
- DSPs digital signal processors
- GPUs graphics processing units
- tensor processors which may be used for artificial neural network implemented functions such a vocal removal.
- Hardware implementations may include special-purpose circuity, for example, to implement audio processing such a echo cancellation or vocal removal.
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)
Abstract
A computer-implemented Karaoke system, which may be deployed in a vehicle for use by a driver and/or one or more passengers of the vehicle, adjusts relevant settings depending on the properties of the song, for instance as automatically determined by analysis of the audio signal of a song. In some examples, the system may dynamically remix original vocals or user-provided vocals depending on whether the user is singing.
Description
DYNAMIC EFFECTS KARAOKE
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. Provisional Application No. 63/425,428, filed November 15, 2022, the content of which is incorporated herein.
BACKGROUND
[0002] This invention relates to dynamically processing vocal accompaniment of recorded audio content, and in particular to dynamic effects in a Karaoke system.
[0003] Conventional “Karaoke” is a type of interactive entertainment usually offered in clubs and bars, where people sing along to recorded music using a microphone. The music is typically an instrumental version of a well-known popular song. Lyrics are usually displayed on a video screen, along with a moving symbol, changing color, or music video images, to guide the singer. Hardware, and some software (e.g., smartphone app), systems include features such as digital signal processing of a user’s voice, for example, to add reverberation and to tune their voice to a specified musical key.
[0004] Karaoke systems usually come with an electro-acoustic system with microphone, amplifier, and loudspeaker to reinforce the singer’s voice. The Karaoke set plays back a special music track where the audio track of the lead singer is missing. The music itself incorporates a lot of audio effects that have been applied during music production. Therefore, also for the voice of the Karaoke singer some audio effects should be applied by the electro-acoustic system such that the music and the Karaoke contribution fit well together in terms of style.
[0005] Karaoke sets therefore usually offer various audio effects that can be selected to enrich the singer’s voice. For example, equalizers to emphasize relevant frequencies, compressors to reduce loudness variations, reverberation effects with adjustable reverberation time, delays that add decaying echoes of the voice to the output signal, chorus effects to create the impression of multiple singers singing simultaneously, exciters that add brightness to the voice, pitch shifting or vocal harmonizer effects, and many more.
[0006] In-vehicle Karaoke, sometimes referred to as “Carpool Karaoke” (referencing a television show starring James Corden), is a type of Karaoke that may be performed by passengers or the driver of a vehicle. Commercial products are available that support features such as capturing vocals from multiple passengers in a vehicle and reducing feedback sound from the vehicle’s speakers.
SUMMARY
[0007] Conventional Karaoke systems require adjusting of individual effects for different types of songs. For example, it may be desirable to have a longer reverberation time for a slow song as compared to the reverberation time that may be best for a song with faster rhythms. Similarly, it may be preferable to match the delay time between two consecutive echoes to the tempo of a song. Depending on the instrumentation of the song, it might be necessary to adjust the equalizer to embed the voice into the mix. Ballads or emotional songs might require loudness variations, whereas in energetic songs the dynamic of the voice should be compressed. Additionally, these potential adjustments may also vary within a song related to the current part of the song, different effects might be chosen for the refrain compared to the verse. Karaoke systems available today require that these effects need to be adjusted manually. It would be desirable to have a more automated system that assists in adjusting the relevant parameters automatically without necessarily requiring any manual adjustment.
[0008] A vehicle environment poses a number of challenges for a Karaoke system, including the relatively “dead” acoustic environment, and the presence of road and other ambient noise, which can be significant in volume and varying, for example, by speed and road type, and can include outside noises such as sirens and construction jack-hammers. Very generally, one or more systems and methods described in this document dynamically changes features of the processing for a user’s (i.e., a singer’s) input according to characteristics of the audio being sung and/or the acoustic environment in which the user is singing. An advantage of system processing can include an improved user experience (e.g., it is more fun or engaging to use the system) and/or a higher quality audio output (e.g., the result of combining the presented audio and the captured audio has more desirable and/or pleasant characteristics).
[0009] In this document, the word “Karaoke” should be interpreted broadly to include any situation in which a system is configured to present an acoustic signal to one or more users, and to capture audio produced by the one or more users during the presentation of the acoustic signal.
For the sake of discussion, the acoustic signal that is presented may be referred to below as a “song” without any connotation that the audio signal includes sung or spoken words, or that it includes music intended to be sung along to; and the captured audio (or their processed versions) may be referred to as the user’s “vocals” without any connotation that the captured audio necessarily includes lyrics or other spoken or sung words. Finally, there is no requirement that text of a song or the like is presented to a user of a Karaoke system, or that the captured audio is necessarily presented along with the song to a user.
[0010] In an aspect, a computer-implemented Karaoke system adjusts relevant settings depending on the properties of the song, for instance as automatically determined by analysis of the audio signal of a song.
[0011] In some embodiments, the Karaoke system is deployed for use in a vehicle, for example, for use by a driver and/or one or more passengers of the vehicle. Multiple microphones may be used, for instance, as distributed speaker-dedicated microphones or in close distance in an array configuration processing a beamformer to focus into specific directions where passengers are speaking.. Using multiple microphones, the system may also be able detect how many singers are participating in the Karaoke and in which seat in the car they are sitting. The system might then assign different audio effects, for instance, automatic gain control (AGC) to the individual contributors to ensure consistent levels for the individual singing contributions. For example, a singer on the backseat may be assigned with effects typical for background singers. Maybe also some pitch shifting, such as transposing by one octave, can be applied.
[0012] In some embodiments, a song selected is analyzed with respect to basic properties such as speed, loudness dynamics, music style, genre, song structure, etc. Depending on these properties a set of effects is chosen and configured. In some embodiments, some of this information may be already available from the database, e.g., the pitch frequency, the tempo, and polyphony may be available from a MIDI file, information on the genre may be available from the database, and therefore no automatic extraction is needed. In some embodiments, a predefined set of audio effects for a song can be prepared by manual tuning, for example, with a user manual tuning for a favorite song, thereby ensuring an optimal (i.e., most desirable) set of audio effects.
[0013] In some embodiments, audio effects may change within one song. For instance, different effect settings might be used for chorus and for verse. The determination of the chorus
versus verse may be determined automatically, for example, based on repetition of the chorus. As another instance, different effects may be applied at the end of a song as compared to during the song such as introducing a delay effect at the end of a break in vocals without having the delay effect active continuously, which may be disturbing to the singer.
[0014] In some embodiments, the audio effects may be adjusted depending on the background noise. For instance, in high noise situations a higher playback gain, and less reverb may be applied. The background noise can be estimated using the same microphone that is used for Karaoke. Such audio effects may be particularly useful for in-vehicle applications where background noise may be substantial and may vary over time. In some embodiments, the microphone(s) and loudspeaker(s) are not necessarily dedicated to Karaoke, for example, being integrated into an audio entertainment system, a hand-free telephony system, and/or a voice assistant system.
[0015] In some embodiments, Karaoke system is configured to interact with other Karaoke systems at other locations, thereby forming a distributed Karaoke system enabling users to participate from multiple locations. For in-vehicle Karaoke systems, the vehicles are connected via a mobile communication system, and the drivers and/or passengers in multiple vehicles can provide vocals for one song, for example, by the systems synchronizing or otherwise coordinating playback of the song and the vocals. The singers’ voices are not only played back in the local vehicle but also transmitted to the other vehicle to be added to the Karaoke sound track together with the voice on the far-end. The audio playback in both vehicles may be synchronized as well as possible, and remaining mismatch in synchronization (which may not be avoidable) is considered in the reverberation effect. The voice from the far-end car can be processed with different audio effects to put it as a chorus on the surround loudspeaker. For example, the music in two vehicles A and B starts at the same time and is synchronized. The singers’ voices from car A then are fed to the effects section of the far end car B where, for instance, surround effects, reverb etc. is generated out of it. The same will be done for the singers’ voices in car B transmitted to car A.
[0016] In one aspect, in general a method for dynamic audio modification of user input for presentation with playback of a source song includes processing a microphone signal to produce an audio voice signal representing a user input.
[0017] Parameter values for one or more audio modification approaches are determined based on characteristics of the source song, and audio voice signal is processed with the audio modification approaches configured according to the determined parameter values to produce an enhanced vocal signal. An advantage of such modification is that the user does not have to readjust the parameters manually when changing songs, which may be particularly advantageous in situations in which the user is occupied with another task, for instance, being occupied with driving a vehicle.
[0018] The enhanced vocal signal and the source song are combined to produce an audio driving signal, and the audio driving signal is provided for acoustic presentation to the user.
[0019] An acoustic signal is acquired at a microphone to produce the microphone signal, The acoustic signal includes at least the user’s voice and the acoustic presentation of the audio driving signal. Processing the microphone signal may then include removing the acoustic presentation of the audio driving signal based on an adaptation that makes use of at least one of a reference of the audio driving signal and source song. The acoustic signal may include ambient noise and the processing of the microphone signal comprises noise reduction.
[0020] The audio modification approaches include one or more of reverberation, echo, excitation, and pitch modification processing.
[0021] Characteristics of the source song includes one or more of a genre, tempo, musical key, and time signature.
[0022] Determining the parameter values includes determining time varying parameter values that vary during a song. Such time varying may be advantageous because different parts of a song, for instant, a chorus and a stanza, may warrant different processing.
[0023] An original vocals signal and a vocal-removed song signal may be determined to correspond to an original source song, and combining the enhanced vocal signal and the source song includes combining the enhanced vocal signal and the vocal-removed song signal.
[0024] The audio voice signal is processed to determine a voice level of a user input. The voice level may represent a presence or absence of voice, or may represent a volume or energy of voice.
[0025] The voice signal is determined to have a presence of the user input during a first period and an absence of the user input during a second period.
[0026] An audio driving signal is formed, including combining the audio voice signal and the vocal-removed song signal to produce an audio driving signal during the first period.
[0027] The audio driving signal is formed, including combining the original vocals signal and the vocal-removed song signal to produce the audio driving signal during the second period. Such presentation of the original vocals may be advantageous when the user forgets the words, and begins to sing at a lower level.
[0028] Forming the audio driving signal further includes, during the first interval, combining the original vocals signal at an attenuated level that is based on the determined voice level (e.g., based on history or a time filtering of the voice level). Such attenuated presentation of the original vocals may be advantageous when the user may not be sure of the words, and begins to sing at a lower level.
[0029] Determining the original vocals signal and the vocal-removed song signal includes receiving said vocals signal and said vocal-removed signal prior to the playback of the source song.
[0030] Determining the original vocal signal and the vocal-removed song signal includes processing the original source song to demix vocal and vocal-removed components of the original source song.
[0031] Voice is detected in the microphone signal, and during a period when voice is not detected in the microphone signal providing a signal corresponding to the original source song, including providing at least some of the original vocals signal, for acoustic presentation to the user.
[0032] The microphone signal is acquired in a cabin of a first vehicle, and the audio driving signal is presented in the cabin of the first vehicle.
[0033] A remote vocal signal is received from a second vehicle, and the enhanced vocal signal, the remote vocal signal, and the source song are combined to produce the audio driving signal.
[0034] The enhanced vocal signal is provided for presentation in the second vehicle.
[0035] Presentation of the song in the first vehicle and the second vehicle are synchronized.
[0036] Determining the parameter values for one or more audio modification approaches based on characteristics of the source song is performed at the first vehicle during presentation of the song to the user.
[0037] Determining the parameter values for one or more audio modification approaches based on characteristics of the source song is performed prior to presentation of the song to the user.
[0038] In another aspect, in general, a non-transitory machine-readable medium has instructions stored on it. The instructions, when executed by a processor, cause the processor to perform all the steps of any one of the methods set forth above.
[0039] In another aspect, in general, an audio processing system includes a processor configured to perform all the steps of any one of the methods set forth above. The audio processing system of may comprise an in-vehicle audio processing system. The audio processing system may be integrated into at least one of an audio entertainment system, a hand-free telephony system, or a voice assistant system.
[0040] Other features and advantages of the invention are apparent from the following description, and from the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0041] FIG. 1 is a schematic diagram of a vehicle-based Karaoke system.
[0042] FIG. 2 is a schematic block diagram of an audio processing system.
[0043] FIG. 3 is a schematic block diagram of a voice processing module.
[0044] FIG. 4 is a schematic block diagram of an audio processing system with dynamic remixing.
DETAILED DESCRIPTION
[0045] Referring to FIG. 1, an audio processing system 100 provides a way for a user 110 to “sing along” with recorded audio presented in an acoustic environment generally to provide an enjoyable experience for the user or to other listeners of the combination of the user’s vocal input and the recorded audio. This process is often referred to as “Karaoke”, and while the term is used to describe this system no inference should be made regarding the characteristics of the system based on the usage of the word. Particular examples of a Karaoke system are described below in the context of the acoustic environment being a vehicle 120 (or multiple vehicles) and the users and/or listeners being drivers and/or passengers in a vehicle.
[0046] In the vehicle example illustrated in FIG. 1, the user 110 is a driver of a vehicle in which audio is played via one or more speakers 103 in the vehicle cabin, and audio is captured at one or more microphones 102 in the cabin. While a single microphone is illustrated, multiple microphones for instance forming a directional microphone array may be used. Furthermore, while the figure and discussion focusses on a single user, examples of the system may have multiple users in a vehicle contributing vocal input together. In some examples there are further modes of monitoring the users. One such example is via a camera 104, which may be useful in detecting and analyzing the user’s vocal output (e.g., based on lip position etc.) particularly in high noise or high music volume situations. Some examples described below involve users in multiple vehicles, for instance in a first vehicle 120 and a second vehicle 120A, and the processing in the multiple vehicles may be coordinated, for example, to synchronize the input and/or output processing in the multiple vehicles. Finally, the songs that are played as part of this process may be stored local to the vehicle (e.g., in the entertainment system), or may be streamed from a server 160 in communication with the vehicle, or may be streamed (e.g., over a Bluetooth wireless link) from a user’s smartphone in the vehicle, for example, being stored on the smartphone, or retrieved (e.g., over a cellular data link) from a remote server by an application executing on the smartphone. In some examples, metadata for the song may be available, for example, including genre, pre-compute acoustic or musical parameters, which may be varying through the song, and lyric text.
[0047] In at least some examples, the audio processing described below is performed in an entertainment system 101, which generally includes a general purpose computer processor and/or a digital signal processing (DSP) processor under software control (e.g., with software instructions stored in the entertainment system in a non-transitory machine-readable medium). Not illustrated are user inputs to the entertainment system (e.g., manual, or by voice command), which may be used to initiate the audio processing, select a desired song, adjust volume, and the like. Furthermore, the entertainment system 101 may receive input from vehicle systems 121, for example, providing vehicle state (e.g., speed of the vehicle, navigation state etc.), which may affect the operation of the system. In some example, the audio processing is performed on the same platform as an in-car communication (ICC) system, for example, utilizing the same microphones and speakers that are used to enhance communication between users (e.g., driver and passengers) in a vehicle.
[0048] Referring to FIG. 2, audio processing includes audio input processing 210, followed by voice effect processing 220, and then audio output processing 240. The audio input processing receives audio input from the vehicle microphone(s) 102 capturing an acoustic signal in the vehicle cabin. In general, the microphone 102 captures the user’s voice 257, noise 259 from the vehicle (e.g., road noise or interfering speech from non-singing individuals or voice assistants), and acoustic feedback 255 from the speaker(s) 103 to the microphone(s) 102. The audio input processing may perform one or both of acoustic “echo” cancellation and noise reduction. The echo cancellation removed as much as possible of signals emitted from the speaker(s). This cancellation may be based on the signal 245, which represents the acoustic signal emitted by the speaker 103, or may be based on the audio signal 205 of the song being played, and may use an adaptive cancellation approach, for example, adapting a cancellation filter (e.g., using a least mean squared (LMS) approach). Various noise reduction approaches may be based, for example, on spectral subtraction or adaptive Wiener filtering. The audio input processing may further include acoustic beamforming or microphone selection to improve the capture of the user’s voice and/or exclude interfering noise and/or speaker feedback. The audio input processing may further include automatic gain control and/or equalization, for instance, to match the level of the song 205 or possibly match a target level that is provided with the song representing the level of the excluded original vocals of the song.
[0049] The output of audio input processing 210 is an audio vocal signal 215, which ideally represents only the user’s voice, and in practice may include residual noise and song signals that have been significantly attenuated. The vocal signal 215 is then processed by a voice effect
processing module 220. As discussed further below, this module introduces audio effects to enhance the user’s vocal input. The module receives the song signal 205, thereby allowing the effects to be dynamically adjusted to match the song as a whole, and/or to vary during a song, for example, with different effects during a verse and during a chorus. In some examples as described further below, the song signal 205 includes data as well as audio signals, for example, providing metadata that is used for adjusting the voice effects. In some examples, not illustrated in FIG. 2, acoustic conditions such as the noise level or other noise characteristics (e.g., spectral shape) are provided to the voice effect processing module 220, for example, to modify the parameters of certain voice effects, for example, reducing the amount of reverberation in high- noise situations. The output of the module is an enhanced vocal signal 225. In some examples, the voice effect processing is also controlled directly by the user, for example, by setting parameters using manual or voice command inputs.
[0050] The enhanced vocal signal 225 is then combined with the song signal 205, illustrated in this example, as a signal summation 230, yielding a combined signal 235, which is passed to an audio output processing module 240. For example, this output processing module may amplify the combined signal to a level controlled by the user and adjust typical parameters such as balance and front-back fade in multiple speaker situations. The output of the audio output processing module 245 is an audio driving signal 245, which drives the speaker(s), and which may be fed back to the audio input processing module 210 for the purpose of “echo” cancellation. In some examples, the audio out processing is adaptive to the environment of the vehicle, for example, increasing overall level compression in high-noise environments.
[0051] Referring to FIG. 3, in one example of a voice processing module 220, the song signal 205 is first processed by a song analysis module 310, and the analysis information is passed to a voice modification module 320, which processes the voice signal 215 to produce the enhanced voice signal 225. In one example, the song analysis module can provide an on/off indicator of whether a song is being played, or if a song being played is suitable for Karaoke because it is lacking vocals. The song analysis module 310 may make this determination, for instance, based on a spectral signal energy determination, which may distinguish non-song (e.g., voice only) audio from song audio. In another example, the song analysis module may include a broad categorization 313 of the song as a whole, for example, by genre classifying the song a “rock and roll,” “folk,” “a cappella,” etc. The song categorization may also track varying parameters, such as tempo (e.g., beats per minute). Many other parameters may be determined from the acoustic signal and/or provided from metadata provided for the song, such as the level (power,
amplitude), musical key, time signature, chorus vs. verse indication, and data regarding the excluded lyrics, such as a representation of the words that are excluded from the song audio signal. The characterization of a song may also be done prior to playing the song. The characterization data may be stored, for example, as meta data with the song signal audio.
[0052] The voice modification may implement one or more controllable modifications of the voice signal. Such modifications may include addition of reverberation 322 or echo 323. Such modifications may be parameterized, for example, including characterizations of the echo (e.g., the time delay of the echo) or the reverberation characteristics (e.g., time impulse response). Similarly, an exciter 324 (i.e., for adding harmonic distortions) may be parameterized by its gain, and pitch modification may be parameterized by the target key of the song, for example, as automatically detected in the song audio or as provided in metadata for the song.
[0053] Not illustrated in FIG. 3 are alternatives that include modification of the song audio based on the captured voice signal 215. For example, such song modification may include time modification to match the user’s singing rate, or selection of portions of a song based on recognition or matching of the sung lyrics to the original song lyrics. Such tracking may be beneficial to the vehicle environment in which a driver may not have the luxury to safely read song lyrics while driving!
[0054] Referring to FIG. 4, in some examples, the system is able to adapt to whether or not the user is singing, thereby providing a Karaoke experience if they are singing, and providing the original audio with vocals if they are not, for example, because they have forgotten the words. In this example, the audio input processing 210 functions as in the example of FIG. 2. The vocal signal 215 (which may or may not in fact include the user’s vocals) is passed to a voice detector 450, which determines a voice level, which may represent a binary indicator of whether or not the user is singing, or in some example, the voice level may represent the acoustic volume (e.g., amplitude, energy) of the singing. In some examples, voice is only detected during periods of time when original vocals are present, for example inhibiting user vocals to be introduced at other times. Based on the voice level output of the voice detector 450 a dynamic remixer 440 passes, attenuates, or fully blocks original vocals 435 while passing the vocal-removed song audio 432. In some examples, during periods that the user is singing, the output signal 445 of the remixer 440 corresponds to the song audio 205 of FIG. 3, in some examples including an attenuated version of the original vocals 435 based on a level of the user’s signing, and during periods that the user is not singing, then signal 445 corresponds to the original song including its
original vocals. In some examples, attenuation of the original vocals is based on the voice level using a history (e.g, a time filtering) of the voice level (whether binary or representing an acoustic volume). In use, for example, when a user forgets the words of the song and stops singing, the original vocals provide an audio cue to the words, and if the user merely starts singing at a lower level because they are unsure of the words, the original vocals are included at an attenuated level. The voice effect processing 420, which is optional in the example illustrated in FIG. 4, can operate in the manner of the voice effect processing 220 of FIGS. 2 and 3, optionally inhibiting processing if the voice detector 450 indicates that the user is not singing. In the example illustrated in FIG. 4, the song source 401 provides the song in audio form with the original vocals, and an automatic demixer 430 separates that audio into the vocals 435 and the remainder of the audio as signal 432 either before or during the playing of the song, such that the combination (e.g., sum) of signals 432 and 435 yields the original song audio. A variety of demixers may be used, for example, as describe in Mitsufuji et al, “Audio Demixing Challenge 2021” arXiv:2108.13559v3 (2022). Note that various forms of demixing may be used, for example, with multi-part vocals (e.g., a Capella) a particular voice part may be removed while retaining the other voice parts. In some examples, the demixing is performed prior to the playing of the song, for example, to reduce the computation required during the playing of the song. For example, the demixed song may be provided remotely to the vehicle and/or stored in a song library in the vehicle. In some examples, the presence of vocals (e.g., signal energy) in the original vocals 435 is used to determine periods of time when the user is expected to sing, thereby allowing the voice detector 450 to only operate to detect the user’s voice during those periods and being the basis for a timeout after absence of the user’s voice when the original vocals should be reintroduced. Note that in this document, when referring to combining a first signal with a second signal to produce a third signal, such a combination is not limiting in that yet a fourth signal may be combined with the first and second signals to produce the third signal. Further, the particular order or approach of forming the third signal should not be understood to the limited to a particular sequence of operations, as long as the third signal includes at least some of the other signals that are combined to form it.
[0055] As introduced above, in some examples, users in multiple vehicles that are coupled by a mobile communication system (e.g., cellular data) participate in providing vocals for a signal song. In some examples, a user in each car has a cellular smartphone that is coupled to the entertainment system in each vehicle, and inter- vehicle communication is via the smartphones, while in some examples, inter-vehicle mobile communication integrated into the vehicles. There are a number of examples of coordination of the entertainment systems in multiple cars to
provide a multicar experience. In one example, the same song is stored at each vehicle and is played in both vehicles in a time-synchronized manner, and vocals are picked up in each vehicle. The local vocals are played out as described above, and are also transmitted to the remote vehicle, where they are combined with the local vocals there. The remote vocals are also received at the local system, and added to the signal that is output in the local vehicle. While exact synchronization is desirable, to the extent that a perfect synchronization cannot be achieved, the mismatch may be considered to be a reverberation effect such that the remote vocals are only heard as delayed reverberation without a direct (i.e., delay-free) component. In some such examples, the voice processing module may be modified to use the remote vocals in addition of such reverberation effects. In some examples, the remote vocals may be provided as background singers, for example, with their contribution being attenuated or spatially placed in the acoustic environment differently than the local vocals.
[0056] In some multi-vehicle examples, rather than synchronizing the audio, one vehicle may contribute a subset of vocal parts, with those parts being subsequently presented in another vehicle in which the users there present additional vocal parts. This approach may be accomplished by different vehicles cancelling different of the vocal parts for replacement by the users in those vehicles.
[0057] Referring again to FIG. 3, the mapping from the output of song analysis to parameters of vocal modification may be based on various techniques. One technique makes use of predetermined data that has been tuned to different situations. For example, setting of reverberation time and exciter gain may be parameterized by genre of the song as shown in the following table:
[0058] In some examples, the mapping from a characterization of a song to parameters of the voice modification is a learned mapping, for instance, implemented as a decision tree, neural
network, and/or regression model. Such a learned mapping may be based, for example, on training data that includes songs and manual user settings for the songs according to their preference thereby having the automated system essentially mimic a human’s choice of settings. In some examples, such a learned mapping may be adapted to the preferences of particular users, for example, based on monitoring the user’s modification of automatically set parameters.
[0059] As introduced above, the vehicle environment is one example in which the system may be deployed. It is equally possible to use the dynamic adjustment of vocal effects in more conventional environments such as night clubs. Also, multi-location synchronized (or time shifted) implementations may include all or some of the locations being non-vehicle locations.
[0060] Implementations of the techniques described above may make use of software, hardware, or a combination of hardware and software. Implementations using software may make use of processor instructions stored on non-transitory machine -readable media, with execution of the instructions by a processor or a processing system causing the methods to be performed. The instructions may include high-level language, intermediate representations (e.g., “bytecode”), or machine-level instructions, and the processor executing the instruction may be a physical processor, or may involve a virtual processor that accepts the instructions and causes a physical processor hosting the virtual processor to perform steps of the methods. A processing system may include components that execute in one or more vehicles (or other user equipment) and may also include remote server (e.g., “cloud”) based processors. A processing system may include special-purpose processors such as digital signal processors (DSPs), for instance, for execution of audio processing functions, and high-performance numerical processor, such as graphics processing units (GPUs) or tensor processors, which may be used for artificial neural network implemented functions such a vocal removal. Hardware implementations may include special-purpose circuity, for example, to implement audio processing such a echo cancellation or vocal removal.
[0061] A number of embodiments of the invention have been described. Nevertheless, it is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the following claims. Accordingly, other embodiments are also within the scope of the following claims. For example, various modifications may be made without departing from the scope of the invention. Additionally, some of the steps described above may be order independent, and thus can be performed in an order different from that described.
Claims
1. A method for dynamic audio modification of user input (257) for presentation with playback of a source song (205), the method comprising: processing a microphone signal to produce an audio voice signal (215) representing a user input; determining parameter values for one or more audio modification approaches (322-325) based on characteristics (312-313) of the source song; processing audio voice signal (215) with the audio modification approaches (322-325) configured according to the determined parameter values to produce an enhanced vocal signal (225); combining the enhanced vocal signal (225) and the source song (205) to produce an audio driving signal (245); and providing the audio driving signal (245) for acoustic presentation to the user.
2. The method of claim 1, further comprising acquiring an acoustic signal at a microphone to produce the microphone signal, wherein said acoustic signal includes at least the user’s voice and the acoustic presentation of the audio driving signal.
3. The method of claim 2, wherein processing the microphone signal includes removing the acoustic presentation of the audio driving signal based on an adaptation that makes use of at least one of a reference of the audio driving signal and source song.
4. The method of any of claims 1 through 3, wherein said acoustic signal include ambient noise, the processing of the microphone signal comprises noise reduction.
5. The method of any of claims 1 through 4, wherein audio modification approaches include one or more of reverberation, echo, excitation, and pitch modification processing.
6. The method of any of claims 1 through 5, wherein characteristics of the source song includes one or more of a genre, tempo, musical key, and time signature.
7. The method of any of claims 1 through 6, wherein determining the parameter values includes determining time varying parameter values that vary during a song.
8. The method of any of claims 1 through 7, further comprising: determining an original vocals signal (435) and a vocal-removed song signal (432) corresponding to an original source song (401); and wherein combining the enhanced vocal signal (225) and the source song (205) comprises combining the enhanced vocal signal and the vocal-removed song signal.
9. A method for dynamic audio modification of user input (257) for presentation with playback of a source song (401), the method comprising: determining an original vocals signal (435) and a vocal-removed song signal (432) corresponding to an original source song (401) processing a microphone signal to produce an audio voice signal (215); processing the audio voice signal to determine a voice level of a user input, including determining a presence of the user input during a first period and an absence of the user input during a second period; forming an audio driving signal (245), including combining the audio voice signal (215) and the vocal-removed song signal (432) to produce an audio driving signal (245) during the first period, and combining the original vocals signal (435) and the vocal-removed song signal (432) to produce the audio driving signal (245) during the second period; and providing the audio driving signal (245) for acoustic presentation to the user.
10. The method of claim 9, wherein forming the audio driving signal further includes, during the first interval, combining the original vocals signal (435) at an attenuated level that is based on the determined voice level.
11. The method of any of claims 8 through 10, wherein determining the original vocals signal (435) and the vocal-removed song signal (432) includes receiving said vocals signal and said vocal-removed signal prior to the playback of the source song.
12. The method of any of claims 8 through 10, wherein determining the original vocal signal and the vocal-removed song signal includes processing the original source song to demix vocal and vocal-removed components of the original source song.
13. The method of any of claims 8 through 12, further comprising detecting voice in the microphone signal, and during a period when voice is not detected in the microphone signal providing a signal corresponding to the original source song (401), including providing at least some of the original vocals signal (435), for acoustic presentation to the user.
14. The method of any of claims 1 through 13, wherein the microphone signal is acquired in a cabin of a first vehicle, and the audio driving signal is presented in the cabin of the first vehicle.
15. The method of claim 14, further comprising receiving a remote vocal signal from a second vehicle, and combining the enhanced vocal signal (225), the remote vocal signal, and the source song (205) to produce the audio driving signal (245).
16. The method of claim 15, further comprising providing the enhanced vocal signal for presentation in the second vehicle.
17. The method of any of claims 15 and 16, further comprising synchronizing presentation of the song in the first vehicle and the second vehicle.
18. The method of claim 14, wherein determining the parameter values for one or more audio modification approaches (322-325) based on characteristics (312-313) of the source song is performed at the first vehicle during presentation of the song to the user.
19. The method of claim 14, wherein determining the parameter values for one or more audio modification approaches (322-325) based on characteristics (312-313) of the source song is performed prior to presentation of the song to the user.
20. A non-transitory machine-readable medium comprising instructions stored thereon, the instructions, when executed by a processor, cause the processor to perform all the steps of any one of claims 1 through 19.
21. An audio processing system, comprising a processor configured to perform all the steps of any one of claims 1 through 19.
22. The audio processing system of claim 21, comprising an in-vehicle audio processing system.
23. The audio processing system of claim 21 or claim 22, integrated into at least one of an audio entertainment system, a hand-free telephony system, and a voice assistant system.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202263425428P | 2022-11-15 | 2022-11-15 | |
US63/425,428 | 2022-11-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024107342A1 true WO2024107342A1 (en) | 2024-05-23 |
Family
ID=89378620
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2023/036670 WO2024107342A1 (en) | 2022-11-15 | 2023-11-02 | Dynamic effects karaoke |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024107342A1 (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5542000A (en) * | 1993-03-19 | 1996-07-30 | Yamaha Corporation | Karaoke apparatus having automatic effector control |
US5804752A (en) * | 1996-08-30 | 1998-09-08 | Yamaha Corporation | Karaoke apparatus with individual scoring of duet singers |
US20100107856A1 (en) * | 2008-11-03 | 2010-05-06 | Qnx Software Systems (Wavemakers), Inc. | Karaoke system |
US10332495B1 (en) * | 2018-07-10 | 2019-06-25 | Byton Limited | In vehicle karaoke |
EP3923269A1 (en) * | 2016-07-22 | 2021-12-15 | Dolby Laboratories Licensing Corp. | Server-based processing and distribution of multimedia content of a live musical performance |
CN113808557A (en) * | 2020-06-12 | 2021-12-17 | 比亚迪股份有限公司 | Vehicle-mounted audio processing system, method and device |
-
2023
- 2023-11-02 WO PCT/US2023/036670 patent/WO2024107342A1/en unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5542000A (en) * | 1993-03-19 | 1996-07-30 | Yamaha Corporation | Karaoke apparatus having automatic effector control |
US5804752A (en) * | 1996-08-30 | 1998-09-08 | Yamaha Corporation | Karaoke apparatus with individual scoring of duet singers |
US20100107856A1 (en) * | 2008-11-03 | 2010-05-06 | Qnx Software Systems (Wavemakers), Inc. | Karaoke system |
EP3923269A1 (en) * | 2016-07-22 | 2021-12-15 | Dolby Laboratories Licensing Corp. | Server-based processing and distribution of multimedia content of a live musical performance |
US10332495B1 (en) * | 2018-07-10 | 2019-06-25 | Byton Limited | In vehicle karaoke |
CN113808557A (en) * | 2020-06-12 | 2021-12-17 | 比亚迪股份有限公司 | Vehicle-mounted audio processing system, method and device |
Non-Patent Citations (1)
Title |
---|
MITSUFUJI ET AL.: "Audio Demixing Challenge 2021", RXIV:2108.13559V3, 2022 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11348595B2 (en) | Voice interface and vocal entertainment system | |
US11146907B2 (en) | Audio contribution identification system and method | |
CN113270082A (en) | Vehicle-mounted KTV control method and device and vehicle-mounted intelligent networking terminal | |
Elmosnino | Audio production principles: Practical studio applications | |
US20230057082A1 (en) | Electronic device, method and computer program | |
JPH04174696A (en) | Electronic musical instrument coping with playing environment | |
WO2024107342A1 (en) | Dynamic effects karaoke | |
JPWO2005111997A1 (en) | Audio playback device | |
JP7403646B2 (en) | In-vehicle audio signal output device | |
US20230215449A1 (en) | Voice reinforcement in multiple sound zone environments | |
JP7149218B2 (en) | karaoke device | |
JP7117229B2 (en) | karaoke equipment | |
JP4168391B2 (en) | Karaoke apparatus, voice processing method and program | |
WO2021121563A1 (en) | Apparatus for outputting an audio signal in a vehicle cabin | |
JP3471672B2 (en) | Karaoke device with automatic vocal volume fade-out function | |
JP6182894B2 (en) | Sound processing apparatus and sound processing method | |
JP2023020577A (en) | masking device | |
Bruce | Feedback Saxophone: Expanding the Microphonic Process in Post-Digital Research-Creation | |
JP5273402B2 (en) | Karaoke equipment | |
JP3931901B2 (en) | Audio converter | |
CN116486770A (en) | Vehicle-mounted K song method, electronic equipment, storage medium and vehicle | |
Clarke | I LOVE IT LOUD! | |
CA2990207A1 (en) | Voice interface and vocal entertainment system | |
Mattsson | Sound engineers and Guitarists Perception of the Timbre of the Electric Guitar Amplifier as Dependent on Microphone Distance | |
JP2005321441A (en) | Karaoke sound recording system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23828834 Country of ref document: EP Kind code of ref document: A1 |