Nothing Special   »   [go: up one dir, main page]

US20210151082A1 - Systems and methods for mixing synthetic voice with original audio tracks - Google Patents

Systems and methods for mixing synthetic voice with original audio tracks Download PDF

Info

Publication number
US20210151082A1
US20210151082A1 US16/747,314 US202016747314A US2021151082A1 US 20210151082 A1 US20210151082 A1 US 20210151082A1 US 202016747314 A US202016747314 A US 202016747314A US 2021151082 A1 US2021151082 A1 US 2021151082A1
Authority
US
United States
Prior art keywords
audio
audio track
segment
track
dialog
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US16/747,314
Other versions
US11430485B2 (en
Inventor
Yadong Wang
Murthy Parthasarathi
Andrew Swan
Raja Ranjan Senapati
Shilpa Jois Rao
Anjali Chablani
Kyle Tacke
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netflix Inc
Original Assignee
Netflix Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netflix Inc filed Critical Netflix Inc
Priority to US16/747,314 priority Critical patent/US11430485B2/en
Assigned to NETFLIX, INC reassignment NETFLIX, INC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: Parthasarathi, Murthy, CHABLANI, ANJALI, RAO, SHILPA JOIS, SENAPATI, RAJA RANJAN, SWAN, ANDREW, TACKE, KYLE, WANG, YADONG
Publication of US20210151082A1 publication Critical patent/US20210151082A1/en
Application granted granted Critical
Publication of US11430485B2 publication Critical patent/US11430485B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/036Insert-editing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/043
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • G11B27/034Electronic editing of digitised analogue information signals, e.g. audio or video signals on discs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234336Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by media transcoding, e.g. video is transformed into a slideshow of still pictures or audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • H04N21/4852End-user interface for client configuration for modifying audio parameters, e.g. switching between mono and stereo
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors

Definitions

  • Digital content distribution systems may provide a variety of different types of content (e.g., tv shows, movies, etc.) to end users.
  • This content may include both audio and video data and may be sent to a user's content player as a multimedia stream.
  • streaming content has become a very popular form of entertainment.
  • the ability to enjoy a film, television program or other form of audiovisual content in the comfort of one's home offers many advantages to viewers.
  • One of these advantages is that of enabling visually impaired viewers to more easily view the displayed content by being able to adjust their position to a more comfortable distance from a screen or other display device than might be available at a movie theater.
  • Such visually impaired viewers may still miss certain details in a scene that is displayed, or not be able to recognize certain objects. This can reduce their understanding and enjoyment of the displayed content. For example, a visually impaired viewer may not recognize that a character is holding an object that can explain an element of a plot or that an object in a scene might provide a clue to what is happening. Similarly, a visually impaired viewer may not be able to recognize an expression on a character's face which could add to the viewer's understanding of the character or to the meaning of dialog spoken by the character.
  • the present disclosure describes systems and methods for mixing a synthesized voice description of a scene or other portion of a video recording with the existing audio track associated with the video recording.
  • the synthesized voice description can be used to provide additional information to a visually impaired viewer without interrupting the audio track that is associated with the video recording, typically by inserting the synthesized voice description into a segment of the audio track in which there is no dialog.
  • a computer-implemented method includes accessing an audio track that is associated with a video recording, identifying a section of the accessed audio track having a specific audio characteristic, reducing a volume level of the audio track in the identified section, accessing an audio segment that includes a synthesized voice, and inserting the accessed audio segment into the identified section of the audio track, the inserted segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • the specific audio characteristic is that the identified section is one in which no dialog is spoken.
  • the specific audio characteristic is that the identified section is one in which dialog is spoken.
  • the volume of the identified section is reduced by an amount between 6 and 12 decibels (dB).
  • the volume of the identified section is reduced by an amount of approximately 9 decibels (dB).
  • the accessed audio segment includes an audio description of a portion of the video recording.
  • the audio description provides additional information regarding the portion of the video recording, where the additional information may include an explanation or description of a scene, an object in a scene, a character's expression, a character's clothing, or a plot element.
  • the portion of the video recording for which the additional information is provided corresponds to the section of the audio track where the accessed audio segment is inserted.
  • the accessed audio segment includes dialog spoken in a different language than dialog in the audio track prior to implementing the method.
  • the method includes processing the accessed audio segment to alter its length in time prior to inserting the audio segment into the identified section of the audio track.
  • the processing of the audio segment to alter the segment's length in time includes increasing the length of the audio segment.
  • processing of the audio segment to alter the segment's length in time includes decreasing the length of the audio segment.
  • the amount of reduction in the volume level of the audio track in the identified section depends upon the audio characteristic of the identified section.
  • identifying a section of the accessed audio track having a specific audio characteristic is performed, at least in part, by a machine learning model.
  • a corresponding system e.g., a server, computing device, etc.
  • the system includes a set of modules stored in an electronic data storage or memory, with each module containing instructions for a computer-implemented process, function, or operation and an electronic processor for executing the instructions.
  • the modules include one or more modules containing instructions for accessing an audio track that is associated with a video recording, identifying a section of the accessed audio track having a specific audio characteristic, reducing a volume level of the audio track in the identified section, accessing an audio segment that includes a synthesized voice, and inserting the accessed audio segment into the identified section of the audio track, the inserted segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • a computer-readable medium may include one or more computer-executable instructions that, when executed by at least one processor of a computing device, may cause the computing device to access an audio track that is associated with a video recording, identify a section of the accessed audio track having a specific audio characteristic, reduce a volume level of the audio track in the identified section, access an audio segment that includes a synthesized voice, and insert the accessed audio segment into the identified section of the audio track, the inserted segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • FIG. 1 is a diagram illustrating a system 100 containing a set of modules 102 , with each module containing executable instructions that when executed by an electronic processor implement a method for altering an audio track associated with a video recording in accordance with an embodiment of the systems and methods described herein.
  • FIG. 2 is a flow chart or flow diagram of an exemplary computer-implemented method, operation, function or process 200 for altering an audio track associated with a video recording in accordance with an embodiment of the systems and methods described herein.
  • FIG. 3A is an illustration of an example time period from T 0 to T N of an audio track that may be associated with a video recording. As shown, the time period includes multiple time segments or sections, which may contain dialog and/or a soundtrack.
  • FIG. 3B is an illustration of the audio track of FIG. 3A in which the soundtrack volume level has been lowered within a time segment or segments in which no dialog is spoken.
  • FIG. 3C is an illustration of how the soundtrack volume level within a time segment or segments in which no dialog is spoken may be lowered relatively abruptly or gradually decreased to the desired level.
  • FIG. 3D is an illustration of an example time period from T 0 to T N of an audio track that may be associated with a video recording and is used to illustrate an embodiment in which the original audio track is modified by use of lektoring.
  • FIG. 3E is an illustration of the audio track of FIG. 3D in which the dialog volume level has been lowered within a time segment or segments in which dialog is spoken.
  • FIG. 3F is an illustration of the audio track of FIG. 3E in which a new audio segment of dialog (such as used in lektoring) has been inserted within the same time period as original dialog, but at a higher volume level than the lowered volume of the original dialog.
  • a new audio segment of dialog such as used in lektoring
  • FIGS. 4A, 4B, and 4C are illustrations of example ways in which the original dialog in a time period may be lowered in volume for a case in which dialog used for lektoring will be added.
  • FIG. 4A the original dialog is lowered in volume abruptly
  • FIG. 4B the original dialog is lowered in volume gradually
  • FIG. 4C the original dialog is played at a lowered volume after a slight delay.
  • FIG. 5 is an illustration of how the original length in time of an audio track containing additional audio may be processed to adjust its length prior to being inserted into a section of an original audio track.
  • FIG. 6A is an illustration of how a machine learning model may be trained to identify segments or sections of an audio track having a specific audio characteristic.
  • FIG. 6B is an illustration of how the trained machine learning model of FIG. 6A may be used to classify a new segment of an audio track to determine if the new segment includes a specific audio characteristic (such as including or not including dialog).
  • a specific audio characteristic such as including or not including dialog
  • FIG. 7 is a flow diagram illustrating a method, operation, function or process for converting dialog into synthesized speech and mixing the synthesized speech with an original audio track to add an audio description (AD) or a lektor, in accordance with an embodiment of the systems and methods described herein.
  • AD audio description
  • FIG. 7 is a flow diagram illustrating a method, operation, function or process for converting dialog into synthesized speech and mixing the synthesized speech with an original audio track to add an audio description (AD) or a lektor, in accordance with an embodiment of the systems and methods described herein.
  • AD audio description
  • the present disclosure is generally directed to systems and methods for mixing a new section of an audio track into an existing audio track, where the existing track is associated with a video recording.
  • a multimedia stream used to provide content to viewers will typically include an audio track and a video recording.
  • the audio track and video recording are stored in an electronic data storage element or memory. The two may be combined into a single set of data or stored as separate but associated sets of data. If stored separately, the two sets of data are synchronized or aligned during the streaming process.
  • Lektoring Another example of a situation in which an audio track might require alteration for a specific type of viewer is in the process known as “lektoring”, which involves adding a voice-over translation of dialog to an audio track.
  • Lektoring typically involves a voice speaking a language that is different from original language of the content item, i.e., hereinafter referred to as the target language.
  • the target language speaking voice is overlaid on top of a portion of an audio track containing the dialog being spoken in the original language by actors in a scene. It is a form of dubbing for an audience that speaks a different language than that originally spoken by the actors in a film or program.
  • lektoring typically involves an overlay of dialog in a scene spoken in a different language
  • the methods described herein may be used to introduce an audio segment containing additional audio description spoken in a different language into an audio track, either over dialog or adjacent to it in the audio track. If adjacent to dialog, then the dialog may be subject to overlay by dialog spoken in the different language with the additional audio content inserted adjacent to it.
  • the disclosed system and method is used to mix or integrate a synthesized voice description of a portion of a video recording with the existing audio track associated with the recording.
  • the synthesized voice description can be used to provide additional information to a visually impaired viewer without interrupting the audio track that is associated with the video recording. This is accomplished in some examples by inserting the synthesized voice description into a segment of the audio track in which no dialog is being spoken.
  • a segment in which no dialog is spoken may be automatically identified or detected by a substantial reduction in the frequency components associated with speech on the audio track or by another indication of a time period or time slot in which no speech is present. It may also be identified by use of a Voice Activity Detection (VAD) technique. Such a segment may also be identified or detected by use of a machine learning model that has been trained to recognize when a section of audio track is lacking dialog (this may be accomplished by using a set of training data that includes multiple examples of sections of audio track having or lacking dialog and an associated label or indicator of whether dialog is present in that section).
  • VAD Voice Activity Detection
  • the automated process described herein enables a visually impaired viewer to be provided with additional information about a scene in a film or television program that might otherwise not be apparent to them and does so in a more efficient and comprehensible manner than conventional approaches.
  • the automated process described herein provides a way to integrate a dubbed version of dialog into an existing audio track in a more efficient manner than conventional approaches.
  • Embodiments of the systems and methods described herein enable a visually impaired viewer to obtain greater enjoyment of (or information from) a film or television program.
  • Embodiments described herein improve the conventional mixing process by automatically detecting or identifying specific time slots in which to place new audio content, followed by automatically adjusting the volume level in those time slots and inserting new audio content in a synthesized voice into the time slot or slots at a specific volume level.
  • embodiments of the systems and methods described herein may be used to insert a translation, voice-over audio track, or additional audio description into an existing audio track.
  • the described technique may be used in implementing a version of “lektoring”, in which a voice speaking a different language than that originally spoken in a scene (e.g., a Polish speaking voice) is inserted into the audio track, over the existing dialog.
  • lektoring the new audio is overlapping with the original dialog.
  • the lowered volume of the original dialog may begin and after a short time (e.g., 1 second, although this time delay is configurable, as is the amount or degree of lowering of the volume level of the original dialog) the new audio containing the dialog spoken in a different language may start.
  • a short time e.g. 1 second, although this time delay is configurable, as is the amount or degree of lowering of the volume level of the original dialog
  • the audio segments containing dialog spoken in a different language may be inserted into a section of the audio track in which no dialog is spoken (e.g., adjacent to the audio track position of the original dialog).
  • the technique(s) described may also be used to insert an audio description of a scene or character into the audio track.
  • the new audio content is inserted into the audio track in a region between portions of the track containing spoken dialog.
  • the inserted content may include dialog spoken in a different language or audio descriptions of features or elements of a scene (spoken in the original or in a different language), as described previously.
  • FIG. 1 is a diagram illustrating a system 100 containing a set of modules 102 , with each module containing executable instructions that when executed by an electronic processor implement a method for altering an audio track associated with a video recording, in accordance with an embodiment of the systems and methods described herein.
  • system 100 may represent a server or other form of computing or data processing device.
  • Modules 102 each contain a set of executable instructions, where when the set of instructions is executed by a suitable electronic processor (such as that indicated in the figure by “Physical Processor(s) 130 ”), system (or server or device) 100 operates to perform a specific process, operation, function or method.
  • Modules 102 are stored in a memory 120 , which typically includes an Operating System module 104 that contains instructions used (among other functions) to access and control the execution of the instructions contained in other modules.
  • the modules 102 in memory 120 are accessed for purposes of transferring data and executing instructions by use of a “bus” or communications line 114 , which also serves to permit processor(s) 130 to communicate with the modules for purposes of accessing and executing a set of instructions.
  • Bus or communications line 114 also permits processor(s) 130 to interact with other elements of system 100 , such as input or output devices 122 , communications elements 124 for exchanging data and information with devices external to system 100 , and additional memory devices 126 .
  • audio track may refer to a track containing spoken dialog, music, sound effects, a mono track, a stereo track, a MIDI (Musical Instrument Digital Interface) track or environmental sounds.
  • MIDI Musical Instrument Digital Interface
  • an audio track that is associated with a video recording may refer to an audio track containing one or more of spoken dialog, music or background sounds that is meant to be played as part of a combined audio and video presentation.
  • the combined presentation may be a film or television program, for example.
  • the audio track may be stored separately and combined with a video recording or it may be stored as part of the combined audio and video presentation.
  • the term “synthesized” as used herein may refer to an artificially generated voice or sound meant to represent a voice.
  • modules 102 may contain one or more sets of instructions for performing a method that is described with reference to FIG. 2 . These modules may include those illustrated but may also include a greater number or fewer number than those illustrated.
  • Audio Track Access Module 106 may contain instructions that when executed perform a process to access an audio track for a film or program.
  • Audio Track Processing Module 108 may contain instructions that when executed perform a process to lower the volume in a specific time slot of an audio track. Module 108 may also contain instructions that when executed cause the volume to be lowered in a specific manner. Similarly, the instructions that cause the volume to be lowered in a specific manner may be contained in a different module.
  • New Audio Section Processing Module 110 may contain instructions that when executed perform a process to access a previously generated new section of audio description that corresponds to each of the time slots.
  • the generated section or sections are generated in a synthesized voice and stored in a suitable electronic data storage element.
  • module 110 adjusts the volume level of each piece of the synthesized audio description and inserts each piece of synthesized audio content into the audio track in (or at) its desired time slot.
  • Audio Track Return Module 112 may contain instructions that when executed perform a process to store the final audio track in a suitable data storage element. The final audio track may then be combined or recombined with the video recording after appropriate synchronization, thereby resulting in a final version of the film or program in which the new (additional) audio content has been added.
  • FIG. 2 is a flow chart or flow diagram of an exemplary computer-implemented method, operation, function or process 200 for altering an audio track associated with a video recording in accordance with an embodiment of the systems and methods described herein. In some embodiments, this is performed for the purpose of mixing an audio description of a scene from a film or television program into an existing audio track for the film or program.
  • the steps shown in FIG. 2 may be performed by any suitable computer-executable code and/or computing system, including the system(s) illustrated in FIG. 1 .
  • each of the steps shown in FIG. 2 may represent an algorithm whose structure includes and/or is represented by multiple sub-steps, examples of which will be provided in greater detail below.
  • a system for implementing the method of FIG. 2 includes an electronic processor (e.g., a CPU, controller, microcontroller, etc.) which accesses a set of software instructions that are stored in an electronic data storage element. Each set of instructions that is used to perform a specific function may be contained in a specific module. The processor executes the stored instructions in a module to perform the specific function or operation.
  • an electronic processor e.g., a CPU, controller, microcontroller, etc.
  • the processor executes the stored instructions in a module to perform the specific function or operation.
  • one or more of the modules described herein operates to cause a computing device to access an audio track for a film or program (e.g., by execution of the instructions contained in module 106 of FIG. 1 ).
  • the track is stored in an electronic data storage element or memory.
  • the track is stored with its associated video recording in a common memory or in a separate memory.
  • the accessed audio track includes one or more of spoken dialog, music, sound effects, a mono track, a stereo track, a midi track, background sounds or environmental sounds.
  • the track is accessed using an automated process that identifies an audio track as part of a combined audio and video project.
  • the track may also be accessed by a process that is coupled to the digital audio workstation (DAW) of a sound engineer (a digital audio workstation is an electronic device or application software used for recording, editing and producing audio files).
  • DAW digital audio workstation
  • the accessed track is processed at step 204 (e.g., by execution of the instructions contained in module 108 ) to identify or detect one or more time periods (i.e., time sections or segments, sometimes referred to as time slots) in the audio track having a specific audio characteristic.
  • time slots represent sections of the audio track in which the new audio content may be inserted during later stages of the mixing or processing.
  • section or “segment” as used herein in reference to an audio track may refer to a time period or time slot that is part of the audio track.
  • the term or phrase “specific audio characteristic” as used herein in reference to an audio track may refer to a section of the audio track in which no dialog is spoken or to a section of the audio track in which dialog is spoken. In some embodiments, the term or phrase “specific audio characteristic” may refer to a section of the audio track having a certain volume level, a certain frequency range, a certain maximum frequency, a certain frequency spectrum or other feature.
  • the specific sections or time slots identified are those in which no dialog is spoken. That is, the specific audio characteristic is the lack of spoken dialog in the particular section of the audio track. In some embodiments, this is determined by examining the spectral range of audio within a time slot, as spoken dialog will typically be associated with a specific range of frequencies. In some embodiments, other characteristics of spoken dialog may be used to identify or detect the specific time slots. In other embodiments, the specific audio characteristic is that the identified time slots are those in which dialog is spoken.
  • a machine learning model is trained to automatically determine one or more time slots or sections of the audio track in which dialog is spoken and/or one or more time slots or sections of the audio track in which dialog is not spoken. As will be described further with reference to FIGS. 6A and 6B , this is accomplished by generating a set of training data for the model that includes multiple examples of sections of audio, some of which include spoken dialog and some of which do not include spoken dialog. The examples are each associated with a label or other indicator of whether they contain or do not contain dialog. The examples and labels are input to the model for purposes of training the model. When trained, the model will operate to respond to an input sample of an audio track by providing an output that indicates whether the input sample contains spoken dialog or does not contain spoken dialog (or contains or does not contain another audio characteristic that the model was trained to identify).
  • the method at step 206 reduces the volume level within each of the detected time slots.
  • This process typically lowers the volume level by an amount between 6 and 12 decibels (dB), although other amounts of reduction (either greater or lesser) may be used.
  • dB decibels
  • One result of the reduction in volume level is to reduce any distraction or confusion to the listener caused by the original audio (such as music or sounds) after insertion of the new audio content.
  • the reduction in volume level is relatively abrupt, that is the volume level is sharply (i.e., substantially immediately) decreased by the desired amount.
  • the volume level is gradually decreased over a period of time (e.g., hundreds of milliseconds to 2 seconds) to its final level.
  • the process accesses a previously generated new section of audio description that corresponds to each of the time slots, as indicated by step 208 (and as implemented by the execution of the set of instructions contained in module 110 ).
  • the generated section or sections include a synthesized voice and are stored in a suitable electronic data storage element.
  • an audio segment that includes a synthesized voice may refer to a section of an audio track or recording in which there is a synthesized voice speaking dialog or making sounds.
  • the synthesized voice may be in the same language as the original audio track or in a second, different language.
  • the generated section(s) include an audio description of a portion of the video recording, such as a description of an object in a scene, a description of a character's appearance, expression or emotional state, a description of the scene environment, etc.
  • the generated section(s) are for the purpose of providing a visually impaired viewer with additional information or context regarding a scene.
  • the generated section(s) contain dialog spoken in a different language than in the original audio track.
  • an audio description of a portion of the video recording may refer to spoken dialog, spoken commentary or sounds that are meant to provide additional information to a viewer, where such additional information may include one or more of an explanation or description of a scene, an object in a scene, a character's expression, a character's clothing, or a plot element.
  • each of the generated new sections of the audio track are associated with a specific scene or portion of the video recording and from that, with a specific portion of the audio track. In some embodiments, this is done by use of conventional sound mixing technologies or synchronization tools. This allows the method illustrated in FIG. 2 to associate each of the new synthesized pieces of audio description with its corresponding time slot (or in some cases, a time period after the time slot in which dialog for a specific scene is spoken) in the original audio track.
  • the method then adjusts the volume level of each piece of the synthesized audio description, as also illustrated at step 208 .
  • the adjusted volume level of each piece of synthesized audio description is determined by considering the original volume level in a time slot, i.e., the volume level that was present before it was reduced in step 206 .
  • the adjusted volume level of a new section of audio description is determined by increasing the volume level of the new section so that it is greater than that of the reduced volume level in the audio track for the time slot.
  • adjusting the volume of the new section of audio track is performed automatically by a process that increases the volume level of the new section to a pre-determined amount greater than the level to which the volume has been reduced in the corresponding time slot or segment.
  • the pre-determined amount can be expressed as an increased decibel level.
  • the pre-determined amount can be that sufficient (as found via experimental testing) to make the new audio section comprehensible to a listener when the listener hears the new audio content at its adjusted volume and the original audio content at its lowered volume.
  • the amount of desired volume adjustment can also be determined as a result of using a machine learning model that has been trained to identify when a section of audio track is comprehensible to a listener hearing both an original section of audio at a lowered volume and a new section of audio at a higher volume. Note that the amount of volume adjustment may depend upon the contents of the audio in the original section of the audio track and/or the contents in the new audio track.
  • Each piece of synthesized audio content is then inserted into the audio track in (or at) its desired time slot, as illustrated at step 210 (and as implemented by the execution of the set of instructions contained in module 110 ). Additional mixing may be performed (if desired) to gradually increase the volume level of the inserted content to its final desired level. This may be done to reduce any confusion or disruption to the listener.
  • the final audio track for the film or program is stored in a suitable data storage element, as indicated by step 212 (by execution of the instructions contained in module 112 ).
  • the final audio track may then be combined or recombined with the video recording after appropriate synchronization, thereby resulting in a final version of the film or program in which the new (additional) audio content has been added.
  • the length of time of a piece of new (additional) audio description may not match that of the time slot into which it is to be inserted.
  • the time length of the piece to be inserted may be altered by being sped up (if it is longer than the time slot) or slowed down (if it is shorter than the time slot). In either case, the goal is to produce a piece of additional audio description that will fit into a specific time slot without introducing distracting distortion or confusion to a listener.
  • the process of modifying the time length of a piece of audio track to be inserted into an existing track may be performed automatically by a process that has as inputs the length of time of the time slot in the existing audio track and the length of time of the new section of audio content.
  • the process can compare the two and based on the comparison, process the new section as needed to make it fit into the existing track.
  • the processing may include or more of compression, re-sampling, increasing the playback speed without changing the pitch (e.g., by using a commercially available audio processing tool, such as SOX), decreasing the playback speed (again without changing the pitch), editing the track to remove pauses, etc.
  • the inserted additional audio description is mixed into only the center channel of a multi-channel audio track. In some embodiments, the inserted additional audio description is mixed into other channels of a multi-channel audio track, or into both the center channel and one or more of other channels of the audio track.
  • the volume level in the specific section is reduced abruptly, such as by using a cut-off filter. In some embodiments, the volume level in the specific section is reduced gradually, such as by reducing from an original level to a reduced level over a specific time period.
  • the volume level of the inserted new section of the audio track is adjusted to gradually increase over a period of time to the volume level of the specific section prior to the section's reduction in volume. In some embodiments, the volume level of the inserted new section is set at the volume level of the specific section prior to the section's reduction in volume without a gradual increase in volume level.
  • the amount or degree of reduction in the volume level of the audio track in the specific section is varied depending upon the audio characteristic(s) of the specific section (i.e., containing or lacking dialog, containing or lacking a soundtrack, etc.).
  • FIG. 2 may be used to add audio content to a film or program audio track for the purpose of enhancing the information available to a visually impaired viewer
  • another use of an embodiment of the systems and methods described herein is that of adding a voice-over translation of dialog to an audio track.
  • lektoring typically involves a different language speaking voice-overlay onto an audio track of the dialog being spoken in an original language by actors in a scene. It is a form of dubbing for an audience that speaks a different language than that spoken by the actors in a film or program.
  • each section of dialog in the lektoring audio track may be inserted into the appropriate time slot corresponding to an interval of no spoken dialog in the original audio track (for example, the time slot following the spoken dialog to which a section of the lektoring audio track relates).
  • the time slots of the original audio track in which dialog is spoken are identified and lowered in volume, and the appropriate lektoring section inserted into each time slot.
  • the time slots in which dialog is being spoken would be “ducked” by a certain amount and then the appropriate lektoring section would be inserted into the audio track at a higher volume level than the volume level of the section after the ducking operation.
  • FIG. 3A is an illustration of an example time period from T 0 to T N of an audio track associated with a video recording.
  • the time period includes multiple time segments or sections, which contain dialog and/or a soundtrack.
  • the time period or segment between T 0 and T 1 contains both dialog and a soundtrack, with the dialog during that time period varying in amplitude and the soundtrack being at a lower constant amplitude.
  • the time periods or segments between T 1 and T 2 and between T 3 and T N ⁇ 2 do not contain dialog but do contain the audio corresponding to the soundtrack.
  • FIG. 3B is an illustration of the audio track of FIG. 3A in which the soundtrack volume level has been lowered within a time segment or segments in which no dialog is spoken. As shown, the time periods or segments between T 1 and T 2 and between T 3 and T N ⁇ 2 do not contain dialog. In these sections the original volume level or amplitude of the soundtrack has been lowered or reduced by a specific amount. The amount by which the original level of the soundtrack is lowered may depend on the type of additional or new audio being inserted and/or the volume level of the soundtrack (which may include sound effects, music, or background noises).
  • FIG. 3C is an illustration of how the soundtrack volume level within a time segment or segments in which no dialog is spoken is lowered relatively abruptly or instead gradually decreased to the desired level. As shown, in the time period or segment between T 1 and T 2 the soundtrack volume level has been lowered relatively abruptly and uniformly to the desired level. In contrast, in the time period or segment between T 3 and T N ⁇ 2 the soundtrack volume level has been gradually decreased or increased at the ends of the time period or segment.
  • FIG. 3D is an illustration of an example time period from T 0 to T N of an audio track that may be associated with a video recording and is used to illustrate an embodiment in which the original audio track is modified by use of lektoring.
  • FIG. 3E is an illustration of the audio track of FIG. 3D in which the dialog volume level has been lowered within a time segment or segments in which dialog is spoken (such as that between T 0 and T 1 and between T 2 and T 3 ). As shown in the figure, the amplitude of the original dialog has been lowered in those time periods.
  • FIG. 3F is an illustration of the audio track of FIG.
  • the reduction in volume entering a time segment is referred to as the “attack” and typically has a duration between 100 ms and 500 ms, with a default value of 250 ms in some instances. This corresponds to an abrupt although not instantaneous reduction in the original volume of the track.
  • the increase in volume exiting a time segment is referred to as the “release” and may have a value in the range between 1000 ms and 2500 ms, with a default value of 2000 ms in some cases. This represents a more gradual increase in volume.
  • the beginning or ending of a section of original dialog or other audio over which (or into which) the new audio is to be inserted may be subject to an abrupt, gradual, or delayed increase or decrease in volume level.
  • FIGS. 4A, 4B, and 4C are illustrations of example ways in which the original dialog in a time period may be lowered in volume for a case in which dialog used for lektoring will be added.
  • the original dialog is lowered in volume abruptly
  • FIG. 4B the original dialog is lowered in volume gradually
  • FIG. 4C the original dialog is played at a lowered volume after a slight delay.
  • the original dialog may be heard at a lowered volume level starting at a short time period (e.g., 1 second) after the dialog being spoken in the different language is heard.
  • FIG. 5 is an illustration of how the original length in time of an audio track containing additional audio may be processed to adjust its length prior to being inserted into a section of an original audio track.
  • the original length of the audio track to be inserted into the time period or segment between T 3 and T N ⁇ 2 exceeds the length of the time period.
  • the audio track containing the new audio may require further processing to make its length fit into the available time period.
  • Such further processing or modification may include or more of compression, re-sampling, increasing the playback speed (while maintaining the pitch), decreasing the playback speed (while maintaining the pitch), editing the track to remove pauses, etc.
  • the amount of modification of the new section of audio track may vary depending upon the amount by which the new section needs to be shortened or lengthened to fit the time slot into which it is to be inserted.
  • performing a compression process on the new section of audio track involves reducing the amount of data used to store the track. This can reduce the amount of time required to playback the section and hence make it possible to fit the new section into the original length of time available in the desired time slot.
  • Re-sampling the new section of audio track can be used to achieve a similar result, that is to reduce the amount of data required for the new section and hence the playback time.
  • Increasing or decreasing the playback speed can be used to decrease or increase the length of time required to playback the new section by altering how fast the new section is played. This can be used to alter the total time required for playback of the new section and hence make it possible to fit the new section into the original length of time available in the desired time slot.
  • Another possible modification is to identify and remove pauses or periods of silence in the new section of track (if it is longer in time than the available time slot) or add periods of silence in between dialog or sounds in the new section of track (if it is shorter in time than the available time slot). In either case, the result is to modify the length of time required for the new section of audio track so that it fits into the desired time slot.
  • an audio description dialog merging module may be implemented. In operation, the gap between neighboring audio description segments is calculated. In the situation of two descriptions with a separation gap of less than 50 ms, the two segments are merged into one event.
  • FIG. 6A is an illustration of how a machine learning model may be trained to identify segments or sections of an audio track having a specific audio characteristic.
  • a machine learning model may be trained by using a set of training data.
  • the training data may include (a) segments of audio, with some segments having a specific audio characteristic and some segments lacking the specific audio characteristic, and (b) a corresponding label, indicator, or annotation for each segment specifying the presence or absence of the specific audio characteristic.
  • the audio segments and labels are input to the model to “train” the model.
  • the model will operate to respond to a new input sample of an audio track by providing an output that indicates whether the input sample has the specific audio characteristic (such as containing dialog) or does not have the specific audio characteristic.
  • FIG. 6B is an illustration of how the trained machine learning model of FIG. 6A may be used to classify a new segment of an audio track to determine if the new segment includes a specific audio characteristic (such including or not including dialog).
  • a new segment or sample of an audio track is input to the trained machine learning model.
  • the model provides an output which represents a classification of the input sample as either including or not including the specific audio characteristic (such as whether the track segment includes spoken dialog or does not include spoken dialog).
  • the trained model would identify the time segment between T 0 and T 1 as including spoken dialog, and the time segment between T 3 and T N ⁇ 2 as not including spoken dialog.
  • FIG. 7 is a flow diagram illustrating a method, operation, function, or process for converting dialog into synthesized speech and mixing the synthesized speech with an original audio track to add an audio description (AD) or a lektor, in accordance with an embodiment of the systems and methods described herein.
  • the inputs to the processing pipeline are the target language dialog list and the original audio track. From the dialog list the synthesized speech is generated.
  • the original audio track could be either stereo or 5.1 audio.
  • the output of the pipeline is the mixed version of synthesized audio and the original audio.
  • the dialog list 702 includes one or more of a translation of the dialog in a film or television program (as would be used in lektoring) or a dialog describing a visual element of the film or program (as would be used in providing an audio description (AD)).
  • the input dialog file may be a spreadsheet file which has the target language dialog list with in-and-out timecodes.
  • input text may be converted into Speech Synthesis Markup Language (SSML) format.
  • SSML provides details on pauses, formatting for dates, time, speaker pitch and speaking rate.
  • the input dialog is parsed by a script parser 704 .
  • Script parsing involves processing the input dialog list using the following steps: 1) On-screen text and timing extraction, 2) The dialog text, in-and-out time are extracted from the input dialog list file, 3) Markup terms like ‘deep breath’ and ‘overlap’ are removed from the dialog text, 4) The date such as year number is converted into ‘date formatting’ with SSML, 5) To control the speech rate, text can be converted into SSML with a different rate.
  • the text or SSML output from the parsing operation is then converted to synthesized speech using a text-to-speech engine 706 .
  • the text-to-speech engine may be provided by a third party.
  • the synthesized speech segment may be longer than the assigned time slot.
  • the text-to-speech process steps for insertion of an AD may include the following: 1) Call a text-to-speech engine (whether local or third party) to generate the synthesized speech, 2) Measure the synthesized speech length, and 3) If length is longer than the assigned time slot length: Calculate the speed up ratio, and Call text to speech second time with new speech rate.
  • the new audio description as expressed in synthesized speech may be altered with regards to its speaking rate in order to reduce the time required for a specific item of dialog to be described.
  • the generated synthesized speech is then mixed with the original audio track by an automated mixing process 708 (as generally described above with reference to FIGS. 1-6 ( b )).
  • the present disclosure is directed to systems and methods for mixing a new section of an audio track into an existing audio track, where the existing track is associated with a video recording. More specifically, the disclosure is directed to systems and methods for mixing a synthesized voice description of a portion of a video recording with the existing audio track associated with the recording.
  • the synthesized voice description can be used to provide additional information to a visually impaired viewer without interrupting the audio track that is associated with the video recording, typically by inserting the synthesized voice description into a segment of the audio track in which there is no dialog.
  • the automated process described herein enables a visually impaired viewer to be provided with additional information about a scene in a film or television program that might otherwise not be apparent to them and does so in a more efficient and comprehensible manner than conventional approaches.
  • Embodiments of the system and methods described herein enable a visually impaired viewer to obtain greater enjoyment of (or information from) a film or television program by implementing an improvement to the conventional way in which an audio track associated with a video is modified or enhanced by a human sound mixer.
  • the embodiments described herein improve the conventional mixing process by automatically detecting or identifying specific time slots in which to place new audio content, followed by automatically adjusting the volume level in those time slots and inserting new audio content in a synthesized voice into the time slot or slots at a specific volume level.
  • Embodiments of the system and methods described herein may also (or instead) be used to insert an audio section containing dialog spoken in a different language into an audio track in an automated and efficient manner, and one in which the inserted section is more comprehensible to a listener.
  • a computer-implemented method comprising: accessing an audio track that is associated with a video recording; identifying a section of the accessed audio track having a specific audio characteristic; reducing a volume level of the audio track in the identified section; accessing an audio segment that includes a synthesized voice; and inserting the accessed audio segment into the identified section of the audio track, the inserted audio segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • the audio description provides additional information regarding the portion of the video recording, the additional information including an explanation or description of one or more of a scene, an object in a scene, a character's expression, a character's clothing, or a plot element.
  • processing of the audio segment to alter the segment's length in time further comprises increasing the length of time of the audio segment.
  • processing of the audio segment to alter the segment's length in time further comprises decreasing the length of time of the audio segment.
  • identifying a section of the accessed audio track having a specific audio characteristic is performed, at least in part, by a machine learning model.
  • a system comprising: at least one physical electronic processor; and a physical electronic memory comprising computer-executable instructions that, when executed by the physical electronic processor, cause the physical electronic processor to: access an audio track that is associated with a video recording; identify a section of the accessed audio track having a specific audio characteristic; reduce a volume level of the audio track in the identified section; access an audio segment that includes a synthesized voice; and insert the accessed audio segment into the identified section of the audio track, the inserted segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • the specific audio characteristic is that the identified section is one in which no dialog is spoken, and the audio segment includes an audio description of a portion of the video recording.
  • the audio description provides additional information regarding the portion of the video recording, the additional information including an explanation or description of at least one of a scene, an object in a scene, a character's expression, a character's clothing, or a plot element.
  • the audio segment is processed to alter its length in time prior to inserting the audio segment into the identified section of the audio track, and further, wherein the processing may vary depending upon the amount by which the audio segment needs to be shortened or lengthened to fit the time slot into which it is to be inserted.
  • a non-transitory computer-readable medium comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to: access an audio track that is associated with a video recording; identify a section of the accessed audio track having a specific audio characteristic; reduce a volume level of the audio track in the identified section; access an audio segment that includes a synthesized voice; and insert the accessed audio segment into the identified section of the audio track, the inserted segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • non-transitory computer-readable medium of claim 19 further comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to process the audio segment to alter its length in time prior to inserting the audio segment into the identified section of the audio track.
  • computing devices and systems described and/or illustrated herein broadly represent any type or form of computing device or system capable of executing computer-readable instructions, such as those contained within the modules described herein.
  • these computing device(s) may each include at least one memory device and at least one physical electronic processor.
  • the term “memory device” generally refers to any type or form of volatile or non-volatile storage device or medium capable of storing data and/or computer-readable instructions.
  • a memory device may store, load, and/or maintain one or more of the modules described herein. Examples of memory devices include, without limitation, Random Access Memory (RAM), Read Only Memory (ROM), flash memory, Hard Disk Drives (HDDs), Solid-State Drives (SSDs), optical disk drives, caches, variations or combinations of one or more of the same, or any other suitable storage memory.
  • RAM Random Access Memory
  • ROM Read Only Memory
  • HDDs Hard Disk Drives
  • SSDs Solid-State Drives
  • optical disk drives caches, variations or combinations of one or more of the same, or any other suitable storage memory.
  • the term “physical processor” generally refers to any type or form of hardware-implemented processing unit capable of interpreting and/or executing computer-readable instructions.
  • a physical processor may access and/or modify one or more modules stored in the above-described memory device.
  • Examples of physical processors include, without limitation, microprocessors, microcontrollers, Central Processing Units (CPUs), Field-Programmable Gate Arrays (FPGAs) that implement softcore processors, Application-Specific Integrated Circuits (ASICs), portions of one or more of the same, variations or combinations of one or more of the same, or any other suitable physical processor.
  • modules described and/or illustrated herein may represent portions of a single module or application.
  • one or more of these modules may represent one or more software applications or programs that, when executed by a computing device, may cause the computing device to perform one or more tasks.
  • one or more of the modules described and/or illustrated herein may represent modules stored and configured to run on one or more of the computing devices or systems described and/or illustrated herein.
  • One or more of these modules may also represent all or portions of one or more special-purpose computers configured to perform one or more tasks.
  • one or more of the modules described herein may transform data, physical devices, and/or representations of physical devices from one form to another.
  • one or more of the modules recited herein may receive data to be transformed, transform the data, output a result of the transformation to perform a function, use the result of the transformation to make a decision or perform another function, and store the result of the transformation.
  • one or more of the modules recited herein may transform a processor, volatile memory, non-volatile memory, and/or any other portion of a physical computing device from one form to another by executing on the computing device, storing data on the computing device, and/or otherwise interacting with the computing device.
  • the term “computer-readable medium” generally refers to any form of device, carrier, or medium capable of storing or carrying computer-readable instructions.
  • Examples of computer-readable media include, without limitation, transmission-type media, such as carrier waves, and non-transitory-type media, such as magnetic-storage media (e.g., hard disk drives, tape drives, and floppy disks), optical-storage media (e.g., Compact Disks (CDs), Digital Video Disks (DVDs), and BLU-RAY disks), electronic-storage media (e.g., solid-state drives and flash media), and other distribution systems.
  • transmission-type media such as carrier waves
  • non-transitory-type media such as magnetic-storage media (e.g., hard disk drives, tape drives, and floppy disks), optical-storage media (e.g., Compact Disks (CDs), Digital Video Disks (DVDs), and BLU-RAY disks), electronic-storage media (e.g., solid-state drives

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Management Or Editing Of Information On Record Carriers (AREA)

Abstract

The disclosed computer-implemented method may include accessing an audio track that is associated with a video recording, identifying a section of the accessed audio track having a specific audio characteristic, reducing a volume level of the audio track in the identified section, accessing an audio segment that includes a synthesized voice and inserting the accessed audio segment into the identified section of the audio track, where the inserted segment has a higher volume level than the reduced volume level of the audio track in the identified section. The synthesized voice description can be used to provide additional information to a visually impaired viewer without interrupting the audio track that is associated with the video recording, typically by inserting the synthesized voice description into a segment of the audio track in which there is no dialog. Various other methods, systems, and computer-readable media are also disclosed.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application claims priority to and the benefit of U.S. Provisional Patent Application No. 62/937,755, filed on Nov. 19, 2019, which application is incorporated by reference in its entirety herein.
  • BACKGROUND
  • Digital content distribution systems may provide a variety of different types of content (e.g., tv shows, movies, etc.) to end users. This content may include both audio and video data and may be sent to a user's content player as a multimedia stream. As a result, streaming content has become a very popular form of entertainment. The ability to enjoy a film, television program or other form of audiovisual content in the comfort of one's home offers many advantages to viewers. One of these advantages is that of enabling visually impaired viewers to more easily view the displayed content by being able to adjust their position to a more comfortable distance from a screen or other display device than might be available at a movie theater.
  • However, such visually impaired viewers may still miss certain details in a scene that is displayed, or not be able to recognize certain objects. This can reduce their understanding and enjoyment of the displayed content. For example, a visually impaired viewer may not recognize that a character is holding an object that can explain an element of a plot or that an object in a scene might provide a clue to what is happening. Similarly, a visually impaired viewer may not be able to recognize an expression on a character's face which could add to the viewer's understanding of the character or to the meaning of dialog spoken by the character.
  • SUMMARY
  • As will be described in greater detail below, the present disclosure describes systems and methods for mixing a synthesized voice description of a scene or other portion of a video recording with the existing audio track associated with the video recording. The synthesized voice description can be used to provide additional information to a visually impaired viewer without interrupting the audio track that is associated with the video recording, typically by inserting the synthesized voice description into a segment of the audio track in which there is no dialog.
  • In one example, a computer-implemented method includes accessing an audio track that is associated with a video recording, identifying a section of the accessed audio track having a specific audio characteristic, reducing a volume level of the audio track in the identified section, accessing an audio segment that includes a synthesized voice, and inserting the accessed audio segment into the identified section of the audio track, the inserted segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • In one example, the specific audio characteristic is that the identified section is one in which no dialog is spoken.
  • In one example, the specific audio characteristic is that the identified section is one in which dialog is spoken.
  • In one example, the volume of the identified section is reduced by an amount between 6 and 12 decibels (dB).
  • In one example, the volume of the identified section is reduced by an amount of approximately 9 decibels (dB).
  • In one example, the accessed audio segment includes an audio description of a portion of the video recording.
  • In one example, the audio description provides additional information regarding the portion of the video recording, where the additional information may include an explanation or description of a scene, an object in a scene, a character's expression, a character's clothing, or a plot element.
  • In one example, the portion of the video recording for which the additional information is provided corresponds to the section of the audio track where the accessed audio segment is inserted.
  • In one example, the accessed audio segment includes dialog spoken in a different language than dialog in the audio track prior to implementing the method.
  • In one example, the method includes processing the accessed audio segment to alter its length in time prior to inserting the audio segment into the identified section of the audio track.
  • In one example, the processing of the audio segment to alter the segment's length in time includes increasing the length of the audio segment.
  • In one example, processing of the audio segment to alter the segment's length in time includes decreasing the length of the audio segment.
  • In one example, the amount of reduction in the volume level of the audio track in the identified section depends upon the audio characteristic of the identified section.
  • In one example, identifying a section of the accessed audio track having a specific audio characteristic is performed, at least in part, by a machine learning model.
  • In addition, a corresponding system (e.g., a server, computing device, etc.) is disclosed. The system includes a set of modules stored in an electronic data storage or memory, with each module containing instructions for a computer-implemented process, function, or operation and an electronic processor for executing the instructions. The modules include one or more modules containing instructions for accessing an audio track that is associated with a video recording, identifying a section of the accessed audio track having a specific audio characteristic, reducing a volume level of the audio track in the identified section, accessing an audio segment that includes a synthesized voice, and inserting the accessed audio segment into the identified section of the audio track, the inserted segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • In some examples, the above-described method may be encoded as computer-readable instructions on a computer-readable medium. For example, a computer-readable medium may include one or more computer-executable instructions that, when executed by at least one processor of a computing device, may cause the computing device to access an audio track that is associated with a video recording, identify a section of the accessed audio track having a specific audio characteristic, reduce a volume level of the audio track in the identified section, access an audio segment that includes a synthesized voice, and insert the accessed audio segment into the identified section of the audio track, the inserted segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • Features from any of the embodiments described herein may be used in combination with one another in accordance with the general principles described herein. These and other embodiments, features, and advantages will be more fully understood upon reading the following detailed description in conjunction with the accompanying drawings and claims.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The accompanying drawings illustrate a number of exemplary embodiments and are a part of the specification. Together with the following description, these drawings demonstrate and explain various principles of the present disclosure.
  • FIG. 1 is a diagram illustrating a system 100 containing a set of modules 102, with each module containing executable instructions that when executed by an electronic processor implement a method for altering an audio track associated with a video recording in accordance with an embodiment of the systems and methods described herein.
  • FIG. 2 is a flow chart or flow diagram of an exemplary computer-implemented method, operation, function or process 200 for altering an audio track associated with a video recording in accordance with an embodiment of the systems and methods described herein.
  • FIG. 3A is an illustration of an example time period from T0 to TN of an audio track that may be associated with a video recording. As shown, the time period includes multiple time segments or sections, which may contain dialog and/or a soundtrack.
  • FIG. 3B is an illustration of the audio track of FIG. 3A in which the soundtrack volume level has been lowered within a time segment or segments in which no dialog is spoken.
  • FIG. 3C is an illustration of how the soundtrack volume level within a time segment or segments in which no dialog is spoken may be lowered relatively abruptly or gradually decreased to the desired level.
  • FIG. 3D is an illustration of an example time period from T0 to TN of an audio track that may be associated with a video recording and is used to illustrate an embodiment in which the original audio track is modified by use of lektoring.
  • FIG. 3E is an illustration of the audio track of FIG. 3D in which the dialog volume level has been lowered within a time segment or segments in which dialog is spoken.
  • FIG. 3F is an illustration of the audio track of FIG. 3E in which a new audio segment of dialog (such as used in lektoring) has been inserted within the same time period as original dialog, but at a higher volume level than the lowered volume of the original dialog.
  • FIGS. 4A, 4B, and 4C are illustrations of example ways in which the original dialog in a time period may be lowered in volume for a case in which dialog used for lektoring will be added. In FIG. 4A the original dialog is lowered in volume abruptly, in FIG. 4B the original dialog is lowered in volume gradually, and in FIG. 4C the original dialog is played at a lowered volume after a slight delay.
  • FIG. 5 is an illustration of how the original length in time of an audio track containing additional audio may be processed to adjust its length prior to being inserted into a section of an original audio track.
  • FIG. 6A is an illustration of how a machine learning model may be trained to identify segments or sections of an audio track having a specific audio characteristic.
  • FIG. 6B is an illustration of how the trained machine learning model of FIG. 6A may be used to classify a new segment of an audio track to determine if the new segment includes a specific audio characteristic (such as including or not including dialog).
  • FIG. 7 is a flow diagram illustrating a method, operation, function or process for converting dialog into synthesized speech and mixing the synthesized speech with an original audio track to add an audio description (AD) or a lektor, in accordance with an embodiment of the systems and methods described herein.
  • Throughout the drawings, identical reference characters and descriptions indicate similar, but not necessarily identical, elements. While the exemplary embodiments described herein are susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and will be described in detail herein. However, the exemplary embodiments described herein are not intended to be limited to the particular forms disclosed. Rather, the present disclosure covers all modifications, equivalents, and alternatives falling within the scope of the appended claims.
  • DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS
  • The present disclosure is generally directed to systems and methods for mixing a new section of an audio track into an existing audio track, where the existing track is associated with a video recording. A multimedia stream used to provide content to viewers will typically include an audio track and a video recording. The audio track and video recording are stored in an electronic data storage element or memory. The two may be combined into a single set of data or stored as separate but associated sets of data. If stored separately, the two sets of data are synchronized or aligned during the streaming process.
  • As mentioned, visually impaired viewers may miss certain details in a scene that is displayed, or not be able to recognize certain objects. Another possible approach is to overlay a synthesized voice over an existing audio track whenever it is desired to insert additional audio information. However, this may not be effective, as potential overlaps between the existing audio track and the additional audio information can confuse a listener.
  • Another example of a situation in which an audio track might require alteration for a specific type of viewer is in the process known as “lektoring”, which involves adding a voice-over translation of dialog to an audio track. Lektoring typically involves a voice speaking a language that is different from original language of the content item, i.e., hereinafter referred to as the target language. The target language speaking voice is overlaid on top of a portion of an audio track containing the dialog being spoken in the original language by actors in a scene. It is a form of dubbing for an audience that speaks a different language than that originally spoken by the actors in a film or program. One approach to implementing lektoring places the lektoring audio track over the existing dialog portions of a film or program's audio track, which plays in the background. However, this has the disadvantage that the voice-over may be confusing to a listener due to background sounds, such as the original dialog.
  • Note that although lektoring typically involves an overlay of dialog in a scene spoken in a different language, in some circumstances it may be desirable to introduce a description of an element of a scene or character into an audio track, with that description spoken in a different language than the original dialog. In this example, the methods described herein may be used to introduce an audio segment containing additional audio description spoken in a different language into an audio track, either over dialog or adjacent to it in the audio track. If adjacent to dialog, then the dialog may be subject to overlay by dialog spoken in the different language with the additional audio content inserted adjacent to it.
  • In one embodiment, the disclosed system and method is used to mix or integrate a synthesized voice description of a portion of a video recording with the existing audio track associated with the recording. The synthesized voice description can be used to provide additional information to a visually impaired viewer without interrupting the audio track that is associated with the video recording. This is accomplished in some examples by inserting the synthesized voice description into a segment of the audio track in which no dialog is being spoken.
  • In some examples, a segment in which no dialog is spoken may be automatically identified or detected by a substantial reduction in the frequency components associated with speech on the audio track or by another indication of a time period or time slot in which no speech is present. It may also be identified by use of a Voice Activity Detection (VAD) technique. Such a segment may also be identified or detected by use of a machine learning model that has been trained to recognize when a section of audio track is lacking dialog (this may be accomplished by using a set of training data that includes multiple examples of sections of audio track having or lacking dialog and an associated label or indicator of whether dialog is present in that section).
  • In some examples, the automated process described herein enables a visually impaired viewer to be provided with additional information about a scene in a film or television program that might otherwise not be apparent to them and does so in a more efficient and comprehensible manner than conventional approaches.
  • In some examples, the automated process described herein provides a way to integrate a dubbed version of dialog into an existing audio track in a more efficient manner than conventional approaches.
  • As will be described in greater detail, embodiments of the systems and methods described herein enable a visually impaired viewer to obtain greater enjoyment of (or information from) a film or television program. Embodiments described herein improve the conventional mixing process by automatically detecting or identifying specific time slots in which to place new audio content, followed by automatically adjusting the volume level in those time slots and inserting new audio content in a synthesized voice into the time slot or slots at a specific volume level.
  • As will be described, embodiments of the systems and methods described herein may be used to insert a translation, voice-over audio track, or additional audio description into an existing audio track. In one example, the described technique may be used in implementing a version of “lektoring”, in which a voice speaking a different language than that originally spoken in a scene (e.g., a Polish speaking voice) is inserted into the audio track, over the existing dialog. As a result, in lektoring, the new audio is overlapping with the original dialog.
  • For example, the lowered volume of the original dialog may begin and after a short time (e.g., 1 second, although this time delay is configurable, as is the amount or degree of lowering of the volume level of the original dialog) the new audio containing the dialog spoken in a different language may start. In addition to the example described, instead of being inserted over a portion of the audio track containing existing dialog, in one example, the audio segments containing dialog spoken in a different language may be inserted into a section of the audio track in which no dialog is spoken (e.g., adjacent to the audio track position of the original dialog).
  • The technique(s) described may also be used to insert an audio description of a scene or character into the audio track. In this example, the new audio content is inserted into the audio track in a region between portions of the track containing spoken dialog. The inserted content may include dialog spoken in a different language or audio descriptions of features or elements of a scene (spoken in the original or in a different language), as described previously.
  • The following will provide, with reference to FIGS. 1-7, detailed descriptions of a computer-implemented method, function, operation or process for altering an existing audio track associated with a video recording to mix into the track an audio section containing additional information about the associated video recording in an automated and efficient manner. Further, the following will also provide a detailed description of a system, server, or computing device for performing the described method, function, operation or process.
  • The following will also provide detailed descriptions of a computer-implemented method, function, operation or process for altering an existing audio track associated with a video recording to mix into the track an audio section containing dialog spoken in a different language than in the original audio track in an automated and efficient manner. Further, the following will also provide a detailed description of a system, server, or computing device for performing the described method, function, operation or process.
  • FIG. 1 is a diagram illustrating a system 100 containing a set of modules 102, with each module containing executable instructions that when executed by an electronic processor implement a method for altering an audio track associated with a video recording, in accordance with an embodiment of the systems and methods described herein.
  • As shown in the figure, system 100 may represent a server or other form of computing or data processing device. Modules 102 each contain a set of executable instructions, where when the set of instructions is executed by a suitable electronic processor (such as that indicated in the figure by “Physical Processor(s) 130”), system (or server or device) 100 operates to perform a specific process, operation, function or method. Modules 102 are stored in a memory 120, which typically includes an Operating System module 104 that contains instructions used (among other functions) to access and control the execution of the instructions contained in other modules. The modules 102 in memory 120 are accessed for purposes of transferring data and executing instructions by use of a “bus” or communications line 114, which also serves to permit processor(s) 130 to communicate with the modules for purposes of accessing and executing a set of instructions. Bus or communications line 114 also permits processor(s) 130 to interact with other elements of system 100, such as input or output devices 122, communications elements 124 for exchanging data and information with devices external to system 100, and additional memory devices 126.
  • In some embodiments, the term “audio track” as used herein may refer to a track containing spoken dialog, music, sound effects, a mono track, a stereo track, a MIDI (Musical Instrument Digital Interface) track or environmental sounds.
  • In some embodiments, the term or phrase “an audio track that is associated with a video recording” as used herein may refer to an audio track containing one or more of spoken dialog, music or background sounds that is meant to be played as part of a combined audio and video presentation. The combined presentation may be a film or television program, for example. The audio track may be stored separately and combined with a video recording or it may be stored as part of the combined audio and video presentation.
  • In some embodiments, the term “synthesized” as used herein may refer to an artificially generated voice or sound meant to represent a voice.
  • As shown in FIG. 1, modules 102 may contain one or more sets of instructions for performing a method that is described with reference to FIG. 2. These modules may include those illustrated but may also include a greater number or fewer number than those illustrated. For example, Audio Track Access Module 106 may contain instructions that when executed perform a process to access an audio track for a film or program. Audio Track Processing Module 108 may contain instructions that when executed perform a process to lower the volume in a specific time slot of an audio track. Module 108 may also contain instructions that when executed cause the volume to be lowered in a specific manner. Similarly, the instructions that cause the volume to be lowered in a specific manner may be contained in a different module.
  • New Audio Section Processing Module 110 may contain instructions that when executed perform a process to access a previously generated new section of audio description that corresponds to each of the time slots. In some embodiments, the generated section or sections are generated in a synthesized voice and stored in a suitable electronic data storage element. In some embodiments, module 110 adjusts the volume level of each piece of the synthesized audio description and inserts each piece of synthesized audio content into the audio track in (or at) its desired time slot.
  • Audio Track Return Module 112 may contain instructions that when executed perform a process to store the final audio track in a suitable data storage element. The final audio track may then be combined or recombined with the video recording after appropriate synchronization, thereby resulting in a final version of the film or program in which the new (additional) audio content has been added.
  • FIG. 2 is a flow chart or flow diagram of an exemplary computer-implemented method, operation, function or process 200 for altering an audio track associated with a video recording in accordance with an embodiment of the systems and methods described herein. In some embodiments, this is performed for the purpose of mixing an audio description of a scene from a film or television program into an existing audio track for the film or program. The steps shown in FIG. 2 may be performed by any suitable computer-executable code and/or computing system, including the system(s) illustrated in FIG. 1. In one example, each of the steps shown in FIG. 2 may represent an algorithm whose structure includes and/or is represented by multiple sub-steps, examples of which will be provided in greater detail below.
  • As described, a system for implementing the method of FIG. 2 includes an electronic processor (e.g., a CPU, controller, microcontroller, etc.) which accesses a set of software instructions that are stored in an electronic data storage element. Each set of instructions that is used to perform a specific function may be contained in a specific module. The processor executes the stored instructions in a module to perform the specific function or operation.
  • As illustrated in FIG. 2, at step 202 one or more of the modules described herein operates to cause a computing device to access an audio track for a film or program (e.g., by execution of the instructions contained in module 106 of FIG. 1). The track is stored in an electronic data storage element or memory. The track is stored with its associated video recording in a common memory or in a separate memory. The accessed audio track includes one or more of spoken dialog, music, sound effects, a mono track, a stereo track, a midi track, background sounds or environmental sounds.
  • In some embodiments, the track is accessed using an automated process that identifies an audio track as part of a combined audio and video project. The track may also be accessed by a process that is coupled to the digital audio workstation (DAW) of a sound engineer (a digital audio workstation is an electronic device or application software used for recording, editing and producing audio files).
  • The accessed track is processed at step 204 (e.g., by execution of the instructions contained in module 108) to identify or detect one or more time periods (i.e., time sections or segments, sometimes referred to as time slots) in the audio track having a specific audio characteristic. In one example, these one or more time slots represent sections of the audio track in which the new audio content may be inserted during later stages of the mixing or processing.
  • In some embodiments, the term “section” or “segment” as used herein in reference to an audio track may refer to a time period or time slot that is part of the audio track.
  • In some embodiments, the term or phrase “specific audio characteristic” as used herein in reference to an audio track may refer to a section of the audio track in which no dialog is spoken or to a section of the audio track in which dialog is spoken. In some embodiments, the term or phrase “specific audio characteristic” may refer to a section of the audio track having a certain volume level, a certain frequency range, a certain maximum frequency, a certain frequency spectrum or other feature.
  • In some embodiments, the specific sections or time slots identified are those in which no dialog is spoken. That is, the specific audio characteristic is the lack of spoken dialog in the particular section of the audio track. In some embodiments, this is determined by examining the spectral range of audio within a time slot, as spoken dialog will typically be associated with a specific range of frequencies. In some embodiments, other characteristics of spoken dialog may be used to identify or detect the specific time slots. In other embodiments, the specific audio characteristic is that the identified time slots are those in which dialog is spoken.
  • In some embodiments, a machine learning model is trained to automatically determine one or more time slots or sections of the audio track in which dialog is spoken and/or one or more time slots or sections of the audio track in which dialog is not spoken. As will be described further with reference to FIGS. 6A and 6B, this is accomplished by generating a set of training data for the model that includes multiple examples of sections of audio, some of which include spoken dialog and some of which do not include spoken dialog. The examples are each associated with a label or other indicator of whether they contain or do not contain dialog. The examples and labels are input to the model for purposes of training the model. When trained, the model will operate to respond to an input sample of an audio track by providing an output that indicates whether the input sample contains spoken dialog or does not contain spoken dialog (or contains or does not contain another audio characteristic that the model was trained to identify).
  • Once the one or more time slots have been identified/detected, the method at step 206 (as implemented by module 108 or a separate set of instructions in a different module) reduces the volume level within each of the detected time slots. This process (sometimes referred to as “ducking”) typically lowers the volume level by an amount between 6 and 12 decibels (dB), although other amounts of reduction (either greater or lesser) may be used. One result of the reduction in volume level is to reduce any distraction or confusion to the listener caused by the original audio (such as music or sounds) after insertion of the new audio content. In some embodiments, the reduction in volume level is relatively abrupt, that is the volume level is sharply (i.e., substantially immediately) decreased by the desired amount. In some embodiments, the volume level is gradually decreased over a period of time (e.g., hundreds of milliseconds to 2 seconds) to its final level.
  • After each of the identified/detected time slots has had its volume reduced, the process accesses a previously generated new section of audio description that corresponds to each of the time slots, as indicated by step 208 (and as implemented by the execution of the set of instructions contained in module 110). In some embodiments, the generated section or sections include a synthesized voice and are stored in a suitable electronic data storage element.
  • In some embodiments, the term or phrase “an audio segment that includes a synthesized voice” as used herein may refer to a section of an audio track or recording in which there is a synthesized voice speaking dialog or making sounds. The synthesized voice may be in the same language as the original audio track or in a second, different language.
  • In some embodiments, the generated section(s) include an audio description of a portion of the video recording, such as a description of an object in a scene, a description of a character's appearance, expression or emotional state, a description of the scene environment, etc. In some cases, the generated section(s) are for the purpose of providing a visually impaired viewer with additional information or context regarding a scene. In other embodiments, the generated section(s) contain dialog spoken in a different language than in the original audio track.
  • In some embodiments, the term or phrase “an audio description of a portion of the video recording” as used herein may refer to spoken dialog, spoken commentary or sounds that are meant to provide additional information to a viewer, where such additional information may include one or more of an explanation or description of a scene, an object in a scene, a character's expression, a character's clothing, or a plot element.
  • In some embodiments, each of the generated new sections of the audio track are associated with a specific scene or portion of the video recording and from that, with a specific portion of the audio track. In some embodiments, this is done by use of conventional sound mixing technologies or synchronization tools. This allows the method illustrated in FIG. 2 to associate each of the new synthesized pieces of audio description with its corresponding time slot (or in some cases, a time period after the time slot in which dialog for a specific scene is spoken) in the original audio track.
  • The method then adjusts the volume level of each piece of the synthesized audio description, as also illustrated at step 208. In some embodiments, the adjusted volume level of each piece of synthesized audio description is determined by considering the original volume level in a time slot, i.e., the volume level that was present before it was reduced in step 206. In some embodiments, the adjusted volume level of a new section of audio description is determined by increasing the volume level of the new section so that it is greater than that of the reduced volume level in the audio track for the time slot.
  • In some embodiments, adjusting the volume of the new section of audio track is performed automatically by a process that increases the volume level of the new section to a pre-determined amount greater than the level to which the volume has been reduced in the corresponding time slot or segment. The pre-determined amount can be expressed as an increased decibel level. The pre-determined amount can be that sufficient (as found via experimental testing) to make the new audio section comprehensible to a listener when the listener hears the new audio content at its adjusted volume and the original audio content at its lowered volume.
  • The amount of desired volume adjustment can also be determined as a result of using a machine learning model that has been trained to identify when a section of audio track is comprehensible to a listener hearing both an original section of audio at a lowered volume and a new section of audio at a higher volume. Note that the amount of volume adjustment may depend upon the contents of the audio in the original section of the audio track and/or the contents in the new audio track.
  • Each piece of synthesized audio content is then inserted into the audio track in (or at) its desired time slot, as illustrated at step 210 (and as implemented by the execution of the set of instructions contained in module 110). Additional mixing may be performed (if desired) to gradually increase the volume level of the inserted content to its final desired level. This may be done to reduce any confusion or disruption to the listener.
  • After insertion of each section of new audio (such as the additional content), and if desired further processing of the volume level for each inserted section, the final audio track for the film or program is stored in a suitable data storage element, as indicated by step 212 (by execution of the instructions contained in module 112). The final audio track may then be combined or recombined with the video recording after appropriate synchronization, thereby resulting in a final version of the film or program in which the new (additional) audio content has been added.
  • In some embodiments, the length of time of a piece of new (additional) audio description may not match that of the time slot into which it is to be inserted. In such a situation, the time length of the piece to be inserted may be altered by being sped up (if it is longer than the time slot) or slowed down (if it is shorter than the time slot). In either case, the goal is to produce a piece of additional audio description that will fit into a specific time slot without introducing distracting distortion or confusion to a listener.
  • The process of modifying the time length of a piece of audio track to be inserted into an existing track may be performed automatically by a process that has as inputs the length of time of the time slot in the existing audio track and the length of time of the new section of audio content. The process can compare the two and based on the comparison, process the new section as needed to make it fit into the existing track. The processing may include or more of compression, re-sampling, increasing the playback speed without changing the pitch (e.g., by using a commercially available audio processing tool, such as SOX), decreasing the playback speed (again without changing the pitch), editing the track to remove pauses, etc. These modification processes are described in further detail with reference to FIG. 5.
  • In some embodiments, the inserted additional audio description is mixed into only the center channel of a multi-channel audio track. In some embodiments, the inserted additional audio description is mixed into other channels of a multi-channel audio track, or into both the center channel and one or more of other channels of the audio track.
  • In some embodiments, the volume level in the specific section is reduced abruptly, such as by using a cut-off filter. In some embodiments, the volume level in the specific section is reduced gradually, such as by reducing from an original level to a reduced level over a specific time period.
  • In some embodiments, the volume level of the inserted new section of the audio track is adjusted to gradually increase over a period of time to the volume level of the specific section prior to the section's reduction in volume. In some embodiments, the volume level of the inserted new section is set at the volume level of the specific section prior to the section's reduction in volume without a gradual increase in volume level.
  • In some embodiments, the amount or degree of reduction in the volume level of the audio track in the specific section is varied depending upon the audio characteristic(s) of the specific section (i.e., containing or lacking dialog, containing or lacking a soundtrack, etc.).
  • Although the processes and methods described with reference to FIG. 2 may be used to add audio content to a film or program audio track for the purpose of enhancing the information available to a visually impaired viewer, another use of an embodiment of the systems and methods described herein is that of adding a voice-over translation of dialog to an audio track.
  • As mentioned, lektoring typically involves a different language speaking voice-overlay onto an audio track of the dialog being spoken in an original language by actors in a scene. It is a form of dubbing for an audience that speaks a different language than that spoken by the actors in a film or program. In a use of the processes and methods described herein for purposes of inserting a lektoring audio track, each section of dialog in the lektoring audio track may be inserted into the appropriate time slot corresponding to an interval of no spoken dialog in the original audio track (for example, the time slot following the spoken dialog to which a section of the lektoring audio track relates).
  • In another embodiment, the time slots of the original audio track in which dialog is spoken are identified and lowered in volume, and the appropriate lektoring section inserted into each time slot. In this embodiment, the time slots in which dialog is being spoken would be “ducked” by a certain amount and then the appropriate lektoring section would be inserted into the audio track at a higher volume level than the volume level of the section after the ducking operation.
  • FIG. 3A is an illustration of an example time period from T0 to TN of an audio track associated with a video recording. As shown, the time period includes multiple time segments or sections, which contain dialog and/or a soundtrack. For example, the time period or segment between T0 and T1 contains both dialog and a soundtrack, with the dialog during that time period varying in amplitude and the soundtrack being at a lower constant amplitude. Similarly, the time periods or segments between T1 and T2 and between T3 and TN−2 do not contain dialog but do contain the audio corresponding to the soundtrack.
  • FIG. 3B is an illustration of the audio track of FIG. 3A in which the soundtrack volume level has been lowered within a time segment or segments in which no dialog is spoken. As shown, the time periods or segments between T1 and T2 and between T3 and TN−2 do not contain dialog. In these sections the original volume level or amplitude of the soundtrack has been lowered or reduced by a specific amount. The amount by which the original level of the soundtrack is lowered may depend on the type of additional or new audio being inserted and/or the volume level of the soundtrack (which may include sound effects, music, or background noises).
  • FIG. 3C is an illustration of how the soundtrack volume level within a time segment or segments in which no dialog is spoken is lowered relatively abruptly or instead gradually decreased to the desired level. As shown, in the time period or segment between T1 and T2 the soundtrack volume level has been lowered relatively abruptly and uniformly to the desired level. In contrast, in the time period or segment between T3 and TN−2 the soundtrack volume level has been gradually decreased or increased at the ends of the time period or segment.
  • FIG. 3D is an illustration of an example time period from T0 to TN of an audio track that may be associated with a video recording and is used to illustrate an embodiment in which the original audio track is modified by use of lektoring. FIG. 3E is an illustration of the audio track of FIG. 3D in which the dialog volume level has been lowered within a time segment or segments in which dialog is spoken (such as that between T0 and T1 and between T2 and T3). As shown in the figure, the amplitude of the original dialog has been lowered in those time periods. FIG. 3F is an illustration of the audio track of FIG. 3E in which a new audio segment of dialog (such as used in lektoring) has been inserted within the same time period (T0 to T1) as a segment of original dialog, but at a higher volume level than the previously lowered volume of the original dialog. The same technique may of course be used in other time periods.
  • The reduction in volume entering a time segment is referred to as the “attack” and typically has a duration between 100 ms and 500 ms, with a default value of 250 ms in some instances. This corresponds to an abrupt although not instantaneous reduction in the original volume of the track. The increase in volume exiting a time segment is referred to as the “release” and may have a value in the range between 1000 ms and 2500 ms, with a default value of 2000 ms in some cases. This represents a more gradual increase in volume. Depending on the situation, the beginning or ending of a section of original dialog or other audio over which (or into which) the new audio is to be inserted may be subject to an abrupt, gradual, or delayed increase or decrease in volume level.
  • FIGS. 4A, 4B, and 4C are illustrations of example ways in which the original dialog in a time period may be lowered in volume for a case in which dialog used for lektoring will be added. In FIG. 4A the original dialog is lowered in volume abruptly, in FIG. 4B the original dialog is lowered in volume gradually, and in FIG. 4C the original dialog is played at a lowered volume after a slight delay. As shown in FIG. 4C, the original dialog may be heard at a lowered volume level starting at a short time period (e.g., 1 second) after the dialog being spoken in the different language is heard.
  • FIG. 5 is an illustration of how the original length in time of an audio track containing additional audio may be processed to adjust its length prior to being inserted into a section of an original audio track. As shown, in the example the original length of the audio track to be inserted into the time period or segment between T3 and TN−2 exceeds the length of the time period. In such a situation, the audio track containing the new audio may require further processing to make its length fit into the available time period. Such further processing or modification may include or more of compression, re-sampling, increasing the playback speed (while maintaining the pitch), decreasing the playback speed (while maintaining the pitch), editing the track to remove pauses, etc. The amount of modification of the new section of audio track may vary depending upon the amount by which the new section needs to be shortened or lengthened to fit the time slot into which it is to be inserted.
  • As an example, performing a compression process on the new section of audio track involves reducing the amount of data used to store the track. This can reduce the amount of time required to playback the section and hence make it possible to fit the new section into the original length of time available in the desired time slot.
  • Re-sampling the new section of audio track can be used to achieve a similar result, that is to reduce the amount of data required for the new section and hence the playback time.
  • Increasing or decreasing the playback speed (while maintaining the pitch) can be used to decrease or increase the length of time required to playback the new section by altering how fast the new section is played. This can be used to alter the total time required for playback of the new section and hence make it possible to fit the new section into the original length of time available in the desired time slot.
  • Another possible modification is to identify and remove pauses or periods of silence in the new section of track (if it is longer in time than the available time slot) or add periods of silence in between dialog or sounds in the new section of track (if it is shorter in time than the available time slot). In either case, the result is to modify the length of time required for the new section of audio track so that it fits into the desired time slot.
  • It has been noticed that some neighboring sections of audio description may be very close to each other and may have a minimal time separation between them. In such cases, a default value used for a ducking process may cause releasing and attacking segments of audio to be very close to each other. This may create a noticeable and potentially annoying audio effect to a listener. To reduce this potentially annoying audio effect, in some embodiments, an audio description dialog merging module may be implemented. In operation, the gap between neighboring audio description segments is calculated. In the situation of two descriptions with a separation gap of less than 50 ms, the two segments are merged into one event.
  • FIG. 6A is an illustration of how a machine learning model may be trained to identify segments or sections of an audio track having a specific audio characteristic. As shown, in one example, a machine learning model may be trained by using a set of training data. The training data may include (a) segments of audio, with some segments having a specific audio characteristic and some segments lacking the specific audio characteristic, and (b) a corresponding label, indicator, or annotation for each segment specifying the presence or absence of the specific audio characteristic. The audio segments and labels are input to the model to “train” the model. When trained, the model will operate to respond to a new input sample of an audio track by providing an output that indicates whether the input sample has the specific audio characteristic (such as containing dialog) or does not have the specific audio characteristic.
  • FIG. 6B is an illustration of how the trained machine learning model of FIG. 6A may be used to classify a new segment of an audio track to determine if the new segment includes a specific audio characteristic (such including or not including dialog). As shown, a new segment or sample of an audio track is input to the trained machine learning model. In response, the model provides an output which represents a classification of the input sample as either including or not including the specific audio characteristic (such as whether the track segment includes spoken dialog or does not include spoken dialog). Thus, for example, the trained model would identify the time segment between T0 and T1 as including spoken dialog, and the time segment between T3 and TN−2 as not including spoken dialog.
  • FIG. 7 is a flow diagram illustrating a method, operation, function, or process for converting dialog into synthesized speech and mixing the synthesized speech with an original audio track to add an audio description (AD) or a lektor, in accordance with an embodiment of the systems and methods described herein. In general terms, the inputs to the processing pipeline are the target language dialog list and the original audio track. From the dialog list the synthesized speech is generated. In some embodiments, the original audio track could be either stereo or 5.1 audio. The output of the pipeline is the mixed version of synthesized audio and the original audio.
  • As shown in FIG. 7, the content to be inserted into an audio track is provided as a dialog list 702. The dialog list 702 includes one or more of a translation of the dialog in a film or television program (as would be used in lektoring) or a dialog describing a visual element of the film or program (as would be used in providing an audio description (AD)). The input dialog file may be a spreadsheet file which has the target language dialog list with in-and-out timecodes. To allow for more customization in the synthesized audio, input text may be converted into Speech Synthesis Markup Language (SSML) format. SSML provides details on pauses, formatting for dates, time, speaker pitch and speaking rate.
  • Continuing the flow in FIG. 7, the input dialog is parsed by a script parser 704. Script parsing involves processing the input dialog list using the following steps: 1) On-screen text and timing extraction, 2) The dialog text, in-and-out time are extracted from the input dialog list file, 3) Markup terms like ‘deep breath’ and ‘overlap’ are removed from the dialog text, 4) The date such as year number is converted into ‘date formatting’ with SSML, 5) To control the speech rate, text can be converted into SSML with a different rate.
  • The text or SSML output from the parsing operation is then converted to synthesized speech using a text-to-speech engine 706. In some cases, the text-to-speech engine may be provided by a third party. In some cases, the synthesized speech segment may be longer than the assigned time slot. As a result, in some embodiments, the text-to-speech process steps for insertion of an AD (where this may also apply to a lektor segment in some cases) may include the following: 1) Call a text-to-speech engine (whether local or third party) to generate the synthesized speech, 2) Measure the synthesized speech length, and 3) If length is longer than the assigned time slot length: Calculate the speed up ratio, and Call text to speech second time with new speech rate.
  • In some embodiments, the new audio description as expressed in synthesized speech may be altered with regards to its speaking rate in order to reduce the time required for a specific item of dialog to be described. The generated synthesized speech is then mixed with the original audio track by an automated mixing process 708 (as generally described above with reference to FIGS. 1-6(b)).
  • Accordingly, the present disclosure is directed to systems and methods for mixing a new section of an audio track into an existing audio track, where the existing track is associated with a video recording. More specifically, the disclosure is directed to systems and methods for mixing a synthesized voice description of a portion of a video recording with the existing audio track associated with the recording. The synthesized voice description can be used to provide additional information to a visually impaired viewer without interrupting the audio track that is associated with the video recording, typically by inserting the synthesized voice description into a segment of the audio track in which there is no dialog. The automated process described herein enables a visually impaired viewer to be provided with additional information about a scene in a film or television program that might otherwise not be apparent to them and does so in a more efficient and comprehensible manner than conventional approaches.
  • Embodiments of the system and methods described herein enable a visually impaired viewer to obtain greater enjoyment of (or information from) a film or television program by implementing an improvement to the conventional way in which an audio track associated with a video is modified or enhanced by a human sound mixer. Instead of a manual and time consuming process involving a sound mixer and a voice actor or one in which a synthesized voice is overlaid on top of an existing sound track, the embodiments described herein improve the conventional mixing process by automatically detecting or identifying specific time slots in which to place new audio content, followed by automatically adjusting the volume level in those time slots and inserting new audio content in a synthesized voice into the time slot or slots at a specific volume level.
  • Embodiments of the system and methods described herein may also (or instead) be used to insert an audio section containing dialog spoken in a different language into an audio track in an automated and efficient manner, and one in which the inserted section is more comprehensible to a listener.
  • Example Embodiments
  • 1. A computer-implemented method, comprising: accessing an audio track that is associated with a video recording; identifying a section of the accessed audio track having a specific audio characteristic; reducing a volume level of the audio track in the identified section; accessing an audio segment that includes a synthesized voice; and inserting the accessed audio segment into the identified section of the audio track, the inserted audio segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • 2. The computer-implemented method of claim 1, wherein the specific audio characteristic is that the identified section includes no spoken dialog.
  • 3. The computer-implemented method of claim 2, wherein the volume is reduced by an amount between 6 and 12 decibels (dB).
  • 4. The computer-implemented method of claim 2, wherein the accessed audio segment includes an audio description of a portion of the video recording.
  • 5. The computer-implemented method of claim 4, wherein the audio description provides additional information regarding the portion of the video recording, the additional information including an explanation or description of one or more of a scene, an object in a scene, a character's expression, a character's clothing, or a plot element.
  • 6. The computer-implemented method of claim 5, wherein the portion of the video recording for which the additional information is provided corresponds to the inserted section of the audio track.
  • 7. The computer-implemented method of claim 2, wherein the accessed audio segment includes dialog spoken in a different language than dialog in the audio track prior to implementing the method.
  • 8. The computer-implemented method of claim 1, wherein the specific audio characteristic is that the identified section includes spoken dialog.
  • 9. The computer-implemented method of claim 8, wherein the volume is reduced by an amount of approximately 9 decibels (dB).
  • 10. The computer-implemented method of claim 1, further comprising processing the audio segment to alter the segment's length in time prior to inserting the audio segment into the identified section of the audio track.
  • 11. The computer-implemented method of claim 10, wherein the processing of the audio segment to alter the segment's length in time further comprises increasing the length of time of the audio segment.
  • 12. The computer-implemented method of claim 10, wherein the processing of the audio segment to alter the segment's length in time further comprises decreasing the length of time of the audio segment.
  • 13. The computer-implemented method of claim 1, wherein the amount of reduction in the volume level of the audio track in the identified section depends upon the audio characteristic of the identified section.
  • 14. The computer-implemented method of claim 1, wherein identifying a section of the accessed audio track having a specific audio characteristic is performed, at least in part, by a machine learning model.
  • 15. A system comprising: at least one physical electronic processor; and a physical electronic memory comprising computer-executable instructions that, when executed by the physical electronic processor, cause the physical electronic processor to: access an audio track that is associated with a video recording; identify a section of the accessed audio track having a specific audio characteristic; reduce a volume level of the audio track in the identified section; access an audio segment that includes a synthesized voice; and insert the accessed audio segment into the identified section of the audio track, the inserted segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • 16. The system of claim 15, wherein the specific audio characteristic is that the identified section is one in which no dialog is spoken, and the audio segment includes an audio description of a portion of the video recording.
  • 17. The system of claim 16, wherein the audio description provides additional information regarding the portion of the video recording, the additional information including an explanation or description of at least one of a scene, an object in a scene, a character's expression, a character's clothing, or a plot element.
  • 18. The system of claim 15, wherein the audio segment is processed to alter its length in time prior to inserting the audio segment into the identified section of the audio track, and further, wherein the processing may vary depending upon the amount by which the audio segment needs to be shortened or lengthened to fit the time slot into which it is to be inserted.
  • 19. A non-transitory computer-readable medium comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to: access an audio track that is associated with a video recording; identify a section of the accessed audio track having a specific audio characteristic; reduce a volume level of the audio track in the identified section; access an audio segment that includes a synthesized voice; and insert the accessed audio segment into the identified section of the audio track, the inserted segment having a higher volume level than the reduced volume level of the audio track in the identified section.
  • 20. The non-transitory computer-readable medium of claim 19, further comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to process the audio segment to alter its length in time prior to inserting the audio segment into the identified section of the audio track.
  • As detailed above, the computing devices and systems described and/or illustrated herein broadly represent any type or form of computing device or system capable of executing computer-readable instructions, such as those contained within the modules described herein. In their most basic configuration, these computing device(s) may each include at least one memory device and at least one physical electronic processor.
  • In some examples, the term “memory device” generally refers to any type or form of volatile or non-volatile storage device or medium capable of storing data and/or computer-readable instructions. In one example, a memory device may store, load, and/or maintain one or more of the modules described herein. Examples of memory devices include, without limitation, Random Access Memory (RAM), Read Only Memory (ROM), flash memory, Hard Disk Drives (HDDs), Solid-State Drives (SSDs), optical disk drives, caches, variations or combinations of one or more of the same, or any other suitable storage memory.
  • In some examples, the term “physical processor” generally refers to any type or form of hardware-implemented processing unit capable of interpreting and/or executing computer-readable instructions. In one example, a physical processor may access and/or modify one or more modules stored in the above-described memory device. Examples of physical processors include, without limitation, microprocessors, microcontrollers, Central Processing Units (CPUs), Field-Programmable Gate Arrays (FPGAs) that implement softcore processors, Application-Specific Integrated Circuits (ASICs), portions of one or more of the same, variations or combinations of one or more of the same, or any other suitable physical processor.
  • Although illustrated as separate elements, the modules described and/or illustrated herein may represent portions of a single module or application. In addition, in certain embodiments one or more of these modules may represent one or more software applications or programs that, when executed by a computing device, may cause the computing device to perform one or more tasks. For example, one or more of the modules described and/or illustrated herein may represent modules stored and configured to run on one or more of the computing devices or systems described and/or illustrated herein. One or more of these modules may also represent all or portions of one or more special-purpose computers configured to perform one or more tasks.
  • In addition, one or more of the modules described herein may transform data, physical devices, and/or representations of physical devices from one form to another. For example, one or more of the modules recited herein may receive data to be transformed, transform the data, output a result of the transformation to perform a function, use the result of the transformation to make a decision or perform another function, and store the result of the transformation. Additionally, or alternatively, one or more of the modules recited herein may transform a processor, volatile memory, non-volatile memory, and/or any other portion of a physical computing device from one form to another by executing on the computing device, storing data on the computing device, and/or otherwise interacting with the computing device.
  • In some embodiments, the term “computer-readable medium” generally refers to any form of device, carrier, or medium capable of storing or carrying computer-readable instructions. Examples of computer-readable media include, without limitation, transmission-type media, such as carrier waves, and non-transitory-type media, such as magnetic-storage media (e.g., hard disk drives, tape drives, and floppy disks), optical-storage media (e.g., Compact Disks (CDs), Digital Video Disks (DVDs), and BLU-RAY disks), electronic-storage media (e.g., solid-state drives and flash media), and other distribution systems.
  • The process parameters and sequence of the steps described and/or illustrated herein are given by way of example only and can be varied as desired. For example, while the steps illustrated and/or described herein may be shown or discussed in a particular order, these steps do not necessarily need to be performed in the order illustrated or discussed. The various exemplary methods described and/or illustrated herein may also omit one or more of the steps described or illustrated herein or include additional steps in addition to those disclosed.
  • The preceding description has been provided to enable others skilled in the art to best utilize various aspects of the exemplary embodiments disclosed herein. This exemplary description is not intended to be exhaustive or to be limited to any precise form disclosed. Many modifications and variations are possible without departing from the spirit and scope of the present disclosure. The embodiments disclosed herein should be considered in all respects illustrative and not restrictive. Reference should be made to the appended claims and their equivalents in determining the scope of the present disclosure.
  • Unless otherwise noted, the terms “connected to” and “coupled to” (and their derivatives), as used in the specification and claims, are to be construed as permitting both direct and indirect (i.e., via other elements or components) connection. In addition, the terms “a” or “an,” as used in the specification and claims, are to be construed as meaning “at least one of.” Finally, for ease of use, the terms “including” and “having” (and their derivatives), as used in the specification and claims, are interchangeable with and have the same meaning as the word “comprising.”

Claims (24)

1. A computer-implemented method, comprising:
accessing audio of an audio track that is associated with a video recording;
detecting, within an identified section of the audio track, presence or absence of spoken dialog;
in response to detecting the presence or absence of spoken dialog within section of the audio track, reducing a volume level of the audio track in the identified section;
accessing an audio segment that includes a voice; and
modifying the audio track by inserting the accessed audio segment into the identified section of the audio track, the inserted audio segment having a higher volume level than the reduced volume level of the audio track in the identified section.
2. The computer-implemented method of claim 1, wherein detecting the presence or absence of spoken dialog comprises automatically detecting the absence of spoken dialog.
3. The computer-implemented method of claim 2, wherein automatically detecting the absence of spoken dialog comprises detecting a substantial reduction in frequency components associated with speech on the audio track.
4. The computer-implemented method of claim 2, wherein the accessed audio segment includes an audio description of a portion of the video recording, the audio description comprising dialog describing a visual element of the video recording for a visually impaired viewer.
5. The computer-implemented method of claim 4, wherein the audio description provides additional information regarding the portion of the video recording, the additional information including an explanation or description of one or more of a scene, an object in a scene, a character's expression, a character's clothing, or a plot element.
6. The computer-implemented method of claim 5, wherein the portion of the video recording for which the additional information is provided corresponds to the section of the audio track detected to have the absence of spoken dialog.
7. (canceled)
8. The computer-implemented method of claim 1, wherein detecting the presence or absence of spoken dialog comprises automatically detecting the presence of spoken dialog.
9. (canceled)
10. The computer-implemented method of claim 1, further comprising processing the audio segment to alter a length of time of the audio segment prior to inserting the audio segment into the identified section of the audio track.
11. The computer-implemented method of claim 10, wherein the processing of the audio segment to alter the length of time of the audio segment further comprises at least one of increasing or decreasing the length of time of the audio segment to match a length of time of the identified section of the audio track.
12. (canceled)
13. The computer-implemented method of claim 1, wherein an amount of reduction in the volume level of the audio track in the identified section depends upon at least one characteristic of the identified section.
14. The computer-implemented method of claim 1, wherein detecting the presence or the absence of spoken dialog within the identified section of the audio track is performed, at least in part, by training a machine learning model to classify samples of audio tracks as containing or not containing spoken dialog and classifying the identified section of the audio track using the trained machine learning model.
15. A system comprising:
at least one physical electronic processor; and
a physical electronic memory comprising computer-executable instructions that, when executed by the physical electronic processor, cause the physical electronic processor to:
access audio of an audio track that is associated with a video recording;
detect, within an identified section of the audio track, presence or absence of spoken dialog;
reduce a volume level of the audio track in the identified section;
access an audio segment that includes a voice; and
modify the audio track by inserting the accessed audio segment into the identified section of the audio track, the inserted audio segment having a higher volume level than the reduced volume level of the audio track in the identified section.
16. The system of claim 15, wherein detecting the presence or absence of spoken dialog comprises detecting the absence of spoken dialog, and the audio segment includes an audio description of a portion of the video recording for the identified section of the audio track detected to have the absence of spoken dialog.
17. The system of claim 16, wherein the audio description provides additional information regarding the portion of the video recording, the additional information including an explanation or description of at least one of a scene, an object in a scene, a character's expression, a character's clothing, or a plot element.
18. The system of claim 15, wherein the audio segment is processed to alter its length in time prior to inserting the audio segment into the identified section of the audio track, and further, wherein the processing varies depending upon the amount by which the audio segment needs to be shortened or lengthened to fit a time slot in the audio track into which it is to be inserted.
19. A non-transitory computer-readable medium comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to:
access audio of an audio track that is associated with a video recording;
detect, within an identified section of the audio track, presence or absence of spoken dialog;
in response to detecting the presence or absence of spoken dialog within the section of the audio track, reduce a volume level of the audio track in the identified section;
access an audio segment that includes a voice; and
modify the audio track by inserting the accessed audio segment into the identified section of the audio track, the inserted segment having a higher volume level than the reduced volume level of the audio track in the identified section.
20. The non-transitory computer-readable medium of claim 19, further comprising one or more computer-executable instructions that, when executed by at least one processor of a computing device, cause the computing device to process the audio segment to alter its length in time prior to inserting the audio segment into the identified section of the audio track.
21. (canceled)
22. The computer-implemented method of claim 8, wherein the accessed audio segment includes dialog spoken in a different language than the spoken dialog detected in the identified section of the audio track.
23. The computer-implemented method of claim 13, wherein:
the characteristic of the identified section comprises presence of a soundtrack within the identified section; and
the amount of reduction in the volume level of the audio track in the identified section depends upon the presence of the soundtrack.
24. The computer-implemented method of claim 1, wherein detecting the presence or the absence of spoken dialog within the identified section of the audio track is performed, at least in part, by a Voice Activity Detection (VAD) technique.
US16/747,314 2019-11-19 2020-01-20 Systems and methods for mixing synthetic voice with original audio tracks Active US11430485B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/747,314 US11430485B2 (en) 2019-11-19 2020-01-20 Systems and methods for mixing synthetic voice with original audio tracks

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962937755P 2019-11-19 2019-11-19
US16/747,314 US11430485B2 (en) 2019-11-19 2020-01-20 Systems and methods for mixing synthetic voice with original audio tracks

Publications (2)

Publication Number Publication Date
US20210151082A1 true US20210151082A1 (en) 2021-05-20
US11430485B2 US11430485B2 (en) 2022-08-30

Family

ID=75908889

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/747,314 Active US11430485B2 (en) 2019-11-19 2020-01-20 Systems and methods for mixing synthetic voice with original audio tracks

Country Status (1)

Country Link
US (1) US11430485B2 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210397405A1 (en) * 2020-06-18 2021-12-23 Sony Group Corporation Multiple output control based on user input
US11238883B2 (en) * 2018-05-25 2022-02-01 Dolby Laboratories Licensing Corporation Dialogue enhancement based on synthesized speech
US20220256156A1 (en) * 2021-02-08 2022-08-11 Sony Group Corporation Reproduction control of scene description
US11445269B2 (en) * 2020-05-11 2022-09-13 Sony Interactive Entertainment Inc. Context sensitive ads
CN115695902A (en) * 2022-11-07 2023-02-03 百视通网络电视技术发展有限责任公司 Barrier-free film audio processing method and device for blind people and storage medium
US20230230607A1 (en) * 2020-04-13 2023-07-20 Dolby Laboratories Licensing Corporation Automated mixing of audio description
US20240005943A1 (en) * 2020-10-28 2024-01-04 Comcast Cable Communications, Llc Methods and systems for augmenting audio content

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021039924A1 (en) * 2019-08-29 2021-03-04 有限会社Bond Program production device, program production method, and recording medium

Family Cites Families (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005141424A (en) * 2003-11-05 2005-06-02 Canon Inc Information processing method and information processor
JP4346613B2 (en) * 2006-01-11 2009-10-21 株式会社東芝 Video summarization apparatus and video summarization method
US7735101B2 (en) * 2006-03-28 2010-06-08 Cisco Technology, Inc. System allowing users to embed comments at specific points in time into media presentation
US9066049B2 (en) * 2010-04-12 2015-06-23 Adobe Systems Incorporated Method and apparatus for processing scripts
US9172345B2 (en) * 2010-07-27 2015-10-27 Bitwave Pte Ltd Personalized adjustment of an audio device
EP3493205B1 (en) * 2010-12-24 2020-12-23 Huawei Technologies Co., Ltd. Method and apparatus for adaptively detecting a voice activity in an input audio signal
US9423944B2 (en) * 2011-09-06 2016-08-23 Apple Inc. Optimized volume adjustment
US8977962B2 (en) * 2011-09-06 2015-03-10 Apple Inc. Reference waveforms
US8780168B2 (en) * 2011-12-16 2014-07-15 Logitech Europe S.A. Performing DMA transfer of audio and video data received over a serial bus
WO2013142650A1 (en) * 2012-03-23 2013-09-26 Dolby International Ab Enabling sampling rate diversity in a voice communication system
US8869222B2 (en) * 2012-09-13 2014-10-21 Verance Corporation Second screen content
US20150287043A1 (en) * 2014-04-02 2015-10-08 Avaya Inc. Network-based identification of device usage patterns that can indicate that the user has a qualifying disability
US20150339097A1 (en) * 2014-05-16 2015-11-26 Glenn Squires Voice blaster
US9496000B2 (en) * 2014-05-16 2016-11-15 Comcast Cable Communications, Llc Audio modification for adjustable playback rate
US20160078442A1 (en) * 2014-09-11 2016-03-17 Sony Corporation User id with integrated device setup parameters
US10345999B2 (en) * 2014-11-03 2019-07-09 Opentv, Inc. Media presentation modification using audio segment marking
US10332052B2 (en) * 2014-11-04 2019-06-25 Workplace Dynamics, LLC Interactive meeting agenda
US10158825B2 (en) * 2015-09-02 2018-12-18 International Business Machines Corporation Adapting a playback of a recording to optimize comprehension
WO2017147428A1 (en) * 2016-02-25 2017-08-31 Dolby Laboratories Licensing Corporation Capture and extraction of own voice signal
US10027994B2 (en) * 2016-03-23 2018-07-17 Dts, Inc. Interactive audio metadata handling
EP3446311A1 (en) * 2016-04-22 2019-02-27 Sony Mobile Communications Inc. Speech to text enhanced media editing
US9774911B1 (en) * 2016-07-29 2017-09-26 Rovi Guides, Inc. Methods and systems for automatically evaluating an audio description track of a media asset
EP3539124A1 (en) * 2016-11-14 2019-09-18 Yissum Research Development Company of the Hebrew University of Jerusalem, Ltd. Spatialized verbalization of visual scenes
US10370093B1 (en) * 2016-11-16 2019-08-06 Amazon Technologies, Inc. On-demand drone noise measurements
US9998847B2 (en) * 2016-11-17 2018-06-12 Glen A. Norris Localizing binaural sound to objects
US10356481B2 (en) * 2017-01-11 2019-07-16 International Business Machines Corporation Real-time modifiable text captioning
US10726828B2 (en) * 2017-05-31 2020-07-28 International Business Machines Corporation Generation of voice data as data augmentation for acoustic model training
US10580457B2 (en) * 2017-06-13 2020-03-03 3Play Media, Inc. Efficient audio description systems and methods
US20180376212A1 (en) * 2017-06-23 2018-12-27 Sony Corporation Modifying display region for people with vision impairment
US10303427B2 (en) * 2017-07-11 2019-05-28 Sony Corporation Moving audio from center speaker to peripheral speaker of display device for macular degeneration accessibility
US11190855B2 (en) * 2017-08-30 2021-11-30 Arris Enterprises Llc Automatic generation of descriptive video service tracks
EP3457716A1 (en) * 2017-09-15 2019-03-20 Oticon A/s Providing and transmitting audio signal
EP3503092A1 (en) * 2017-12-21 2019-06-26 Thomson Licensing Method for establishing a link between a device and a speaker in a gateway, corresponding computer program computer and apparatus
US10157604B1 (en) * 2018-01-02 2018-12-18 Plantronics, Inc. Sound masking system with improved high-frequency spatial uniformity
EP4343499A3 (en) * 2018-05-04 2024-06-05 Google LLC Adapting automated assistant based on detected mouth movement and/or gaze
CN110875059B (en) * 2018-08-31 2022-08-05 深圳市优必选科技有限公司 Method and device for judging reception end and storage device
US10861482B2 (en) * 2018-10-12 2020-12-08 Avid Technology, Inc. Foreign language dub validation
US11150866B2 (en) * 2018-11-13 2021-10-19 Synervoz Communications Inc. Systems and methods for contextual audio detection and communication mode transactions
US20200186852A1 (en) * 2018-12-07 2020-06-11 Arris Enterprises Llc Methods and Systems for Switching Between Summary, Time-shifted, or Live Content

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11238883B2 (en) * 2018-05-25 2022-02-01 Dolby Laboratories Licensing Corporation Dialogue enhancement based on synthesized speech
US20230230607A1 (en) * 2020-04-13 2023-07-20 Dolby Laboratories Licensing Corporation Automated mixing of audio description
US11445269B2 (en) * 2020-05-11 2022-09-13 Sony Interactive Entertainment Inc. Context sensitive ads
US20210397405A1 (en) * 2020-06-18 2021-12-23 Sony Group Corporation Multiple output control based on user input
US11669295B2 (en) * 2020-06-18 2023-06-06 Sony Group Corporation Multiple output control based on user input
US20240005943A1 (en) * 2020-10-28 2024-01-04 Comcast Cable Communications, Llc Methods and systems for augmenting audio content
US20220256156A1 (en) * 2021-02-08 2022-08-11 Sony Group Corporation Reproduction control of scene description
US11729476B2 (en) * 2021-02-08 2023-08-15 Sony Group Corporation Reproduction control of scene description
CN115695902A (en) * 2022-11-07 2023-02-03 百视通网络电视技术发展有限责任公司 Barrier-free film audio processing method and device for blind people and storage medium

Also Published As

Publication number Publication date
US11430485B2 (en) 2022-08-30

Similar Documents

Publication Publication Date Title
US11430485B2 (en) Systems and methods for mixing synthetic voice with original audio tracks
US9536541B2 (en) Content aware audio ducking
US9552807B2 (en) Method, apparatus and system for regenerating voice intonation in automatically dubbed videos
US10433089B2 (en) Digital audio supplementation
EP2136370B1 (en) Systems and methods for identifying scenes in a video to be edited and for performing playback
US8179475B2 (en) Apparatus and method for synchronizing a secondary audio track to the audio track of a video source
US20100298959A1 (en) Speech reproducing method, speech reproducing device, and computer program
US10782928B2 (en) Apparatus and method for providing various audio environments in multimedia content playback system
US20040152054A1 (en) System for learning language through embedded content on a single medium
KR20050121666A (en) System for learning language through embedded content on a single medium
US20180226101A1 (en) Methods and systems for interactive multimedia creation
US20240112668A1 (en) Audio-based media edit point selection
JP4086532B2 (en) Movie playback apparatus, movie playback method and computer program thereof
CN103680561B (en) The system and method that human voice signal is synchronization with its explanatory note data
US20220417659A1 (en) Systems, methods, and devices for audio correction
EP2261900A1 (en) Method and apparatus for modifying the playback rate of audio-video signals
CN111899714A (en) Dubbing method and system
JP4086886B2 (en) Movie playback apparatus, movie playback method and computer program thereof
TW201415884A (en) System for adjusting display time of expressions based on analyzing voice signal and method thereof
CN115695902B (en) Barrier-free movie audio processing method and device for blind people and storage medium
WO2023045730A1 (en) Audio/video processing method and apparatus, device, and storage medium
US20140185830A1 (en) Methods, systems, and apparatus for audio backtracking control
JP2008154258A (en) Motion picture playback apparatus, motion picture playback method and computer program therefor
Kanevsky et al. Preference-Based Acceleration of Video Material
Subhash All About Audio

Legal Events

Date Code Title Description
AS Assignment

Owner name: NETFLIX, INC, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WANG, YADONG;PARTHASARATHI, MURTHY;SWAN, ANDREW;AND OTHERS;SIGNING DATES FROM 20200114 TO 20200117;REEL/FRAME:051558/0380

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE