US20230169991A1 - Systems and methods for improving audio conferencing services - Google Patents
Systems and methods for improving audio conferencing services Download PDFInfo
- Publication number
- US20230169991A1 US20230169991A1 US18/102,146 US202318102146A US2023169991A1 US 20230169991 A1 US20230169991 A1 US 20230169991A1 US 202318102146 A US202318102146 A US 202318102146A US 2023169991 A1 US2023169991 A1 US 2023169991A1
- Authority
- US
- United States
- Prior art keywords
- conference
- utterance
- audio
- conference participant
- participant
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 76
- 230000005236 sound signal Effects 0.000 claims abstract description 35
- 238000012545 processing Methods 0.000 claims abstract description 19
- 230000004044 response Effects 0.000 claims description 29
- 238000013518 transcription Methods 0.000 claims description 15
- 230000035897 transcription Effects 0.000 claims description 15
- 238000012544 monitoring process Methods 0.000 claims description 11
- 238000001514 detection method Methods 0.000 claims description 7
- 230000000977 initiatory effect Effects 0.000 claims description 3
- 238000004891 communication Methods 0.000 description 22
- 230000008569 process Effects 0.000 description 22
- 230000001133 acceleration Effects 0.000 description 11
- ZPUCINDJVBIVPJ-LJISPDSOSA-N cocaine Chemical compound O([C@H]1C[C@@H]2CC[C@@H](N2C)[C@H]1C(=O)OC)C(=O)C1=CC=CC=C1 ZPUCINDJVBIVPJ-LJISPDSOSA-N 0.000 description 9
- 238000013500 data storage Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000002939 deleterious effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000001404 mediated effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/56—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
- H04M3/563—User guidance or feature selection
- H04M3/566—User guidance or feature selection relating to a participants right to speak
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/686—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title or artist information, time, location or usage information, user ratings
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/005—Reproducing at a different information rate from the information rate of recording
-
- G—PHYSICS
- G11—INFORMATION STORAGE
- G11B—INFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
- G11B27/00—Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
- G11B27/10—Indexing; Addressing; Timing or synchronising; Measuring tape travel
- G11B27/19—Indexing; Addressing; Timing or synchronising; Measuring tape travel by using information detectable on the record carrier
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements, protocols or services for supporting real-time applications in data packet communication
- H04L65/40—Support for services or applications
- H04L65/403—Arrangements for multi-party communication, e.g. for conferences
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/42221—Conversation recording systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/56—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
- H04M3/568—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/56—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
- H04M3/568—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
- H04M3/569—Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants using the instant speaker's algorithm
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/141—Systems for two-way working between two video terminals, e.g. videophone
- H04N7/147—Communication arrangements, e.g. identifying the communication as a video-communication, intermediate storage of the signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/226—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
- G10L2015/227—Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L12/00—Data switching networks
- H04L12/02—Details
- H04L12/16—Arrangements for providing special services to substations
- H04L12/18—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
- H04L12/1813—Arrangements for providing special services to substations for broadcast or conference, e.g. multicast for computer conferences, e.g. chat rooms
- H04L12/1831—Tracking arrangements for later retrieval, e.g. recording contents, participants activities or behavior, network status
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2203/00—Aspects of automatic or semi-automatic exchanges
- H04M2203/30—Aspects of automatic or semi-automatic exchanges related to audio recordings in general
- H04M2203/303—Marking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2203/00—Aspects of automatic or semi-automatic exchanges
- H04M2203/30—Aspects of automatic or semi-automatic exchanges related to audio recordings in general
- H04M2203/305—Recording playback features, e.g. increased speed
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2203/00—Aspects of automatic or semi-automatic exchanges
- H04M2203/50—Aspects of automatic or semi-automatic exchanges related to audio conference
- H04M2203/5072—Multiple active speakers
Definitions
- This disclosure relates to services provided during and after audio conferencing.
- Conferencing is an important way for a set of individuals who are remote from one another to communicate.
- Existing conferencing systems connect the conference participants in real time, and play the same audio or video to all participants in real time. These conferencing systems are associated with several disadvantages or problems.
- participants may tend to interrupt one another. Such interruptions cause the participants to lose their train of thought, and ideas are lost.
- the original participant may be distracted listening to what the interrupter is saying and would likely lose his original thought.
- the interrupter waits to speak until the original participant is done, the interrupter may lose his own thought and may never find a moment to contribute it to the conversation.
- One aspect relates to processing audio content of a conference.
- a first audio signal is received from a first conference participant, and a start and an end of a first utterance by the first conference participant are detected from the first audio signal.
- a second audio signal is received from a second conference participant, and a start and an end of a second utterance by the second conference participant is detected from the second audio signal.
- the second conference participant is provided with at least a portion of the first utterance at a time that is determined based at least in part on the start, the end, or both the start and the end of the second utterance.
- the time corresponds to at least one of a start time for providing the portion of the first utterance, a start point of the portion of the first utterance, and a duration of the first utterance.
- the portion of the first utterance is provided to the second conference participant before the start of the second utterance or after the end of the second utterance.
- the first utterance and the second utterance may overlap in time, and the providing of the portion of the first utterance may be based on determining that the first and second utterances overlap in time.
- the start of the second utterance may occur after the start of the first utterance and before the end of the first utterance
- the portion of the first utterance may be based on a previous portion of the first utterance that is provided to the second conference participant before the start of the second utterance.
- the first and the second conference participants may be switched to a mode in which utterances are played sequentially to the first and second conference participants.
- the providing of the portion of the first utterance to the second conference participant is stopped.
- An indication may be stored of a point in the portion of the first utterance at which the providing to the second conference participant was stopped, where in response to detecting the end of the second utterance, the providing of the portion of the first utterance to the second conference participant is resumed at the point referenced by the stored indication.
- resuming the providing of the portion of the first utterance to the second conference participant at the point referenced by the stored indication may include accessing a recorded version of the first audio signal at the point referenced by the stored indication, playing the portion of the first utterance from the point referenced by the stored indication (optionally at an accelerated rate), and providing conference audio to the second conference participant in real time when playback of the recorded version terminates.
- the recorded version of the first audio signal may be stored as a plurality of audio clips in a playlist, each audio clip including an utterance by one of the conference participants. Playing the portion of the first utterance may include playing the plurality of audio clips sequentially from the point referenced by the stored indication.
- the recorded version of the first audio signal is stored as at least some of a plurality of audio clips in a playlist, each audio clip including an utterance by one of a plurality of conference participants.
- the plurality of audio clips may be played from the point referenced by the stored indication in the same manner in which they were recorded, where two or more of the pluralities of audio clips are played in an overlapping manner when the corresponding conference audio included overlapping utterances from multiple conference participants.
- the start of the first utterance by the first conference participant is detected by monitoring a volume level of an audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to a threshold value, and determining the start of the utterance when the monitored volume level of the audio stream exceeds the threshold value.
- the end of the first utterance by the first conference participant may be detected by monitoring the volume level of the audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to the threshold value, and determining the end of the utterance when the monitored volume level of the audio stream falls below the threshold value for a predefined duration of time.
- detecting the start of the first utterance includes receiving a first selection from the first conference participant to unmute an audio input interface or to pause an audio output. Detecting the end of the first utterance may include receiving a second selection from the first conference participant to mute the audio input interface or to play the audio output.
- a recording of the first utterance by the first conference participant may be initiated, and in response to detecting the end of the first utterance, the recording of the first utterance by the first conference participant may be terminated.
- the recorded utterance is stored as an audio clip in a playlist, where the playlist includes a plurality of audio clips of utterances by other conference participants.
- the stored audio clip in the playlist may be automatically categorized under a section identifying the conference or a subject of the conference, and may be automatically tagged with information identifying the first conference participant.
- User input may be received that is indicative of data to associate with the stored audio clip in the playlist, and the data may be stored with an association to the stored audio clip.
- the stored data may include at least one of a subject, description, transcription, keyword, flag, digital file, and uniform resource locator.
- the system comprises an audio detector configured to receive a first audio signal from a first conference participant, detect, from the first audio signal, a start and an end of a first utterance by the first conference participant, receive a second audio signal from a second conference participant, and detect, from the second audio signal, a start and an end of a second utterance by the second conference participant.
- the system further comprises a transmitter configured to provide, to the second conference participant, a portion of the first utterance including a delayed version of at least a portion of the first utterance at a time determined based at least in part on the start, the end, or both the start and the end of the second utterance.
- One aspect relates to a non-transitory computer-readable medium comprising computer-readable instructions encoded thereon for processing audio content of a conference.
- the computer-readable instructions comprise instructions for receiving a first audio signal from a first conference participant, detecting, from the first audio signal, a start and an end of a first utterance by the first conference participant, receiving a second audio signal from a second conference participant, and detecting, from the second audio signal, a start and an end of a second utterance by the second conference participant.
- the computer-readable instructions further comprises instructions for providing, to the second conference participant, at least a portion of the first utterance at a time determined based at least in part on the start, the end or both the start and end of the second utterance.
- a processor provides audio from the conference to a first conference participant, detects a start of an utterance by the first conference participant, and in response to detecting the start of the utterance, stops the provision of the audio from the conference to the first conference participant.
- An indication of a point in the audio from the conference at which the provision of the audio from the conference to the first conference participant was stopped is stored, and an end of the utterance by the first conference participant is detected.
- the processor resumes the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication.
- detecting a start of an utterance by the first conference participant comprises monitoring a volume level of an audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to a threshold value, and determining the start of the utterance when the monitored volume level of the audio stream exceeds the threshold value.
- Detecting an end of the utterance by the first conference participant may comprise monitoring the volume level of the audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to the threshold value, and determining the end of the utterance when the monitored volume level of the audio stream falls below the threshold value for a predefined duration of time.
- detecting a start of an utterance by the first conference participant comprises receiving a first selection from the first conference participant to unmute an audio input interface or to pause an audio output.
- An end of the utterance by the first conference participant may be detected by receiving a second selection from the first conference participant to mute the audio input interface or to play the audio output.
- the processor in response to detecting the start of the utterance, initiates a recording of the utterance by the first conference participant. In response to detecting the end of the utterance, the processor terminates the recording of the utterance by the first conference participant.
- the recorded utterance may be stored as an audio clip in a playlist, where the playlist includes a plurality of audio clips of utterances by other conference participants.
- the stored audio clip in the playlist may be automatically categorized under a section identifying the conference or a subject of the conference, and the stored audio clip may be automatically tagged with information identifying the first conference participant.
- resuming the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication comprises accessing a recorded version of the audio from the conference at the point referenced by the stored indication, playing the recorded version of the audio from the conference from the point referenced by the stored indication at an accelerated rate, and providing the audio from the conference to the first conference participant in real time when playback of the recorded version terminates.
- the recorded version of the audio from the conference is stored as a plurality of audio clips in a playlist, where each audio clip includes an utterance by one of a plurality of conference participants.
- Playing the recorded version of the audio from the conference may comprise playing the plurality of audio clips from the point referenced by the stored indication in the same manner in which they were recorded, wherein two or more of the plurality of audio clips are played in an overlapping manner when the corresponding audio from the conference included overlapping utterances from multiple conference participants.
- the processor receives user input of data to associate with the stored audio clip in the playlist and stores the data with an association to the stored audio clip.
- the stored data comprises at least one of a subject, description, transcription, keyword, flag, digital file, and uniform resource locator.
- the system comprises a transmitter configured to provide audio from the conference to a first conference participant and an audio detector configured to detect a start of an utterance by the first conference participant and detect an end of the utterance by the first conference participant.
- the system further comprises a processor configured to, in response to detecting the start of the utterance, stop the provision of the audio from the conference to the first conference participant.
- the processor is further configured to store, in a memory, an indication of a point in the audio from the conference at which the provision of the audio from the conference to the first conference participant was stopped, and in response to detecting the end of the utterance, resume the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication.
- the audio detector is configured to detect a start of an utterance by the first conference participant by monitoring a volume level of an audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to a threshold value, and determining the start of the utterance when the monitored volume level of the audio stream exceeds the threshold value.
- the audio detector may be configured to detect an end of the utterance by the first conference participant by monitoring the volume level of the audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to the threshold value, and determining the end of the utterance when the monitored volume level of the audio stream falls below the threshold value for a predefined duration of time.
- the audio detector is configured to detect a start of an utterance by the first conference participant by receiving a first selection from the first conference participant to unmute an audio input interface or to pause an audio output.
- the audio detector may be configured to detect an end of the utterance by the first conference participant by receiving a second selection from the first conference participant to mute the audio input interface or to play the audio output.
- the processor is further configured to in response to detecting the start of the utterance, initiate a recording of the utterance by the first conference participant, and in response to detecting the end of the utterance, terminate the recording of the utterance by the first conference participant.
- the processor may be further configured to store a reference to the recorded utterance as an audio clip in a playlist, wherein the playlist includes a plurality of audio clips of utterances by other conference participants.
- the processor may be further configured to automatically categorize the stored audio clip in the playlist under a section identifying the conference or a subject of the conference, and automatically tag the stored audio clip with information identifying the first conference participant.
- the processor is configured to store a first index point corresponding to the start of the utterance in response to detecting the start of the utterance, and store a second index point corresponding to the end of the utterance in response to detecting the end of the utterance.
- the processor is configured to resume the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication by accessing, from the memory, a recorded version of the audio from the conference at the point referenced by the stored indication, providing the recorded version of the audio from the conference from the point referenced by the stored indication for playback at an accelerated rate, and providing the audio from the conference to the first conference participant in real time when playback of the recorded version terminates.
- the recorded version of the audio from the conference may be stored as a plurality of audio clips in a playlist, each audio clip including an utterance by one of a plurality of conference participants.
- the processor may be configured to provide the recorded version of the audio from the conference for playback by providing the plurality of audio clips for sequential playback from the point referenced by the stored indication.
- the recorded version of the audio from the conference may be stored as a plurality of audio clips in a playlist, each audio clip including an utterance by one of a plurality of conference participants.
- the processor may be configured to provide the recorded version of the audio from the conference for playback by providing the plurality of audio clips for playback in the same manner in which they were recorded from the point referenced by the stored indication, where two or more of the plurality of audio clips are played in an overlapping manner when the corresponding audio from the conference included overlapping utterances from multiple conference participants.
- the processor is further configured to receive user input of data to associate with the stored audio clip in the playlist, and store the data with an association to the stored audio clip.
- the stored data may comprise at least one of a subject, description, transcription, keyword, flag, digital file, and uniform resource locator.
- One aspect relates to a non-transitory computer-readable medium comprising computer-readable instructions encoded thereon for processing audio content of a conference.
- the computer-readable instructions comprise instructions for providing audio from the conference to a first conference participant, detecting a start of an utterance by the first conference participant, and in response to detecting the start of the utterance, stopping the provision of the audio from the conference to the first conference participant.
- the computer-readable instructions further comprise instructions for storing an indication of a point in the audio from the conference at which the provision of the audio from the conference to the first conference participant was stopped, detecting an end of the utterance by the first conference participant, and in response to detecting the end of the utterance, resuming the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication.
- FIG. 1 is an example diagram of various timelines during a conference in voicechat mode, in accordance with an implementation of the disclosure.
- FIG. 2 is an example display that is shown to a user for joining a conference, in accordance with an implementation of the disclosure.
- FIGS. 3 A and 3 B show an example display that allows a user to access and play audio clips, conferences, a playlist, or subdivisions of a playlist, in accordance with an implementation of the disclosure.
- FIGS. 4 and 5 are example displays of a notifications dialog, in accordance with an implementation of the disclosure.
- FIG. 6 is an example display of a prompt for a user to modify playback settings, in accordance with an implementation of the disclosure.
- FIG. 7 is an example display of an interface for filtering audio clips, in accordance with an implementation of the disclosure.
- FIG. 8 is an example display of transcriptions associated with audio clips, in accordance with an implementation of the disclosure.
- FIG. 9 is an example display of a message displayed to a user to confirm the user wishes to reply to an audio clip, in accordance with an implementation of the disclosure.
- FIG. 10 is an example display of a message displayed to a user to confirm the user wishes to continue a conversation, in accordance with an implementation of the disclosure.
- FIG. 11 is an example display of an option to auto-join a conference, in accordance with an implementation of the disclosure.
- FIG. 12 is a flowchart of a process for conference enhancement, in accordance with an implementation of the disclosure.
- FIG. 13 is an example display of a roster, in accordance with an implementation of the disclosure.
- FIG. 14 is a block diagram of a computerized system for performing any of the techniques described herein, in accordance with an implementation of the disclosure.
- Systems and methods for improving video and audio conferencing services are provided. Specifically, techniques are described herein for processing the audio content of conferences in a manner advantageous to real-time or future playback. These techniques may, for example, reduce the deleterious effects of multiple conference participants speaking at the same time.
- the techniques described herein also enable users to access audio content of conferences in a variety of useful ways. For instance, audio content from one or more conferences may be stored in a playlist as “clips” and delineated by speaker, topic, and/or other criteria. Tools may be provided for accessing, modifying, and/or augmenting the clips in the playlist. For example, users may be provided with search and sort tools, and may be able to tag or otherwise associate data with the audio clips.
- a “clip” may refer to a single audio file corresponding to a single utterance spoken by a user.
- a “clip” may refer to a portion of a longer audio file that includes multiple utterances spoken by one or more users.
- the clip refers to the portion of the longer audio file that corresponds to a single utterance, and index points may be used to indicate the beginning and end of the clip within the longer audio file.
- the systems and methods of the present disclosure allow such users to pause the conference and return to the conference, picking up where they left off. In this manner, users may pause the conference, direct their full attention to the interruption, and return to the conference at the paused point so that upon return to the conference they may devote their full attention to the conference and catch up (with optional accelerated playback) to the other participants without missing any portion of the conference.
- the present disclosure provides systems and methods for the conference to continue even after all participants have disconnected from the live conference. In particular, the audio and/or video signals recorded during the conference are saved, and users may return to play the signals and record new content to continue the conversation.
- a first mode may correspond to a “conference mode,” in which the audio clips are mixed and voices overlap in a reenactment of the actual conference.
- the “conference mode” may be referred to as a “natural mode” because the reenactment of the actual conference is similar to playing the conference in real time.
- a second mode may correspond to a “voicechat mode,” in which the clips are played sequentially one-by-one in the order of their start times.
- the “voicechat mode” may be referred to as a “sequential mode” because the clips are played sequentially.
- a third mode may correspond to an “interleaved sequential mode,” in which the clips are played in the sequence of their start time with earlier starting clips being paused during playing of the later overlapping clips.
- the “interleaved sequential mode” may be referred to as the “threaded mode,” in which clips corresponding to the same topic or conversational thread may be played sequentially, even if intervening clips corresponding to different topics or conversations were recorded in between the clips.
- one of these three modes may be set as a default mode, and a user wishing to transition to a different mode may select different mode buttons on the interface.
- the voicechat mode may be the default mode, and the user may select a conference mode button to transition to the conference mode, or an interleaved sequential mode button to transition to the interleaved sequential mode.
- a user's last used mode may be remembered, such that if the user logs out of the system or otherwise closes the interface, the system may apply the last used mode the next time the user logs in.
- Such settings may be associated with the user's account, such that the system may remember the user's preferences across different devices.
- FIG. 1 is an example diagram of various timelines during a conference in voicechat mode.
- the diagram includes five rows corresponding to real time (top row), what Alice hears (second row), what Bob hears (third row), what Charlie hears (fourth row), and what a passive listener hears (bottom row).
- the conference participants interrupt one another, and their utterances overlap.
- Nobody speaks while t 7 to 10
- Alice speaks her second utterance from t 10 to 14.
- the clips are played sequentially in the order of their start times.
- Charlie's two utterances are played back-to-back, followed by Alice's second utterance.
- a passive listener listening to the conference simply hears the clips played sequentially in the order of their start times.
- Alice's first utterance is played first, followed by Bob's utterance, Charlie's two utterances, and Alice's second utterance.
- all participants may return to the same live point in the conference after listening to the same content.
- the systems and methods of the present disclosure allow conference participants to hear every utterance without interference from overlapping speakers and without being interrupted, while maintaining a consistent schedule such that the participants are all roughly at the same point in the conference.
- the audio data are stored as a separate audio clip file for each utterance.
- each audio clip may correspond to a single utterance spoken by a single user.
- a single user may be associated with multiple audio clip files, and the audio clip files may be tagged according to subject threads.
- the audio data may be stored in a single continuous audio file, and index points may refer to timestamps of the continuous audio file indicative of when a speaker begins and stops speaking.
- different continuous audio files may be stored for each user or speaker, and the timestamp indexes may be saved as metadata associated with the audio files. The indexes may also be used to follow a subject thread.
- a separate file may be used to store the various indexes indicating start and end times of utterances regarding the same subject thread.
- index points indicative of time offsets and duration may be used to break up long audio files, rather than storing separate audio clip files for each utterance.
- a clip, an audio clip, or an audio clip file may refer to an individual audio file storing the utterance, or may refer to a portion of a longer audio file storing multiple utterances, where the portion containing the utterance is indicated by one or more index points.
- the audio output from the conference is paused for a user when the user begins to speak, and an audio clip is recorded for each individual who is speaking at the same time, including the user.
- the audio clips may be played back to him in conference mode, voicechat mode, or interleaved sequential mode.
- the audio clip corresponding to the user may be omitted during playback.
- the playback is accelerated so that the user may catch up to the real-time conference.
- mute and unmute events can be detected and used to stop the conference call's live audio and start the recorded playback. For instance, when a first speaker unmutes to speak, the live audio may be stopped so that other speakers do not distract the first speaker. Then, when that speaker finishes speaking and mutes again, the recording of any other utterances is played back to the speaker.
- a user may select to link (1) a selection of the unmute button to a pause of the audio, and (2) a selection of the mute button to a play of the audio.
- play and pause buttons may be used to implicitly signal mute and unmute, respectively.
- the user when the user selects to play the audio, the user may be automatically muted, and when the user selects to pause the audio, the user may be automatically unmuted.
- This may be referred to as a play/pause feature, which may be used during a live conference.
- the user may select a pause button to switch mute off, such that the user may speak in an uninterrupted fashion while the audio is paused.
- the user may select a play button to switch mute on, so that the user may listen to the conference.
- mute, unmute, pause, and play buttons may all be explicitly provided so that the user may have the flexibility to configure the user settings.
- the playback is performed at an accelerated speed until the first speaker is caught up to the real time conference.
- the playback is stopped, and live audio is resumed.
- the user may select to skip the playback and immediately join the live conference without listening to the playback.
- the detection of speech and cessation of speech can be used as automated triggers to achieve the same effect as unmute and mute. Detection of the cessation of speech may involve detecting that an amount of sound level recorded at the speaker's microphone is below a threshold level for a certain period of time. Moreover, a small amount of buffer time may be used after the detection of the cessation of speech but before the end of the clip to ensure that the utterance has ended.
- conference should be understood to encompass an audio or a video conference (or other multimedia conference) having an audio component.
- Conference participants may participate in a conference in real time, or different users may participate in the conference at different times.
- a conference may include a live conversation that is recorded and capable of being listened or watched at a later time, audio or video content that may be recorded by one user that is not connected to a live conference, or a combination thereof.
- a conference may include a live portion, during which users are connected over the same line and may interact with one another. The live portion may be capable of being played to a participant or a non-participant of the live conference, who may then record a response to some content discussed during the live portion.
- Such a recording may be referred to as an offline portion of the conference.
- multiple participants may access a conference using a wide-range of devices and/or applications, such as land phones, mobile phones, tablets, computers, or any suitable device for accessing a conference.
- a wide-range of devices and/or applications such as land phones, mobile phones, tablets, computers, or any suitable device for accessing a conference.
- one participant may use a phone to dial-in to a conference, while another joins the conference using an Internet service that he accesses using a personal computer.
- the present disclosure relates to interactive web collaboration systems and methods, which are described in U.S. Pat. No. 7,571,212, entitled “Interactive web collaboration systems and methods,” which is incorporated herein by reference in its entirety.
- the present disclosure uses systems and methods similar to a musical application for creating and editing a musical enhancement file to process audio data collected during a conference.
- a musical application is described in U.S. Pat. No. 7,423,214, entitled “Systems and methods for creation and playback performance,” which is incorporated herein by reference in its entirety.
- FIG. 2 shows two example displays that may be shown to a user for joining a conference.
- the left hand side of FIG. 2 provides an example screen for the user to enter the user's call id for entering the conference system, and optionally the call id of one or more other users for inviting to enter the conference system.
- the conference may be initiated from a web page set up for a particular conference id.
- the user may enter a conference id indicative of a particular conference that the user wishes to join (not shown in FIG. 2 ).
- the right hand side of FIG. 2 allows the user to set various conference settings, such as mute, headset, volume, and gain.
- the sound recording from each participant may be stored as individual audio clips and associated with data.
- This data may include metadata that identifies the particular conference, the speaker, the time and date, the subject of the conference, the duration of the clip, any assets shown during the conference, such as a presentation or a view of a screen, and/or any other suitable information.
- the audio clips may be stored together with the data or metadata and added to a playlist accessible by users, who may be the participants of the conference or non-participants of the conference.
- FIGS. 3 A and 3 B When a user accesses a playlist, the user may be presented with a display screen, such as the exemplary display shown in FIGS. 3 A and 3 B .
- the display in FIGS. 3 A and 3 B allows the user to access and play individual audio clips, whole conferences, the entire playlist, or subdivisions thereof such as individual subject threads.
- the display in FIGS. 3 A and 3 B includes a foreground section labeled “Conversational threads,” which lists the various subject headings associated with different tracks, and the message number (“Msg”) associated with the beginning of each corresponding track.
- the background of the display of FIGS. 3 A and 3 B includes an expanded list of the various tracks listed in the foreground “Conversation threads” section, including a list of all messages in each track, the participant who spoke the corresponding utterance, and the text of the utterance.
- the user may select to play clips individually or in sequence. After listening to a selected clip, the user may select record a new audio clip to “reply” to the clip or to “continue” the conversation.
- the new audio clip may be recorded in real-time, automatically linked to the originally selected clip, and/or added to the playlist. Upon future playback, the new clip may optionally be played immediately after the originally selected clip.
- the new clip may be demarcated as having been recorded after the original conference took place (e.g., using a visual or audio indicator), this process enables the clip to be played back as if it were spoken during the conference.
- a user can select to play a “thread,” which is limited to the selected clip and any replies to a subject indicated by a heading and continuation headings as the conversation moves back and forth over various subjects.
- users may select to record utterances into one of a number of ‘tracks’.
- tracks may be associated with individual participants, individual topics, or a combination of both.
- Some playback modes may play each utterance based on its start time, regardless of its track, while other playback modes may allow a user to select to listen to one or more tracks simultaneously, or play all the utterances in one track followed by all those in the next track.
- threads may be implemented using tracks.
- an utterance may be assigned to a track automatically, such as the track of the previous utterance that was recorded. The user may select to update the default track to a different track or a new track.
- headings or tags may be added to the audio clip to add the clip to one or more suitable threads.
- a listener may wish to play the thread related to the “Reply feature” listed in the “Conversational threads” dialog box shown in FIGS. 3 A and 3 B .
- the listener may select the heading of the “Discussion on ‘Reply’ feature” thread, and the utterances tagged with the appropriate thread are played.
- the utterances labeled 72 , 73 , 83 , 84 , and 85 are played in sequence, and utterances 74 - 82 are skipped because these utterances are related to a different thread.
- a user When a user requests to “continue” a conversation, the user records a new audio clip or multiple clips in real-time and that audio clip is automatically added to the playlist and linked to the same discussion topic as the selected clip.
- This process enables multiple users to have continuous yet asynchronous verbal discussions. For example, a first and second user may participate in a conference on a Monday and the audio from that conference may be stored in the playlist under a heading indicating the conference date and/or topic. A third user may then play the conference the following Tuesday and select to “continue” the discussion.
- This feature allows the third user to record a one or more audio clip and to link it to the original conference in the playlist, e.g., so that it appears under the same heading or is otherwise labeled as to indicate its association with a subject thread in the original conference.
- the other conference participants may be alerted to the fact that the third user added a new clip to the conversation and may play the new clip.
- utterances labeled with “continue” or “reply” may be automatically recorded to different tracks to distinguish such utterances from utterances belonging to the main thread.
- FIGS. 4 and 5 show example displays of a notifications dialog, in accordance with an implementation of the present disclosure.
- the display in FIG. 4 shows the data sorted according to conversations, while the display in FIG. 5 shows the data sorted according to pages.
- the owners of various extensions are listed in the display, and the user may select to modify a type and an interval for each corresponding owner and extension.
- each extension corresponds to an owner id (i.e., the first three digits of the extension), followed by a conversation id (i.e., the last 6 digits of the extension).
- Each owner of an extension may select to receive notifications when an update to the conversation is made, or at fixed time intervals indicated by the interval.
- the type field shown in FIG. 4 refers to a type of notification.
- the digest type of notification may correspond to a summary of all the changes that have occurred in the conversation since the last notification was sent. Examples of changes include identifiers of users who made what changes and when such changes were made.
- another type of notification is a “link,” for which the user receives a link to the changed entity within the conversation.
- Other types of notifications with different levels of detail may be used, and the users may select to receive notifications having different levels of detail for different conversations or extensions.
- the user may select to modify a type and/or an interval for each extension.
- FIG. 6 shows an exemplary display of the system prompting the user to modify playback settings, according to an implementation of the present disclosure.
- a user may modify a number of playback settings to control how audio clips are played in the playlist.
- Playback settings may be set globally or specifically for individual speakers. For example, a user may set a play speed, or tempo, for all speakers, or the user may set the tempo individually for each speaker.
- a user may enter the user's identifier, the user's nickname, the user's conference identifier, and the user's tempo, which may refer to the relative speed at which the user's audio is played.
- FIG. 6 includes options for the user to set settings specific to the speakers.
- settings include tempo, pitch shift, filter, silence, and volume.
- Selecting the silence setting causes silences to be removed during the playback of the clips.
- Selecting the filter setting causes the audio signals for the corresponding speaker to be filtered, to remove noise for example.
- audio characteristics may be set and/or adjusted automatically.
- the tempo of each speaker can be detected, such as by detecting an average syllabic frequency uttered by the speaker, and automatically adjusted to match a user-selected target tempo or an interval of elapsed time available for catch-up.
- the syllabic frequency of a speaker may be detected and compared to threshold syllabic frequency.
- the threshold syllabic frequency may correspond to a fixed maximum syllabic frequency that is set for intelligibility, or may correspond to the syllabic frequency of the fastest speaker in the conference.
- the amount of speed-up applied to a speaker's utterance may be dependent on this comparison.
- the utterances spoken by slower speakers may be sped up at a higher rate than the utterances spoken by faster speakers, because the syllabic frequencies of slower speakers are further from the maximum syllabic frequency than the syllabic frequencies of the faster speakers.
- utterances from different speakers may be individually adjusted (manually or automatically) in accordance with their tempos to ensure that the utterances are sped up for efficiency while still being intelligible.
- FIG. 7 shows an exemplary display of an interface for filtering audio clips, according to an implementation of the present disclosure.
- the user may configure various filters to control which audio clips in the playlist are played. Using one such filter, the user may select to play the audio clips of only a specific person or specific persons. For example, the user may select to play back all audio clips aside from those for which he is the speaker. Using another filter, the user may select to play back only audio clips associated with particular tags, keywords, or other data. Audio clips may be associated with tags, metadata, URLs, descriptions, priorities, external documents, etc. In some configurations, a user associates data with audio clips via manual data entry or by dragging and dropping the data onto a clip in the playlist.
- FIG. 8 shows an exemplary display of transcriptions associated with audio clips, according to an implementation of the present disclosure.
- Audio clips may be transcribed, and the transcription of the audio content may be made available via the playlist.
- the transcription may be manual, automatic, or a combination of both. For example, a user may select a clip, request automatic transcription, and then manually edit the results. It is contemplated that automatic transcription may be provided by outside services, e.g., Internet-based services.
- the audio clip may be automatically transmitted to the Internet service and the resulting transcription automatically retrieved and stored with the clip. Users of such services may correct transcriptions, and the corrections may be transmitted back to the service so that the service may improve its accuracy.
- the service can improve still further in accuracy. The effect of all these improvements together should enable automatic transcription to be utilized for provision of highly accurate text.
- the text may be used for communication, authoring of programming and user scripting languages, translation between natural languages, and targeted advertising, for example.
- participants in a conference may engage in cooperative browsing.
- a participant shares a data object (e.g., a document, audio clip, video, URL, etc.), and the data object (or a reference to the data object such as a hyperlink) is automatically transmitted to each participant and displayed in real-time.
- the data object may involve a video stream of the participant's computer screen, so that the other participants may view the participant's screen during the conference.
- These data objects may also be stored and linked to particular audio clips in the playlist (and/or to particular index points within the audio clips).
- the objects may be redisplayed at the appropriate time just as in the live conference.
- shared assets related to a conference may be presented to a user as a synchronized slideshow during playback. Alternatively, such assets may be viewed as a collection of resources in a separate window.
- the conference audio may be paused for that participant so as to not interfere with his or her speaking.
- the conference audio may be accumulated and stored in the interim, e.g., as audio clips are added to the playlist.
- the stored audio content is subsequently played to the participant, so that the participant can listen to what the other participants said while he or she was speaking.
- the clips can then be replayed in “conference mode” in which the audio clips are mixed and voices overlap as they did in the actual conference, in sequential “voicechat mode” in which the clips are played one by one, or in “interleaved sequential mode” in which the clips are played one by one starting at the time relative to each other as they occurred with earlier starting clips paused during the playback of later starting clips.
- the clips that are played sequentially may correspond to a single thread, subject, or conversation.
- a reply clip that includes a reply to an original clip may be played immediately after the original clip in the interleaved sequential mode, even if intervening clips were recorded between the times that the original clip and the reply clip were recorded.
- the user may choose to accelerate playback so that he catches up to the live conference.
- the rate of acceleration may be automatically determined based on the elapsed time since the audio conference was stopped.
- FIG. 9 shows an exemplary display of a message displayed to a user to confirm that the user wishes to reply to an audio clip. As shown in FIG. 9 , the user may provide a maximum duration for the reply. Selecting to reply to an original audio clip may cause the next recorded utterance to be associated with the original audio clip.
- FIG. 10 shows an exemplary display of a message displayed to a user to confirm that the user wishes to continue a conversation by recording a new clip related to an existing clip or set of clips.
- the new recording may be automatically tagged to reflect the tags of the thread. Selecting to continuing a conversation may cause the next recorded utterance to be tagged with the associated tags of the thread or conversation.
- utterances labeled with “continue” or “reply” may be automatically recorded to different tracks to distinguish such utterances from utterances belonging to the main thread.
- FIG. 11 shows an exemplary display of an option to auto-join a conference, in accordance with an implementation of the present disclosure.
- a user may select the option such that when he reaches the end of the playlist, he will be automatically added to the live conference and join with others already there and/or invite others to join him.
- the sound level being transmitted from each participant's device may be monitored.
- recording may commence or an index point to a continuous recording is noted.
- an individual clip may refer to the interval in a continuous recording between successive index points.
- recording may stop, or another index point may be generated. Recording of the same clip or a different clip may resume if the sound level returns above a resume threshold within a defined period of time.
- the resume threshold may be the same or different from the predefined threshold that was originally used at the start of a clip.
- the thresholds may be different based on speakers or other variables such as background noise.
- the sound level threshold may be adjusted progressively as the clip is recorded. Other factors, such as vocabulary and phrase determination, may also be used to determine useful clip boundaries. Alternatively, if the sound level does not return above the threshold within the defined period of time, the recording may be terminated, and the audio clip is stored.
- the audio clip may also be associated with data, such as information identifying the speaker, topic, subject thread, related messages and vocabularies and grammars used. Speaker information may be provided, for instance, within the data stream transmitted by each participant's device, or the conferencing service may track each user when they access the conference. In particular, such metadata may include the speaker and the duration of the audio clip.
- FIG. 12 is a flowchart of a process 100 for conference enhancement, in accordance with an embodiment of the present invention.
- the steps of FIG. 12 may be performed by a process (which may be a software application, e.g., executed on a local device or a remote server).
- a process which may be a software application, e.g., executed on a local device or a remote server.
- the sound level of a participant on a conference is monitored.
- it is determined whether the participant is speaking i.e., whether the sound level emanating from the participant's audio stream (e.g., input by a microphone or other audio input device) is greater than a defined threshold value. If not, the process 100 returns to step 102 to continue monitoring the participant's sound level.
- step 106 recording is initiated of the participant's audio stream as an audio clip.
- the conference audio is paused such that conference audio is not provided to the participant.
- the participant may speak without being interrupted by other speakers.
- a channel routine in the participant's out channel may detect that the participant is speaking by means of a signal or message from the in channel which is monitoring the speech, and stop sending the conference audio.
- the audio signals recorded from the conference are processed after the conference is over. In particular, such processing may include detecting a speaker or metadata associated with the clips.
- the “in channel” may refer to the playback, or the original raw clips
- the “out channel” may refer to the output of the processing or analysis, such as the speaker information or metadata associated with the utterances.
- pausing the conference audio is optional.
- the user may select whether the process 100 should automatically stop the conference audio upon detecting speech.
- the user may control the provision of the conference audio manually (e.g., by selecting a button).
- the conference audio may be recorded from the beginning, e.g., as soon as the conference starts, and the audio may be stored as audio clips that are added to a playlist.
- the time reached in the conference when the audio is stopped is noted and stored, as reflected by step 108 .
- an index point into the audio conference may be generated and stored.
- the conference is not being recorded at the time it is stopped for the participant, it may be recorded at that point and stored until the user stops speaking.
- a track, a subject thread, or both is initiated or continued from the point at which the conference is stopped and the participant commences speaking.
- step 110 After initiating recording of the participant and stopping the conference audio, the process 100 proceeds to step 110 , where it determines whether the participant has stopped speaking. As mentioned above, a user may be deemed to have stopped speaking when the sound level drops below a threshold for a particular duration of time. Alternatively, a user may manually indicate that he has stopped speaking. If the user has not yet stopped speaking, the process 100 proceeds to step 112 and continues to record the participant's utterances in the audio clip. The process 100 then loops between steps 110 and 112 until it is determined that the user has indeed stopped speaking.
- the process 100 proceeds to steps 114 and 116 .
- the order of steps 114 and 116 is not significant, and either may be performed first or the steps may be performed in parallel.
- the audio clip is stored in the playlist, where the clip may be tagged, annotated, and played, for example.
- the previously stopped conference audio is accessed at the location indicated by the index points stored at step 108 , and played back from that point.
- the conference may be played at a normal speed or at an accelerated pace, and the conference audio may be replayed in accordance with one of multiple modes. In conference mode, the audio is replayed as it was heard on the conference, with all audio streams from the various speakers mixed.
- voicechat mode the audio is replayed sequentially, with the audio streams of each speaker separated and played one after the other sequentially or interleaved in interleave sequential mode.
- the process may revert to transmitting the live conference audio.
- the replaying of conference audio to the user can be by replaying the audio either sequentially or reenacting the conference as a mix from recordings of the clips by each other speaker as described earlier or from a recording of the audio mix as it would have been heard by that speaker by monitoring and recording the out-channel to each participant from the conference bridge as well as the in-channel (as there is a different mix for each participant with each mix leaving out that participant's voice).
- the clips may be analyzed in real time or after a conference to determine and save metadata associated with each clip. Such metadata may include the speaker of the clip or the duration of the clip. Alternatively a recording of the mix of all speakers may be used though in this case the user will hear his own speech played back to him.
- the conference audio begins playing from the playlist at exactly the point indicated by the stored timing information.
- the conference audio may start replaying from a prior point, e.g., a user-adjustable time.
- the user may manually accelerate the conference audio or the user may request automatic acceleration (the latter may be a default setting).
- the user may control acceleration, i.e., set the speed at which the audio is replayed, or the process may determine the speed of acceleration automatically.
- the speed of acceleration may be determined from the elapsed time since the conference audio was stopped. For instance, if the participant spoke for 1 minute as did others in parallel and others continue to speak, and he desires to catch up to the live conference within 2 minutes the rate of acceleration may be calculated as 1.5 ⁇ .
- the rate of acceleration may be speaker-specific, e.g., to ensure intelligibility.
- the process may automatically determine the tempo of each speaker and accelerate those tempos up by the same relative amount or to a global maximum tempo, or up to an average tempo necessary to enable the participant to catch up within a desired period of time.
- the playback may also be paused or slowed, manually or automatically, at certain events or time intervals to enable the participant to add tags or other annotations. For example, the playback may pause after the termination of each audio clip. These pauses may be taken into account when calculating the total acceleration necessary to catch up to the live conference.
- the user is provided with a feature to save his place in the conference audio during playback, so that he can switch between the live conference and playback at his desire. In this manner, for example, the user can catch up on a missed portion of the conference during one or more breaks in the live conference.
- each participant may be provided with an indication of which other participants are actively listening to the live audio or are engaged in playback. Users may be able to “ping” or otherwise message the participants in playback mode to signal that their presence in live mode is requested.
- the process in response to a ping request, may automatically save the playback participant's place in the conference playback and automatically connect him to the live conference.
- the playback mode features discussed above can be used advantageously in circumstances other than when a participant speaks.
- a user may replay a conference after the conference has terminated and access the features above, e.g., the tools enabling acceleration and/or sequential or normal playback.
- the user may select to omit clips from certain speakers, e.g., clips originating from the user himself.
- playback mode may be used during the live conference even when the participant is not speaking. Specifically, when several conference participants speak at once, a participant (or all participants) may be automatically switched to playback mode and clips of each speaker may be played sequentially. In an example, all participants may be switched to playback in voicechat mode, and the clips may be played such as is shown in FIG. 1 .
- the playback can be at an accelerated rate, which may be configured on a speaker-by-speaker basis, as discussed above. Moreover, acceleration may be configured based on the accumulated backlog of audio clips. After listening to each clip (or skipping one or more of the clips), the participant may then be transitioned back to the live conference. The switching between playback mode and live mode may be seamless to the user, so that each participant experiences a pseudo-real-time conference free from individuals talking over one another.
- Conference participants may also make use of playback mode electively, e.g., to recap something previously said, or in case the participant has to leave the conference for a period of time.
- the user can, during the live conference, access tools that allow the user to “pause,” “rewind,” or otherwise skip backwards in time, and then playback the conference as desired.
- the action of playing back an earlier part of the conference may optionally cause the conference audio to stop and go into catch up mode when the playback is paused. Alternatively playback may continue until it reaches the live conference. This may be done while on mute so that the playback does not affect the ongoing conference and the playback may be accomplished independently of the conference connection by means described earlier.
- the detection of speech causes the conference live audio to stop.
- the playback may be sent by the system to the telephone or computer connected to the conference in place of the usual conference mix. Echo cancelling may be used to prevent the playback from being heard back in the conference.
- the user may wish to play back into the conference for all the conference participants to hear.
- the user may unmute his microphone at the start of the portion he wishes to play back, share the play back, and mute his microphone after the play back.
- the system may switch off echo cancelling or otherwise cause selected recordings to play into the conference such as through a playback station connected as another user and controlled by any user through a web interface.
- the mute and unmute buttons may be implemented, and the audio is automatically paused when the unmute is selected and is automatically played when the mute is selected.
- the roster may indicate presence information, such as indications of when a particular participant is speaking, or what portion of the conference one or more participants are listening to and at what speed.
- the roster may flag a significant change of speaker state (such as a participant joining or leaving a conference, for example) by displaying a popup notification on a user's desktop or browser.
- FIG. 13 is an exemplary display of a roster, in accordance with an implementation of the present disclosure.
- the roster shown in FIG. 13 includes a list of names of conference participants, as well as a status of each participant. The status indicates whether the corresponding participant is listening to the conference, speaking into the conference, or neither (i.e., viewing the page). In particular, if the participant is neither listening nor speaking to the conference, the participant may be viewing and/or editing a webpage associated with the conference. Such information of what the participant is viewing or editing may be displayed on the roster.
- the roster includes a current mode (i.e., conference or chat mode) associated with each participant. The state of each participant indicates whether the participant is muted or recording his voice.
- a current mode i.e., conference or chat mode
- the “Last Rec” section of the roster indicates the last recording created by each corresponding user
- the “On Msg” section of the roster indicates the current message or clip that is being listened to by the user.
- the roster shown in FIG. 13 may be updated in real time as the various participants change their modes and states.
- the roster shown in FIG. 13 may also include a user option to select whether to play his own utterances during playback.
- the timeline display may indicate how the conversation shifts across different speakers and/or across various conversation threads. Different speakers and/or different threads may be indicated by different colors on segments of each thread line.
- the timeline display may provide a visual indicator referring to the track into which an utterance is recorded (such as a track number that is displayed for each utterance, for example).
- the timeline display may be shown in the roster as is described above. In the example view of the roster in FIG. 13 , the timeline display includes a row of twenty rectangles corresponding to each participant.
- Each row of rectangles corresponds to the last twenty utterances spoken by all the participants in the conference, and a highlighted rectangle in a particular row means that the corresponding participant spoke that utterance. Different colors or textures, or any other suitable graphical indicia may be used highlight a rectangle.
- the example timeline in FIG. 13 is shown as an illustrative example only, and other features may be included in the timeline that are not shown in FIG. 13 .
- a user may interact with the timeline display to navigate an entire conference, conversation, or a portion thereof. In this case, the user may zoom into the timeline display to focus on one or more particular utterances, or may zoom out to see a roadmap of the conference.
- each rectangle may be based on a duration of the utterance, such that wider rectangles correspond to longer utterances.
- Such timelines may be referred to as scaled timelines.
- it may be undesirable to display very long silences on the timeline which may instead of indicated using a graphical indicia such as an ellipsis or a line across the timeline.
- the other participants When one person starts speaking and his conference audio is paused, the other participants have the option of hearing his speech in the conference in the conference mode or in the voicechat mode. For example, all participants may be switched into voicechat mode as soon as the speaker starts speaking. In this case, all participants may listen to the speaker at normal speed such that the participants finish listening at the same time, as is described in more detail in relation to FIG. 1 .
- a second participant starts speaking before the first speaker stops speaking. If the other participants continue to hear the first speaker in the normal conference audio, they have the option of hearing both speakers mixed together.
- the speakers may be separated by different positions in a stereo image.
- the conference audio may be stopped and replaced with playback of the first speaker from the time at which the second speaker starts. Then, audio from the second and any subsequent speakers may be played until the listeners are caught up to real time. When the listeners are caught up, they may be automatically joined back into the conference audio mix.
- waiting time is eliminated by ensuring all participants finish either speaking or listening to the sequence at the same time.
- all participants may select the voicechat mode as playback.
- all participants are automatically switched to playback mode when any of the participants begin speaking over each other. Remaining in voicechat mode during the conference causes the participants to listen to a conference in which every speaker appears to speak in turn.
- voicechat mode ensures that no speaker has to wait to speak, and that no one is interrupted. The voicechat mode may be particularly useful when the voice connections are peer to peer without the need to utilize a conference service on a central server.
- Each Speaker May Record his or her Voice Only
- each participant may control the sharing of the clips through shared playlists, chat messages, and/or emails with links to clips.
- the clips may be communicated directly between participants for playing, but the receiving participants may not be able to copy or save the clip or audio selection onto their personal devices.
- the clips may be configured to be capable of being saved only to a personal device associated with the speaker in the clip.
- the media signals for each speaker may be stored in files on a storage space that the speaker controls himself. If the speaker wishes to allow other users to listen to his media content, the speaker may provide the other users access to certain metadata associated with the media signals and provide permission to the other users to access the media signals.
- the metadata may include presentations or a video of a screen sharing used during the conference.
- the other users are only given access to play the content, but not to copy the content and store them on other devices. In this case, the speaker may update the permissions by removing permissions for one or more users after initially providing access.
- Internet traffic is reduced through the use of a caching server.
- each peer needs to transmit each clip only once, and the other peers may obtain the clips from the caching server.
- the clips may be encrypted when first recorded on the speaker's peer system and only decrypted when played on his system or on another authorized peer system.
- only authorized users may be permitted to participate in certain protected conferences. In such conferences, all channels may be encrypted such that they may not be played to a user unless the user has the appropriate conference key, individual speaker key, or both.
- participants start in voicechat mode, see each other in the roster, and elect to speak in conference.
- a broadcasting mode the same audio is played to all participants as soon as possible.
- participants may send audio or visual messages to one another. The messages may be transmitted at a lower volume or with an alert tone and/or text message or presence flag.
- a user participates in a conference using the voicechat mode.
- the voicechat mode allows all users to speak whenever they have a thought and hear everyone else's thoughts played one at a time.
- Preconditions in this case include turning autoplay on in sequential mode with streaming, applying filters to new messages to hear all speakers except oneself (or an omit-self option for autoplay), and setting set auto join to be on, and speed to 130% (or 115%, 100%, or any other suitable speed).
- a gadget may be used for recording and playback, though the gadget may not omit a user's own utterances or have speed-up options.
- a phone connection with touch tones to select options may be used to select to not hear a user's own messages.
- the touch tones may be used to start and stop recording (in this case a SOX mediated read may be used to speed up like the player).
- a user may be listening to playback and catch up with the conference volume down and mute on. To speak, the user may select a pause button and unmute button. When the user is finished speaking, the user may select to mute himself, and then the play button to resume listening to the conference.
- a roster such as is shown in FIG. 13 is used so that the participants may determine where the other participants are in the conversation.
- a user plays back the audio clips during the conference.
- the user may wish to replay recent messages while in conference without missing anything. This case may arise if the subject matter is complex, or if the user was distracted by an interruption for example. Preconditions in this case include setting auto join on and speed to 130% (or 115%, 100% or any other suitable speed).
- the user may hang up the conference call and set the filters to hear speakers he wishes to replay (including optionally omitting himself).
- the user then plays the clips from the point at which he wishes to start, may select to skip any unwanted messages, and when he decides he has heard enough, plays to or skips to the end to rejoin the call.
- a roster such as is shown in FIG. 13 is used so that the participants may determine where the other participants are in the conversation.
- the roster may display an indication that a user has left the live conference and has gone back to replay a previous portion of the conference.
- an original user wishes to speak while another user is speaking, and what the original user wishes to say may affect what the other user is saying.
- Preconditions in this case include setting the scroll mode to latest, setting auto play on and being caught up, where playback volume is down, speed is set to 130% (or 115%, 100%, or any other suitable speed), filters are set to hear all speakers except himself (or selecting an omit self option), and having have sequential play on (or natural play).
- the original user may turn the conference volume down, press pause, unmute himself, and speak.
- the original user is finished speaking, he may press mute, turn playback volume up, and start play.
- the conference volume may be turned up, and playback volume may be turned down.
- the other participants may react as the following examples (or these may be automated by the systems and methods herein).
- the first speaker upon hearing another speaker, may turn his conference volume down and pause playback. When he finishes speaking, he starts play omitting himself, with playback volume up. When he catches up he turns down his play volume and turns up his conference volume.
- other participants may turn their conference volume down and play volume up, and the reverse when they are caught up.
- the system may download a speaker's own utterances even though they are not played immediately, so that the utterances may be cached in the speaker's browser for future use.
- a user is in a conference, and while someone else speaks, the user wants to hear what the other person has to say, but has a related thought to contribute and does not want to lose the thought.
- Preconditions for this use case include that the user is in the conference, background autoplay is on so play volume is down and conference volume is up.
- the user speaks the thought using single message reply by muting conference and lowering conference volume, clicking a reply button in a widget so that the user replies to the most recent message being played. Playback is paused, the call is answered, the user speaks the thought and hangs up the call. The user may then return to the conference by raising the playback volume and restarting playback at accelerated speed.
- the user may raise the conference volume and lower the playback volume.
- the user may restate the thought or play it back into the conference by locating the reply in the playlist, unmuting the conference, increasing the playback volume, and playing the reply. Then the user may return to the normal conference by lowering the playback volume and restoring autoplay.
- the system may assist the user in finding an appropriate moment to inject the thought before the flow of the conversation moves on.
- a user joins a conference late and wants to catch up.
- the user may start play with speed up by speaker and possibly filter by speaker with autojoin on.
- the user may stop play or let it continue in the background at low or no volume to be able to do other things.
- a user wishes to join a conference when his agenda item comes up.
- the systems and methods of the present disclosure may provide an agenda linked to threads for each item.
- the user may play the thread from the agenda item with auto play on, such that the user may hear a signal when his agenda item is up.
- This implementation requires that the participants indicate they are associated with particular agenda items.
- the user may select to have autojoin on so that the user is brought into the conference when the messages in the user's thread start.
- a seventh use case when a user is listening to a playback of a conference that has previously been recorded, the user may wish to continue the conference by recording additional utterances.
- This may be implemented in a mode referred to conference continuation mode, in which the user may record additional utterances to continue the thread of discussion and update the conference. Later, when the user or other users listen to the playback of the conference, the user's additional utterances are included in the playback.
- the additional utterances may be added at the beginning, middle, or end of the original conference, and redirection tags may be inserted automatically so that later listeners are redirected to the additional utterances at the proper time.
- Playback of conference clips may be through the conference connection in place of the usual conference mix on the out channel.
- it can be by media playing in various formats such as provided by HTML.
- it is helpful to stream the playback so that it can catch up to users speaking including while clips are being recorded.
- audio processing may be undertaken by audio utilities included in the streaming process such as SoX (Sound eXchange), which is a well known Unix utility.
- SoX Sound eXchange
- a simple streaming technique may be implemented using the following steps.
- FIG. 14 is a block diagram of a computing device, such as any of the components of the system of FIG. 1 , for performing any of the processes described herein.
- Each of the components of these systems may be implemented on one or more computing devices 1400 .
- a plurality of the components of these systems may be included within one computing device 1400 .
- a component and a storage device may be implemented across several computing devices 1400 .
- the computing device 1400 comprises at least one communications interface unit, an input/output controller 1410 , system memory, and one or more data storage devices.
- the system memory includes at least one random access memory (RAM 1402 ) and at least one read-only memory (ROM 1404 ). All of these elements are in communication with a central processing unit (CPU 1406 ) to facilitate the operation of the computing device 1400 .
- the computing device 1400 may be configured in many different ways. For example, the computing device 1400 may be a conventional standalone computer or alternatively, the functions of computing device 1400 may be distributed across multiple computer systems and architectures. In FIG. 14 , the computing device 1400 is linked, via network or local network, to other servers or systems.
- the computing device 1400 may be configured in a distributed architecture, wherein databases and processors are housed in separate units or locations. Some units perform primary processing functions and contain at a minimum a general controller or a processor and a system memory. In distributed architecture implementations, each of these units may be attached via the communications interface unit 1408 to a communications hub or port (not shown) that serves as a primary communication link with other servers, client or user computers and other related devices.
- the communications hub or port may have minimal processing capability itself, serving primarily as a communications router.
- a variety of communications protocols may be part of the system, including, but not limited to: Ethernet, SAP, SASTM, ATP, BLUETOOTHTM, GSM and TCP/IP.
- the CPU 1406 comprises a processor, such as one or more conventional microprocessors and one or more supplementary co-processors such as math co-processors for offloading workload from the CPU 1406 .
- the CPU 1406 is in communication with the communications interface unit 1408 and the input/output controller 1410 , through which the CPU 1406 communicates with other devices such as other servers, user terminals, or devices.
- the communications interface unit 1408 and the input/output controller 1410 may include multiple communication channels for simultaneous communication with, for example, other processors, servers or client terminals.
- the CPU 1406 is also in communication with the data storage device.
- the data storage device may comprise an appropriate combination of magnetic, optical or semiconductor memory, and may include, for example, RAM 1402 , ROM 1404 , flash drive, an optical disc such as a compact disc or a hard disk or drive.
- the CPU 1406 and the data storage device each may be, for example, located entirely within a single computer or other computing device; or connected to each other by a communication medium, such as a USB port, serial port cable, a coaxial cable, an Ethernet cable, a telephone line, a radio frequency transceiver or other similar wireless or wired medium or combination of the foregoing.
- the CPU 1406 may be connected to the data storage device via the communications interface unit 1408 .
- the CPU 1406 may be configured to perform one or more particular processing functions.
- the data storage device may store, for example, (i) an operating system 1412 for the computing device 1400 ; (ii) one or more applications 1414 (e.g., computer program code or a computer program product) adapted to direct the CPU 1406 in accordance with the systems and methods described here, and particularly in accordance with the processes described in detail with regard to the CPU 1406 ; or (iii) database(s) 1416 adapted to store information that may be utilized to store information required by the program.
- applications 1414 e.g., computer program code or a computer program product
- the operating system 1412 and applications 1414 may be stored, for example, in a compressed, an uncompiled and an encrypted format, and may include computer program code.
- the instructions of the program may be read into a main memory of the processor from a computer-readable medium other than the data storage device, such as from the ROM 1404 or from the RAM 1402 . While execution of sequences of instructions in the program causes the CPU 1406 to perform the process steps described herein, hard-wired circuitry may be used in place of, or in combination with, software instructions for implementation of the processes of the present disclosure. Thus, the systems and methods described are not limited to any specific combination of hardware and software.
- Suitable computer program code may be provided for performing one or more functions described herein.
- the program also may include program elements such as an operating system 1412 , a database management system and “device drivers” that allow the processor to interface with computer peripheral devices (e.g., a video display, a keyboard, a computer mouse, etc.) via the input/output controller 1410 .
- computer peripheral devices e.g., a video display, a keyboard, a computer mouse, etc.
- Non-volatile media include, for example, optical, magnetic, or opto-magnetic disks, or integrated circuit memory, such as flash memory.
- Volatile media include dynamic random access memory (DRAM), which typically constitutes the main memory.
- Computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM or EEPROM (electronically erasable programmable read-only memory), a FLASH-EEPROM, any other memory chip or cartridge, or any other non-transitory medium from which a computer can read.
- a floppy disk a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM or EEPROM (electronically erasable programmable read-only memory), a FLASH-EEPROM, any other memory chip or cartridge, or any other non-transitory medium from which a computer can read.
- Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the CPU 1406 (or any other processor of a device described herein) for execution.
- the instructions may initially be borne on a magnetic disk of a remote computer (not shown).
- the remote computer can load the instructions into its dynamic memory and send the instructions over an Ethernet connection, cable line, or even telephone line using a modem.
- a communications device local to a computing device 1400 e.g., a server
- the system bus carries the data to main memory, from which the processor retrieves and executes the instructions.
- the instructions received by main memory may optionally be stored in memory either before or after execution by the processor.
- instructions may be received via a communication port as electrical, electromagnetic or optical signals, which are exemplary forms of wireless communications or data streams that carry various types of information.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Library & Information Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computer Networks & Wireless Communication (AREA)
- Quality & Reliability (AREA)
- Telephonic Communication Services (AREA)
Abstract
Systems and methods are disclosed herein for improving audio conferencing services. One aspect relates to processing audio content of a conference. A first audio signal is received from a first conference participant, and a start and an end of a first utterance by the first conference participant are detected from the first audio signal. A second audio signal is received from a second conference participant, and a start and an end of a second utterance by the second conference participant is detected from the second audio signal. The second conference participant is provided with at least a portion of the first utterance, wherein at least one of start time, start point, and duration is determined based at least in part on the start, end, or both, of the second utterance.
Description
- This disclosure claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Application No. 61/842,331, filed on Jul. 2, 2013, which is hereby incorporated herein by reference in its entirety. This application is related to co-pending PCT Application No. (Attorney Docket No. 003108-0007-WO1) filed Jul. 1, 2014, which is hereby incorporated herein by reference in its entirety.
- This disclosure relates to services provided during and after audio conferencing.
- Conferencing is an important way for a set of individuals who are remote from one another to communicate. Existing conferencing systems connect the conference participants in real time, and play the same audio or video to all participants in real time. These conferencing systems are associated with several disadvantages or problems. First, during a conference, participants may tend to interrupt one another. Such interruptions cause the participants to lose their train of thought, and ideas are lost. In particular, when one participant is interrupted by another, the original participant may be distracted listening to what the interrupter is saying and would likely lose his original thought. Alternatively, if the interrupter waits to speak until the original participant is done, the interrupter may lose his own thought and may never find a moment to contribute it to the conversation. Second, it can be difficult for listeners to effectively understand what the participants are saying when two or more conference participants are trying to speak over one another. When this happens, whichever speaker prevails may be affected by circumstances such as rank of the participants, which would impede the useful flow of the information. Third, conferencing systems may sometimes have poor channel conditions that cause delays, which may increase the frequency of interruptions. Also, someone participating in a conference may be distracted at his location and miss important content. Furthermore, someone who is interested in only a part of the content of the conference has to attend the whole conference to hear or say his part. Techniques are needed to improve existing conferencing services to remedy these problems.
- Systems and methods are disclosed herein for improving audio conferencing services. One aspect relates to processing audio content of a conference. A first audio signal is received from a first conference participant, and a start and an end of a first utterance by the first conference participant are detected from the first audio signal. A second audio signal is received from a second conference participant, and a start and an end of a second utterance by the second conference participant is detected from the second audio signal. The second conference participant is provided with at least a portion of the first utterance at a time that is determined based at least in part on the start, the end, or both the start and the end of the second utterance.
- In one embodiment, the time corresponds to at least one of a start time for providing the portion of the first utterance, a start point of the portion of the first utterance, and a duration of the first utterance.
- In one embodiment, the portion of the first utterance is provided to the second conference participant before the start of the second utterance or after the end of the second utterance. The first utterance and the second utterance may overlap in time, and the providing of the portion of the first utterance may be based on determining that the first and second utterances overlap in time. In particular, the start of the second utterance may occur after the start of the first utterance and before the end of the first utterance, and the portion of the first utterance may be based on a previous portion of the first utterance that is provided to the second conference participant before the start of the second utterance. In an example, upon detection of the start of the second utterance, the first and the second conference participants may be switched to a mode in which utterances are played sequentially to the first and second conference participants.
- In one embodiment, in response to detecting the start of the second utterance, the providing of the portion of the first utterance to the second conference participant is stopped. An indication may be stored of a point in the portion of the first utterance at which the providing to the second conference participant was stopped, where in response to detecting the end of the second utterance, the providing of the portion of the first utterance to the second conference participant is resumed at the point referenced by the stored indication. In an example, resuming the providing of the portion of the first utterance to the second conference participant at the point referenced by the stored indication may include accessing a recorded version of the first audio signal at the point referenced by the stored indication, playing the portion of the first utterance from the point referenced by the stored indication (optionally at an accelerated rate), and providing conference audio to the second conference participant in real time when playback of the recorded version terminates. The recorded version of the first audio signal may be stored as a plurality of audio clips in a playlist, each audio clip including an utterance by one of the conference participants. Playing the portion of the first utterance may include playing the plurality of audio clips sequentially from the point referenced by the stored indication. In an example, the recorded version of the first audio signal is stored as at least some of a plurality of audio clips in a playlist, each audio clip including an utterance by one of a plurality of conference participants. The plurality of audio clips may be played from the point referenced by the stored indication in the same manner in which they were recorded, where two or more of the pluralities of audio clips are played in an overlapping manner when the corresponding conference audio included overlapping utterances from multiple conference participants.
- In one embodiment, the start of the first utterance by the first conference participant is detected by monitoring a volume level of an audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to a threshold value, and determining the start of the utterance when the monitored volume level of the audio stream exceeds the threshold value. The end of the first utterance by the first conference participant may be detected by monitoring the volume level of the audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to the threshold value, and determining the end of the utterance when the monitored volume level of the audio stream falls below the threshold value for a predefined duration of time.
- In one embodiment, detecting the start of the first utterance includes receiving a first selection from the first conference participant to unmute an audio input interface or to pause an audio output. Detecting the end of the first utterance may include receiving a second selection from the first conference participant to mute the audio input interface or to play the audio output. In response to detecting the start of the first utterance, a recording of the first utterance by the first conference participant may be initiated, and in response to detecting the end of the first utterance, the recording of the first utterance by the first conference participant may be terminated.
- In one embodiment, the recorded utterance is stored as an audio clip in a playlist, where the playlist includes a plurality of audio clips of utterances by other conference participants. The stored audio clip in the playlist may be automatically categorized under a section identifying the conference or a subject of the conference, and may be automatically tagged with information identifying the first conference participant. User input may be received that is indicative of data to associate with the stored audio clip in the playlist, and the data may be stored with an association to the stored audio clip. The stored data may include at least one of a subject, description, transcription, keyword, flag, digital file, and uniform resource locator.
- One aspect relates to a system for processing audio content of a conference. The system comprises an audio detector configured to receive a first audio signal from a first conference participant, detect, from the first audio signal, a start and an end of a first utterance by the first conference participant, receive a second audio signal from a second conference participant, and detect, from the second audio signal, a start and an end of a second utterance by the second conference participant. The system further comprises a transmitter configured to provide, to the second conference participant, a portion of the first utterance including a delayed version of at least a portion of the first utterance at a time determined based at least in part on the start, the end, or both the start and the end of the second utterance.
- One aspect relates to a non-transitory computer-readable medium comprising computer-readable instructions encoded thereon for processing audio content of a conference. The computer-readable instructions comprise instructions for receiving a first audio signal from a first conference participant, detecting, from the first audio signal, a start and an end of a first utterance by the first conference participant, receiving a second audio signal from a second conference participant, and detecting, from the second audio signal, a start and an end of a second utterance by the second conference participant. The computer-readable instructions further comprises instructions for providing, to the second conference participant, at least a portion of the first utterance at a time determined based at least in part on the start, the end or both the start and end of the second utterance.
- One aspect relates to a system or method for processing audio content of a conference. A processor provides audio from the conference to a first conference participant, detects a start of an utterance by the first conference participant, and in response to detecting the start of the utterance, stops the provision of the audio from the conference to the first conference participant. An indication of a point in the audio from the conference at which the provision of the audio from the conference to the first conference participant was stopped is stored, and an end of the utterance by the first conference participant is detected. In response to detecting the end of the utterance, the processor resumes the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication.
- In one embodiment, detecting a start of an utterance by the first conference participant comprises monitoring a volume level of an audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to a threshold value, and determining the start of the utterance when the monitored volume level of the audio stream exceeds the threshold value. Detecting an end of the utterance by the first conference participant may comprise monitoring the volume level of the audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to the threshold value, and determining the end of the utterance when the monitored volume level of the audio stream falls below the threshold value for a predefined duration of time.
- In one embodiment, detecting a start of an utterance by the first conference participant comprises receiving a first selection from the first conference participant to unmute an audio input interface or to pause an audio output. An end of the utterance by the first conference participant may be detected by receiving a second selection from the first conference participant to mute the audio input interface or to play the audio output.
- In one embodiment, in response to detecting the start of the utterance, the processor initiates a recording of the utterance by the first conference participant. In response to detecting the end of the utterance, the processor terminates the recording of the utterance by the first conference participant. The recorded utterance may be stored as an audio clip in a playlist, where the playlist includes a plurality of audio clips of utterances by other conference participants. The stored audio clip in the playlist may be automatically categorized under a section identifying the conference or a subject of the conference, and the stored audio clip may be automatically tagged with information identifying the first conference participant.
- In one embodiment, resuming the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication comprises accessing a recorded version of the audio from the conference at the point referenced by the stored indication, playing the recorded version of the audio from the conference from the point referenced by the stored indication at an accelerated rate, and providing the audio from the conference to the first conference participant in real time when playback of the recorded version terminates. The recorded version of the audio from the conference may be stored as a plurality of audio clips in a playlist, where each audio clip includes an utterance by one of a plurality of conference participants. Playing the recorded version of the audio from the conference may comprise playing the plurality of audio clips sequentially from the point referenced by the stored indication. In some embodiments, the recorded version of the audio from the conference is stored as a plurality of audio clips in a playlist, where each audio clip includes an utterance by one of a plurality of conference participants. Playing the recorded version of the audio from the conference may comprise playing the plurality of audio clips from the point referenced by the stored indication in the same manner in which they were recorded, wherein two or more of the plurality of audio clips are played in an overlapping manner when the corresponding audio from the conference included overlapping utterances from multiple conference participants. In some embodiments, the processor receives user input of data to associate with the stored audio clip in the playlist and stores the data with an association to the stored audio clip. The stored data comprises at least one of a subject, description, transcription, keyword, flag, digital file, and uniform resource locator.
- One aspect relates to a system for processing audio content of a conference. The system comprises a transmitter configured to provide audio from the conference to a first conference participant and an audio detector configured to detect a start of an utterance by the first conference participant and detect an end of the utterance by the first conference participant. The system further comprises a processor configured to, in response to detecting the start of the utterance, stop the provision of the audio from the conference to the first conference participant. The processor is further configured to store, in a memory, an indication of a point in the audio from the conference at which the provision of the audio from the conference to the first conference participant was stopped, and in response to detecting the end of the utterance, resume the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication.
- In one embodiment, the audio detector is configured to detect a start of an utterance by the first conference participant by monitoring a volume level of an audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to a threshold value, and determining the start of the utterance when the monitored volume level of the audio stream exceeds the threshold value. The audio detector may be configured to detect an end of the utterance by the first conference participant by monitoring the volume level of the audio stream sourced from the first conference participant, comparing the monitored volume level of the audio stream to the threshold value, and determining the end of the utterance when the monitored volume level of the audio stream falls below the threshold value for a predefined duration of time.
- In one embodiment, the audio detector is configured to detect a start of an utterance by the first conference participant by receiving a first selection from the first conference participant to unmute an audio input interface or to pause an audio output. The audio detector may be configured to detect an end of the utterance by the first conference participant by receiving a second selection from the first conference participant to mute the audio input interface or to play the audio output.
- In one embodiment, the processor is further configured to in response to detecting the start of the utterance, initiate a recording of the utterance by the first conference participant, and in response to detecting the end of the utterance, terminate the recording of the utterance by the first conference participant. The processor may be further configured to store a reference to the recorded utterance as an audio clip in a playlist, wherein the playlist includes a plurality of audio clips of utterances by other conference participants. The processor may be further configured to automatically categorize the stored audio clip in the playlist under a section identifying the conference or a subject of the conference, and automatically tag the stored audio clip with information identifying the first conference participant.
- In one embodiment, the processor is configured to store a first index point corresponding to the start of the utterance in response to detecting the start of the utterance, and store a second index point corresponding to the end of the utterance in response to detecting the end of the utterance.
- In one embodiment, the processor is configured to resume the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication by accessing, from the memory, a recorded version of the audio from the conference at the point referenced by the stored indication, providing the recorded version of the audio from the conference from the point referenced by the stored indication for playback at an accelerated rate, and providing the audio from the conference to the first conference participant in real time when playback of the recorded version terminates. The recorded version of the audio from the conference may be stored as a plurality of audio clips in a playlist, each audio clip including an utterance by one of a plurality of conference participants. The processor may be configured to provide the recorded version of the audio from the conference for playback by providing the plurality of audio clips for sequential playback from the point referenced by the stored indication. The recorded version of the audio from the conference may be stored as a plurality of audio clips in a playlist, each audio clip including an utterance by one of a plurality of conference participants. The processor may be configured to provide the recorded version of the audio from the conference for playback by providing the plurality of audio clips for playback in the same manner in which they were recorded from the point referenced by the stored indication, where two or more of the plurality of audio clips are played in an overlapping manner when the corresponding audio from the conference included overlapping utterances from multiple conference participants. In some embodiments, the processor is further configured to receive user input of data to associate with the stored audio clip in the playlist, and store the data with an association to the stored audio clip. The stored data may comprise at least one of a subject, description, transcription, keyword, flag, digital file, and uniform resource locator.
- One aspect relates to a non-transitory computer-readable medium comprising computer-readable instructions encoded thereon for processing audio content of a conference. The computer-readable instructions comprise instructions for providing audio from the conference to a first conference participant, detecting a start of an utterance by the first conference participant, and in response to detecting the start of the utterance, stopping the provision of the audio from the conference to the first conference participant. The computer-readable instructions further comprise instructions for storing an indication of a point in the audio from the conference at which the provision of the audio from the conference to the first conference participant was stopped, detecting an end of the utterance by the first conference participant, and in response to detecting the end of the utterance, resuming the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication.
- The above and other features of the present disclosure, including its nature and its various advantages, will be more apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings in which:
-
FIG. 1 is an example diagram of various timelines during a conference in voicechat mode, in accordance with an implementation of the disclosure. -
FIG. 2 is an example display that is shown to a user for joining a conference, in accordance with an implementation of the disclosure. -
FIGS. 3A and 3B show an example display that allows a user to access and play audio clips, conferences, a playlist, or subdivisions of a playlist, in accordance with an implementation of the disclosure. -
FIGS. 4 and 5 are example displays of a notifications dialog, in accordance with an implementation of the disclosure. -
FIG. 6 is an example display of a prompt for a user to modify playback settings, in accordance with an implementation of the disclosure. -
FIG. 7 is an example display of an interface for filtering audio clips, in accordance with an implementation of the disclosure. -
FIG. 8 is an example display of transcriptions associated with audio clips, in accordance with an implementation of the disclosure. -
FIG. 9 is an example display of a message displayed to a user to confirm the user wishes to reply to an audio clip, in accordance with an implementation of the disclosure. -
FIG. 10 is an example display of a message displayed to a user to confirm the user wishes to continue a conversation, in accordance with an implementation of the disclosure. -
FIG. 11 is an example display of an option to auto-join a conference, in accordance with an implementation of the disclosure. -
FIG. 12 is a flowchart of a process for conference enhancement, in accordance with an implementation of the disclosure. -
FIG. 13 is an example display of a roster, in accordance with an implementation of the disclosure. -
FIG. 14 is a block diagram of a computerized system for performing any of the techniques described herein, in accordance with an implementation of the disclosure. - To provide an overall understanding of the invention, certain illustrative embodiments will now be described, including a system for providing various services to an audio conference. However, it will be understood by one of ordinary skill in the art that the systems and methods described herein may be adapted and modified as is appropriate for the application being addressed and that the systems and methods described herein may be employed in other suitable applications, and that such other additions and modifications will not depart from the scope of the present disclosure. Moreover, one of ordinary skill in the art will understand that other embodiments may be used to implement the systems and methods described herein.
- Systems and methods for improving video and audio conferencing services are provided. Specifically, techniques are described herein for processing the audio content of conferences in a manner advantageous to real-time or future playback. These techniques may, for example, reduce the deleterious effects of multiple conference participants speaking at the same time. The techniques described herein also enable users to access audio content of conferences in a variety of useful ways. For instance, audio content from one or more conferences may be stored in a playlist as “clips” and delineated by speaker, topic, and/or other criteria. Tools may be provided for accessing, modifying, and/or augmenting the clips in the playlist. For example, users may be provided with search and sort tools, and may be able to tag or otherwise associate data with the audio clips. Users may also be provided with tools for efficiently playing audio clips in the playlist, e.g., using various filtering and playback settings. In some configurations, users may be able to add new audio clips to the playlist, which may be linked to existing clips and conferences in a number of ways. As used herein, a “clip” may refer to a single audio file corresponding to a single utterance spoken by a user. Alternatively, a “clip” may refer to a portion of a longer audio file that includes multiple utterances spoken by one or more users. In this case, the clip refers to the portion of the longer audio file that corresponds to a single utterance, and index points may be used to indicate the beginning and end of the clip within the longer audio file. Moreover, users participating in a conference often have external interruptions that are distracting. In this case, the systems and methods of the present disclosure allow such users to pause the conference and return to the conference, picking up where they left off. In this manner, users may pause the conference, direct their full attention to the interruption, and return to the conference at the paused point so that upon return to the conference they may devote their full attention to the conference and catch up (with optional accelerated playback) to the other participants without missing any portion of the conference. Moreover, the present disclosure provides systems and methods for the conference to continue even after all participants have disconnected from the live conference. In particular, the audio and/or video signals recorded during the conference are saved, and users may return to play the signals and record new content to continue the conversation.
- In an embodiment, as individuals speak during a conference, audio is recorded for each individual and the sound levels monitored to determine the start and end of each utterance. These utterances may be indexed and/or used to create clips that are added to a playlist for subsequent playback. The clips may then be replayed naturally in a selected mode. For example, a first mode may correspond to a “conference mode,” in which the audio clips are mixed and voices overlap in a reenactment of the actual conference. The “conference mode” may be referred to as a “natural mode” because the reenactment of the actual conference is similar to playing the conference in real time. A second mode may correspond to a “voicechat mode,” in which the clips are played sequentially one-by-one in the order of their start times. The “voicechat mode” may be referred to as a “sequential mode” because the clips are played sequentially. A third mode may correspond to an “interleaved sequential mode,” in which the clips are played in the sequence of their start time with earlier starting clips being paused during playing of the later overlapping clips. The “interleaved sequential mode” may be referred to as the “threaded mode,” in which clips corresponding to the same topic or conversational thread may be played sequentially, even if intervening clips corresponding to different topics or conversations were recorded in between the clips.
- In some implementations, one of these three modes may be set as a default mode, and a user wishing to transition to a different mode may select different mode buttons on the interface. In an example, the voicechat mode may be the default mode, and the user may select a conference mode button to transition to the conference mode, or an interleaved sequential mode button to transition to the interleaved sequential mode. Moreover, a user's last used mode may be remembered, such that if the user logs out of the system or otherwise closes the interface, the system may apply the last used mode the next time the user logs in. Such settings may be associated with the user's account, such that the system may remember the user's preferences across different devices.
-
FIG. 1 is an example diagram of various timelines during a conference in voicechat mode. As shown inFIG. 1 , there are at least three participants Alice, Bob, and Charlie in the conference. The diagram includes five rows corresponding to real time (top row), what Alice hears (second row), what Bob hears (third row), what Charlie hears (fourth row), and what a passive listener hears (bottom row). In real time, the conference participants interrupt one another, and their utterances overlap. In particular, Bob begins speaking at time t=1, during Alice's first utterance. Charlie begins speaking at time t=2, also during Alice's first utterance and during Bob's first utterance. Nobody speaks while t=7 to 10, and Alice speaks her second utterance from t=10 to 14. - In voicechat mode, the clips are played sequentially in the order of their start times. In the second row of
FIG. 1 , even though Bob begins speaking at time t=1, Alice does not hear his utterance until she is done speaking her first utterance at time t=3. Moreover, Alice does not hear Charlie's utterance that actually began at time t=2 until time t=8 after she has heard Bob's complete utterance. Alice selects to speak her second utterance starting at time t=10, such that Charlie's second utterance that actually began at time t=5 is not played to Alice until her second utterance is complete at time t=14. - According to the third row of
FIG. 1 , Bob interrupts Alice at time t=1, such that Bob does not hear the last two seconds of Alice's first utterance until Bob is done speaking his first utterance at time t=6. Immediately after Bob is done listening to Alice's first utterance, Charlie's two utterances are played back-to-back, followed by Alice's second utterance. According to the fourth row ofFIG. 1 , Charlie interrupts Alice's first utterance beginning at time t=2, and does not hear the rest of Alice's first utterance until Charlie's first utterance is complete at time t=4. According to the bottom row ofFIG. 1 , a passive listener listening to the conference simply hears the clips played sequentially in the order of their start times. In particular, Alice's first utterance is played first, followed by Bob's utterance, Charlie's two utterances, and Alice's second utterance. As is shown inFIG. 1 , all participants are caught up to the same point in the conference at the same time (i.e., shown as time t=16 inFIG. 1 ). In particular, because all participants are either listening to or speaking the same utterances, but perhaps broken up in different ways, all participants may return to the same live point in the conference after listening to the same content. In this manner, the systems and methods of the present disclosure allow conference participants to hear every utterance without interference from overlapping speakers and without being interrupted, while maintaining a consistent schedule such that the participants are all roughly at the same point in the conference. - In some implementations of the present disclosure, the audio data are stored as a separate audio clip file for each utterance. In particular, each audio clip may correspond to a single utterance spoken by a single user. In this case, a single user may be associated with multiple audio clip files, and the audio clip files may be tagged according to subject threads. Alternatively, the audio data may be stored in a single continuous audio file, and index points may refer to timestamps of the continuous audio file indicative of when a speaker begins and stops speaking. In this case, different continuous audio files may be stored for each user or speaker, and the timestamp indexes may be saved as metadata associated with the audio files. The indexes may also be used to follow a subject thread. In particular, a separate file may be used to store the various indexes indicating start and end times of utterances regarding the same subject thread. In this manner, index points indicative of time offsets and duration may be used to break up long audio files, rather than storing separate audio clip files for each utterance. As used herein, a clip, an audio clip, or an audio clip file may refer to an individual audio file storing the utterance, or may refer to a portion of a longer audio file storing multiple utterances, where the portion containing the utterance is indicated by one or more index points.
- In an embodiment, while a user speaks during a conference, the audio output from the conference is paused for a user when the user begins to speak, and an audio clip is recorded for each individual who is speaking at the same time, including the user. When the user finishes speaking, the audio clips may be played back to him in conference mode, voicechat mode, or interleaved sequential mode. Optionally, the audio clip corresponding to the user may be omitted during playback. In some embodiments, the playback is accelerated so that the user may catch up to the real-time conference.
- Often on a conference call participants will mute their microphone on their phone or computer by which they connect to the conference to prevent unwanted background noise or interruptions such as a cell phone call from interfering with the call. When they wish to speak on the conference they unmute the microphone. When they stop speaking, participants often mute the microphone again, so that others on the conference are not disturbed by background noise from their environment.
- These mute and unmute events can be detected and used to stop the conference call's live audio and start the recorded playback. For instance, when a first speaker unmutes to speak, the live audio may be stopped so that other speakers do not distract the first speaker. Then, when that speaker finishes speaking and mutes again, the recording of any other utterances is played back to the speaker. A user may select to link (1) a selection of the unmute button to a pause of the audio, and (2) a selection of the mute button to a play of the audio.
- Alternatively, play and pause buttons may be used to implicitly signal mute and unmute, respectively. In particular, when the user selects to play the audio, the user may be automatically muted, and when the user selects to pause the audio, the user may be automatically unmuted. This may be referred to as a play/pause feature, which may be used during a live conference. In particular, the user may select a pause button to switch mute off, such that the user may speak in an uninterrupted fashion while the audio is paused. Moreover, the user may select a play button to switch mute on, so that the user may listen to the conference. In another example, mute, unmute, pause, and play buttons may all be explicitly provided so that the user may have the flexibility to configure the user settings.
- Optionally, the playback is performed at an accelerated speed until the first speaker is caught up to the real time conference. When the speaker is caught up, the playback is stopped, and live audio is resumed. Alternatively, the user may select to skip the playback and immediately join the live conference without listening to the playback. In some embodiments, if the first speaker does not use the mute control, the detection of speech and cessation of speech can be used as automated triggers to achieve the same effect as unmute and mute. Detection of the cessation of speech may involve detecting that an amount of sound level recorded at the speaker's microphone is below a threshold level for a certain period of time. Moreover, a small amount of buffer time may be used after the detection of the cessation of speech but before the end of the clip to ensure that the utterance has ended.
- As used herein, the term “conference” should be understood to encompass an audio or a video conference (or other multimedia conference) having an audio component. Conference participants may participate in a conference in real time, or different users may participate in the conference at different times. In particular, a conference may include a live conversation that is recorded and capable of being listened or watched at a later time, audio or video content that may be recorded by one user that is not connected to a live conference, or a combination thereof. In particular, a conference may include a live portion, during which users are connected over the same line and may interact with one another. The live portion may be capable of being played to a participant or a non-participant of the live conference, who may then record a response to some content discussed during the live portion. Such a recording may be referred to as an offline portion of the conference. In an embodiment, multiple participants may access a conference using a wide-range of devices and/or applications, such as land phones, mobile phones, tablets, computers, or any suitable device for accessing a conference. For example, one participant may use a phone to dial-in to a conference, while another joins the conference using an Internet service that he accesses using a personal computer. In some implementations, the present disclosure relates to interactive web collaboration systems and methods, which are described in U.S. Pat. No. 7,571,212, entitled “Interactive web collaboration systems and methods,” which is incorporated herein by reference in its entirety. In some implementations, the present disclosure uses systems and methods similar to a musical application for creating and editing a musical enhancement file to process audio data collected during a conference. Such a musical application is described in U.S. Pat. No. 7,423,214, entitled “Systems and methods for creation and playback performance,” which is incorporated herein by reference in its entirety.
-
FIG. 2 shows two example displays that may be shown to a user for joining a conference. In particular, the left hand side ofFIG. 2 provides an example screen for the user to enter the user's call id for entering the conference system, and optionally the call id of one or more other users for inviting to enter the conference system. In some embodiments, the conference may be initiated from a web page set up for a particular conference id. Alternatively, the user may enter a conference id indicative of a particular conference that the user wishes to join (not shown inFIG. 2 ). The right hand side ofFIG. 2 allows the user to set various conference settings, such as mute, headset, volume, and gain. Regardless of the means used to access the conference, the sound recording from each participant may be stored as individual audio clips and associated with data. This data may include metadata that identifies the particular conference, the speaker, the time and date, the subject of the conference, the duration of the clip, any assets shown during the conference, such as a presentation or a view of a screen, and/or any other suitable information. The audio clips may be stored together with the data or metadata and added to a playlist accessible by users, who may be the participants of the conference or non-participants of the conference. - When a user accesses a playlist, the user may be presented with a display screen, such as the exemplary display shown in
FIGS. 3A and 3B . The display inFIGS. 3A and 3B allows the user to access and play individual audio clips, whole conferences, the entire playlist, or subdivisions thereof such as individual subject threads. In particular, the display inFIGS. 3A and 3B includes a foreground section labeled “Conversational threads,” which lists the various subject headings associated with different tracks, and the message number (“Msg”) associated with the beginning of each corresponding track. The background of the display ofFIGS. 3A and 3B includes an expanded list of the various tracks listed in the foreground “Conversation threads” section, including a list of all messages in each track, the participant who spoke the corresponding utterance, and the text of the utterance. - The user may select to play clips individually or in sequence. After listening to a selected clip, the user may select record a new audio clip to “reply” to the clip or to “continue” the conversation. The new audio clip may be recorded in real-time, automatically linked to the originally selected clip, and/or added to the playlist. Upon future playback, the new clip may optionally be played immediately after the originally selected clip. Although the new clip may be demarcated as having been recorded after the original conference took place (e.g., using a visual or audio indicator), this process enables the clip to be played back as if it were spoken during the conference.
- In some configurations, a user can select to play a “thread,” which is limited to the selected clip and any replies to a subject indicated by a heading and continuation headings as the conversation moves back and forth over various subjects. In particular, users may select to record utterances into one of a number of ‘tracks’. In some embodiments, tracks may be associated with individual participants, individual topics, or a combination of both. Some playback modes may play each utterance based on its start time, regardless of its track, while other playback modes may allow a user to select to listen to one or more tracks simultaneously, or play all the utterances in one track followed by all those in the next track. In some embodiments, threads may be implemented using tracks. For example, upon being recorded, an utterance may be assigned to a track automatically, such as the track of the previous utterance that was recorded. The user may select to update the default track to a different track or a new track. After the utterance is recorded, headings or tags may be added to the audio clip to add the clip to one or more suitable threads. In an example, a listener may wish to play the thread related to the “Reply feature” listed in the “Conversational threads” dialog box shown in
FIGS. 3A and 3B . In this case, the listener may select the heading of the “Discussion on ‘Reply’ feature” thread, and the utterances tagged with the appropriate thread are played. In particular, the utterances labeled 72, 73, 83, 84, and 85 are played in sequence, and utterances 74-82 are skipped because these utterances are related to a different thread. - When a user requests to “continue” a conversation, the user records a new audio clip or multiple clips in real-time and that audio clip is automatically added to the playlist and linked to the same discussion topic as the selected clip. This process enables multiple users to have continuous yet asynchronous verbal discussions. For example, a first and second user may participate in a conference on a Monday and the audio from that conference may be stored in the playlist under a heading indicating the conference date and/or topic. A third user may then play the conference the following Tuesday and select to “continue” the discussion. This feature allows the third user to record a one or more audio clip and to link it to the original conference in the playlist, e.g., so that it appears under the same heading or is otherwise labeled as to indicate its association with a subject thread in the original conference. The other conference participants may be alerted to the fact that the third user added a new clip to the conversation and may play the new clip. In some implementations, utterances labeled with “continue” or “reply” may be automatically recorded to different tracks to distinguish such utterances from utterances belonging to the main thread.
- These other participants, e.g., the first user, may in turn “reply” to that clip or “continue” the conversation in the same manner. Thus, a continuous verbal discussion, including conferences between multiple participants and individual contributions, and which allows for asynchronous communication, may be maintained.
-
FIGS. 4 and 5 show example displays of a notifications dialog, in accordance with an implementation of the present disclosure. The display inFIG. 4 shows the data sorted according to conversations, while the display inFIG. 5 shows the data sorted according to pages. As is shown inFIG. 4 , the owners of various extensions are listed in the display, and the user may select to modify a type and an interval for each corresponding owner and extension. In the example shown inFIG. 4 , each extension corresponds to an owner id (i.e., the first three digits of the extension), followed by a conversation id (i.e., the last 6 digits of the extension). Each owner of an extension may select to receive notifications when an update to the conversation is made, or at fixed time intervals indicated by the interval. The type field shown inFIG. 4 refers to a type of notification. In an example, the digest type of notification may correspond to a summary of all the changes that have occurred in the conversation since the last notification was sent. Examples of changes include identifiers of users who made what changes and when such changes were made. Alternatively, another type of notification is a “link,” for which the user receives a link to the changed entity within the conversation. Other types of notifications with different levels of detail may be used, and the users may select to receive notifications having different levels of detail for different conversations or extensions. Similarly, as is shown inFIG. 5 , the user may select to modify a type and/or an interval for each extension. -
FIG. 6 shows an exemplary display of the system prompting the user to modify playback settings, according to an implementation of the present disclosure. In an embodiment, a user may modify a number of playback settings to control how audio clips are played in the playlist. Playback settings may be set globally or specifically for individual speakers. For example, a user may set a play speed, or tempo, for all speakers, or the user may set the tempo individually for each speaker. As is shown inFIG. 6 , a user may enter the user's identifier, the user's nickname, the user's conference identifier, and the user's tempo, which may refer to the relative speed at which the user's audio is played. - Moreover,
FIG. 6 includes options for the user to set settings specific to the speakers. Such settings include tempo, pitch shift, filter, silence, and volume. Selecting the silence setting causes silences to be removed during the playback of the clips. Selecting the filter setting causes the audio signals for the corresponding speaker to be filtered, to remove noise for example. Optionally, audio characteristics may be set and/or adjusted automatically. For example, the tempo of each speaker can be detected, such as by detecting an average syllabic frequency uttered by the speaker, and automatically adjusted to match a user-selected target tempo or an interval of elapsed time available for catch-up. For example, the syllabic frequency of a speaker may be detected and compared to threshold syllabic frequency. The threshold syllabic frequency may correspond to a fixed maximum syllabic frequency that is set for intelligibility, or may correspond to the syllabic frequency of the fastest speaker in the conference. The amount of speed-up applied to a speaker's utterance may be dependent on this comparison. In an example, the utterances spoken by slower speakers may be sped up at a higher rate than the utterances spoken by faster speakers, because the syllabic frequencies of slower speakers are further from the maximum syllabic frequency than the syllabic frequencies of the faster speakers. In this manner, utterances from different speakers may be individually adjusted (manually or automatically) in accordance with their tempos to ensure that the utterances are sped up for efficiency while still being intelligible. -
FIG. 7 shows an exemplary display of an interface for filtering audio clips, according to an implementation of the present disclosure. In an embodiment, the user may configure various filters to control which audio clips in the playlist are played. Using one such filter, the user may select to play the audio clips of only a specific person or specific persons. For example, the user may select to play back all audio clips aside from those for which he is the speaker. Using another filter, the user may select to play back only audio clips associated with particular tags, keywords, or other data. Audio clips may be associated with tags, metadata, URLs, descriptions, priorities, external documents, etc. In some configurations, a user associates data with audio clips via manual data entry or by dragging and dropping the data onto a clip in the playlist. -
FIG. 8 shows an exemplary display of transcriptions associated with audio clips, according to an implementation of the present disclosure. Audio clips may be transcribed, and the transcription of the audio content may be made available via the playlist. The transcription may be manual, automatic, or a combination of both. For example, a user may select a clip, request automatic transcription, and then manually edit the results. It is contemplated that automatic transcription may be provided by outside services, e.g., Internet-based services. - The audio clip may be automatically transmitted to the Internet service and the resulting transcription automatically retrieved and stored with the clip. Users of such services may correct transcriptions, and the corrections may be transmitted back to the service so that the service may improve its accuracy. By providing information concerning the speaker for each clip, the subject thread in which he is speaking, and the vocabulary associated with that thread, the service can improve still further in accuracy. The effect of all these improvements together should enable automatic transcription to be utilized for provision of highly accurate text. The text may be used for communication, authoring of programming and user scripting languages, translation between natural languages, and targeted advertising, for example.
- In an embodiment, participants in a conference may engage in cooperative browsing. In this case, a participant shares a data object (e.g., a document, audio clip, video, URL, etc.), and the data object (or a reference to the data object such as a hyperlink) is automatically transmitted to each participant and displayed in real-time. In an example, the data object may involve a video stream of the participant's computer screen, so that the other participants may view the participant's screen during the conference. These data objects may also be stored and linked to particular audio clips in the playlist (and/or to particular index points within the audio clips). Upon playback, the objects may be redisplayed at the appropriate time just as in the live conference. In some implementations, shared assets related to a conference may be presented to a user as a synchronized slideshow during playback. Alternatively, such assets may be viewed as a collection of resources in a separate window.
- As discussed above, when a conference participant begins speaking during a conference, the conference audio may be paused for that participant so as to not interfere with his or her speaking. The conference audio may be accumulated and stored in the interim, e.g., as audio clips are added to the playlist. When the participant stops speaking, the stored audio content is subsequently played to the participant, so that the participant can listen to what the other participants said while he or she was speaking. The clips can then be replayed in “conference mode” in which the audio clips are mixed and voices overlap as they did in the actual conference, in sequential “voicechat mode” in which the clips are played one by one, or in “interleaved sequential mode” in which the clips are played one by one starting at the time relative to each other as they occurred with earlier starting clips paused during the playback of later starting clips. In the interleaved sequential mode, the clips that are played sequentially may correspond to a single thread, subject, or conversation. In particular, a reply clip that includes a reply to an original clip may be played immediately after the original clip in the interleaved sequential mode, even if intervening clips were recorded between the times that the original clip and the reply clip were recorded. In any such mode, the user may choose to accelerate playback so that he catches up to the live conference. In some configurations, the rate of acceleration may be automatically determined based on the elapsed time since the audio conference was stopped.
- A similar feature may be provided when a user is playing back a stored conference, e.g., using the playlist. If a user chooses to “reply” to a particular clip or “continue” a conversation, playback may be paused while the user records a new audio clip. Subsequently, playback may resume at a normal or accelerated rate.
FIG. 9 shows an exemplary display of a message displayed to a user to confirm that the user wishes to reply to an audio clip. As shown inFIG. 9 , the user may provide a maximum duration for the reply. Selecting to reply to an original audio clip may cause the next recorded utterance to be associated with the original audio clip.FIG. 10 shows an exemplary display of a message displayed to a user to confirm that the user wishes to continue a conversation by recording a new clip related to an existing clip or set of clips. As shown inFIG. 10 , the new recording may be automatically tagged to reflect the tags of the thread. Selecting to continuing a conversation may cause the next recorded utterance to be tagged with the associated tags of the thread or conversation. In some implementations, utterances labeled with “continue” or “reply” may be automatically recorded to different tracks to distinguish such utterances from utterances belonging to the main thread. -
FIG. 11 shows an exemplary display of an option to auto-join a conference, in accordance with an implementation of the present disclosure. In particular, a user may select the option such that when he reaches the end of the playlist, he will be automatically added to the live conference and join with others already there and/or invite others to join him. - To store audio clips from each speaker, the sound level being transmitted from each participant's device may be monitored. In an embodiment, when the sound level surpasses a predefined threshold, recording may commence or an index point to a continuous recording is noted. As used herein, an individual clip may refer to the interval in a continuous recording between successive index points. When the sound level subsequently decreases below the same or a different threshold value, recording may stop, or another index point may be generated. Recording of the same clip or a different clip may resume if the sound level returns above a resume threshold within a defined period of time. The resume threshold may be the same or different from the predefined threshold that was originally used at the start of a clip. The thresholds may be different based on speakers or other variables such as background noise. In some embodiments, in order to optimize the size of the clip for transcription, translation or other purposes, the sound level threshold may be adjusted progressively as the clip is recorded. Other factors, such as vocabulary and phrase determination, may also be used to determine useful clip boundaries. Alternatively, if the sound level does not return above the threshold within the defined period of time, the recording may be terminated, and the audio clip is stored. The audio clip may also be associated with data, such as information identifying the speaker, topic, subject thread, related messages and vocabularies and grammars used. Speaker information may be provided, for instance, within the data stream transmitted by each participant's device, or the conferencing service may track each user when they access the conference. In particular, such metadata may include the speaker and the duration of the audio clip.
-
FIG. 12 is a flowchart of aprocess 100 for conference enhancement, in accordance with an embodiment of the present invention. The steps ofFIG. 12 may be performed by a process (which may be a software application, e.g., executed on a local device or a remote server). Atstep 102, the sound level of a participant on a conference is monitored. Atstep 104, it is determined whether the participant is speaking, i.e., whether the sound level emanating from the participant's audio stream (e.g., input by a microphone or other audio input device) is greater than a defined threshold value. If not, theprocess 100 returns to step 102 to continue monitoring the participant's sound level. Otherwise, theprocess 100 proceeds to step 106, where recording is initiated of the participant's audio stream as an audio clip. In addition, upon determining that the participant has begun speaking, or in response to the initiation of recording (e.g., a file write routine), the conference audio is paused such that conference audio is not provided to the participant. In this manner, the participant may speak without being interrupted by other speakers. In particular, a channel routine in the participant's out channel may detect that the participant is speaking by means of a signal or message from the in channel which is monitoring the speech, and stop sending the conference audio. In some embodiments, the audio signals recorded from the conference are processed after the conference is over. In particular, such processing may include detecting a speaker or metadata associated with the clips. In this case, the “in channel” may refer to the playback, or the original raw clips, and the “out channel” may refer to the output of the processing or analysis, such as the speaker information or metadata associated with the utterances. In some embodiments, pausing the conference audio is optional. In this case, the user may select whether theprocess 100 should automatically stop the conference audio upon detecting speech. Alternatively, the user may control the provision of the conference audio manually (e.g., by selecting a button). - Generally, the conference audio may be recorded from the beginning, e.g., as soon as the conference starts, and the audio may be stored as audio clips that are added to a playlist. In this scenario, the time reached in the conference when the audio is stopped is noted and stored, as reflected by
step 108. For instance, an index point into the audio conference may be generated and stored. Alternatively, if the conference is not being recorded at the time it is stopped for the participant, it may be recorded at that point and stored until the user stops speaking. Optionally, a track, a subject thread, or both is initiated or continued from the point at which the conference is stopped and the participant commences speaking. - After initiating recording of the participant and stopping the conference audio, the
process 100 proceeds to step 110, where it determines whether the participant has stopped speaking. As mentioned above, a user may be deemed to have stopped speaking when the sound level drops below a threshold for a particular duration of time. Alternatively, a user may manually indicate that he has stopped speaking. If the user has not yet stopped speaking, theprocess 100 proceeds to step 112 and continues to record the participant's utterances in the audio clip. Theprocess 100 then loops betweensteps - After determining the user has stopped speaking, the
process 100 proceeds tosteps steps step 114, the audio clip is stored in the playlist, where the clip may be tagged, annotated, and played, for example. Atstep 116, the previously stopped conference audio is accessed at the location indicated by the index points stored atstep 108, and played back from that point. The conference may be played at a normal speed or at an accelerated pace, and the conference audio may be replayed in accordance with one of multiple modes. In conference mode, the audio is replayed as it was heard on the conference, with all audio streams from the various speakers mixed. In voicechat mode, the audio is replayed sequentially, with the audio streams of each speaker separated and played one after the other sequentially or interleaved in interleave sequential mode. After the participant has “caught up” to the live conference, the process may revert to transmitting the live conference audio. - The replaying of conference audio to the user can be by replaying the audio either sequentially or reenacting the conference as a mix from recordings of the clips by each other speaker as described earlier or from a recording of the audio mix as it would have been heard by that speaker by monitoring and recording the out-channel to each participant from the conference bridge as well as the in-channel (as there is a different mix for each participant with each mix leaving out that participant's voice). The clips may be analyzed in real time or after a conference to determine and save metadata associated with each clip. Such metadata may include the speaker of the clip or the duration of the clip. Alternatively a recording of the mix of all speakers may be used though in this case the user will hear his own speech played back to him.
- In an embodiment, the conference audio begins playing from the playlist at exactly the point indicated by the stored timing information. Alternatively, the conference audio may start replaying from a prior point, e.g., a user-adjustable time. The user may manually accelerate the conference audio or the user may request automatic acceleration (the latter may be a default setting). In addition, the user may control acceleration, i.e., set the speed at which the audio is replayed, or the process may determine the speed of acceleration automatically. In some configurations, the speed of acceleration may be determined from the elapsed time since the conference audio was stopped. For instance, if the participant spoke for 1 minute as did others in parallel and others continue to speak, and he desires to catch up to the live conference within 2 minutes the rate of acceleration may be calculated as 1.5×. There may, in some instances, be a maximum rate of acceleration for intelligibility. In some configurations, the rate of acceleration may be speaker-specific, e.g., to ensure intelligibility. For example, the process may automatically determine the tempo of each speaker and accelerate those tempos up by the same relative amount or to a global maximum tempo, or up to an average tempo necessary to enable the participant to catch up within a desired period of time.
- The playback may also be paused or slowed, manually or automatically, at certain events or time intervals to enable the participant to add tags or other annotations. For example, the playback may pause after the termination of each audio clip. These pauses may be taken into account when calculating the total acceleration necessary to catch up to the live conference. In an embodiment, the user is provided with a feature to save his place in the conference audio during playback, so that he can switch between the live conference and playback at his desire. In this manner, for example, the user can catch up on a missed portion of the conference during one or more breaks in the live conference.
- During a conference, each participant may be provided with an indication of which other participants are actively listening to the live audio or are engaged in playback. Users may be able to “ping” or otherwise message the participants in playback mode to signal that their presence in live mode is requested. In some embodiments, in response to a ping request, the process may automatically save the playback participant's place in the conference playback and automatically connect him to the live conference.
- The playback mode features discussed above can be used advantageously in circumstances other than when a participant speaks. As discussed above, a user may replay a conference after the conference has terminated and access the features above, e.g., the tools enabling acceleration and/or sequential or normal playback. During playback, the user may select to omit clips from certain speakers, e.g., clips originating from the user himself. In some configurations, playback mode may be used during the live conference even when the participant is not speaking. Specifically, when several conference participants speak at once, a participant (or all participants) may be automatically switched to playback mode and clips of each speaker may be played sequentially. In an example, all participants may be switched to playback in voicechat mode, and the clips may be played such as is shown in
FIG. 1 . The playback, of course, can be at an accelerated rate, which may be configured on a speaker-by-speaker basis, as discussed above. Moreover, acceleration may be configured based on the accumulated backlog of audio clips. After listening to each clip (or skipping one or more of the clips), the participant may then be transitioned back to the live conference. The switching between playback mode and live mode may be seamless to the user, so that each participant experiences a pseudo-real-time conference free from individuals talking over one another. - Conference participants may also make use of playback mode electively, e.g., to recap something previously said, or in case the participant has to leave the conference for a period of time. In this scenario, the user can, during the live conference, access tools that allow the user to “pause,” “rewind,” or otherwise skip backwards in time, and then playback the conference as desired.
- The action of playing back an earlier part of the conference may optionally cause the conference audio to stop and go into catch up mode when the playback is paused. Alternatively playback may continue until it reaches the live conference. This may be done while on mute so that the playback does not affect the ongoing conference and the playback may be accomplished independently of the conference connection by means described earlier. In some embodiments, the detection of speech causes the conference live audio to stop. In general, the playback may be sent by the system to the telephone or computer connected to the conference in place of the usual conference mix. Echo cancelling may be used to prevent the playback from being heard back in the conference.
- Playback into Conference
- Alternatively, the user may wish to play back into the conference for all the conference participants to hear. In this case, the user may unmute his microphone at the start of the portion he wishes to play back, share the play back, and mute his microphone after the play back. In embodiments wherein the mute button is unnecessary or not used, the system may switch off echo cancelling or otherwise cause selected recordings to play into the conference such as through a playback station connected as another user and controlled by any user through a web interface. Alternatively, the mute and unmute buttons may be implemented, and the audio is automatically paused when the unmute is selected and is automatically played when the mute is selected.
- Participants may be informed of other participants' actions by displaying a roster of the participants. The roster may indicate presence information, such as indications of when a particular participant is speaking, or what portion of the conference one or more participants are listening to and at what speed. In some implementations, the roster may flag a significant change of speaker state (such as a participant joining or leaving a conference, for example) by displaying a popup notification on a user's desktop or browser.
-
FIG. 13 is an exemplary display of a roster, in accordance with an implementation of the present disclosure. In particular, the roster shown inFIG. 13 includes a list of names of conference participants, as well as a status of each participant. The status indicates whether the corresponding participant is listening to the conference, speaking into the conference, or neither (i.e., viewing the page). In particular, if the participant is neither listening nor speaking to the conference, the participant may be viewing and/or editing a webpage associated with the conference. Such information of what the participant is viewing or editing may be displayed on the roster. Moreover, the roster includes a current mode (i.e., conference or chat mode) associated with each participant. The state of each participant indicates whether the participant is muted or recording his voice. Furthermore, the “Last Rec” section of the roster indicates the last recording created by each corresponding user, and the “On Msg” section of the roster indicates the current message or clip that is being listened to by the user. The roster shown inFIG. 13 may be updated in real time as the various participants change their modes and states. In an example, the roster shown inFIG. 13 may also include a user option to select whether to play his own utterances during playback. - There may also be a timeline display of the conference, which displays the utterances of each speaker. The timeline display may indicate how the conversation shifts across different speakers and/or across various conversation threads. Different speakers and/or different threads may be indicated by different colors on segments of each thread line. In some implementations, the timeline display may provide a visual indicator referring to the track into which an utterance is recorded (such as a track number that is displayed for each utterance, for example). In some implementations, the timeline display may be shown in the roster as is described above. In the example view of the roster in
FIG. 13 , the timeline display includes a row of twenty rectangles corresponding to each participant. Each row of rectangles corresponds to the last twenty utterances spoken by all the participants in the conference, and a highlighted rectangle in a particular row means that the corresponding participant spoke that utterance. Different colors or textures, or any other suitable graphical indicia may be used highlight a rectangle. The example timeline inFIG. 13 is shown as an illustrative example only, and other features may be included in the timeline that are not shown inFIG. 13 . For example, a user may interact with the timeline display to navigate an entire conference, conversation, or a portion thereof. In this case, the user may zoom into the timeline display to focus on one or more particular utterances, or may zoom out to see a roadmap of the conference. Furthermore, the width of each rectangle may be based on a duration of the utterance, such that wider rectangles correspond to longer utterances. Such timelines may be referred to as scaled timelines. In this case, it may be further desirable to provide an indication on the timeline of when silences and pauses occur. Indicating when silences and pauses occur may be desirable for tuning the process of detecting utterances. However, it may be undesirable to display very long silences on the timeline, which may instead of indicated using a graphical indicia such as an ellipsis or a line across the timeline. - Options when Speaking Starts
- When one person starts speaking and his conference audio is paused, the other participants have the option of hearing his speech in the conference in the conference mode or in the voicechat mode. For example, all participants may be switched into voicechat mode as soon as the speaker starts speaking. In this case, all participants may listen to the speaker at normal speed such that the participants finish listening at the same time, as is described in more detail in relation to
FIG. 1 . - In some embodiments, a second participant starts speaking before the first speaker stops speaking. If the other participants continue to hear the first speaker in the normal conference audio, they have the option of hearing both speakers mixed together. The speakers may be separated by different positions in a stereo image. Alternatively, the conference audio may be stopped and replaced with playback of the first speaker from the time at which the second speaker starts. Then, audio from the second and any subsequent speakers may be played until the listeners are caught up to real time. When the listeners are caught up, they may be automatically joined back into the conference audio mix.
- In some embodiments, waiting time is eliminated by ensuring all participants finish either speaking or listening to the sequence at the same time. In this case, all participants may select the voicechat mode as playback. In some cases, all participants are automatically switched to playback mode when any of the participants begin speaking over each other. Remaining in voicechat mode during the conference causes the participants to listen to a conference in which every speaker appears to speak in turn. Advantageously, voicechat mode ensures that no speaker has to wait to speak, and that no one is interrupted. The voicechat mode may be particularly useful when the voice connections are peer to peer without the need to utilize a conference service on a central server.
- With the introduction and adoption of webrtc, conference services using peer to peer connections will be increasingly prevalent. It may be desirable for each participant to record his own voice and no one else's. In this manner, the systems and methods described herein may ensure that each person or assignee has ownership and control of his own utterances, and such utterances are each person's or assignee's own property. Thus, each person may control the sharing of the clips through shared playlists, chat messages, and/or emails with links to clips. In some embodiments, the clips may be communicated directly between participants for playing, but the receiving participants may not be able to copy or save the clip or audio selection onto their personal devices. The clips may be configured to be capable of being saved only to a personal device associated with the speaker in the clip. In some implementations, the media signals for each speaker may be stored in files on a storage space that the speaker controls himself. If the speaker wishes to allow other users to listen to his media content, the speaker may provide the other users access to certain metadata associated with the media signals and provide permission to the other users to access the media signals. For example, the metadata may include presentations or a video of a screen sharing used during the conference. In an example, the other users are only given access to play the content, but not to copy the content and store them on other devices. In this case, the speaker may update the permissions by removing permissions for one or more users after initially providing access.
- In some embodiments, Internet traffic is reduced through the use of a caching server. By using the caching server, each peer needs to transmit each clip only once, and the other peers may obtain the clips from the caching server. The clips may be encrypted when first recorded on the speaker's peer system and only decrypted when played on his system or on another authorized peer system. In particular, only authorized users may be permitted to participate in certain protected conferences. In such conferences, all channels may be encrypted such that they may not be played to a user unless the user has the appropriate conference key, individual speaker key, or both.
- In some embodiments, participants start in voicechat mode, see each other in the roster, and elect to speak in conference. In a broadcasting mode, the same audio is played to all participants as soon as possible. In the broadcasting mode, participants may send audio or visual messages to one another. The messages may be transmitted at a lower volume or with an alert tone and/or text message or presence flag.
- Below are several use cases which may be implemented with the features described herein. Any of the features may be implemented manually, automatically, or a combination of both.
- In a first use case, a user participates in a conference using the voicechat mode. In this case, the voicechat mode allows all users to speak whenever they have a thought and hear everyone else's thoughts played one at a time. Preconditions in this case include turning autoplay on in sequential mode with streaming, applying filters to new messages to hear all speakers except oneself (or an omit-self option for autoplay), and setting set auto join to be on, and speed to 130% (or 115%, 100%, or any other suitable speed). In another example, a gadget may be used for recording and playback, though the gadget may not omit a user's own utterances or have speed-up options. In another example, a phone connection with touch tones to select options may be used to select to not hear a user's own messages. Moreover, the touch tones may be used to start and stop recording (in this case a SOX mediated read may be used to speed up like the player). A user may be listening to playback and catch up with the conference volume down and mute on. To speak, the user may select a pause button and unmute button. When the user is finished speaking, the user may select to mute himself, and then the play button to resume listening to the conference. In some embodiments, a roster such as is shown in
FIG. 13 is used so that the participants may determine where the other participants are in the conversation. - In a second use case, a user plays back the audio clips during the conference. In particular, the user may wish to replay recent messages while in conference without missing anything. This case may arise if the subject matter is complex, or if the user was distracted by an interruption for example. Preconditions in this case include setting auto join on and speed to 130% (or 115%, 100% or any other suitable speed). When the user wishes to wish to replay a clip, the user may hang up the conference call and set the filters to hear speakers he wishes to replay (including optionally omitting himself). The user then plays the clips from the point at which he wishes to start, may select to skip any unwanted messages, and when he decides he has heard enough, plays to or skips to the end to rejoin the call. In some embodiments, a roster such as is shown in
FIG. 13 is used so that the participants may determine where the other participants are in the conversation. In particular, the roster may display an indication that a user has left the live conference and has gone back to replay a previous portion of the conference. - In a third use case, an original user wishes to speak while another user is speaking, and what the original user wishes to say may affect what the other user is saying. Preconditions in this case include setting the scroll mode to latest, setting auto play on and being caught up, where playback volume is down, speed is set to 130% (or 115%, 100%, or any other suitable speed), filters are set to hear all speakers except himself (or selecting an omit self option), and having have sequential play on (or natural play). When the other user is speaking and the original user wants to be heard as well, the original user may turn the conference volume down, press pause, unmute himself, and speak. When the original user is finished speaking, he may press mute, turn playback volume up, and start play. After the original user is caught up, the conference volume may be turned up, and playback volume may be turned down. The other participants may react as the following examples (or these may be automated by the systems and methods herein). The first speaker, upon hearing another speaker, may turn his conference volume down and pause playback. When he finishes speaking, he starts play omitting himself, with playback volume up. When he catches up he turns down his play volume and turns up his conference volume. Upon detecting another speaker in parallel, other participants may turn their conference volume down and play volume up, and the reverse when they are caught up. In some embodiments, the system may download a speaker's own utterances even though they are not played immediately, so that the utterances may be cached in the speaker's browser for future use.
- In a fourth use case, a user is in a conference, and while someone else speaks, the user wants to hear what the other person has to say, but has a related thought to contribute and does not want to lose the thought. Preconditions for this use case include that the user is in the conference, background autoplay is on so play volume is down and conference volume is up. The user speaks the thought using single message reply by muting conference and lowering conference volume, clicking a reply button in a widget so that the user replies to the most recent message being played. Playback is paused, the call is answered, the user speaks the thought and hangs up the call. The user may then return to the conference by raising the playback volume and restarting playback at accelerated speed. When the user is caught up to real time, the user may raise the conference volume and lower the playback volume. At a convenient moment in the conference, the user may restate the thought or play it back into the conference by locating the reply in the playlist, unmuting the conference, increasing the playback volume, and playing the reply. Then the user may return to the normal conference by lowering the playback volume and restoring autoplay. In some embodiments, the system may assist the user in finding an appropriate moment to inject the thought before the flow of the conversation moves on.
- In a fifth use case, a user joins a conference late and wants to catch up. In this case, the user may start play with speed up by speaker and possibly filter by speaker with autojoin on. When the user is called into the conference, the user may stop play or let it continue in the background at low or no volume to be able to do other things.
- In a sixth use case, a user wishes to join a conference when his agenda item comes up. In this case, the systems and methods of the present disclosure may provide an agenda linked to threads for each item. The user may play the thread from the agenda item with auto play on, such that the user may hear a signal when his agenda item is up. This implementation requires that the participants indicate they are associated with particular agenda items. Optionally, the user may select to have autojoin on so that the user is brought into the conference when the messages in the user's thread start.
- In a seventh use case, when a user is listening to a playback of a conference that has previously been recorded, the user may wish to continue the conference by recording additional utterances. This may be implemented in a mode referred to conference continuation mode, in which the user may record additional utterances to continue the thread of discussion and update the conference. Later, when the user or other users listen to the playback of the conference, the user's additional utterances are included in the playback. In an example, the additional utterances may be added at the beginning, middle, or end of the original conference, and redirection tags may be inserted automatically so that later listeners are redirected to the additional utterances at the proper time.
- Playback of conference clips may be through the conference connection in place of the usual conference mix on the out channel. Alternatively it can be by media playing in various formats such as provided by HTML. In this case it is helpful to stream the playback so that it can catch up to users speaking including while clips are being recorded. During streaming audio processing may be undertaken by audio utilities included in the streaming process such as SoX (Sound eXchange), which is a well known Unix utility. A simple streaming technique may be implemented using the following steps.
- 1. set up
-
- a. read file and note size or last position
- b. pass to SOX
- i. sleep long enough for SOX to do some processing, e.g. 100 msecs
- ii. pass output of SOX to browser
- iii. repeat steps i and ii until there is no more output from SOX
- c. sleep long enough for speaker to record more audio, e.g. 1 second
- d. repeat steps a, b, and c until file size has not grown
- 2. close down.
-
FIG. 14 is a block diagram of a computing device, such as any of the components of the system ofFIG. 1 , for performing any of the processes described herein. Each of the components of these systems may be implemented on one ormore computing devices 1400. In certain aspects, a plurality of the components of these systems may be included within onecomputing device 1400. In certain implementations, a component and a storage device may be implemented acrossseveral computing devices 1400. - The
computing device 1400 comprises at least one communications interface unit, an input/output controller 1410, system memory, and one or more data storage devices. The system memory includes at least one random access memory (RAM 1402) and at least one read-only memory (ROM 1404). All of these elements are in communication with a central processing unit (CPU 1406) to facilitate the operation of thecomputing device 1400. Thecomputing device 1400 may be configured in many different ways. For example, thecomputing device 1400 may be a conventional standalone computer or alternatively, the functions ofcomputing device 1400 may be distributed across multiple computer systems and architectures. InFIG. 14 , thecomputing device 1400 is linked, via network or local network, to other servers or systems. - The
computing device 1400 may be configured in a distributed architecture, wherein databases and processors are housed in separate units or locations. Some units perform primary processing functions and contain at a minimum a general controller or a processor and a system memory. In distributed architecture implementations, each of these units may be attached via thecommunications interface unit 1408 to a communications hub or port (not shown) that serves as a primary communication link with other servers, client or user computers and other related devices. The communications hub or port may have minimal processing capability itself, serving primarily as a communications router. A variety of communications protocols may be part of the system, including, but not limited to: Ethernet, SAP, SAS™, ATP, BLUETOOTH™, GSM and TCP/IP. - The
CPU 1406 comprises a processor, such as one or more conventional microprocessors and one or more supplementary co-processors such as math co-processors for offloading workload from theCPU 1406. TheCPU 1406 is in communication with thecommunications interface unit 1408 and the input/output controller 1410, through which theCPU 1406 communicates with other devices such as other servers, user terminals, or devices. Thecommunications interface unit 1408 and the input/output controller 1410 may include multiple communication channels for simultaneous communication with, for example, other processors, servers or client terminals. - The
CPU 1406 is also in communication with the data storage device. The data storage device may comprise an appropriate combination of magnetic, optical or semiconductor memory, and may include, for example,RAM 1402,ROM 1404, flash drive, an optical disc such as a compact disc or a hard disk or drive. TheCPU 1406 and the data storage device each may be, for example, located entirely within a single computer or other computing device; or connected to each other by a communication medium, such as a USB port, serial port cable, a coaxial cable, an Ethernet cable, a telephone line, a radio frequency transceiver or other similar wireless or wired medium or combination of the foregoing. For example, theCPU 1406 may be connected to the data storage device via thecommunications interface unit 1408. TheCPU 1406 may be configured to perform one or more particular processing functions. - The data storage device may store, for example, (i) an
operating system 1412 for thecomputing device 1400; (ii) one or more applications 1414 (e.g., computer program code or a computer program product) adapted to direct theCPU 1406 in accordance with the systems and methods described here, and particularly in accordance with the processes described in detail with regard to theCPU 1406; or (iii) database(s) 1416 adapted to store information that may be utilized to store information required by the program. - The
operating system 1412 andapplications 1414 may be stored, for example, in a compressed, an uncompiled and an encrypted format, and may include computer program code. The instructions of the program may be read into a main memory of the processor from a computer-readable medium other than the data storage device, such as from theROM 1404 or from theRAM 1402. While execution of sequences of instructions in the program causes theCPU 1406 to perform the process steps described herein, hard-wired circuitry may be used in place of, or in combination with, software instructions for implementation of the processes of the present disclosure. Thus, the systems and methods described are not limited to any specific combination of hardware and software. - Suitable computer program code may be provided for performing one or more functions described herein. The program also may include program elements such as an
operating system 1412, a database management system and “device drivers” that allow the processor to interface with computer peripheral devices (e.g., a video display, a keyboard, a computer mouse, etc.) via the input/output controller 1410. - The term “computer-readable medium” as used herein refers to any non-transitory medium that provides or participates in providing instructions to the processor of the computing device 1400 (or any other processor of a device described herein) for execution. Such a medium may take many forms, including but not limited to, non-volatile media and volatile media. Non-volatile media include, for example, optical, magnetic, or opto-magnetic disks, or integrated circuit memory, such as flash memory. Volatile media include dynamic random access memory (DRAM), which typically constitutes the main memory. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM or EEPROM (electronically erasable programmable read-only memory), a FLASH-EEPROM, any other memory chip or cartridge, or any other non-transitory medium from which a computer can read.
- Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the CPU 1406 (or any other processor of a device described herein) for execution. For example, the instructions may initially be borne on a magnetic disk of a remote computer (not shown). The remote computer can load the instructions into its dynamic memory and send the instructions over an Ethernet connection, cable line, or even telephone line using a modem. A communications device local to a computing device 1400 (e.g., a server) can receive the data on the respective communications line and place the data on a system bus for the processor. The system bus carries the data to main memory, from which the processor retrieves and executes the instructions. The instructions received by main memory may optionally be stored in memory either before or after execution by the processor. In addition, instructions may be received via a communication port as electrical, electromagnetic or optical signals, which are exemplary forms of wireless communications or data streams that carry various types of information.
- While various embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.
Claims (30)
1. A method for processing audio content of a conference, the method comprising:
receiving a first audio signal from a first conference participant;
detecting, from the first audio signal, a start and an end of a first utterance by the first conference participant;
receiving a second audio signal from a second conference participant;
detecting, from the second audio signal, a start and an end of a second utterance by the second conference participant; and
providing, to the second conference participant, at least a portion of the first utterance at a time that is determined based at least in part on the start, the end, or both the start and the end of the second utterance.
2. The method of claim 1 , wherein the time corresponds to at least one of a start time for providing the portion of the first utterance, a start point of the portion of the first utterance, and a duration of the first utterance, and wherein the portion of the first utterance is provided to the second conference participant before the start of the second utterance or after the end of the second utterance.
3. The method of claim 1 , wherein the start of the second utterance occurs after the start of the first utterance and before the end of the first utterance, and the portion of the first utterance is based on a previous portion of the first utterance that is provided to the second conference participant before the start of the second utterance.
4. The method claim 3 , further comprising, upon detection of the start of the second utterance, switching the first and second conference participants to a mode in which utterances are played sequentially to the first and second conference participants.
5. The method of claim 1 , wherein in response to detecting the start of the second utterance, the method further comprises stopping the providing of the portion of the first utterance to the second conference participant, the method further comprising storing an indication of a point in the first utterance at which the providing to the second conference participant was stopped, wherein in response to detecting the end of the second utterance, the method further comprises resuming the providing of the portion of the first utterance to the second conference participant at the point referenced by the stored indication.
6. The method of claim 5 , wherein resuming the providing of the portion of the first utterance to the second conference participant at the point referenced by the stored indication comprises:
accessing a recorded version of the first audio signal at the point referenced by the stored indication;
playing the portion of the first utterance from the point referenced by the stored indication; and
providing conference audio to the second conference participant in real time when playback of the recorded version terminates.
7. The method of claim 6 , wherein the recorded version of the first audio signal is stored as a plurality of audio clips in a playlist, each audio clip including an utterance by one of the conference participants, and wherein playing the portion of the first utterance comprises:
playing the plurality of audio clips sequentially from the point referenced by the stored indication.
8. The method of claim 6 , wherein the recorded version of the first audio signal is stored as at least some of a plurality of audio clips in a playlist, each audio clip including an utterance by one of a plurality of conference participants, and the method further comprising:
playing the plurality of audio clips from the point referenced by the stored indication in the same manner in which they were recorded, wherein two or more of the plurality of audio clips are played in an overlapping manner when the corresponding conference audio included overlapping utterances from multiple conference participants.
9. The method of claim 1 , wherein detecting the start of the first utterance comprises receiving a first selection from the first conference participant to unmute an audio input interface or to pause an audio output, and detecting the end of the first utterance comprises receiving a second selection from the first conference participant to mute the audio input interface or to play the audio output.
10. The method of claim 1 further comprising:
storing the first utterance as an audio clip in a playlist, wherein the playlist includes a plurality of audio clips of utterances by other conference participants;
automatically categorizing the stored audio clip in the playlist under a section identifying the conference or a subject of the conference; and
automatically tagging the stored audio clip with information identifying the first conference participant.
11. A system for processing audio content of a conference, the system comprising:
an audio detector configured to:
receive a first audio signal from a first conference participant;
detect, from the first audio signal, a start and an end of a first utterance by the first conference participant;
receive a second audio signal from a second conference participant;
detect, from the second audio signal, a start and an end of a second utterance by the second conference participant; and
a transmitter configured to provide, to the second conference participant, at least a portion of the first utterance at a time determined based at least in part on the start, the end, or both the start and the end of the second utterance.
12. The system of claim 11 , wherein the time corresponds to at least one of a start time for providing the portion of the first utterance, a start point of the portion of the first utterance, and a duration of the first utterance, and wherein the portion of the first utterance is provided to the second conference participant before the start of the second utterance or after the end of the second utterance.
13. The system of claim 11 , wherein the start of the second utterance occurs after the start of the first utterance and before the end of the first utterance, and the portion of the first utterance is based on a previous portion of the first utterance that is provided to the second conference participant before the start of the second utterance.
14. The system of claim 13 , further comprising a processor configured to, upon detection of the start of the second utterance, switch the first and second conference participants to a mode in which utterances are played sequentially to the first and second participants.
15. The system of claim 11 , wherein:
in response to detecting the start of the second utterance, the transmitter stops the providing of portion of the first utterance to the second conference participant;
the system further comprises memory configured to store an indication of a point in the first utterance at which the transmitter stops the providing of the portion of the first utterance to the second conference participant; and
in response to the audio detector detecting the end of the second utterance, the transmitter resumes the providing of the portion of the first utterance to the second conference participant at the point referenced by the stored indication.
16. The system of claim 15 , wherein the transmitter resumes the providing of the portion of the first utterance to the second conference participant at the point referenced by the stored indication by:
accessing a recorded version of the first audio signal at the point referenced by the stored indication;
playing the portion of the first utterance from the point referenced by the stored indication; and
providing conference audio to the second conference participant in real time when playback of the recorded version terminates.
17. The system of claim 16 , wherein the recorded version of the first audio signal is stored as a plurality of audio clips in a playlist in memory, each audio clip includes an utterance by one of the conference participants, and wherein the transmitter plays the portion of the first utterance by playing the plurality of audio clips sequentially from the point referenced by the stored indication.
18. The system of claim 16 , wherein the recorded version of the first audio signal is stored as at least some of a plurality of audio clips in a playlist in memory, each audio clip includes an utterance by one of a plurality of conference participants, and wherein the transmitter plays the plurality of audio clips from the point referenced by the stored indication in the same manner in which they were recorded, wherein two or more of the plurality of audio clips are played in an overlapping manner when the corresponding conference audio included overlapping utterances from multiple conference participants.
19. The system of claim 11 , wherein the audio detector detects the start of the first utterance by receiving a first selection from the first conference participant to unmute an audio input interface or to pause an audio output, and detecting the end of the first utterance comprises receiving a second selection from the first conference participant to mute the audio input interface or to play the audio output.
20. The system of claim 11 further comprising:
a memory configured to store the first utterance as an audio clip in a playlist, wherein the playlist includes a plurality of audio clips of utterances by other conference participants; and
a processor configured to:
automatically categorize the stored audio clip in the playlist under a section identifying the conference or a subject of the conference; and
automatically tag the stored audio clip with information identifying the first conference participant.
21. A method for processing audio content of a conference, the method comprising:
providing audio from the conference to a first conference participant;
detecting a start of an utterance by the first conference participant;
in response to detecting the start of the utterance, stopping the provision of the audio from the conference to the first conference participant;
storing an indication of a point in the audio from the conference at which the provision of the audio from the conference to the first conference participant was stopped;
detecting an end of the utterance by the first conference participant; and
in response to detecting the end of the utterance, resuming the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication.
22. The method of claim 21 , wherein:
detecting a start of an utterance by the first conference participant comprises:
monitoring a volume level of an audio stream sourced from the first conference participant;
comparing the monitored volume level of the audio stream to a threshold value; and
determining the start of the utterance when the monitored volume level of the audio stream exceeds the threshold value; and
detecting an end of the utterance by the first conference participant comprises:
monitoring the volume level of the audio stream sourced from the first conference participant;
comparing the monitored volume level of the audio stream to the threshold value; and
determining the end of the utterance when the monitored volume level of the audio stream falls below the threshold value for a predefined duration of time.
23. The method of claim 21 , wherein detecting a start of an utterance by the first conference participant comprises receiving a first selection from the first conference participant to unmute an audio input interface or to pause an audio output, and detecting an end of the utterance by the first conference participant comprises receiving a second selection from the first conference participant to mute the audio input interface or to play the audio output.
24. The method of claim 21 further comprising:
in response to detecting the start of the utterance, initiating a recording of the utterance by the first conference participant; and
in response to detecting the end of the utterance, terminating the recording of the utterance by the first conference participant.
25. The method of claim 24 further comprising:
storing a reference to the recorded utterance as an audio clip in a playlist, wherein the playlist includes a plurality of audio clips of utterances by other conference participants
automatically categorizing the stored audio clip in the playlist identifying the conference or a subject of the conference; and
automatically tagging the stored audio clip with information identifying the first conference participant.
26. The method of claim 25 further comprising:
receiving user input of data to associate with the stored audio clip in the playlist; and
storing the data with an association to the stored audio clip, wherein the stored data comprises at least one of a subject, description, transcription, keyword, flag, digital file, and uniform resource locator.
27. The method of claim 21 , further comprising:
in response to detecting the start of the utterance, storing a first index point corresponding the start of the utterance; and
in response to detecting the end of the utterance, storing a second index point corresponding to the end of the utterance.
28. The method of claim 21 , wherein resuming the provision of the audio from the conference to the first conference participant at the point referenced by the stored indication comprises:
accessing a recorded version of the audio from the conference at the point referenced by the stored indication;
playing the recorded version of the audio from the conference from the point referenced by the stored indication at an accelerated rate; and
providing the audio from the conference to the first conference participant in real time when playback of the recorded version terminates.
29. The method of claim 28 , wherein the recorded version of the audio from the conference is stored as a plurality of audio clips in a playlist, each audio clip including an utterance by one of a plurality of conference participants, and wherein playing the recorded version of the audio from the conference comprises:
playing the plurality of audio clips sequentially from the point referenced by the stored indication.
30. The method of claim 28 , wherein the recorded version of the audio from the conference is stored as a plurality of audio clips in a playlist, each audio clip including an utterance by one of a plurality of conference participants, and wherein playing the recorded version of the audio from the conference comprises:
playing the plurality of audio clips from the point referenced by the stored indication in the same manner in which they were recorded, wherein two or more of the plurality of audio clips are played in an overlapping manner when the corresponding audio from the conference included overlapping utterances from multiple conference participants.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/102,146 US20230169991A1 (en) | 2013-07-02 | 2023-01-27 | Systems and methods for improving audio conferencing services |
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201361842331P | 2013-07-02 | 2013-07-02 | |
US14/321,348 US9087521B2 (en) | 2013-07-02 | 2014-07-01 | Systems and methods for improving audio conferencing services |
US14/791,910 US9538129B2 (en) | 2013-07-02 | 2015-07-06 | Systems and methods for improving audio conferencing services |
US15/364,654 US10553239B2 (en) | 2013-07-02 | 2016-11-30 | Systems and methods for improving audio conferencing services |
US16/778,691 US20200411038A1 (en) | 2013-07-02 | 2020-01-31 | Systems and methods for improving audio conferencing services |
US18/102,146 US20230169991A1 (en) | 2013-07-02 | 2023-01-27 | Systems and methods for improving audio conferencing services |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/778,691 Continuation US20200411038A1 (en) | 2013-07-02 | 2020-01-31 | Systems and methods for improving audio conferencing services |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230169991A1 true US20230169991A1 (en) | 2023-06-01 |
Family
ID=51211820
Family Applications (5)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/321,348 Expired - Fee Related US9087521B2 (en) | 2013-07-02 | 2014-07-01 | Systems and methods for improving audio conferencing services |
US14/791,910 Active US9538129B2 (en) | 2013-07-02 | 2015-07-06 | Systems and methods for improving audio conferencing services |
US15/364,654 Expired - Fee Related US10553239B2 (en) | 2013-07-02 | 2016-11-30 | Systems and methods for improving audio conferencing services |
US16/778,691 Abandoned US20200411038A1 (en) | 2013-07-02 | 2020-01-31 | Systems and methods for improving audio conferencing services |
US18/102,146 Pending US20230169991A1 (en) | 2013-07-02 | 2023-01-27 | Systems and methods for improving audio conferencing services |
Family Applications Before (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/321,348 Expired - Fee Related US9087521B2 (en) | 2013-07-02 | 2014-07-01 | Systems and methods for improving audio conferencing services |
US14/791,910 Active US9538129B2 (en) | 2013-07-02 | 2015-07-06 | Systems and methods for improving audio conferencing services |
US15/364,654 Expired - Fee Related US10553239B2 (en) | 2013-07-02 | 2016-11-30 | Systems and methods for improving audio conferencing services |
US16/778,691 Abandoned US20200411038A1 (en) | 2013-07-02 | 2020-01-31 | Systems and methods for improving audio conferencing services |
Country Status (3)
Country | Link |
---|---|
US (5) | US9087521B2 (en) |
EP (2) | EP3017589B1 (en) |
WO (1) | WO2015001492A1 (en) |
Families Citing this family (75)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9344218B1 (en) | 2013-08-19 | 2016-05-17 | Zoom Video Communications, Inc. | Error resilience for interactive real-time multimedia applications |
US10529359B2 (en) * | 2014-04-17 | 2020-01-07 | Microsoft Technology Licensing, Llc | Conversation detection |
US9922667B2 (en) * | 2014-04-17 | 2018-03-20 | Microsoft Technology Licensing, Llc | Conversation, presence and context detection for hologram suppression |
USD771112S1 (en) | 2014-06-01 | 2016-11-08 | Apple Inc. | Display screen or portion thereof with graphical user interface |
US9940944B2 (en) * | 2014-08-19 | 2018-04-10 | Qualcomm Incorporated | Smart mute for a communication device |
CN113140216B (en) * | 2015-02-03 | 2023-09-19 | 杜比实验室特许公司 | Selective meeting abstract |
US20180190266A1 (en) * | 2015-02-03 | 2018-07-05 | Dolby Laboratories Licensing Corporation | Conference word cloud |
USD760746S1 (en) | 2015-06-04 | 2016-07-05 | Apple Inc. | Display screen or portion thereof with animated graphical user interface |
EP3210373B1 (en) | 2015-07-09 | 2020-03-11 | Ale International | Multipoint communication system and method |
US10171908B1 (en) * | 2015-07-27 | 2019-01-01 | Evernote Corporation | Recording meeting audio via multiple individual smartphones |
US9984674B2 (en) * | 2015-09-14 | 2018-05-29 | International Business Machines Corporation | Cognitive computing enabled smarter conferencing |
GB201600907D0 (en) | 2016-01-18 | 2016-03-02 | Dolby Lab Licensing Corp | Replaying content of a virtual meeting |
US10614418B2 (en) * | 2016-02-02 | 2020-04-07 | Ricoh Company, Ltd. | Conference support system, conference support method, and recording medium |
EP3203701B1 (en) | 2016-02-04 | 2021-04-21 | Unify Patente GmbH & Co. KG | Method of controlling a real-time conference session, a computer program product causing a computer to execute the method, and a communication system for controlling the real-time conference session |
CN105895116B (en) * | 2016-04-06 | 2020-01-03 | 普强信息技术(北京)有限公司 | Double-track voice break-in analysis method |
US10026417B2 (en) * | 2016-04-22 | 2018-07-17 | Opentv, Inc. | Audio driven accelerated binge watch |
US10015103B2 (en) * | 2016-05-12 | 2018-07-03 | Getgo, Inc. | Interactivity driven error correction for audio communication in lossy packet-switched networks |
US20170372697A1 (en) * | 2016-06-22 | 2017-12-28 | Elwha Llc | Systems and methods for rule-based user control of audio rendering |
WO2017222408A1 (en) * | 2016-06-23 | 2017-12-28 | Ringcentral, Inc., (A Delaware Corporation) | Conferencing system and method implementing video quasi-muting |
US11032675B2 (en) | 2016-06-23 | 2021-06-08 | AINA Wireless Finland Oy | Electronic accessory incorporating dynamic user-controlled audio muting capabilities, related methods and communications terminal |
US10771631B2 (en) | 2016-08-03 | 2020-09-08 | Dolby Laboratories Licensing Corporation | State-based endpoint conference interaction |
US10506089B2 (en) * | 2016-08-12 | 2019-12-10 | International Business Machines Corporation | Notification bot for topics of interest on voice communication devices |
JP6672114B2 (en) * | 2016-09-13 | 2020-03-25 | 本田技研工業株式会社 | Conversation member optimization device, conversation member optimization method and program |
US9900440B1 (en) | 2016-11-18 | 2018-02-20 | International Business Machines Corporation | Context-driven teleconference session management |
US10950275B2 (en) * | 2016-11-18 | 2021-03-16 | Facebook, Inc. | Methods and systems for tracking media effects in a media effect index |
US10303928B2 (en) | 2016-11-29 | 2019-05-28 | Facebook, Inc. | Face detection for video calls |
US10554908B2 (en) | 2016-12-05 | 2020-02-04 | Facebook, Inc. | Media effect application |
US11308952B2 (en) * | 2017-02-06 | 2022-04-19 | Huawei Technologies Co., Ltd. | Text and voice information processing method and terminal |
US10600420B2 (en) | 2017-05-15 | 2020-03-24 | Microsoft Technology Licensing, Llc | Associating a speaker with reactions in a conference session |
US10558421B2 (en) | 2017-05-22 | 2020-02-11 | International Business Machines Corporation | Context based identification of non-relevant verbal communications |
US10192584B1 (en) | 2017-07-23 | 2019-01-29 | International Business Machines Corporation | Cognitive dynamic video summarization using cognitive analysis enriched feature set |
US11641513B2 (en) * | 2017-08-18 | 2023-05-02 | Roku, Inc. | Message processing using a client-side control group |
USD843442S1 (en) | 2017-09-10 | 2019-03-19 | Apple Inc. | Type font |
US10510346B2 (en) * | 2017-11-09 | 2019-12-17 | Microsoft Technology Licensing, Llc | Systems, methods, and computer-readable storage device for generating notes for a meeting based on participant actions and machine learning |
US10423382B2 (en) | 2017-12-12 | 2019-09-24 | International Business Machines Corporation | Teleconference recording management system |
US10582063B2 (en) | 2017-12-12 | 2020-03-03 | International Business Machines Corporation | Teleconference recording management system |
US10467335B2 (en) | 2018-02-20 | 2019-11-05 | Dropbox, Inc. | Automated outline generation of captured meeting audio in a collaborative document context |
US11488602B2 (en) | 2018-02-20 | 2022-11-01 | Dropbox, Inc. | Meeting transcription using custom lexicons based on document history |
US10673913B2 (en) * | 2018-03-14 | 2020-06-02 | 8eo, Inc. | Content management across a multi-party conference system by parsing a first and second user engagement stream and transmitting the parsed first and second user engagement stream to a conference engine and a data engine from a first and second receiver |
TWI672690B (en) * | 2018-03-21 | 2019-09-21 | 塞席爾商元鼎音訊股份有限公司 | Artificial intelligence voice interaction method, computer program product, and near-end electronic device thereof |
CN108566558B (en) * | 2018-04-24 | 2023-02-28 | 腾讯科技(深圳)有限公司 | Video stream processing method and device, computer equipment and storage medium |
US11227117B2 (en) * | 2018-08-03 | 2022-01-18 | International Business Machines Corporation | Conversation boundary determination |
US20200112450A1 (en) * | 2018-10-05 | 2020-04-09 | Microsoft Technology Licensing, Llc | System and method for automatically connecting to a conference |
CN109545193B (en) * | 2018-12-18 | 2023-03-14 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating a model |
USD902221S1 (en) | 2019-02-01 | 2020-11-17 | Apple Inc. | Electronic device with animated graphical user interface |
USD900871S1 (en) | 2019-02-04 | 2020-11-03 | Apple Inc. | Electronic device with animated graphical user interface |
US11171795B2 (en) | 2019-03-29 | 2021-11-09 | Lenovo (Singapore) Pte. Ltd. | Systems and methods to merge data streams from different conferencing platforms |
US10827157B1 (en) * | 2019-05-10 | 2020-11-03 | Gopro, Inc. | Generating videos with short audio |
US11689379B2 (en) | 2019-06-24 | 2023-06-27 | Dropbox, Inc. | Generating customized meeting insights based on user interactions and meeting media |
US11301629B2 (en) * | 2019-08-21 | 2022-04-12 | International Business Machines Corporation | Interleaved conversation concept flow enhancement |
CN110473528B (en) * | 2019-08-22 | 2022-01-28 | 北京明略软件系统有限公司 | Speech recognition method and apparatus, storage medium, and electronic apparatus |
US11558208B2 (en) * | 2019-09-24 | 2023-01-17 | International Business Machines Corporation | Proximity based audio collaboration |
US11616814B2 (en) * | 2019-09-30 | 2023-03-28 | Thinkrite, Inc. | Data privacy in screen sharing during a web conference |
US11196869B2 (en) * | 2020-02-15 | 2021-12-07 | Lenovo (Singapore) Pte. Ltd. | Facilitation of two or more video conferences concurrently |
JP7335457B2 (en) * | 2020-02-24 | 2023-08-29 | 北京字節跳動網絡技術有限公司 | INTERACTION METHOD, APPARATUS AND ELECTRONICS |
CN111755003B (en) * | 2020-06-23 | 2023-10-31 | 北京联想软件有限公司 | Voice interaction implementation method and device and electronic equipment |
US11516347B2 (en) * | 2020-06-30 | 2022-11-29 | ROVl GUIDES, INC. | Systems and methods to automatically join conference |
CN112291073B (en) * | 2020-09-16 | 2024-05-14 | 视联动力信息技术股份有限公司 | Method and device for determining speaking terminal and electronic equipment |
CN112312063A (en) * | 2020-10-30 | 2021-02-02 | 维沃移动通信有限公司 | Multimedia communication method and device |
US11855793B2 (en) | 2020-12-11 | 2023-12-26 | Lenovo (Singapore) Pte. Ltd. | Graphical user interfaces for grouping video conference participants |
US11595451B2 (en) * | 2020-12-30 | 2023-02-28 | Zoom Video Communications, Inc. | Methods and apparatus for receiving meeting controls for network conferences |
US11575525B2 (en) | 2020-12-30 | 2023-02-07 | Zoom Video Communications, Inc. | Methods and apparatus for providing meeting controls for network conferences |
US11740856B2 (en) * | 2021-01-07 | 2023-08-29 | Meta Platforms, Inc. | Systems and methods for resolving overlapping speech in a communication session |
US11404061B1 (en) * | 2021-01-11 | 2022-08-02 | Ford Global Technologies, Llc | Speech filtering for masks |
US11405584B1 (en) | 2021-03-25 | 2022-08-02 | Plantronics, Inc. | Smart audio muting in a videoconferencing system |
US20220374585A1 (en) * | 2021-05-19 | 2022-11-24 | Google Llc | User interfaces and tools for facilitating interactions with video content |
US11632473B2 (en) | 2021-06-08 | 2023-04-18 | International Business Machines Corporation | Dynamic notification tone modification |
US11916687B2 (en) | 2021-07-28 | 2024-02-27 | Zoom Video Communications, Inc. | Topic relevance detection using automated speech recognition |
US11381412B1 (en) | 2021-07-30 | 2022-07-05 | Zoom Video Communications, Inc. | Conference event alerting based on participant focus |
JP7030233B1 (en) * | 2021-08-31 | 2022-03-04 | 株式会社ドワンゴ | Meeting support system, meeting support method, and meeting support program |
CN113741765B (en) * | 2021-09-22 | 2023-03-10 | 北京字跳网络技术有限公司 | Page jump method, device, equipment, storage medium and program product |
US11792032B2 (en) * | 2021-10-28 | 2023-10-17 | Zoom Video Communications, Inc. | Content replay for conference participants |
US20240146879A1 (en) * | 2022-10-31 | 2024-05-02 | Docusign, Inc. | Conferencing platform integration with workflow management |
US20240146560A1 (en) * | 2022-10-31 | 2024-05-02 | Zoom Video Communications, Inc. | Participant Audio Stream Modification Within A Conference |
CN116110373B (en) * | 2023-04-12 | 2023-06-09 | 深圳市声菲特科技技术有限公司 | Voice data acquisition method and related device of intelligent conference system |
Family Cites Families (53)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6594688B2 (en) * | 1993-10-01 | 2003-07-15 | Collaboration Properties, Inc. | Dedicated echo canceler for a workstation |
US5786814A (en) * | 1995-11-03 | 1998-07-28 | Xerox Corporation | Computer controlled display system activities using correlated graphical and timeline interfaces for controlling replay of temporal data representing collaborative activities |
US5627936A (en) * | 1995-12-21 | 1997-05-06 | Intel Corporation | Apparatus and method for temporal indexing of multiple audio, video and data streams |
US6298129B1 (en) * | 1998-03-11 | 2001-10-02 | Mci Communications Corporation | Teleconference recording and playback system and associated method |
US6418125B1 (en) * | 1998-06-18 | 2002-07-09 | Cisco Technology, Inc. | Unified mixing, speaker selection, and jitter buffer management for multi-speaker packet audio systems |
US7912199B2 (en) * | 2002-11-25 | 2011-03-22 | Telesector Resources Group, Inc. | Methods and systems for remote cell establishment |
US7047201B2 (en) * | 2001-05-04 | 2006-05-16 | Ssi Corporation | Real-time control of playback rates in presentations |
US6970712B1 (en) * | 2001-12-13 | 2005-11-29 | At&T Corp | Real time replay service for communications network |
US7483945B2 (en) * | 2002-04-19 | 2009-01-27 | Akamai Technologies, Inc. | Method of, and system for, webcasting with just-in-time resource provisioning, automated telephone signal acquisition and streaming, and fully-automated event archival |
US8135115B1 (en) * | 2006-11-22 | 2012-03-13 | Securus Technologies, Inc. | System and method for multi-channel recording |
US7571212B2 (en) | 2002-05-14 | 2009-08-04 | Family Systems, Ltd. | Interactive web collaboration systems and methods |
US7466334B1 (en) * | 2002-09-17 | 2008-12-16 | Commfore Corporation | Method and system for recording and indexing audio and video conference calls allowing topic-based notification and navigation of recordings |
WO2004027577A2 (en) | 2002-09-19 | 2004-04-01 | Brian Reynolds | Systems and methods for creation and playback performance |
US7391763B2 (en) * | 2002-10-23 | 2008-06-24 | International Business Machines Corporation | Providing telephony services using proxies |
US7003286B2 (en) * | 2002-10-23 | 2006-02-21 | International Business Machines Corporation | System and method for conference call line drop recovery |
US7730407B2 (en) * | 2003-02-28 | 2010-06-01 | Fuji Xerox Co., Ltd. | Systems and methods for bookmarking live and recorded multimedia documents |
US6959075B2 (en) * | 2003-03-24 | 2005-10-25 | Cisco Technology, Inc. | Replay of conference audio |
US7809388B1 (en) * | 2004-02-26 | 2010-10-05 | Core Mobility, Inc. | Selectively replaying voice data during a voice communication session |
US20060062366A1 (en) * | 2004-09-22 | 2006-03-23 | Siemens Information And Communication Networks, Inc. | Overlapped voice conversation system and method |
JP4509860B2 (en) | 2005-05-25 | 2010-07-21 | 株式会社東芝 | Data division apparatus, data division method and program |
US7881234B2 (en) * | 2006-10-19 | 2011-02-01 | International Business Machines Corporation | Detecting interruptions in audio conversations and conferences, and using a conversation marker indicative of the interrupted conversation |
US7679637B1 (en) * | 2006-10-28 | 2010-03-16 | Jeffrey Alan Kohler | Time-shifted web conferencing |
US7852992B1 (en) * | 2006-11-14 | 2010-12-14 | Avaya Inc. | Methods and apparatus for audio communication |
US8121277B2 (en) * | 2006-12-12 | 2012-02-21 | Cisco Technology, Inc. | Catch-up playback in a conferencing system |
US7822050B2 (en) * | 2007-01-09 | 2010-10-26 | Cisco Technology, Inc. | Buffering, pausing and condensing a live phone call |
US7817584B2 (en) * | 2007-02-28 | 2010-10-19 | International Business Machines Corporation | Method and system for managing simultaneous electronic communications |
US8451317B2 (en) * | 2007-04-30 | 2013-05-28 | Hewlett-Packard Development Company, L.P. | Indexing a data stream |
US8175007B2 (en) * | 2007-06-14 | 2012-05-08 | Cisco Technology, Inc. | Call priority based on audio stream analysis |
US7477441B1 (en) | 2007-07-24 | 2009-01-13 | Hewlett-Packard Development Company, L.P. | MEMS device with nanowire standoff layer |
US8249235B2 (en) | 2007-08-30 | 2012-08-21 | International Business Machines Corporation | Conference call prioritization |
US8699383B2 (en) * | 2007-10-19 | 2014-04-15 | Voxer Ip Llc | Method and apparatus for real-time synchronization of voice communications |
JP2009139592A (en) | 2007-12-05 | 2009-06-25 | Sony Corp | Speech processing device, speech processing system, and speech processing program |
US9374453B2 (en) * | 2007-12-31 | 2016-06-21 | At&T Intellectual Property I, L.P. | Audio processing for multi-participant communication systems |
US8380253B2 (en) * | 2008-02-15 | 2013-02-19 | Microsoft Corporation | Voice switching for voice communication on computers |
JP5197276B2 (en) * | 2008-02-26 | 2013-05-15 | 株式会社東芝 | Information presenting apparatus and information presenting method |
US20090313010A1 (en) * | 2008-06-11 | 2009-12-17 | International Business Machines Corporation | Automatic playback of a speech segment for media devices capable of pausing a media stream in response to environmental cues |
US20100158232A1 (en) * | 2008-12-23 | 2010-06-24 | Nortel Networks Limited | Accessing recorded conference content |
US8838179B2 (en) * | 2009-09-25 | 2014-09-16 | Blackberry Limited | Method and apparatus for managing multimedia communication recordings |
US8370142B2 (en) * | 2009-10-30 | 2013-02-05 | Zipdx, Llc | Real-time transcription of conference calls |
US8797380B2 (en) * | 2010-04-30 | 2014-08-05 | Microsoft Corporation | Accelerated instant replay for co-present and distributed meetings |
US8406390B1 (en) * | 2010-08-23 | 2013-03-26 | Sprint Communications Company L.P. | Pausing a live teleconference call |
US20120150863A1 (en) * | 2010-12-13 | 2012-06-14 | Microsoft Corporation | Bookmarking of meeting context |
US8773993B2 (en) * | 2011-01-31 | 2014-07-08 | Apple Inc. | Adaptive bandwidth estimation |
CN102647578B (en) * | 2011-02-17 | 2014-08-13 | 鸿富锦精密工业(深圳)有限公司 | Video switching system and method |
US8812510B2 (en) * | 2011-05-19 | 2014-08-19 | Oracle International Corporation | Temporally-correlated activity streams for conferences |
CN102572356B (en) * | 2012-01-16 | 2014-09-03 | 华为技术有限公司 | Conference recording method and conference system |
US8725508B2 (en) * | 2012-03-27 | 2014-05-13 | Novospeech | Method and apparatus for element identification in a signal |
US8929529B2 (en) * | 2012-06-29 | 2015-01-06 | International Business Machines Corporation | Managing voice collision in multi-party communications |
US9313335B2 (en) * | 2012-09-14 | 2016-04-12 | Google Inc. | Handling concurrent speech |
DE102012220688A1 (en) * | 2012-11-13 | 2014-05-15 | Symonics GmbH | Method of operating a telephone conference system and telephone conference system |
US8719032B1 (en) * | 2013-12-11 | 2014-05-06 | Jefferson Audio Video Systems, Inc. | Methods for presenting speech blocks from a plurality of audio input data streams to a user in an interface |
CN113140216B (en) * | 2015-02-03 | 2023-09-19 | 杜比实验室特许公司 | Selective meeting abstract |
EP3254478B1 (en) * | 2015-02-03 | 2020-02-26 | Dolby Laboratories Licensing Corporation | Scheduling playback of audio in a virtual acoustic space |
-
2014
- 2014-07-01 WO PCT/IB2014/062776 patent/WO2015001492A1/en active Application Filing
- 2014-07-01 US US14/321,348 patent/US9087521B2/en not_active Expired - Fee Related
- 2014-07-01 EP EP14741676.2A patent/EP3017589B1/en not_active Not-in-force
- 2014-07-01 EP EP18182549.8A patent/EP3448006B1/en active Active
-
2015
- 2015-07-06 US US14/791,910 patent/US9538129B2/en active Active
-
2016
- 2016-11-30 US US15/364,654 patent/US10553239B2/en not_active Expired - Fee Related
-
2020
- 2020-01-31 US US16/778,691 patent/US20200411038A1/en not_active Abandoned
-
2023
- 2023-01-27 US US18/102,146 patent/US20230169991A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
EP3017589B1 (en) | 2018-08-08 |
US10553239B2 (en) | 2020-02-04 |
EP3448006B1 (en) | 2023-03-15 |
US20200411038A1 (en) | 2020-12-31 |
US9538129B2 (en) | 2017-01-03 |
US20150012270A1 (en) | 2015-01-08 |
EP3017589A1 (en) | 2016-05-11 |
EP3448006A1 (en) | 2019-02-27 |
US9087521B2 (en) | 2015-07-21 |
US20150312518A1 (en) | 2015-10-29 |
US20170236532A1 (en) | 2017-08-17 |
WO2015001492A1 (en) | 2015-01-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230169991A1 (en) | Systems and methods for improving audio conferencing services | |
US20210247883A1 (en) | Digital Media Player Behavioral Parameter Modification | |
US8818175B2 (en) | Generation of composited video programming | |
US10819855B2 (en) | Managing, monitoring and transcribing concurrent meetings and/or conference calls | |
US8406608B2 (en) | Generation of composited video programming | |
US9485748B2 (en) | Controlling delivery of notifications in real-time communications based on communication channel state | |
CN106462636B (en) | Interpreting audible verbal information in video content | |
JP4973894B2 (en) | System and method for coordinating overlapping media messages | |
US9202469B1 (en) | Capturing noteworthy portions of audio recordings | |
US9344291B2 (en) | Conferencing system with catch-up features and method of using same | |
US9432516B1 (en) | System and method for communicating streaming audio to a telephone device | |
US8626496B2 (en) | Method and apparatus for enabling playback of ad HOC conversations | |
US20170064254A1 (en) | Providing shortened recordings of online conferences | |
US20240087595A1 (en) | Systems and methods for improved audio-video conferences | |
US20170148438A1 (en) | Input/output mode control for audio processing | |
US20230276001A1 (en) | Systems and methods for improved audio/video conferences | |
CA2955928A1 (en) | Apparatus and methods for recording audio and video | |
US20230283843A1 (en) | Systems and methods for detecting and analyzing audio in a media presentation environment to determine whether to replay a portion of the media | |
US20200304221A1 (en) | System and method for capturing and accessing real-time audio and associated metadata | |
McElhearn | Take Control of Audio Hijack | |
JP2019195149A (en) | Computer system, computer program, and method for group voice communication and past voice confirmation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |