Nothing Special   »   [go: up one dir, main page]

CN111800552A - Audio output processing method, device and system and electronic equipment - Google Patents

Audio output processing method, device and system and electronic equipment Download PDF

Info

Publication number
CN111800552A
CN111800552A CN202010617753.2A CN202010617753A CN111800552A CN 111800552 A CN111800552 A CN 111800552A CN 202010617753 A CN202010617753 A CN 202010617753A CN 111800552 A CN111800552 A CN 111800552A
Authority
CN
China
Prior art keywords
audio
output
electronic device
matched
moment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010617753.2A
Other languages
Chinese (zh)
Inventor
赵泽清
徐培来
贾宸
张银平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN202010617753.2A priority Critical patent/CN111800552A/en
Publication of CN111800552A publication Critical patent/CN111800552A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/403Arrangements for multi-party communication, e.g. for conferences

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application provides an audio output processing method, a device, a system and an electronic device, wherein any electronic device participating in a multi-party call is marked as a first electronic device, before outputting a to-be-output audio sent by a second electronic device participating in the multi-party call, which is received and cached at a first moment, a cached first audio within a preset time period away from the first moment is obtained, and a second audio matched with the to-be-output audio is detected to exist in the first audio, which indicates that audio contents already known by a participant using the first electronic device exist in the to-be-output audio, so that the embodiment can directly eliminate the audio matched with the second audio in the to-be-output audio, thereby ensuring that the first electronic device can not repeatedly output the same-content audio, eliminating the interference caused by the fact that the audio played by the first electronic device per se is recorded again, and, the interference caused by the recording of other electronic equipment is eliminated, and the method can be applied to various types of multi-party call scenes.

Description

Audio output processing method, device and system and electronic equipment
Technical Field
The present application relates to the field of multi-party applications, and more particularly, to an audio output processing method, apparatus, system and electronic device.
Background
The multiparty call is a multiparty online voice call realized by various means, can realize multiparty and remote real-time online communication, and is mainly applied to application scenes such as conferences, games and the like at present.
In practical applications, during the period of playing the received audio signal collected and sent by the electronic device B, any electronic device a participating in a multi-party call often collects the played audio signal and transmits the audio signal back to the electronic device B, so that a participant of the electronic device B hears the audio signal output by the participant, and the quality of the multi-party call is often affected by the useless output.
Moreover, if a plurality of participants of the multi-party call are located in the same place, for example, the participant a and the participant B are located in the same room, the audio signal output by the participant a is not only collected by the electronic device used by the participant a, but also collected and output by the electronic device used by the participant B, and the output audio signal is collected again by the electronic device used by the participant a, so that the audio signals output by the electronic devices used by the participant B and other participants contain more echo interference signals.
In order to improve the quality of multi-party calls, in the prior art, generally, in the process of audio acquisition by an electronic device, an echo noise cancellation technology is adopted to eliminate interference caused by audio signals played by the electronic device, but the processing method cannot eliminate interference caused by audio signals played by other electronic devices, so that the application range of the processing method has certain limitation, and the processing method cannot be applied to various types of multi-party call scenes.
Disclosure of Invention
In view of the above, in order to reduce interference audio in audio output in a multi-party call and improve quality of the multi-party call, in one aspect, the present application provides an audio output processing method, including:
aiming at a first electronic device participating in multi-party communication, acquiring audio to be output, which is received and cached at a first moment and is sent by a second electronic device participating in multi-party communication;
obtaining a first audio which is cached within a preset time period away from the first moment, wherein the first audio comprises an audio which is input by the first electronic equipment at a second moment and/or an audio which is output at a third moment, and the second moment and the third moment are both earlier than the first moment;
and detecting that second audio matched with the audio to be output exists in the first audio, and eliminating the audio matched with the second audio in the audio to be output.
Optionally, the detecting that a second audio matched with the audio to be output exists in the first audio includes:
acquiring the Mel cepstrum characteristic coefficient MFCC characteristic of the audio to be output;
sequentially matching the MFCC features of the audio to be output with the MFCC features corresponding to the audio contained in the first audio;
and determining that the second audio corresponding to the matching result meeting the first condition is matched with the audio to be output.
Optionally, the detecting that a second audio matched with the audio to be output exists in the first audio includes:
extracting a first audio fingerprint feature of the audio to be output;
sequentially matching the first audio fingerprint features with second audio fingerprint features corresponding to each audio contained in the first audio;
and determining that the second audio corresponding to the matching result meeting the second condition is matched with the audio to be output.
Optionally, the detecting that a second audio matched with the audio to be output exists in the first audio includes:
acquiring first voiceprint information of the audio to be output and second voiceprint information corresponding to each audio contained in the first audio;
based on the first voiceprint information and the second voiceprint information, performing content identification on the audio corresponding to the same voiceprint in the audio to be output and the first audio to obtain a corresponding first audio text and a corresponding second audio text;
acquiring the text similarity of the first audio text and the second audio text;
and determining that the second audio corresponding to the text similarity larger than the first similarity threshold is matched with the audio to be output.
Optionally, the eliminating the audio matched with the second audio in the audio to be output includes:
if the second audio frequency is matched with the whole audio frequency to be output, eliminating the audio frequency to be output;
if the second audio frequency is matched with part of the audio frequency in the audio frequency to be output, denoising the audio frequency to be output by using the second audio frequency to obtain a target output audio frequency;
outputting the target output audio.
Optionally, the method further includes:
detecting whether a third electronic device in a preset space range with the first electronic device exists in second electronic devices participating in multi-party call;
if the third electronic equipment exists, determining that the audio from the third electronic equipment in the audio to be output is the audio to be eliminated;
and eliminating the audio to be eliminated.
In yet another aspect, the present application also provides an audio output processing apparatus, including:
the audio output device comprises an audio output module, a first audio output module and a second audio output module, wherein the audio output device comprises a first audio output module, a second audio output module and a third audio output module;
the first audio acquisition module is used for acquiring a first audio which is cached within a preset time period away from the first moment, wherein the first audio comprises an audio which is input by the first electronic equipment at a second moment and/or an audio which is output at a third moment, and the second moment and the third moment are both earlier than the first moment;
and the audio detection module is used for detecting that second audio matched with the audio to be output exists in the first audio and eliminating the audio matched with the second audio in the audio to be output.
Optionally, the audio detection module includes:
the characteristic obtaining unit is used for obtaining the Mel cepstrum characteristic coefficient MFCC characteristics of the audio to be output;
the first matching unit is used for sequentially matching the MFCC characteristics of the audio to be output with the MFCC characteristics corresponding to the audio contained in the first audio;
the first determining unit is used for determining that the second audio corresponding to the matching result meeting the first condition is matched with the audio to be output;
and the first eliminating unit is used for eliminating the audio matched with the second audio in the audio to be output.
In another aspect, the present application further provides an electronic device, including:
an audio collector; an audio player;
a memory for storing a program for implementing the audio output processing method as described above;
the processor is used for loading and executing the program stored in the memory so as to realize the steps of the audio output processing method.
In yet another aspect, the present application further provides an audio output processing system, the system comprising:
the electronic equipment comprises a plurality of pieces of electronic equipment participating in multi-party conversation, wherein the electronic equipment is the electronic equipment;
and the communication server is in communication connection with the electronic devices respectively and is used for constructing a virtual space for realizing multi-party call so that the electronic devices access the virtual space to realize mutual communication.
It can be seen that, the present application provides an audio output processing method, apparatus, system and electronic device, where any electronic device participating in a multi-party call is denoted as a first electronic device, and before outputting a to-be-output audio sent by a second electronic device participating in a multi-party call, which is received and cached at a first time, a first audio that is cached within a preset time period from the first time, such as an audio that is recorded at a second past time and/or an audio that is output at a third time, is obtained to detect whether a second audio that matches the to-be-output audio exists in the first audio, and if so, it indicates that there is an audio that is interfering with a participant using the first electronic device in the to-be-output audio, that is, an audio that has been already heard by the participant, in this embodiment, an audio that matches the second audio in the to-be-output audio can be directly eliminated, therefore, the first electronic equipment is ensured not to repeatedly output the audio with the same content, the interference caused by recording the audio played by the first electronic equipment again is eliminated, the interference caused by recording the audio played by other electronic equipment is eliminated, the user experience is greatly improved, and the method and the device are suitable for various multiparty call scenes.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 shows a schematic flow diagram of an alternative example of an audio output processing method proposed by the present application;
FIG. 2 is a schematic diagram illustrating an alternative scenario flow of the audio output processing method proposed in the present application;
FIG. 3 shows a schematic flow diagram of yet another alternative example of an audio output processing method as proposed by the present application;
FIG. 4 shows a schematic flow diagram of yet another alternative example of an audio output processing method as proposed by the present application;
FIG. 5 shows a schematic flow diagram of yet another alternative example of an audio output processing method as proposed by the present application;
fig. 6 shows a schematic structural diagram of an alternative example of the audio output processing apparatus proposed in the present application;
fig. 7 is a schematic diagram showing a hardware configuration of an alternative example of an electronic device implementing the audio output processing method proposed in the present application;
fig. 8 shows a schematic diagram of an alternative example of an audio output processing system proposed by the present application.
Detailed Description
In view of the technical problems set forth in the background section,
the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be understood that "system", "apparatus", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.
As used in this application and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements. An element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
In the description of the embodiments herein, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present application, "a plurality" means two or more than two. The terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.
Additionally, flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.
Referring to fig. 1, a flow diagram of an alternative example of an audio output processing method proposed in the present application may be applied to an electronic device, which may include, but is not limited to, a smart phone, a tablet computer, a wearable device, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), an e-book reader, a desktop computer, and the like. As shown in fig. 1, the method may include:
step S11, aiming at a first electronic device participating in the multi-party call, acquiring audio to be output, which is received and cached at a first moment and sent by a second electronic device participating in the multi-party call;
in an application scenario requiring a multi-party call, such as a conference, a game, and online teaching, any one of the electronic devices participating in the multi-party call may be denoted as a first electronic device, and in a general case, the audio signal uploaded by the other electronic devices participating in the multi-party call (i.e., the other electronic devices participating in the multi-party call except the first electronic device, for convenience of description, the other electronic devices may be denoted as second electronic devices) may be acquired and played as the audio to be output of the first electronic device.
In practical application, the audio to be output sent by the second electronic device includes various audio frequencies in an environment where the second electronic device is located, and if the environment where the second electronic device is located has more noise, the audio recorded by the second electronic device will include more useless noise audio frequencies, where the noise audio frequencies may be environmental noise, or audio frequencies played by other users or electronic devices, and the like, and may be determined according to a specific application scenario, and the details of the application are not described herein.
For example, in the multi-party call application scenario shown in fig. 2, if a participant a and a participant B use their respective electronic devices to participate in a multi-party call in the same room of a city C1, and the participant C participates in the multi-party call in a city C2, according to a conventional processing manner of the multi-party call, the audio spoken by each participant is recorded by their respective electronic device and sent to other electronic devices for output, and in the process, the electronic device may also record the audio spoken by other participants in the environment and send to other electronic devices.
Taking the case of speaking of the participant a as an example, the electronic device 1 and the electronic device 2 used by the participant a and the participant B respectively will record the audio of the participant a, then the electronic device 1 will send the recorded audio of the participant a to the electronic device 2 and the electronic device 3 for playing, and at the same time, the electronic device 2 will send the recorded audio of the participant a to the electronic device 1 and the electronic device 3 for playing, so that, for the electronic device 1 of the participant a, the audio to be output sent by the second electronic device received by the electronic device 1 of the participant a includes the audio of the participant a sent by the electronic device 2, the audio of the participant B, and the audio sent by the participant C sent by the electronic device 3; similarly, for the electronic device 2 of the participant B, the received audio to be output includes the audio of the participant a sent by the electronic device 1, the audio of the participant B, and the audio sent by the participant C sent by the electronic device 3; and for the electronic device 3 of the participant C, the received audio to be output includes the audio of the respective participants a and B sent by the electronic device 1, and the audio of the respective participants a and B sent by the electronic device 2.
It can be seen that in the multi-party call scenario shown in fig. 2, the electronic device of participant a (or participant B) will receive not only the audio of the participant used by other electronic devices, but also the own audio of participant a (or participant B), and the electronic device of participant C will receive the respective repeated audio of participant a and participant B.
Further, for each of the electronic devices, in the process of playing the received audio, the audio collector of the electronic device still continues to collect the audio, so that the audio played by the local terminal is recorded again and sent to other electronic devices, and the process is repeated, so that each electronic device may repeatedly receive multiple times of the audio with the same content.
It should be noted that, in the multi-party call scenario shown in fig. 2, speaking times of different participants may be different, and therefore, audio generation and sending times of different participants in fig. 2 may be different, which is not distinguished in fig. 2, and the transmission line in the figure only indicates audio categories that may be transmitted in the case where different participants speak, but does not indicate an audio transmission sequence of each participant.
For other multi-party call application scenarios, for example, each electronic device participating in the multi-party call is respectively located in different spaces, in combination with the above analysis, the audio to be output sent by the second electronic device not only includes the audio of the participant using the second electronic device, but also may include the audio played by the second electronic device, and/or the audio after the played audio is reflected by the space, and so on, which may cause that the audio to be output received by the first electronic device may include multiple times of audio with the same content, and the specific situation is not described in detail in this application.
Step S12, obtaining a first audio cached in a preset time period away from the first moment;
in combination with the above analysis of the content included in the to-be-output audio received by the first electronic device, the to-be-output audio may include, in addition to the target audio that needs to be output by the first electronic device (i.e., the audio that needs to be heard by the participant using the first electronic device), the audio that the participant has heard or has spoken or repeated by himself, so that the first electronic device needs to be preprocessed before outputting the to-be-output audio, so as to avoid outputting the unwanted audio, such as the audio that the participant of the first electronic device has heard or has spoken or repeated by himself.
In order to achieve the above object, compared to a conventional audio processing method, that is, the second electronic device performs noise reduction on the audio recorded therein, and then sends the audio subjected to noise reduction to the first electronic device, this method can only eliminate the audio played by the second electronic device and recorded again, and cannot eliminate the audio played by other recorded electronic devices, which still causes that the audio to be output received by the first electronic device includes the audio that has been heard by the participant using the first electronic device.
Therefore, the application proposes that the audio output end of the first electronic device is processed, that is, before the audio to be output obtained by the first electronic device is played, the audio to be output is subjected to noise reduction and elimination processing, so as to ensure that the audio finally output by the first electronic device is the audio content which needs to be heard by the participant using the first electronic device, and the useless audio listed above cannot be output, so that the interference of the output of the useless audio on the participant of the first electronic device is avoided, and the user experience is improved.
Based on the above inventive concept, in order to eliminate the unwanted audio as listed above from the audio to be output received by the first electronic device at the first time (i.e. any time during the multi-party call, but not the historical time, which may generally refer to the current time), the present embodiment needs to detect whether the audio to be output contains such unwanted audio, and in particular, in combination with the description of the unwanted audio above, it may be detected whether the first electronic device has entered and/or output audio with matching content at the latest time, that is, whether the audio to be output contains audio that has been heard by the participant of the first electronic device.
Because the audio recorded by the first electronic device at each moment and the received audio to be output sent by the second electronic device are generally cached firstly, therefore, the embodiment can directly obtain the buffered first audio within the preset time period from the first time from the audio recorded in the buffer space of the first electronic device, such as audio entered by the first electronic device at the second time, and/or audio output at the third time, etc., the second time and the third time are both earlier than the first time, that is, if the first time is the current time, the second time and the third time are both historical times, however, the present application does not limit the time difference between the second time and the first time, the time difference and the preset time period are usually not large, and may be determined according to the delay time of audio recording and playing, and the present application does not limit the value of the preset time period in step S12.
In combination with the above-listed examples of the multi-party application scenario, in this embodiment, the audio (i.e., the historical audio) entered by the first electronic device at the second time may include audio spoken by a participant using the first electronic device, audio played by a second electronic device in the same geographic space as the first electronic device or spoken by a participant of the second electronic device, and may even include echo audio played in the geographic space where the first electronic device is located or after multiple reflections of the audio spoken by the participant, and the like.
For the audio output by the first electronic device at the third time, as analyzed above, it refers to the audio already output by the first electronic device, i.e. the audio that has been heard by the participant of the first electronic device is used, and the first electronic device does not need to repeatedly output the audio of the content again within the preset time period. Regarding the processing procedure before the first electronic device outputs the audio at the third time, the audio output processing method described in this embodiment may be referred to, so that the audio output at the third time is the content that needs to be heard by the first electronic device participant, and the audio with the same content is not recorded and/or output by the first electronic device within the preset time period before the content is heard by the first electronic device participant.
It should be noted that, the specific implementation process of obtaining the first audio in step S12 is not described in detail in this application, and the first audio may be obtained by using the time point information corresponding to the audio recorded and cached at each time, but is not limited to this.
And step S13, detecting that the first audio has the second audio matched with the audio to be output, and eliminating the audio matched with the second audio in the audio to be output.
In succession to the above description of the inventive concept of the audio output processing method proposed by the present application, it is necessary to eliminate the unwanted audio in the audio to be output sent by the second electronic device and received by the first electronic device at the first time, and in combination with the above analysis of the unwanted audio, the present embodiment may determine, as the unwanted audio, the audio that matches the first audio in the audio to be output, that is, the audio that is recorded and/or output by the first electronic device itself in the audio to be output, and therefore, the present embodiment may perform match detection on the audio to be output and consecutive audio frames included in the first audio cached at the above history time, and if there is a second audio that matches at least a part of consecutive audio frames of the audio to be output, it may be considered that the unwanted audio as listed above exists in the audio to be output, the present embodiment may eliminate the audio that matches the second audio in the audio to be output, the specific implementation method of the audio matching detection method and the audio elimination is not limited in the present application.
It should be understood that the audio to be eliminated may be a part of the audio to be output (i.e., a part of consecutive audio frames) or may be the entire audio to be output, which may be determined according to the actual application scenario. If the eliminated audio is the partial audio, the first electronic device may output the audio that remains in the audio to be output without being eliminated.
For example, taking the application scenario shown in fig. 2 above as an example, the to-be-output audio received by the electronic device 3 includes an audio that is first spoken by the participant B and sent by the electronic device 2, and an audio that is recorded again after the electronic device 2 plays the audio spoken by the participant a, because the electronic device 3 often outputs the audio that is spoken by the participant a, before outputting the to-be-output audio sent by the electronic device 2 this time, according to the audio output processing method described above, the electronic device 3 detects the cached audio of the participant a and matches with part of sound boxes in the to-be-output audio, so that the audio of the participant a in the to-be-output audio is eliminated, and the audio of the participant B is output.
Similarly, before outputting the cached audio to be output by the other electronic device in the multi-party call application scenario shown in fig. 2, according to the audio output processing method described above, the useless audio cancellation is performed on the corresponding audio to be output, so that both the electronic device 1 and the electronic device 2 can only output the audio of the participant C that is not played, and can not play various audio recorded by the other electronic device that is in the same geographic space as the other electronic device, thereby avoiding the repeated playing, and the electronic device 3 only outputs the audio of the participant a and the audio of the participant B that are not played, and can not repeatedly play the audio of the participants a and B, and can not output the audio of the user, thereby greatly improving the user experience.
In some embodiments, if the to-be-output audio transmitted by the second electronic device includes a content that a participant using the second electronic device repeatedly speaks, that is, two audios of the same content are collected by the second electronic device at a short time interval and transmitted to the first electronic device as the to-be-output audio buffer. In this case, when processing the first audio transmitted by the second electronic device, in addition to detecting whether it contains already played or recorded audio in the manner described in the above steps, it may be detected whether there is audio matching it at some future time adjacent to it, and if so, the buffered audio matching it at the future time may be deleted directly during processing the audio.
In still other embodiments, in the manner described in this embodiment, the present application may also only detect whether there is an already played or recorded audio in the currently processed audio to be output, without detecting whether there are audio frequencies matching with the currently processed audio to be output at some future times, and after the processing of the currently processed audio to be output is completed, when the audio frequency buffered at the future time is processed, the processing may continue according to the audio output processing method described in this embodiment, and if the audio frequency buffered at the future time matches with the audio frequency to be output, the processing method according to this embodiment may still detect the result and eliminate the audio frequency buffered at the future time, thereby avoiding repeatedly playing the audio frequency with the same content.
It should be understood that, according to the above-described detection manner, if there is no second audio matching the audio to be output in the first audio, the first electronic device may directly output the audio to be output, and a specific output method is not limited.
To sum up, in this embodiment, any electronic device participating in a multi-party call is regarded as a first electronic device, and before outputting to-be-output audio sent by a second electronic device participating in a multi-party call, which is received and cached at a first time, a first audio that has been cached within a preset time period from the first time, such as an audio that is recorded at a second past time and/or an audio that is output at a third time, is obtained to detect whether a second audio that matches the to-be-output audio exists in the first audio, and if the second audio exists, it is described that an audio that is interfering with a participant using the first electronic device exists in the to-be-output audio, that is, an audio that the participant has heard already exists, in this embodiment, an audio that matches the second audio in the to-be-output audio can be directly eliminated, so that it is ensured that the first electronic device does not repeatedly output the same content audio, the method and the device not only eliminate the interference caused by recording the played audio of the first electronic device again, but also eliminate the interference caused by recording the played audio of other electronic devices, greatly improve the user experience, and can be suitable for various multiparty call scenes.
Referring to fig. 3, a schematic flow chart of yet another optional example of the audio output processing method proposed in the present application, the present embodiment may be an optional detailed implementation method of the audio output processing method proposed in the foregoing embodiment, but is not limited to the detailed implementation manner described in the present embodiment, and as shown in fig. 3, the method may include:
step S31, aiming at a first electronic device participating in the multi-party call, acquiring audio to be output, which is received and cached at a first moment and sent by a second electronic device participating in the multi-party call;
step S32, obtaining a first audio cached in a preset time period away from the first moment;
regarding the implementation processes of step S31 and step S32, reference may be made to the description of corresponding parts in the foregoing embodiments, which are not repeated herein.
Step S33, obtaining MFCC characteristics of the audio to be output;
in practical applications, the perception of audio signals by human ears tends to focus on a specific frequency region, rather than the entire spectral envelope, and the cochlea is usually filtered on a logarithmic frequency scale, such as linear below 1000Hz (unit: hertz) and logarithmic above 1000Hz, which makes the human ear more sensitive to low-frequency signals than high-frequency signals. Thus, human perception of the frequency content of an audio signal can be considered to follow a subjectively defined non-linear scale, which may be referred to as a "Mel" scale.
It can be seen that the MFCC (Mel-Frequency Cepstral Coefficients) feature is to combine the auditory perception characteristic of human ear with the speech generation mechanism, and to take the human ear as a specific filter, only considering some specific Frequency components, so that the present application can determine whether the corresponding audio matches by using the comparison of MFCC features contained in each audio.
In general, for an acquired audio to be output, frame division processing may be performed on the audio to be output, after audio frames included in the audio to be output are obtained, a frequency spectrum of each audio frame may be calculated to obtain a short-time spectrum, after smoothing processing and harmonic elimination processing on the frequency spectrum by a Mel filter, a logarithmic Mel energy spectrum may be output, and after DCT (Discrete Cosine Transform) decorrelation processing, energy is concentrated on a low-frequency portion to obtain an MFCC feature. That is, the MFCC feature extraction process generally includes two parts of conversion to mel-frequency and cepstrum analysis, and the application does not limit the specific implementation method of how to extract the corresponding MFCC feature from the audio to be output.
Step S34, the MFCC characteristics of the audio to be output are sequentially matched with the MFCC characteristics corresponding to the audio contained in the first audio;
similar to the above extraction process of the MFCC features of the audio to be output, the extraction process of the MFCC features of each audio included in the first audio is not described in detail in this application.
In some embodiments, since the first electronic device may perform processing in the manner described in this embodiment each time after receiving the audio to be output sent by the second electronic device, the MFCC feature of the audio to be output will be obtained, and if it is detected and determined to be output, it may be buffered for a period of time as the audio already output by the first electronic device, when step S34 is executed, the first electronic device may directly read the MFCC feature corresponding to the audio already output in the first audio from the buffer space, and similarly, for the audio recorded by the first electronic device itself, if it has been subjected to audio matching detection with the audio to be output received at other times, the first electronic device may also directly read the MFCC feature corresponding to such audio from the buffer space, and without performing audio matching detection each time, perform MFCC feature extraction according to the above method, the data processing workload of the first electronic equipment is reduced, and the audio matching detection efficiency is improved.
For matching of the audio to be output and the MFCC characteristics of each audio in the first audio, difference calculation can be directly performed on corresponding dimension elements, so that whether the corresponding audio is matched or not can be determined by using the obtained difference; since the MFCC features are often feature vectors of multiple dimensions, other similarity detection algorithms may also be adopted in the present application to implement matching detection between the audio to be output and different first audios, and the present application does not limit the specific content of the similarity detection algorithm, that is, the specific implementation method of step S34 is not limited to the matching detection method recorded herein.
Step S35, determining that the second audio corresponding to the matching result meeting the first condition is matched with the audio to be output;
after the matching detection between the audio to be output and each first audio is realized according to the above-described mode, if the obtained matching result meets the first condition, the audio to be output can be considered to be matched with the corresponding audio contained in the first audio, and the corresponding audio at this time is recorded as a second audio; otherwise, if the matching result does not satisfy the first condition, it may be considered that the audio to be output is not matched with the corresponding audio in the first audio, and the matching detection may be continued on the other audio in the first audio.
It can be seen that the first condition may be a condition indicating a match between two audios, such as a difference between corresponding MFCC features being smaller than a first threshold (usually smaller but not limited to a value), or a similarity between corresponding MFCC features being larger than a similarity threshold (usually larger but not limited to a value), and so on, and the content of the first condition may be determined according to the specific implementation method of the match detection, and the content of the first condition is not limited in this application.
In step S36, the audio matching the second audio in the audio to be output is eliminated.
In some embodiments, in the case that it is detected that the to-be-output audio includes the interfering audio to be eliminated in the manner described above, the present application may further detect whether a matching degree between the to-be-output audio and the second audio reaches a matching threshold (i.e., the second audio and the to-be-output audio are considered to be the same threshold, but the audio content is not limited to be completely consistent, and usually a larger value is taken, but the value size is not limited), and if the matching degree reaches the threshold, it may be considered that the second audio and the entire to-be-output audio are matched, and output of the to-be-output audio may be prohibited, and the to-be.
If the matching degree does not reach the matching threshold, the second audio may be considered to be matched with a part of the audio in the audio to be output, and the audio to be output also includes a part of the audio to be output.
Specifically, in a possible implementation manner, the first electronic device may adopt a noise cancellation method such as echo noise cancellation, reference signal subtraction, signal amplitude feature extraction, and the like, to perform cancellation processing on the audio to be output, so as to cancel an audio matched with the second audio in the audio to be output and output an unmatched target output audio.
It should be understood that, through the matching detection, if each matching result does not satisfy the first condition, it is determined that the audio to be output is not matched with the first audio, which indicates that no interfering audio exists in the audio to be output, and the audio to be output may be output.
To sum up, in this embodiment, before outputting the audio to be output sent by the second electronic device participating in the multi-party call, which is received and cached at the first time, by the first electronic device, after obtaining the first audio that is cached within a preset time period from the first time, the first electronic device may obtain the audio to be output and the MFCC features of the respective audios included in the first audio, sequentially match the MFCC features of the audio to be output with the MFCC features of the respective audios included in the first audio, determine that there is a second audio whose matching result satisfies the first condition, that is, the second audio matches the audio to be output, and may consider that there is an interfering audio, that is, an audio that matches the second audio, in the audio to be output, so that the first electronic device may eliminate the audio that matches the second audio in the audio to be output, thereby ensuring that the first electronic device does not repeatedly output the same content audio, the first electronic equipment is prevented from outputting various audio recorded by the first electronic equipment and the interference caused by the audio from various sources which is output, and the user experience in the multi-party call process is improved.
Referring to fig. 4, which is a schematic flow diagram of another optional example of the audio output processing method provided in the present application, this embodiment may be an optional detailed implementation method of the audio output processing method provided in the foregoing embodiment, which is different from the implementation manner of matching detection between the audio to be output and the first audio described in the above detailed embodiment.
Therefore, as shown in fig. 4, in the audio output processing method proposed in the present application, the implementation process of detecting the presence of the second audio matching the audio to be output in the first audio may include, but is not limited to, the following steps:
step S41, extracting a first audio fingerprint characteristic of the audio to be output;
audio fingerprinting technology refers to extracting unique digital features in a piece of Audio in the form of identifiers through a specific algorithm, and using the extracted digital features to identify massive Audio samples, which can be used to implement the fields of Audio identification, copyright content supervision, content deduplication, and the like. Therefore, the embodiment can extract the audio fingerprint feature of the audio to be output by using the audio fingerprint technology, and the extracted audio fingerprint feature is recorded as the first audio fingerprint feature.
It should be understood that, for different types of audio to be output and/or according to different audio matching requirements, one or more audio fingerprint algorithms may be selected to extract the first audio fingerprint feature, and this application is not described in detail herein.
Step S42, matching the first audio fingerprint characteristics with second audio fingerprint characteristics corresponding to each audio contained in the first audio in sequence;
step S43, it is determined that the second audio corresponding to the matching result that satisfies the second condition matches the audio to be output.
The manner of obtaining the second audio fingerprint features corresponding to each audio included in the first audio is similar to the process of obtaining the MFCC features corresponding to each audio included in the first audio, which is described in the corresponding portion of step S34, and is not described again in this embodiment. And for the extraction method of the second audio fingerprint feature, similar to the above first audio feature extraction method, detailed description is not given in this application.
In addition, the second condition of the present application is similar to the content of the first condition, and it may indicate that the condition for determining that the audio to be output matches the second audio according to the matching detection manner described in this embodiment, and the specific content included in the condition may be determined according to a specific implementation method of the matching detection, which is not listed in this application.
It should be understood that, through the matching detection, if each matching result does not satisfy the second condition, it is determined that the audio to be output is not matched with the first audio, which indicates that no interfering audio exists in the audio to be output, and the audio to be output may be output.
Therefore, in this embodiment, before the first electronic device outputs the audio to be output, which is received and cached at the first time and sent by the second electronic device participating in the multi-party call, the first electronic device may acquire the cached first audio within a preset time period from the first time, and then may acquire the audio fingerprint features of the audio to be output and the respective audios included in the first audio, and determine whether the second audio matched with the audio to be output exists in the first audio according to the comparison result of the audio fingerprint features, so as to eliminate the audio matched with the second audio in the audio to be output, thereby avoiding the interference caused by the first electronic device outputting various audios recorded by itself and the audios from various sources that have been output.
Referring to fig. 5, which is a schematic flow diagram of another optional example of the audio output processing method provided in the present application, this embodiment may be an optional detailed implementation method of the audio output processing method provided in the foregoing embodiment, which is different from the implementation manner of matching detection between the audio to be output and the first audio described in the foregoing detailed embodiments.
Therefore, as shown in fig. 5, in the audio output processing method proposed in the present application, the implementation process of detecting the presence of the second audio matching the audio to be output in the first audio may include, but is not limited to, the following steps:
step S51, acquiring first voiceprint information of the audio to be output and second voiceprint information corresponding to each audio contained in the first audio;
voiceprint (Voiceprint) is a spectrum of sound waves carrying verbal information that is not only specific, but also characterized by relative stability. Based on this, this application can be through the voiceprint comparison, and whether discernment is same people, and then dispel the processing to the repeated audio frequency of same people that each electronic equipment types, avoid repeated broadcast.
Based on this, the present application may obtain the first voiceprint information of the audio to be output and the second voiceprint information corresponding to each audio included in the first audio.
Step S52, based on the first voiceprint information and the second voiceprint information, identifying the content of the audio corresponding to the same voiceprint in the audio to be output and the first audio to obtain a corresponding first audio text and a corresponding second audio text;
as described above, the embodiment can compare the obtained first voiceprint information with each second voiceprint information, for example, directly calculate the difference value of the corresponding voiceprint information, and determine whether the sound source outputting the audio with the two voiceprint information is the same person by determining whether the difference value reaches a specific threshold (the value of the difference value is not limited); certainly, the similarity calculation between the first voiceprint information and each second voiceprint information can be realized through other similarity detection modes, and then whether a sound source outputting audio with the two voiceprint information is the same person or not is judged according to the similarity.
According to the comparison method described above, it is determined that the sound source of the audio to be output and a certain audio included in the first audio are the same person, that is, have the same voiceprint, in order to avoid that the first electronic device outputs the repeated content spoken by the user having the voiceprint, in this embodiment, content recognition may be performed on the audio corresponding to the voiceprint in the first audio and the audio to be output, so as to obtain the first audio text of the audio to be output and the second audio text of the audio, and as for the implementation method of how to recognize the content of the audio text included in the audio, the implementation method may be implemented by using an artificial intelligence technology such as voice recognition, and details of the application are not described herein.
Step S53, acquiring the text similarity of the first audio text and the second audio text;
step S54, determining that the second audio corresponding to the text similarity greater than the first similarity threshold matches the audio to be output.
After the detection, it is determined that the audio is input and/or output by the first electronic device, and after the audio comes from the audio of the same person as the audio to be output, it may be further detected whether the content of the audio is matched with the content of the audio to be output, therefore, in this embodiment, a similarity calculation manner may be adopted to obtain the text similarity between the first audio text and the second audio text obtained above, if there is a second audio whose text similarity is greater than the first similarity threshold (that is, a threshold indicating that the content of the corresponding audio is matched with the content of the audio to be output, and the numerical value of the second audio is not limited in this application), it is determined that the second audio is matched with the audio to be output, which indicates that there is an interfering audio in the audio to be output, and the audio matched with the second audio in the audio to be output may be eliminated, and a specific elimination processing process may refer to the description of the.
It should be noted that, the method for acquiring the text similarity between the first audio text and the second audio text is not limited in the present application. The second audio with the text similarity larger than the first similarity threshold may be partially matched with the audio to be output, or may be completely matched with the audio to be output, which may be determined by adjusting the size of the first similarity threshold, and if the first similarity threshold is larger, the matching degree of the second audio with the audio to be output is higher; conversely, the smaller the first similarity threshold is, the lower the matching degree between the second audio and the audio to be output is, and the application can determine how to eliminate the audio to be output subsequently according to the matching degree.
It should be understood that after the similarity detection, if there is no text similarity greater than the first similarity threshold, and more specifically, if each text similarity is smaller than the second similarity threshold (which may be a lower value, indicating that the corresponding audio is not matched with the content of the audio to be output completely, and the value size is not limited), it may be determined that there is no interfering audio in the audio to be output, and the audio to be output may be output directly.
To sum up, in this embodiment, before the first electronic device outputs the audio to be output, which is received and cached at the first time and sent by the second electronic device participating in the multi-party call, after the first audio which is cached within a preset time period from the first time is acquired, the first electronic device determines whether the first audio has the audio with the same voiceprint as the audio to be output, if not, the first electronic device may output the audio to be output, and if the audio with the same voiceprint exists, text similarity detection may be further performed on the audio text included in the audio text and the audio text included in the audio to be output, so as to determine whether the first audio has the second audio matching with the content of the audio to be output, so as to eliminate the audio matching with the second audio in the audio to be output, thereby preventing the first electronic device from outputting various audio recorded by itself, and interference caused by audio from various sources that has already been output.
In some embodiments, before the first electronic device outputs the audio to be output, which is received and cached at the first time and sent by the second electronic device participating in the multi-party call, the first electronic device may obtain the cached first audio within a preset time period from the first time, and this application may also use, but is not limited to, various combination detection methods of the audio matching detection methods described in the above embodiments to detect whether the second audio matching the audio to be output exists in the first audio, and a specific implementation process may refer to the description of the corresponding part in the above embodiments, which is not described in detail in this embodiment. Through the combined detection mode, the reliability and the accuracy of matching detection of the first audio and the audio to be output are improved, and the reliability of eliminating interference audio in the audio to be output is further improved.
In still other embodiments, on the basis of the audio output processing methods described in the above embodiments, the present application may further detect whether a third electronic device in a preset spatial range with the first electronic device exists in the second electronic device participating in the multi-party call, and if the third electronic device exists, determine that the audio from the third electronic device in the audio to be output is the audio to be eliminated, and eliminate the audio to be eliminated.
Therefore, the embodiment can directly determine the audio to be output, which is sent by other electronic devices in the same geographic space (i.e., in the preset spatial range), as the audio to be eliminated (i.e., the above interference audio), and directly eliminate the audio, and subsequently, matching detection is not required to be performed according to the above-described manner, so that the data processing workload of the first electronic device is reduced, and the interference audio elimination efficiency is improved.
It should be noted that, the implementation method for how to implement whether each electronic device is in the same geographic space is not limited in the present application, and may adopt, but is not limited to, an audio content matching method, a geographic location detection method, and the like, and when determining each electronic device in the same geographic space, the present application may add the same attribute value to the audio entered by the electronic device, and for each electronic device in different geographic spaces, may add different attribute values to the audio entered by the electronic device, so that after receiving the audio sent by other electronic devices, the electronic device may directly determine whether it is the audio to be eliminated by comparing the attribute values, but is not limited to the implementation method for adding the attribute values described in this embodiment, and the specific content of the attribute values is not limited.
Referring to fig. 6, a schematic structural diagram of an alternative example of the audio output processing apparatus proposed in the present application, which may be applied to an electronic device, as shown in fig. 6, may include:
the audio to be output acquiring module 61 is configured to acquire, for a first electronic device participating in a multi-party call, an audio to be output sent by a second electronic device participating in the multi-party call, which is received and cached at a first time;
a first audio obtaining module 62, configured to obtain a first audio that is cached within a preset time period from the first time.
The first audio comprises audio input by the first electronic equipment at a second moment and/or audio output by the first electronic equipment at a third moment, and the second moment and the third moment are earlier than the first moment.
And the audio detection module 63 is configured to detect that a second audio matched with the to-be-output audio exists in the first audio, and eliminate an audio matched with the second audio in the to-be-output audio.
In one possible implementation, the audio detection module 63 may include:
the characteristic obtaining unit is used for obtaining the Mel cepstrum characteristic coefficient MFCC characteristics of the audio to be output;
the first matching unit is used for sequentially matching the MFCC characteristics of the audio to be output with the MFCC characteristics corresponding to the audio contained in the first audio;
the first determining unit is used for determining that the second audio corresponding to the matching result meeting the first condition is matched with the audio to be output;
and the first eliminating unit is used for eliminating the audio matched with the second audio in the audio to be output.
In yet another possible implementation manner, the audio detection module 63 may include:
the audio fingerprint feature extraction unit is used for extracting a first audio fingerprint feature of the audio to be output;
the second matching unit is used for sequentially matching the first audio fingerprint features with second audio fingerprint features corresponding to each audio contained in the first audio;
and the second determining unit is used for determining that the second audio corresponding to the matching result meeting the second condition is matched with the audio to be output.
In yet another possible implementation manner, the audio detection module 63 may include:
the voiceprint information acquisition unit is used for acquiring first voiceprint information of the audio to be output and second voiceprint information corresponding to each audio contained in the first audio;
the content identification unit is used for carrying out content identification on the audio corresponding to the same voiceprint in the audio to be output and the first audio based on the first voiceprint information and the second voiceprint information to obtain a corresponding first audio text and a corresponding second audio text;
a text similarity obtaining unit, configured to obtain a text similarity between the first audio text and the second audio text;
and the third determining unit is used for determining that the second audio corresponding to the text similarity larger than the first similarity threshold is matched with the audio to be output.
On the basis of the above embodiments, the audio detection module 63 may further include
A first eliminating unit configured to eliminate the audio to be output in a case where a second audio matches the entire audio to be output;
the noise elimination processing unit is used for carrying out noise elimination processing on the audio to be output by utilizing a second audio under the condition that the second audio is matched with part of the audio in the audio to be output so as to obtain a target output audio;
an audio output unit for outputting the target output audio.
On the basis of the foregoing embodiments, the audio output processing apparatus proposed by the present application may further include:
the device detection module is used for detecting whether third electronic equipment which is in a preset space range with the first electronic equipment exists in second electronic equipment participating in multi-party call;
the audio to be eliminated determining module is used for determining that the audio from the third electronic equipment in the audio to be output is the audio to be eliminated when the detection result of the equipment detecting module is that the third electronic equipment exists;
and the elimination processing module is used for eliminating the audio to be eliminated.
It should be noted that, various modules, units, and the like in the embodiments of the foregoing apparatuses may be stored in the memory as program modules, and the processor executes the program modules stored in the memory to implement corresponding functions, and for the functions implemented by the program modules and their combinations and the achieved technical effects, reference may be made to the description of corresponding parts in the embodiments of the foregoing methods, which is not described in detail in this embodiment.
The present application also provides a storage medium on which a computer program can be stored, the computer program being called and loaded by a processor to implement the steps applied to the audio output processing method described in the above embodiments.
Referring to fig. 7, in order to schematically illustrate a hardware structure of an optional example of an electronic device for implementing the audio output processing method provided in the present application, the electronic device may include: at least one audio collector 71, at least one audio player 72, at least one memory 73 and at least one processor 74, wherein:
the at least one audio collector 71, the at least one audio player 72, the at least one memory 73 and the at least one processor 74 may be connected to a communication bus through corresponding communication interfaces, and data interaction between the at least one audio collector and the at least one audio player is achieved through the communication bus.
The audio collector 71 may include a sound collector, etc., and the present application does not limit the composition and the operation principle thereof. In practical application of the present application, the audio collector 71 may collect audio existing in an environment where the electronic device is currently located, where the audio includes but is not limited to audio output by a participant who uses the electronic device to participate in a multi-party call, and may also include audio re-recorded by playing certain audio by the electronic device, audio output by other participants or the electronic device in the environment, and the like, which may be determined according to a specific application scenario.
In practical application of this embodiment, if the electronic device includes a plurality of audio collectors 71, the plurality of audio collectors 71 may be deployed at different positions, and in the audio collection process, parameters of corresponding main audio collectors may be adjusted according to the sound source position, so as to improve reliability of audio recording output by a target sound source, and detailed details are not given in this embodiment of the specific implementation process.
The audio player 72 may include a speaker and the like, and is configured to output a target output audio sent by other electronic devices participating in the multi-party call, where the target output audio is obtained by processing the audio to be output sent by the other electronic devices according to the audio output processing method provided in the present application and then eliminating an interfering audio in the audio to be output, and the process of obtaining the target output audio may refer to descriptions of corresponding parts in the foregoing embodiments, which is not described in detail in this embodiment.
In various practical applications, if the electronic device includes a plurality of audio players 72, the electronic device may be deployed at different positions of the electronic device to achieve the effect of stereo surround sound, and certainly, according to other audio playing requirements, the application may deploy the plurality of audio players 72 by using corresponding strategies, and the application does not describe in detail the deployment positions of the audio players 72 and the operating principle of the audio recorded therein.
The memory 73 may be used to store programs implementing the audio output processing methods described above; the processor 74 may be configured to load and execute a program stored in the memory 73 to implement each step of the audio output processing method described in any one of the above method embodiments, and for a specific implementation process, reference may be made to the description of the corresponding part in the above embodiment, which is not described in detail in this embodiment.
In the present embodiment, the memory 73 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device or other volatile solid-state storage device. The processor 74 may be a Central Processing Unit (CPU), an application-specific integrated circuit (ASIC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device.
It should be understood that the structure of the electronic device shown in fig. 7 does not constitute a limitation to the electronic device in the embodiment of the present application, and in practical applications, the electronic device may include more or less components than those shown in fig. 7, or may combine some components, such as various communication interfaces, other input devices, output devices, and the like, which are not listed herein.
Referring to fig. 8, a schematic diagram of an alternative example of an audio output processing system proposed in the present application may include: a plurality of electronic devices 81 participating in a multi-party call, and a communication server 82 in communication connection with the plurality of electronic devices, respectively, wherein:
regarding the structural composition of the electronic device 81 and the function implementation process thereof, reference may be made to the description of the embodiment of the electronic device described above, which is not described again in this embodiment.
The communication server 82 may be configured to construct a virtual space for implementing a multi-party call, so that the plurality of electronic devices 81 access the virtual space to implement mutual communication, and a detailed implementation process of this embodiment is not described in detail.
In practical application of the present application, the communication server 82 may be an independent physical server, or a server cluster formed by a plurality of physical servers, or a cloud server supporting cloud computing, and the like, and may implement data interaction with each electronic device 81 through the internet, and a specific interaction process may be determined by combining a multi-party call application scenario, which is not described in detail herein.
For the implementation process of the audio output processing method described in the above method embodiment and the application scenario shown in fig. 2, it should be understood that, although in the audio output processing method described above, data interaction between different electronic devices may be implemented by the communication server, that is, for a plurality of electronic devices participating in a multi-party call, each time an uploaded audio is sent to the communication server, and the communication server forwards the uploaded audio to other electronic devices currently participating in the multi-party call.
Therefore, in the above method embodiment, the first electronic device receives and caches the audio to be output, which is sent by the second electronic devices participating in the multi-party call, where the second electronic devices participating in the multi-party call send the recorded audio to the communication server, and the communication server sends the audio to the electronic devices participating in the multi-party call, such as the first electronic device, and the first electronic device receives and caches the audio sent by the communication server as the audio to be output. In the implementation process of the method, data transmission processes among devices related to other steps are similar, and detailed description is not provided in the present application.
Finally, it should be noted that, in the present specification, the embodiments are described in a progressive or parallel manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device, the system and the electronic equipment disclosed by the embodiment correspond to the method disclosed by the embodiment, so that the description is relatively simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of audio output processing, the method comprising:
aiming at a first electronic device participating in multi-party communication, acquiring audio to be output, which is received and cached at a first moment and is sent by a second electronic device participating in multi-party communication;
obtaining a first audio which is cached within a preset time period away from the first moment, wherein the first audio comprises an audio which is input by the first electronic equipment at a second moment and/or an audio which is output at a third moment, and the second moment and the third moment are both earlier than the first moment;
and detecting that second audio matched with the audio to be output exists in the first audio, and eliminating the audio matched with the second audio in the audio to be output.
2. The method of claim 1, the detecting a presence of a second audio in the first audio that matches the audio to be output, comprising:
acquiring the Mel cepstrum characteristic coefficient MFCC characteristic of the audio to be output;
sequentially matching the MFCC features of the audio to be output with the MFCC features corresponding to the audio contained in the first audio;
and determining that the second audio corresponding to the matching result meeting the first condition is matched with the audio to be output.
3. The method of claim 1, the detecting a presence of a second audio in the first audio that matches the audio to be output, comprising:
extracting a first audio fingerprint feature of the audio to be output;
sequentially matching the first audio fingerprint features with second audio fingerprint features corresponding to each audio contained in the first audio;
and determining that the second audio corresponding to the matching result meeting the second condition is matched with the audio to be output.
4. The method of claim 1, the detecting a presence of a second audio in the first audio that matches the audio to be output, comprising:
acquiring first voiceprint information of the audio to be output and second voiceprint information corresponding to each audio contained in the first audio;
based on the first voiceprint information and the second voiceprint information, performing content identification on the audio corresponding to the same voiceprint in the audio to be output and the first audio to obtain a corresponding first audio text and a corresponding second audio text;
acquiring the text similarity of the first audio text and the second audio text;
and determining that the second audio corresponding to the text similarity larger than the first similarity threshold is matched with the audio to be output.
5. The method according to any one of claims 1 to 4, wherein the eliminating the audio matched with the second audio in the audio to be output comprises:
if the second audio frequency is matched with the whole audio frequency to be output, eliminating the audio frequency to be output;
if the second audio frequency is matched with part of the audio frequency in the audio frequency to be output, denoising the audio frequency to be output by using the second audio frequency to obtain a target output audio frequency;
outputting the target output audio.
6. The method of any of claims 1-4, further comprising:
detecting whether a third electronic device in a preset space range with the first electronic device exists in second electronic devices participating in multi-party call;
if the third electronic equipment exists, determining that the audio from the third electronic equipment in the audio to be output is the audio to be eliminated;
and eliminating the audio to be eliminated.
7. An audio output processing apparatus, the apparatus comprising:
the audio output device comprises an audio output module, a first audio output module and a second audio output module, wherein the audio output device comprises a first audio output module, a second audio output module and a third audio output module;
the first audio acquisition module is used for acquiring a first audio which is cached within a preset time period away from the first moment, wherein the first audio comprises an audio which is input by the first electronic equipment at a second moment and/or an audio which is output at a third moment, and the second moment and the third moment are both earlier than the first moment;
and the audio detection module is used for detecting that second audio matched with the audio to be output exists in the first audio and eliminating the audio matched with the second audio in the audio to be output.
8. The apparatus of claim 7, the audio detection module comprising:
the characteristic obtaining unit is used for obtaining the Mel cepstrum characteristic coefficient MFCC characteristics of the audio to be output;
the first matching unit is used for sequentially matching the MFCC characteristics of the audio to be output with the MFCC characteristics corresponding to the audio contained in the first audio;
the first determining unit is used for determining that the second audio corresponding to the matching result meeting the first condition is matched with the audio to be output;
and the first eliminating unit is used for eliminating the audio matched with the second audio in the audio to be output.
9. An electronic device, the electronic device comprising:
an audio collector; an audio player;
a memory for storing a program for implementing the audio output processing method according to any one of claims 1 to 6;
the processor is used for loading and executing the program stored in the memory so as to realize the steps of the audio output processing method according to any one of claims 1 to 6.
10. An audio output processing system, the system comprising:
a plurality of electronic devices participating in a multi-party call, the electronic devices being the electronic device of claim 9;
and the communication server is in communication connection with the electronic devices respectively and is used for constructing a virtual space for realizing multi-party call so that the electronic devices access the virtual space to realize mutual communication.
CN202010617753.2A 2020-06-30 2020-06-30 Audio output processing method, device and system and electronic equipment Pending CN111800552A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010617753.2A CN111800552A (en) 2020-06-30 2020-06-30 Audio output processing method, device and system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010617753.2A CN111800552A (en) 2020-06-30 2020-06-30 Audio output processing method, device and system and electronic equipment

Publications (1)

Publication Number Publication Date
CN111800552A true CN111800552A (en) 2020-10-20

Family

ID=72810914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010617753.2A Pending CN111800552A (en) 2020-06-30 2020-06-30 Audio output processing method, device and system and electronic equipment

Country Status (1)

Country Link
CN (1) CN111800552A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114694665A (en) * 2020-12-28 2022-07-01 阿里巴巴集团控股有限公司 Voice signal processing method and device, storage medium and electronic device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102025852A (en) * 2009-09-23 2011-04-20 宝利通公司 Detection and suppression of returned audio at near-end
US20160065743A1 (en) * 2014-08-27 2016-03-03 Oki Electric Industry Co., Ltd. Stereo echo suppressing device, echo suppressing device, stereo echo suppressing method, and non transitory computer-readable recording medium storing stereo echo suppressing program
CN106576103A (en) * 2014-08-13 2017-04-19 微软技术许可有限责任公司 Reversed echo canceller
CN106603877A (en) * 2015-10-16 2017-04-26 鸿合科技有限公司 Collaborative conference voice collection method and apparatus
CN108551534A (en) * 2018-03-13 2018-09-18 维沃移动通信有限公司 The method and device of multiple terminals voice communication
CN109547655A (en) * 2018-12-30 2019-03-29 广东大仓机器人科技有限公司 A kind of method of the echo cancellation process of voice-over-net call

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102025852A (en) * 2009-09-23 2011-04-20 宝利通公司 Detection and suppression of returned audio at near-end
CN106576103A (en) * 2014-08-13 2017-04-19 微软技术许可有限责任公司 Reversed echo canceller
US20160065743A1 (en) * 2014-08-27 2016-03-03 Oki Electric Industry Co., Ltd. Stereo echo suppressing device, echo suppressing device, stereo echo suppressing method, and non transitory computer-readable recording medium storing stereo echo suppressing program
CN106603877A (en) * 2015-10-16 2017-04-26 鸿合科技有限公司 Collaborative conference voice collection method and apparatus
CN108551534A (en) * 2018-03-13 2018-09-18 维沃移动通信有限公司 The method and device of multiple terminals voice communication
CN109547655A (en) * 2018-12-30 2019-03-29 广东大仓机器人科技有限公司 A kind of method of the echo cancellation process of voice-over-net call

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114694665A (en) * 2020-12-28 2022-07-01 阿里巴巴集团控股有限公司 Voice signal processing method and device, storage medium and electronic device

Similar Documents

Publication Publication Date Title
US11823679B2 (en) Method and system of audio false keyphrase rejection using speaker recognition
CN107910014B (en) Echo cancellation test method, device and test equipment
CN110853646B (en) Methods, devices, equipment and readable storage media for distinguishing conference speaking roles
CN113766073B (en) Howling detection in conference systems
US20230317096A1 (en) Audio signal processing method and apparatus, electronic device, and storage medium
CN103152546A (en) Echo suppression method for videoconferences based on pattern recognition and delay feedforward control
CN110956976B (en) Echo cancellation method, device and equipment and readable storage medium
CN111868823B (en) Sound source separation method, device and equipment
CN111710344A (en) Signal processing method, device, equipment and computer readable storage medium
CN112466319A (en) Audio processing method and device, computer equipment and storage medium
CN111199751B (en) Microphone shielding method and device and electronic equipment
CN113271430A (en) Anti-interference method, system, equipment and storage medium in network video conference
WO2017045512A1 (en) Voice recognition method and apparatus, terminal, and voice recognition device
CN111800552A (en) Audio output processing method, device and system and electronic equipment
CN114758668A (en) Training method of voice enhancement model and voice enhancement method
WO2025031102A1 (en) Method and apparatus for training speech enhancement network, and storage medium, device and product
CN113921026A (en) Speech enhancement method and device
CN112908336A (en) Role separation method for voice processing device and voice processing device thereof
WO2024017110A1 (en) Voice noise reduction method, model training method, apparatus, device, medium, and product
US20230421702A1 (en) Distributed teleconferencing using personalized enhancement models
CN115410593A (en) Audio channel selection method, device, equipment and storage medium
CN113362849B (en) Voice data processing method and device
CN114974286A (en) Signal enhancement method, model training method, device, equipment, sound box and medium
CN112118511A (en) Earphone noise reduction method and device, earphone and computer readable storage medium
JP2022181759A (en) Speech quality evaluation device, speech quality evaluation method, and speech quality evaluation program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201020

RJ01 Rejection of invention patent application after publication