Nothing Special   »   [go: up one dir, main page]

CN113535913A - Answer scoring method and device, electronic equipment and storage medium - Google Patents

Answer scoring method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113535913A
CN113535913A CN202110614234.5A CN202110614234A CN113535913A CN 113535913 A CN113535913 A CN 113535913A CN 202110614234 A CN202110614234 A CN 202110614234A CN 113535913 A CN113535913 A CN 113535913A
Authority
CN
China
Prior art keywords
word
awakening
preset
wake
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110614234.5A
Other languages
Chinese (zh)
Other versions
CN113535913B (en
Inventor
梁华东
李鑫
胡铭铭
黄倩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202110614234.5A priority Critical patent/CN113535913B/en
Publication of CN113535913A publication Critical patent/CN113535913A/en
Application granted granted Critical
Publication of CN113535913B publication Critical patent/CN113535913B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application discloses an answer scoring method and device, electronic equipment and a storage medium, wherein the answer scoring method comprises the following steps: performing awakening word detection on the answer audio to obtain a detection result; the answer audio is acquired when a user answers a preset question, the detection result comprises at least one target awakening word, the at least one target awakening word is from an awakening word set, and the awakening word set is obtained based on a preset answer of the preset question; and matching the detection result with a preset answer to obtain an answer score. According to the scheme, the efficiency and the accuracy of answer scoring can be improved.

Description

Answer scoring method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to an answer scoring method and apparatus, an electronic device, and a storage medium.
Background
In real life, there are generally application scenarios of questionnaire scoring such as cognitive impairment screening, mental health testing, and the like. At present, the question-answer scoring is generally carried out in a manual face-to-face interactive question-answer mode, so that the efficiency is low; or the customer service robot transcribes the voice into a text and performs keyword matching on the text to perform question and answer scoring, and the spoken language quality of the tested person directly influences the accuracy of the voice transcription so as to influence the answer scoring. In view of this, how to improve the efficiency and accuracy of answer scoring becomes an urgent problem to be solved.
Disclosure of Invention
The application mainly solves the technical problem of providing an answer scoring method and device, electronic equipment and a storage medium, and the answer scoring efficiency and accuracy can be improved.
In order to solve the above technical problem, a first aspect of the present application provides an answer scoring method, including: performing awakening word detection on the answer audio to obtain a detection result; the answer audio is acquired when a user answers a preset question, the detection result comprises at least one target awakening word, the at least one target awakening word is from an awakening word set, and the awakening word set is obtained based on a preset answer of the preset question; and matching the detection result with a preset answer to obtain an answer score.
In order to solve the above technical problem, a second aspect of the present application provides an answer scoring apparatus, including: the system comprises a wake-up detection module and an answer scoring module, wherein the wake-up detection module is used for carrying out wake-up word detection on answer audio to obtain a detection result; the answer audio is acquired when a user answers a preset question, the detection result comprises at least one target awakening word, the at least one target awakening word is from an awakening word set, and the awakening word set is obtained based on a preset answer of the preset question; and the answer scoring module is used for matching the detection result with a preset answer to obtain an answer score.
In order to solve the above technical problem, a third aspect of the present application provides an electronic device, including a memory and a processor coupled to each other, where the memory stores program instructions, and the processor is configured to execute the program instructions to implement the answer scoring method in the first aspect.
In order to solve the above technical problem, a fourth aspect of the present application provides a computer-readable storage medium storing program instructions executable by a processor, the program instructions being configured to implement the answer scoring method in the first aspect.
According to the scheme, the answer audio is subjected to awakening word detection to obtain a detection result, the answer audio is acquired when a user answers a preset question, the detection result comprises at least one target awakening word, the at least one target awakening word is from an awakening word set, the awakening word set is obtained based on a preset answer of the preset question, on the basis, the detection result is matched with the preset answer to obtain an answer score, namely in the answer scoring process, on one hand, the answer scoring can be realized only by acquiring the answer audio of the user for answering the preset question, so that the answer scoring is close to a human-human interaction form as far as possible, on the other hand, the answer score can be obtained by only carrying out awakening word detection on the answer audio, and the answer score is obtained based on the matching of the at least one target awakening word and the preset answer, and voice transcription is not needed to be carried out on the whole answer audio, so that the influence of the spoken language quality on the answer score is reduced as much as possible, and the efficiency and the accuracy of the answer score can be improved.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating an embodiment of the answer scoring method of the present application;
FIG. 2 is a block diagram of answer scoring based on voice arousals;
FIG. 3 is a schematic process diagram of an embodiment of the answer scoring method of the present application;
FIG. 4 is a flowchart illustrating an embodiment of step S11 in FIG. 1;
FIG. 5 is a flowchart illustrating an embodiment of obtaining a wake-up threshold;
FIG. 6 is a schematic flow chart illustrating another embodiment of step S11 in FIG. 1;
FIG. 7 is a schematic flow chart diagram illustrating another embodiment of an answer scoring method according to the present application;
FIG. 8 is a block diagram of an embodiment of the answer scoring device of the present application;
FIG. 9 is a block diagram of an embodiment of an electronic device of the present application;
FIG. 10 is a block diagram of an embodiment of a computer-readable storage medium of the present application.
Detailed Description
The following describes in detail the embodiments of the present application with reference to the drawings attached hereto.
In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.
The terms "system" and "network" are often used interchangeably herein. The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.
Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of the answer scoring method of the present application. It should be noted that any embodiment of the answer scoring method in the present application may be applied to any questionnaire scoring scene such as cognitive impairment screening, mental health testing, postoperative follow-up visit, and the like, and is not limited herein. Specifically, the method may include the steps of:
step S11: and carrying out awakening word detection on the answer audio to obtain a detection result.
In the embodiment of the disclosure, the answer audio is collected when the user answers the preset question, the detection result includes at least one target wake-up word, and the at least one target wake-up word is from a wake-up word set, and the wake-up word set is obtained based on a preset answer to the preset question.
In one implementation scenario, the wake word set may be specifically created based on score words of a preset answer. Taking the cognitive disorder screening scenario as an example, the preset questions may include "imagine you have much money of 1, 5, 10 dollars. Now you need to pay me 13 dollars, please give me 3 payment methods. I do not find your change, you need to pay me 13 dollars ", the point of scoring is that the user can answer three and more payment combination ways, full score is 3 cents, if the preset answer can be 'one 10 dollar, three 1 dollars, two 5 dollars, three 1 dollars, thirteen 1 dollars, one 5 dollars, eight 1 dollars', if the user provides three and more correct payment ways, 3 cents can be obtained, if the user provides two correct payment ways, 2 cents can be obtained, if the user provides one correct payment way, 1 cents can be obtained, and other situations 0 cents can be obtained. On the basis, it can be considered that "1-element", "5-element", "10-element", "one", "two", "three", "eight", and thirteen "in the preset answer are all score words, so that the score words can be directly used as preset wake-up words to create a wake-up word set (" 1-element "," 5-element "," 10-element "," one "," two "," three "," eight ", and thirteen"). Other cases may be analogized, and no one example is given here.
In another implementation scenario, in order to improve robustness of the answer score, the preset wake-up word may include at least one of a first wake-up word and a second wake-up word, and the first wake-up word is obtained by synonymously expanding the score word based on the preset initial consonant and vowel, and the second wake-up word is obtained by dialect conversion of the first wake-up word based on the preset dialect.
In a specific implementation scenario, the preset initial consonant and vowel may be set according to actual needs, for example, through data investigation and analysis, the wake-up words commonly used in the current voice wake-up interaction include the following forms and different variation combinations based on the following forms: small + words (e.g., small, love), double words (e.g., question), non-double words (e.g., ding-dong), and particularly combinations of "name + name" (e.g., small). In addition, because a pronunciation of a Chinese character is a syllable (tone + initial and final), it is popular with yin Ping (1 tone) or Ping (1 tone yin Ping +2 tone yang Ping), while it is popular with zero initial (y, w) and single final (simple or compound). On this basis, still taking the preset problem as an example, the following first awakening words (1 block, 5 blocks, 10 blocks, 1 element, 5 elements, 10 elements, 13 elements, 2 elements, 8 elements) can be obtained according to the above scores, namely, the flat tone (one, three, ten, thirteen, element), the zero initial (one, five), the single final (one), and the synonymy expansion of the "one" by the tongue tip back tone into the "one", and the synonymy expansion of the "one" by the single final into the "block". In the case that the preset problem is other problems, the analogy can be done, and the examples are not repeated.
In another specific implementation scenario, the preset dialect may also be set as a "joint-fertilization dialect", "Nanjing dialect", "Hangzhou dialect", or the like, according to actual needs. Still taking the preset questions as examples, the "blocks", "elements", etc. can be converted into "coins", "children", "hair", etc. using the fertilizer-combining dialect.
In another specific implementation scenario, the scoring words may be combined and expanded, and still taking the preset problem as an example, the following wake-up words, "acanthopanax bark three", "eleven plus three one", and the like may be obtained by combining and expanding, which is not limited herein. In the case that the preset problem is other problems, the analogy can be done, and the examples are not repeated.
It should be noted that the set of wakeup words for the preset question may be created before the user scores the answers. That is, after the preset questions and the preset answers thereof are obtained, the wake-up word set thereof can be created for each preset question.
In one implementation scenario, please refer to fig. 2 in combination, fig. 2 is a block diagram of a voice wake-based answer score. As shown in fig. 2, after the answer audio of the user to the preset question is collected, VAD (Voice Activity Detection) endpoint processing may be performed on the answer audio to locate a Voice start position and a Voice end position in the answer audio, so that a voiced segment may be extracted from the answer audio, and wakeup word Detection may be performed for the voiced segment, and further, when a group of people with degraded spoken language expression ability, such as the elderly, is faced, the influence of a large number of silent segments or noise environments contained in the answer audio due to long-time thinking and pause on wakeup word Detection is greatly alleviated, which is beneficial to improving the real-time performance of wakeup word Detection and reducing resource consumption. The specific process of endpoint processing can refer to details related to VAD, which are not described herein again.
In one implementation scenario, please continue to refer to fig. 2, the detection of the wake word may be performed based on a wake engine and a wake word set, where the wake engine specifically includes but is not limited to: HMM-GMM (Hidden Markov Model-gaussian mixture Model), deep neural network (e.g., convolutional neural network, long-short term memory network, deep separable convolutional neural network, etc.), and is not limited herein. The specific process of detecting the awakening word by using the HMM-GMM and the deep neural network can refer to the details of the voice awakening related technology, which are not described herein again.
Step S12: and matching the detection result with a preset answer to obtain an answer score.
Specifically, by detecting the wake-up word for the answer audio, the preset wake-up word that is awakened in the wake-up word set, that is, the target wake-up word, can be detected. On the basis, the detection result containing the target awakening word can be matched with the preset answer to obtain the answer score of the preset question.
In an implementation scenario, please refer to fig. 2 in combination, according to the score rule of the preset question, in the case that the preset question is a keyword detection-type question (e.g., a recall-type question, an image recognition-type question, etc.), the detection result may be directly keyword-matched with the preset answer to obtain the answer score. Taking a preset question as an example for recognizing an object by looking at the picture, the preset question can be image recognition of four animals including a peacock, a zebra, a butterfly and a tiger, a preset answer can be a peacock, a zebra, a butterfly and a tiger, a created awakening word set can be a peacock, a zebra, a butterfly and a tiger, answer audio of a user can be an answer of the peacock and the zebra which are unknown to the user, detection results including target awakening words including the zebra and the tiger can be obtained through awakening word detection, on the basis, keyword matching is carried out on the detection results and the preset answer, and the answer of the preset question can be determined to be 2 points due to successful matching of the zebra and the tiger. Other cases may be analogized, and no one example is given here.
In an implementation scenario, please continue to refer to fig. 2 in combination, according to the score rule of the preset question, in the case that the preset question is a pattern extraction combination type question (e.g., digital reading, change giving, etc.), the detection result may be processed according to the rule corresponding to the preset question type, and the detection result is matched with the preset answer, so as to obtain the answer score of the user to the preset question. Still imagine you have much money of 1 yuan, 5 yuan, 10 yuan in the aforementioned preset question. Now you need to pay me 13 dollars, please give me 3 payment methods. I do not find your change, and you need to pay my 13 yuan, for example, the detection result includes the following target wake-up words: "quinary", "unary", "quinary", "ten yuan", "coin", "two sheets", "quinary", "coin", "thirteen", and "coin", and the following combinations are obtained by performing fuzzy pattern extraction on the above detection results: the combination can be fuzzy-matched with the preset answer, for example, the combination of < two, five and coin > can be fuzzy-matched to <2 5-element and 3-element 1-element, and other combinations can be matched by analogy, and are not described herein again. In addition, when the user answers the preset question, the user may also include other numerical value coins which are not related to the preset answer, such as "1 piece 7, 1 piece 5, and 7 pieces are divided into 2 pieces 1 and 1 piece 5", in which case, a special identifier (e.g. 4) may be used to replace the other numerical value (e.g. 7) to improve the precision of fuzzy matching.
In one implementation scenario, please refer to fig. 3 in conjunction with the above description, fig. 3 is a schematic process diagram of an embodiment of the answer scoring method of the present application. As shown in fig. 3, in addition to the preset questions of the voice answer class, the preset questions of the touch drawing class may be included. For example, the pentagon problem in MMSE (Mini-Mental State evaluation) requires a user to draw two pentagons, which intersect to form a quadrilateral, and each of the two pentagons has a vertex located in the other pentagon. In this case, the touch data of the user on the preset question can be acquired, preprocessing such as redundant track point removal, stroke segmentation, stroke sequence determination, stroke track smoothing, redundant stroke removal and the like is performed on the touch data, and then the preprocessed touch data is scored by using a discrimination engine to obtain the answer score of the preset question. It should be noted that the discriminant engine may include, but is not limited to, a discriminant rule, a discriminant model (e.g., a support vector machine, a logistic regression model, a naive bayes model, a random forest model, etc.), and is not limited herein.
In an implementation scenario, please continue to refer to fig. 3 in combination, after the user answers all the preset questions, the answer scores of all the preset questions can be counted to obtain a comprehensive score, and the instant score can be used for auxiliary analysis in application scenarios such as cognitive impairment screening, mental health testing, postoperative follow-up and the like. Taking the cognitive disorder screening as an example, by designing preset questions for detecting memory, voice, visual space, executive ability, calculation, understanding judgment and the like and acquiring answer scores of the preset questions in different aspects, whether the cognitive function of the user is impaired or not can be analyzed in an auxiliary manner, and the user can be considered to have the cognitive disorder when the impaired condition affects the daily or social ability of the user. It should be noted that, in the cognitive impairment screening scenario, the preset questions may be derived from: MMSE, MOCA _ B (Montreal cognitive assessment-basic edition), etc., which are not limited herein. Other scenarios may be analogized, and are not exemplified here.
According to the scheme, the answer audio is subjected to awakening word detection to obtain a detection result, the answer audio is acquired when a user answers a preset question, the detection result comprises at least one target awakening word, the at least one target awakening word is from an awakening word set, the awakening word set is obtained based on a preset answer of the preset question, on the basis, the detection result is matched with the preset answer to obtain an answer score, namely in the answer scoring process, on one hand, the answer scoring can be realized only by acquiring the answer audio of the user for answering the preset question, so that the answer scoring is close to a human-human interaction form as far as possible, on the other hand, the answer score can be obtained by only carrying out awakening word detection on the answer audio, and the answer score is obtained based on the matching of the at least one target awakening word and the preset answer, and voice transcription is not needed to be carried out on the whole answer audio, so that the influence of the spoken language quality on the answer score is reduced as much as possible, and the efficiency and the accuracy of the answer score can be improved.
Referring to fig. 4, fig. 4 is a schematic flowchart illustrating an embodiment of step S11 in fig. 1. Specifically, the method may include the steps of:
step S41: and carrying out awakening word detection on the answer audio to obtain awakening excitation of at least one candidate awakening word.
In the embodiment of the present disclosure, the candidate wake-up words are from a wake-up word set, and the wake-up word set includes a plurality of preset wake-up words and wake-up thresholds corresponding to the preset wake-up words, where the wake-up thresholds are obtained by using a sample audio test related to the preset wake-up words. It should be noted that the sample audio may be collected before the scoring of the answers, so as to obtain the set of awakening words of each preset question before the scoring of the answers.
In one implementation scenario, the sample audio related to the preset wake-up word may include a first audio, a second audio and a third audio, where the first audio includes the preset wake-up word, the second audio includes a first reference word of the preset wake-up word, the third audio includes a second reference word of the preset wake-up word, the first reference word is synonymous with the preset wake-up word and has a different tone, and the second reference word is synonymous with the preset wake-up word and has a different tone. In the above manner, the sample audio related to the preset awakening word is set to include the first audio, the second audio and the third audio, the first audio includes the preset awakening word, the second audio includes the first reference word, the third audio includes the second reference word, the first reference word is different sounds synonymous with the preset awakening word, the second reference word is different sounds synonymous with the preset awakening word, namely, the awakening threshold of the preset awakening word can be determined jointly by combining the first audio, the second audio and the third audio, the accuracy of the awakening threshold can be improved, and the accuracy of detection of the awakening word is improved.
In a specific implementation scenario, after a set of wake-up words of a preset problem is obtained, for each preset wake-up word, a first reference word with different tones synonymous with the preset wake-up word and a second reference word with different tones synonymous with the preset wake-up word may be obtained, and a first audio including the preset wake-up word, a second audio including the first reference word and a third audio including the second reference word may be obtained through collection.
In another specific implementation scenario, the preset question "imagine you have much money of 1 yuan, 5 yuan, 10 yuan. Now you need to pay me 13 dollars, please give me 3 payment methods. For example, i do not find your change, and you need to pay me 13 yuan, the process of acquiring the wake-up word set may refer to the related description in the foregoing disclosed embodiment, and details are not described here. Taking the example of the preset wake word "coin" as an example, a first reference word "coin" with different tones synonymous with the preset wake word and a second reference word "hard pen" with different tones synonymous with the preset wake word may be acquired, and a first audio including the preset wake word "coin" may be acquired (e.g., "please find me coin, don't need paper money", "this is a new one of the one yuan coins", etc.), and a second audio including the first reference word may be acquired (e.g., "borrow me one of steel", i do not have steel money on hand, find you paper money ", etc.), and a third audio including the second reference word may be acquired (e.g.," this old-like hard pen calligraphy is well obtained "," i have not done hard pen word for a long time ", etc.). Other scenarios may be analogized, and are not exemplified here.
Referring to fig. 5, fig. 5 is a flowchart illustrating an embodiment of obtaining the wake-up threshold. The method specifically comprises the following steps:
step S51: and respectively carrying out awakening tests by using the first audio frequency, the second audio frequency and the third audio frequency to obtain a first data distribution of a preset awakening word, a second data distribution of a first reference word and a third data distribution of a second reference word.
Specifically, the first data distribution includes a first initial threshold and a first volume mean, the second data distribution includes a second initial threshold and a second volume mean, and the third data distribution includes a third initial threshold and a third volume mean. For convenience of description, the first initial threshold may be denoted as S0Let the second initial threshold be S1And the third initial threshold is recorded as S2Similarly, the first volume mean value may be denoted as v0And the second volume mean value is denoted as v1And the third volume mean value is recorded as v2Therefore, the first data distribution of the preset wake-up word can be represented as (S)0,v0) The second data distribution of the first reference word may represent (S)1,v1) The third data distribution of the second reference word may be represented as (S)2,v2)。
In one implementation scenario, the volume amplitude mean value of the voiced segments in the test audio may be counted to obtain the volume mean value. The voiced segments in the test audio can be obtained by the VAD endpoint processing, which is not described herein again. It should be noted that, when the wake-up word is the preset wake-up word, the test audio represents the first audio, and the volume mean represents the first volume mean v0And in the case that the wake-up word is the first reference word, the test audio represents a second audio, and the volume mean represents a second volume mean v1And in case the wake-up word is the second reference word, the test audio represents a third audio, the volume mean represents a third volume mean v2
In one implementation scenario, the wake-up word may be subjected to a wake-up test using a test audio, the wake-up success rates corresponding to different test wake-up thresholds are counted and selected, and the test wake-up success rate is higher than a preset thresholdThe wake-up threshold serves as an initial threshold. Specifically, after the wake-up word is subjected to a wake-up test by using the test audio, a wake-up stimulus of the wake-up word may be obtained, where a higher wake-up stimulus indicates that the test audio contains the wake-up word more likely, and conversely, a lower wake-up stimulus indicates that the test audio contains the wake-up word less likely. In this case, different test wake-up thresholds (e.g., 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, etc.) may be set, and the wake-up success rate corresponding to the wake-up word in the case of selecting different test wake-up thresholds is counted. For example, when the wake-up stimulus is greater than the test wake-up threshold, it may be determined that the test audio includes a wake-up word, and if the test audio actually includes a wake-up word, the wake-up is considered to be successful, based on which, the wake-up success rates corresponding to the different test wake-up thresholds may be obtained through statistics, and the test wake-up threshold having a wake-up success rate higher than the preset threshold is selected as the initial threshold (e.g., if the wake-up success rate corresponding to the test wake-up threshold of 0.6 is higher than the preset threshold, the wake-up success rate is used as the initial threshold). It should be noted that, in the case that the wake-up word is the preset wake-up word, the test audio represents a first audio, and the initial threshold represents a first initial threshold S0In the case where the wake-up word is the first reference word, the predicted audio represents a second audio, and the initial threshold represents a second initial threshold S1In the case where the wake-up word is the second reference word, the test audio represents a third audio, and the initial threshold represents a third initial threshold S2
Step S52: a first adjustment weight is obtained based on a difference between the first data distribution and the second data distribution, and a second adjustment weight is obtained based on a difference between the first data distribution and the third data distribution.
In one implementation scenario, the difference between the first data distribution and the second data distribution may be measured to obtain a distribution gap d1And obtaining a threshold difference S between the second initial threshold and the first initial threshold1-S0And acquiring a volume difference value v between the second volume average value and the first volume average value1-v0Thereby based on distributionDifference d1Threshold difference S1-S0And volume difference v1-v0Obtaining a first adjusted weight f1And the first adjustment weight f1Difference from distribution d1Negative correlation, first adjusted weight f1Difference S from threshold1-S0Positive correlation, the first adjusted weight f1Difference v from the volume1-v0A negative correlation. Specifically, the first adjustment weight f1Can be expressed as:
f1=((S1-S0)/d1)/((v1-v0)/v0)……(1)
in one implementation scenario, the difference between the first data distribution and the third data distribution may be measured to obtain a distribution gap d2And obtaining a threshold difference S between the third initial threshold and the first initial threshold2-S0And obtaining the volume difference value v between the third volume average value and the first volume average value2-v0Thus based on the distribution gap d2Threshold difference S2-S0And volume difference v2-v0Obtaining a second adjusted weight f2And the second adjustment weight f2Difference from distribution d2Negative correlation, second adjusted weight f2Difference S from threshold2-S0Positive correlation, second adjusted weight f2Difference v from the volume2-v0A negative correlation. Specifically, the second adjustment weight f2Can be expressed as:
f2=((S2-S0)/d2)/((v2-v0)/v0)……(2)
in addition, the distribution gap can be obtained by measuring JS divergence. Taking the example of measuring the distribution gap between the first data distribution and the second data distribution, the JS divergence between the first data distribution and the second data distribution can be expressed as:
Figure BDA0003097373870000111
in the above formula (3), KL represents a KL divergence function, Pg1A distribution function, P, representing a first data distributiong2A distribution function representing a distribution of the second data. Note that the KL divergence can be expressed as:
Figure BDA0003097373870000112
the process of measuring the distribution gap between the first data distribution and the third data distribution may be analogized, and will not be described herein. In the above manner, by calculating the distribution difference between the data distributions, setting the adjustment weight and the distribution difference as negative correlation, setting the adjustment weight and the threshold difference as positive correlation, and setting the adjustment weight and the threshold difference as negative correlation, when the distribution difference between the preset awakening word and the reference word thereof is small, or the volume difference is small, or the threshold difference is large, the adjustment weight can be increased to distinguish the preset awakening word from the reference word thereof, thereby being beneficial to increasing the awakening success rate.
Step S53: and adjusting the first initial threshold value by using the first adjusting weight value and the second adjusting weight value to obtain the awakening threshold value of the preset awakening word.
Specifically, the first adjustment proportion may be determined by using the first adjustment weight and the second adjustment weight, and both the first adjustment weight and the second adjustment weight are positively correlated with the first adjustment proportion, and the sum of the product of the adjustment step length and the first adjustment proportion and the first initial threshold is used as the wake-up threshold. For convenience of description, the wake-up threshold may be denoted as SmWake-up threshold SmCan be expressed as:
Sm=S0+(2*(f1*f2)/(f1+f2))*50……(5)
in the above formula (5), 2 × (f)1*f2)/(f1+f2) Indicating the first adjustment ratio and 50 the adjustment step size. The adjustment step size may be set to 30, 40, etc. according to actual needs, and is not limited herein. In the above manner, by using the first adjustment weight and the second adjustmentAnd (3) adjusting the weight value, determining a first adjustment proportion, wherein the first adjustment proportion and the second adjustment proportion are positively correlated with the first adjustment proportion, and on the basis, the sum of the product of the adjustment step length and the first adjustment proportion and a first initial threshold is used as a wake-up threshold, namely, the wake-up threshold of the preset wake-up word is positively correlated and adjusted on the basis of the first initial threshold by combining the first adjustment weight of the first reference word and the second adjustment weight of the second reference word, so that the accuracy of detection of the wake-up word is favorably improved.
In one implementation scenario, as described above, the preset wake-up word includes at least one of a first wake-up word and a second wake-up word, the first wake-up word is obtained by synonymy expanding the score word based on the preset initial consonant, and the second wake-up word is obtained by dialect converting the first wake-up word based on the preset dialect. On the basis, the awakening threshold value of the first awakening word is higher than the awakening threshold value of the second awakening word, the higher awakening threshold value is given to the first awakening word, mistaken awakening can be favorably reduced, and the lower awakening threshold value is given to the second awakening word, so that the awakening success rate can be favorably improved.
In a specific implementation scenario, for both the first wake-up word (e.g., unary) and the second wake-up word (e.g., coin), the corresponding wake-up thresholds S can be obtained through the foregoing stepsmOn this basis, for the first wake word (e.g. unary), the wake threshold S can be further setm(e.g., 780) is added to a preset up value (e.g., 70) to update the wake-up threshold S for the first wake-up word (e.g., unary)m(e.g., 850), and for a second wake-up word (e.g., a coin), the wake-up threshold S may be further definedm(e.g., 670) is subtracted from the preset down value (e.g., 70) to update the wake-up threshold S for a second wake-up word (e.g., coin)m(e.g., 600). Other cases may be analogized, and no one example is given here.
According to the mode, the first data distribution of the preset awakening words and the second data distribution of the first reference words are obtained through the awakening test by utilizing the first audio frequency, the second audio frequency and the third audio frequency respectively, the third data distribution of the second reference words is well obtained through the awakening test, the first adjustment weight is obtained based on the difference between the first data distribution and the second data distribution, the second adjustment weight is obtained based on the difference between the first data distribution and the third data distribution, and on the basis, the first initial threshold is adjusted by utilizing the first adjustment weight and the second adjustment weight to obtain the awakening threshold of the preset awakening words.
In one implementation scenario, the preset question "imagine you have much money of 1, 5, 10 dollars. Now you need to pay me 13 dollars, please give me 3 payment methods. I do not find your change, you need to pay me 13 yuan as an example, and in the case that the answer audio is 2 5 yuan and 3 1 yuan, the wake-up stimulus of the candidate wake-up word "2", the wake-up stimulus of the candidate wake-up word "5 yuan", the wake-up stimulus of the candidate wake-up word "3", and the wake-up stimulus of the candidate wake-up word "1 yuan" can be obtained. Other cases may be analogized, and no one example is given here.
Step S42: and for each candidate awakening word, determining whether to take the candidate awakening word as a target awakening word or not based on the magnitude relation between the awakening excitation and the awakening threshold corresponding to the candidate awakening word.
Specifically, for each candidate awakening word, if the awakening stimulus is greater than the awakening threshold corresponding to the candidate awakening word, the candidate awakening word may be used as the target awakening word, otherwise, if the awakening stimulus is not greater than the awakening threshold corresponding to the candidate awakening word, the candidate awakening word may not be used as the target awakening word. Still imagine you have much money of 1 yuan, 5 yuan, 10 yuan in the aforementioned preset question. Now you need to pay me 13 dollars, please give me 3 payment methods. I do not find your change, you need to pay my 13 yuan as an example, for the candidate awakening word "2", if the awakening stimulus thereof is greater than the corresponding awakening threshold value thereof, the candidate awakening word "2" may be added to the detection result as the target awakening word, otherwise, the candidate awakening word "2" may not be used as the target awakening word, and for the candidate awakening words "5 yuan", "3 yuan", "1 yuan", and so on, this is not illustrated one by one.
In the scheme, the answer audio is detected by the awakening words to obtain the awakening excitation of at least one candidate awakening word, and at least one candidate awakening word is from the awakening word set, the awakening word set comprises a plurality of preset awakening words and awakening thresholds corresponding to the preset awakening words, the awakening thresholds are obtained by utilizing a sample audio test related to the preset awakening words, on the basis, for each candidate awakening word, determining whether the candidate awakening word is used as the target awakening word or not based on the magnitude relation between the awakening excitation and the awakening threshold corresponding to the candidate awakening word, since the wake-up threshold corresponding to each preset wake-up word is obtained based on the sample audio test related to the preset wake-up word, therefore, whether to wake up or not is determined by combining the wake-up threshold, and the wake-up success rate and the false wake-up rate can be improved.
Referring to fig. 6, fig. 6 is a schematic flowchart illustrating another embodiment of step S11 in fig. 1. In the embodiment of the disclosure, in the process of performing answer scoring, the wake-up threshold value can be adaptively adjusted according to the actual environment. Specifically, the method may include the steps of:
step S61: and carrying out awakening word detection on the answer audio to obtain awakening excitation of at least one candidate awakening word.
In this embodiment of the disclosure, at least one candidate wake-up word is from a wake-up word set, where the wake-up word set includes a plurality of preset wake-up words and wake-up thresholds corresponding to the preset wake-up words, and the wake-up thresholds are obtained by using a sample audio test related to the preset wake-up words.
Step S62: and acquiring the average value of the actually measured volume of the user in the environment of answering the question.
In one implementation scenario, the measured volume average may be obtained before the user answers the preset question. For example, the user may be prompted to begin answering the score and be asked to input a segment of environmental test audio via a microphone, so that VAD endpoint processing may be performed on the environmental test audio and the volume magnitude average of the voiced segment of the environmental test audio may be used as the measured volume average.
In another implementation scenario, as described above, if there are multiple preset questions in the whole answer scoring process, for each preset question, the measured volume average value may be obtained through statistics based on the preset questions already answered by the user. For example, when the user is about to answer the second preset question, VAD endpoint processing may be performed on the answer audio of the user when the user is answering the first preset question, and the volume amplitude average value of the vocal section in the answer audio is used as the measured volume average value; when the user is about to answer the third preset question, VAD endpoint processing may be performed on the answer audio of the user when the user is answering the first preset question and the second preset question, and the volume amplitude average value of the sound segment therein is used as the actually measured volume average value, and so on, which is not exemplified herein.
Step S63: a second adjustment ratio is determined based on the measured volume average.
In the embodiment of the present disclosure, the second adjustment ratio is positively correlated with the measured volume average value, that is, the larger the measured volume average value is, the larger the second adjustment ratio is, otherwise, the smaller the measured volume average value is, the smaller the second adjustment ratio is. For convenience of description, the second adjustment ratio may be denoted as λ, and the second adjustment ratio λ may be specifically expressed as:
λ=(1+((v3-v0)/v0)*50)……(6)
in the above formula (6), v3Representing the mean value of the measured sound volume, v0The first volume average value representing the preset wake-up word is obtained based on the first audio including the preset wake-up word, which may be referred to the related description in the foregoing embodiments specifically, and is not described herein again. In addition, 50 denotes an adjustment step size, which may be set to 30, 40, etc. according to the actual application requirement, and is not limited herein.
In an implementation scenario, in a case that the measured volume average is obtained before the user answers the preset question, the second adjustment proportion of each preset wake-up word in each wake-up word set may be determined based on the measured volume average. That is, before the preset question is answered, the second adjustment proportion of the wake-up threshold corresponding to each preset wake-up word in the wake-up word set corresponding to each preset question may be obtained.
In another implementation scenario, in a case that the measured volume average value is obtained before each preset question based on already answered preset questions, for each preset question, the second adjustment proportion of each preset wake-up word in the wake-up word set corresponding to the preset question may be determined based on the measured volume average value obtained for the preset question. That is, every time a preset question is answered, a second adjustment proportion of each preset awakening word in the awakening word set corresponding to the preset question is determined based on an actually measured volume average value obtained by statistics of the answered preset question.
Step S64: and adjusting the awakening threshold value by utilizing the second adjustment proportion.
In an implementation scenario, when the measured volume average value is obtained before the user answers the preset question, the corresponding wake-up threshold may be adjusted by using the second adjustment ratio for each preset wake-up word. As described in the foregoing disclosure, the wake-up threshold may be denoted as SmFor the convenience of distinction, the adjusted wake-up threshold may be recorded as SfThen adjust the wake-up threshold SfCan be expressed as:
Sf=Sm*λ……(7)
in another implementation scenario, in a case that the measured volume average is obtained based on statistics of already answered preset questions before each preset question, for each preset question, the second adjustment ratio of each preset wake-up word in the wake-up word set corresponding to the preset question may be used to adjust the corresponding wake-up threshold, and the specific calculation process may be as shown in the above formula (7).
Step S65: and for each candidate awakening word, determining whether to take the candidate awakening word as a target awakening or not based on the magnitude relation between the awakening excitation and the awakening threshold corresponding to the candidate awakening word.
Reference may be made specifically to the foregoing disclosure embodiments, which are not described herein again.
According to the scheme, before whether the candidate awakening word is used as the target awakening word or not is determined based on the awakening threshold corresponding to the candidate awakening word, the measured volume average value of the user in the environment of answering the question is obtained firstly, the second adjustment proportion is determined based on the measured volume average value, the second adjustment proportion is positively correlated with the measured volume average value, and the awakening threshold is adjusted by further utilizing the second adjustment proportion, so that the self-adaptive adjustment of the awakening threshold can be realized in the answering and grading process, and the accuracy of answering and grading is improved.
Referring to fig. 7, fig. 7 is a schematic flowchart illustrating an answer scoring method according to another embodiment of the present application.
Specifically, the method may include the steps of:
step S71: and carrying out awakening word detection on the answer audio to obtain a detection result.
In the embodiment of the present disclosure, the answer audio is collected when the user answers a preset question, the detection result includes at least one target wake-up word, and the at least one target wake-up word is from a wake-up word set, the wake-up word set is obtained based on a preset answer to the preset question, the wake-up word set includes a plurality of preset wake-up words and wake-up thresholds corresponding to the preset wake-up words, and the wake-up thresholds are obtained by using a sample audio test related to the preset wake-up words.
Step S72: and modifying the awakening threshold based on the difference between the detection result and the scoring result.
In the embodiment of the disclosure, the scoring result includes at least one actual wake-up word, and the at least one actual wake-up word is included in the answer audio and is from the wake-up word set. That is, the actual wake-up word is the preset wake-up word actually included in the answer audio.
Specifically, referring to fig. 3, by comparing the difference between the detection result and the score result, the preset wake-up word that should be woken up but not woken up and the preset wake-up word that should not be woken up but woken up can be determined, and on the basis, the wake-up thresholds corresponding to the two preset wake-up words, that is, the preset wake-up word that should be woken up but not woken up and the preset wake-up word that should not be woken up but woken up, can be modified. Specifically, the preset wake-up word that should be awakened but not be awakened can be turned down, and the preset wake-up word that should not be awakened but be awakened can be turned up, so as to simultaneously improve the success rate of awakening and reduce the false awakening rate.
In one implementation scenario, the preset question "imagine you have much money of 1, 5, 10 dollars. Now you need to pay me 13 dollars, please give me 3 payment methods. I do not find your change, you need to pay my 13 yuan for example, the corresponding wake-up word set can refer to the foregoing disclosed embodiment, which is not described herein again, the answer audio is "2 5 yuan, 3 yuan 1", that is, the preset wake-up word to be woken up includes: "2", "5 yuan", "3", "1 yuan", and the detection results include: under the condition of 4 target wake-up words including "2", "5", "3" and "10", it may be determined that the preset wake-up word that should be awakened but not be awakened is "1", and the preset wake-up word that should not be awakened but be awakened is "10", for the preset wake-up word "1", the corresponding wake-up threshold may be appropriately adjusted down, and for the preset wake-up word "10", the corresponding wake-up threshold may be appropriately increased. Other cases may be analogized, and no one example is given here.
In another implementation scenario, the embodiments of the present disclosure may be specifically executed in the early stage of application, so that in the early stage of application, if the wake-up threshold is still not accurate enough, the correction is performed in time. Alternatively, embodiments of the present disclosure may specifically also be performed in a test phase to perform testing ahead of the application, so that corrections can be made in time if the wake-up threshold is still not accurate enough.
According to the scheme, the detection result is obtained by detecting the awakening words of the answer audio, the awakening threshold value is corrected based on the difference between the detection result and the scoring result, the accuracy of the awakening threshold value can be improved, the awakening success rate can be improved at the same time, and the false awakening rate is reduced.
Referring to fig. 8, fig. 8 is a block diagram of an embodiment of the answer scoring device 80 of the present application. The answer scoring device 80 includes an awakening detection module 81 and an answer scoring module 82, wherein the awakening detection module 81 is configured to perform awakening word detection on the answer audio to obtain a detection result; the answer audio is acquired when a user answers a preset question, the detection result comprises at least one target awakening word, the at least one target awakening word is from an awakening word set, and the awakening word set is obtained based on a preset answer of the preset question; the answer scoring module 82 is configured to match the detection result with a preset answer to obtain an answer score.
Above-mentioned scheme, in the answer scoring process, on the one hand only need gather the answer audio frequency that the user answered the preset question and can realize answering the score, make answer score as close as possible to people's interaction form, on the other hand only need awaken the word detection and can obtain in the answer audio frequency at least one target awaken word to awaken the word to answer audio frequency to match and obtain answer score based on at least one target awaken word and preset answer, and need not to carry out the pronunciation transcription to whole answer audio frequency, be favorable to reducing the influence of spoken language quality to answer score as far as possible, so can improve the efficiency and the accuracy of answering the score.
In some disclosed embodiments, the wake-up detection module 81 includes a wake-up word detection sub-module configured to perform wake-up word detection on the answer audio to obtain a wake-up excitation of at least one candidate wake-up word; the method comprises the following steps that at least one candidate awakening word is from an awakening word set, the awakening word set comprises a plurality of preset awakening words and awakening thresholds corresponding to the preset awakening words, and the awakening thresholds are obtained by utilizing a sample audio test related to the preset awakening words; the wake-up detection module 81 includes a target wake-up word obtaining sub-module, configured to determine, for each candidate wake-up word, whether to use the candidate wake-up word as the target wake-up word based on a magnitude relationship between the wake-up excitation and a wake-up threshold corresponding to the candidate wake-up word.
Therefore, the awakening threshold corresponding to each preset awakening word is obtained based on the sample audio test related to the preset awakening word, so that whether awakening is carried out or not is determined by combining the awakening threshold, and the awakening success rate and the false awakening rate can be improved.
In some disclosed embodiments, the sample audio associated with the preset wake up word includes a first audio, a second audio, and a third audio; the first audio comprises a preset awakening word, the second audio comprises a first reference word of the preset awakening word, the third audio comprises a second reference word of the preset awakening word, the first reference word and the preset awakening word are synonymous with different tones, and the second reference word and the preset awakening word are synonymous with different tones.
Therefore, the sample audio related to the preset awakening word is set to comprise the first audio, the second audio and the third audio, the first audio comprises the preset awakening word, the second audio comprises the first reference word, the third audio comprises the second reference word, the first reference word and the preset awakening word are synonymous with different tones, the second reference word and the preset awakening word are synonymous with different tones, namely, the awakening threshold of the preset awakening word can be determined jointly by combining the first audio, the second audio and the third audio, the accuracy of the awakening threshold can be improved, and the accuracy of detection of the awakening word is improved.
In some disclosed embodiments, the answer scoring device 80 includes a threshold obtaining module, and the threshold obtaining module includes a wake-up test sub-module, configured to perform a wake-up test using the first audio frequency, the second audio frequency, and the third audio frequency, respectively, to obtain a first data distribution of a preset wake-up word, a second data distribution of a first reference word, and a third data distribution of a second reference word; the first data distribution comprises a first initial threshold value and a first volume mean value, the second data distribution comprises a second initial threshold value and a second volume mean value, and the third data distribution comprises a third initial threshold value and a third volume mean value; the threshold value obtaining module comprises a weight value obtaining submodule and a second adjusting weight value obtaining submodule, wherein the weight value obtaining submodule is used for obtaining a first adjusting weight value based on the difference between the first data distribution and the second data distribution and obtaining a second adjusting weight value based on the difference between the first data distribution and the third data distribution; the threshold value obtaining module comprises an initial adjustment submodule and is used for adjusting the first initial threshold value by utilizing the first adjustment weight value and the second adjustment weight value to obtain the awakening threshold value of the preset awakening word.
Therefore, the first data distribution of the preset awakening words and the second data distribution of the first reference words are obtained by respectively carrying out awakening tests by utilizing the first audio frequency, the second audio frequency and the third audio frequency, the third data distribution of the second reference words is obtained based on the difference between the first data distribution and the second data distribution, the first adjusting weight is obtained based on the difference between the first data distribution and the third data distribution, the second adjusting weight is obtained, on the basis, the first initial threshold is adjusted by utilizing the first adjusting weight and the second adjusting weight, and the awakening threshold of the preset awakening words is obtained.
In some disclosed embodiments, the wake-up test sub-module includes a mean value statistic unit for calculating a mean value of volume amplitudes of a segment having sound in the test audio to obtain a mean value of volume; the awakening test sub-module comprises a threshold test unit, a threshold test unit and a test module, wherein the threshold test unit is used for carrying out awakening test on awakening words by using test audio, counting the awakening success rates respectively corresponding to different test awakening thresholds when the different test awakening thresholds are selected, and selecting the test awakening threshold with the awakening success rate higher than a preset threshold as an initial threshold; the test audio is a first audio under the condition that the awakening word is a preset awakening word, the volume mean value is a first volume mean value, the initial threshold value is a first initial threshold value, the test audio is a second audio under the condition that the awakening word is a first reference word, the volume mean value is a second volume mean value, the initial threshold value is a second initial threshold value, the test audio is a third audio under the condition that the awakening word is a second reference word, the volume mean value is a third volume mean value, and the initial threshold value is a third initial threshold value.
Therefore, the volume mean value is obtained by counting the volume amplitude mean value of the sound section in the test audio, the test audio is used for carrying out awakening test on the awakening word, the awakening success rates corresponding to different test awakening threshold values are counted and selected, and the test awakening threshold value with the awakening success rate higher than the preset threshold value is selected as the initial threshold value, so that the accuracy of data distribution can be improved.
In some disclosed embodiments, the weight acquisition submodule includes a difference measurement unit configured to measure a difference between the first data distribution and the candidate data distribution to obtain a distribution difference; the candidate data distribution comprises a candidate initial threshold value and a candidate volume mean value; the weight value obtaining submodule comprises a difference value calculating unit, a first volume average value calculating unit and a second volume average value calculating unit, wherein the difference value calculating unit is used for obtaining a threshold value difference value between the candidate initial threshold value and the first initial threshold value and obtaining a volume difference value between the candidate volume average value and the first volume average value; the weight value obtaining submodule comprises a weight value calculating unit which is used for obtaining an adjusting weight value based on the distribution difference, the threshold difference value and the volume difference value; the adjustment weight is negatively correlated with the distribution difference, the adjustment weight is positively correlated with the threshold difference, the adjustment weight is negatively correlated with the volume difference, the adjustment weight is the first adjustment weight when the candidate data distribution is the second data distribution, and the adjustment weight is the second adjustment weight when the candidate data distribution is the third data distribution.
Therefore, by calculating the distribution difference between the data distributions, setting the adjustment weight and the distribution difference as negative correlation, setting the adjustment weight and the threshold difference as positive correlation, and setting the adjustment weight and the threshold difference as negative correlation, when the distribution difference between the preset awakening word and the reference word thereof is small, or the volume difference is small, or the threshold difference is large, the adjustment weight can be increased to distinguish the preset awakening word from the reference word thereof, thereby being beneficial to improving the awakening success rate.
In some disclosed embodiments, the adjustment initialization submodule includes a first ratio obtaining unit, configured to determine a first adjustment ratio by using the first adjustment weight and the second adjustment weight; wherein, the first adjusting weight and the second adjusting weight are positively correlated with the first adjusting proportion; the adjustment initial submodule comprises a wake-up threshold calculation unit which is used for taking the sum of the product of the adjustment step length and the first adjustment proportion and the first initial threshold as the wake-up threshold.
Therefore, the first adjustment proportion is determined by utilizing the first adjustment weight and the second adjustment weight, and both the first adjustment weight and the second adjustment weight are positively correlated with the first adjustment proportion, on the basis, the sum of the product of the adjustment step length and the first adjustment proportion and the first initial threshold is used as a wake-up threshold, namely, the wake-up threshold of the preset wake-up word is positively correlated and adjusted on the basis of the first initial threshold by combining the first adjustment weight of the first reference word and the second adjustment weight of the second reference word, and the accuracy of detection of the wake-up word is further improved.
In some disclosed embodiments, the wake-up detection module 81 includes a volume measured sub-module, configured to obtain a measured volume average value of the user in the environment of answering questions; the wake-up detection module 81 includes a second ratio obtaining sub-module, configured to determine a second adjustment ratio based on the measured volume average value; wherein the second adjustment proportion is positively correlated with the measured volume average value; the wake-up detection module 81 includes a threshold adjustment submodule for adjusting the wake-up threshold using the second adjustment ratio.
Therefore, before determining whether the candidate awakening word is used as the target awakening word or not based on the awakening threshold corresponding to the candidate awakening word, the measured volume average value of the user in the environment of answering the question is obtained, the second adjustment proportion is determined based on the measured volume average value, the second adjustment proportion is positively correlated with the measured volume average value, and the awakening threshold is further adjusted by the second adjustment proportion, so that the self-adaptive adjustment of the awakening threshold can be realized in the answer scoring process, and the accuracy of answer scoring is improved.
In some disclosed embodiments, the volume actual measurement sub-module is specifically configured to obtain an actual measurement volume average value before the user answers a preset question; the second proportion obtaining submodule is specifically used for respectively determining a second adjustment proportion of each preset awakening word in each awakening word set based on the actually measured volume average value; and the threshold adjusting submodule is specifically used for adjusting the corresponding awakening threshold by utilizing the second adjusting proportion of the preset awakening words for each preset awakening word.
Therefore, before the user answers the preset questions, the awakening threshold corresponding to each preset awakening word in the awakening word set corresponding to the plurality of preset questions is uniformly and adaptively adjusted, and the accuracy of the awakening threshold can be further improved on the basis of reducing the adaptive adjustment complexity.
In some disclosed embodiments, the volume actual measurement sub-module is specifically configured to, for each preset question, count to obtain an actual measurement volume average value based on the preset question already answered by the user; the second proportion obtaining sub-module is specifically used for determining a second adjustment proportion of each preset awakening word in the awakening word set corresponding to the preset problem based on the measured volume average value obtained by counting the preset problem for each preset problem; and the threshold adjusting submodule is specifically used for adjusting the corresponding awakening threshold by utilizing the second adjusting proportion of each preset awakening word in the corresponding awakening word set for each preset problem.
Therefore, in the process of answering the preset questions by the user, the self-adaptive adjustment is respectively carried out on the awakening threshold value corresponding to each preset awakening word in the awakening word set corresponding to each question, the self-adaptive adjustment precision can be improved, and the accuracy of the awakening threshold value can be improved as much as possible.
In some disclosed embodiments, the set of wake-up words is created based on score words of the preset answer, and the preset wake-up words include at least one of a first wake-up word and a second wake-up word; the first awakening word is obtained by synonymy expanding the score word based on the preset initial consonant and vowel, and the second awakening word is obtained by dialect conversion of the first awakening word based on the preset dialect.
Therefore, the wake-up word set is created based on the score words of the preset answers, the preset wake-up words include at least one of the first wake-up words and the second wake-up words, the first wake-up words are obtained by synonymy expanding the score words based on the preset initial consonants, and the second wake-up words are obtained by dialect converting the first wake-up words based on the preset dialect, so that the robustness of answer scoring can be improved.
In some disclosed embodiments, the wake threshold of the first wake word is higher than the wake threshold of the second wake word.
Therefore, by assigning a higher wake-up threshold to the first wake-up word, false wake-up can be reduced, and by assigning a lower wake-up threshold to the second wake-up word, the wake-up success rate can be improved.
Referring to fig. 9, fig. 9 is a schematic block diagram of an embodiment of an electronic device 90 according to the present application. Electronic device 90 includes a memory 91 and a processor 92 coupled to each other, memory 91 having stored therein program instructions, and processor 92 for executing the program instructions to implement the steps in any of the answer scoring method embodiments described above. Specifically, the electronic device 90 may include, but is not limited to: desktop computers, notebook computers, mobile phones, tablet computers, and the like, without limitation.
In particular, processor 92 is configured to control itself and memory 91 to implement the steps of any of the answer scoring method embodiments described above. The processor 92 may also be referred to as a CPU (Central Processing Unit). The processor 92 may be an integrated circuit chip having signal processing capabilities. The Processor 92 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 92 may be collectively implemented by an integrated circuit chip.
Above-mentioned scheme, in the answer scoring process, on the one hand only need gather the answer audio frequency that the user answered the preset question and can realize answering the score, make answer score as close as possible to people's interaction form, on the other hand only need awaken the word detection and can obtain in the answer audio frequency at least one target awaken word to awaken the word to answer audio frequency to match and obtain answer score based on at least one target awaken word and preset answer, and need not to carry out the pronunciation transcription to whole answer audio frequency, be favorable to reducing the influence of spoken language quality to answer score as far as possible, so can improve the efficiency and the accuracy of answering the score.
Referring to fig. 10, fig. 10 is a block diagram illustrating an embodiment of a computer-readable storage medium 100 according to the present application. Computer readable storage medium 100 stores program instructions 101 executable by a processor, program instructions 101 for implementing the steps of any of the answer scoring method embodiments described above.
Above-mentioned scheme, in the answer scoring process, on the one hand only need gather the answer audio frequency that the user answered the preset question and can realize answering the score, make answer score as close as possible to people's interaction form, on the other hand only need awaken the word detection and can obtain in the answer audio frequency at least one target awaken word to awaken the word to answer audio frequency to match and obtain answer score based on at least one target awaken word and preset answer, and need not to carry out the pronunciation transcription to whole answer audio frequency, be favorable to reducing the influence of spoken language quality to answer score as far as possible, so can improve the efficiency and the accuracy of answering the score.
In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.
The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims (15)

1. An answer scoring method, comprising:
performing awakening word detection on the answer audio to obtain a detection result; the answer audio is acquired when a user answers a preset question, the detection result comprises at least one target awakening word, the at least one target awakening word is from an awakening word set, and the awakening word set is obtained based on a preset answer of the preset question;
and matching the detection result with the preset answer to obtain an answer score.
2. The method of claim 1, wherein the detecting the answer audio by the wake-up word to obtain a detection result comprises:
performing awakening word detection on the answer audio to obtain awakening excitation of at least one candidate awakening word; the at least one candidate awakening word is from the awakening word set, the awakening word set comprises a plurality of preset awakening words and awakening thresholds corresponding to the preset awakening words, and the awakening thresholds are obtained by utilizing a sample audio test related to the preset awakening words;
for each candidate awakening word, determining whether to take the candidate awakening word as the target awakening word based on the magnitude relation between the awakening excitation and the awakening threshold corresponding to the candidate awakening word.
3. The method of claim 2, wherein the sample audio related to the preset wake up word comprises a first audio, a second audio and a third audio;
the first audio comprises the preset awakening word, the second audio comprises a first reference word of the preset awakening word, the third audio comprises a second reference word of the preset awakening word, the first reference word and the preset awakening word are synonymous with different tones, and the second reference word and the preset awakening word are synonymous with different tones.
4. The method according to claim 3, wherein the step of obtaining the wake threshold of the preset wake word comprises:
respectively carrying out awakening tests by using the first audio frequency, the second audio frequency and the third audio frequency to obtain a first data distribution of the preset awakening words, a second data distribution of the first reference words and a third data distribution of the second reference words; wherein the first data distribution comprises a first initial threshold and a first volume mean, the second data distribution comprises a second initial threshold and a second volume mean, and the third data distribution comprises a third initial threshold and a third volume mean;
obtaining a first adjustment weight based on a difference between the first data distribution and the second data distribution, and obtaining a second adjustment weight based on a difference between the first data distribution and the third data distribution;
and adjusting the first initial threshold value by using the first adjustment weight value and the second adjustment weight value to obtain the awakening threshold value of the preset awakening word.
5. The method of claim 4, wherein the performing the wake-up test using the first audio frequency, the second audio frequency, and the third audio frequency to obtain a first data distribution of the preset wake-up word, a second data distribution of the first reference word, and a third data distribution of the second reference word, respectively, comprises:
counting the volume amplitude average value of a sound section in the test audio to obtain a volume average value; and the number of the first and second groups,
carrying out awakening test on awakening words by using test audio, counting awakening success rates respectively corresponding to different test awakening thresholds, and selecting the test awakening threshold with the awakening success rate higher than a preset threshold as an initial threshold;
the test audio is a first audio when the wake-up word is the preset wake-up word, the volume mean value is the first volume mean value, the initial threshold is the first initial threshold, the test audio is a second audio when the wake-up word is the first reference word, the volume mean value is the second volume mean value, the initial threshold is the second initial threshold, the test audio is a third audio when the wake-up word is the second reference word, the volume mean value is the third volume mean value, and the initial threshold is the third initial threshold.
6. The method of claim 4, wherein obtaining a first adjustment weight based on a difference between the first data distribution and the second data distribution, or obtaining a second adjustment weight based on a difference between the first data distribution and the third data distribution comprises:
measuring the difference between the first data distribution and the candidate data distribution to obtain a distribution gap; wherein the candidate data distribution comprises a candidate initial threshold and a candidate volume mean; and the number of the first and second groups,
acquiring a threshold difference value between the candidate initial threshold value and the first initial threshold value, and acquiring a volume difference value between the candidate volume average value and the first volume average value;
obtaining an adjustment weight value based on the distribution gap, the threshold difference value and the volume difference value;
wherein the adjustment weight is negatively correlated with the distribution gap, the adjustment weight is positively correlated with the threshold difference, the adjustment weight is negatively correlated with the volume difference, and the adjustment weight is the first adjustment weight when the candidate data distribution is the second data distribution, and the adjustment weight is the second adjustment weight when the candidate data distribution is the third data distribution.
7. The method according to claim 4, wherein the adjusting the first initial threshold value by using the first adjustment weight and the second adjustment weight to obtain the wake-up threshold value of the preset wake-up word comprises:
determining a first adjustment proportion by using the first adjustment weight and the second adjustment weight; wherein the first adjustment weight and the second adjustment weight are positively correlated with the first adjustment proportion;
and taking the sum of the product of the adjustment step size and the first adjustment proportion and the first initial threshold as the awakening threshold.
8. The method of claim 2, wherein before the determining, for each of the candidate wake words, whether to use the candidate wake word as the target wake word based on a magnitude relationship between the wake stimulus and a wake threshold corresponding to the candidate wake word, the method further comprises:
acquiring an actually measured volume average value of a user in a question answering environment;
determining a second adjustment proportion based on the measured volume average value; wherein the second adjustment proportion is positively correlated with the measured volume average value;
and adjusting the awakening threshold value by utilizing the second adjustment proportion.
9. The method according to claim 8, wherein the preset question has a plurality of channels, and each channel of the preset question corresponds to the set of awakening words; the step of obtaining the measured volume average value of the user in the environment of answering the questions comprises the following steps:
before the user answers the preset question, acquiring the measured volume average value;
the determining a second adjustment ratio based on the measured volume average includes:
respectively determining a second adjustment proportion of each preset awakening word in each awakening word set based on the measured volume average value;
the adjusting the wake-up threshold by using the second adjustment ratio includes:
and for each preset awakening word, adjusting a corresponding awakening threshold value by using a second adjustment proportion of the preset awakening word.
10. The method according to claim 8, wherein the preset question has a plurality of channels, and each channel of the preset question corresponds to the set of awakening words; the step of obtaining the measured volume average value of the user in the environment of answering the questions comprises the following steps:
for each preset question, counting to obtain the mean value of the actually measured volume based on the preset questions already answered by the user;
the determining a second adjustment ratio based on the measured volume average includes:
for each preset problem, determining a second adjustment proportion of each preset awakening word in the awakening word set corresponding to the preset problem based on the measured volume average value obtained by counting the preset problem;
the adjusting the wake-up threshold by using the second adjustment ratio includes:
and for each preset problem, adjusting a corresponding awakening threshold value by using a second adjustment proportion of each preset awakening word in the corresponding awakening word set.
11. The method of claim 2, wherein the set of wake-up words is created based on score words of the preset answer, and the preset wake-up words comprise at least one of a first wake-up word and a second wake-up word;
the first awakening word is obtained by synonymy expanding the score word based on a preset initial consonant and vowel, and the second awakening word is obtained by dialect conversion of the first awakening word based on a preset dialect.
12. The method of claim 11, wherein a wake threshold of the first wake word is higher than a wake threshold of the second wake word.
13. An answer scoring device, comprising:
the awakening detection module is used for carrying out awakening word detection on the answer audio to obtain a detection result; the answer audio is acquired when a user answers a preset question, the detection result comprises at least one target awakening word, the at least one target awakening word is from an awakening word set, and the awakening word set is obtained based on a preset answer of the preset question;
and the answer scoring module is used for matching the detection result with the preset answer to obtain an answer score.
14. An electronic device comprising a memory and a processor coupled to each other, the memory having stored therein program instructions for execution by the processor to implement the answer scoring method of any one of claims 1 to 12.
15. A computer-readable storage medium having stored thereon program instructions executable by a processor for implementing the answer scoring method of any one of claims 1 to 12.
CN202110614234.5A 2021-06-02 2021-06-02 Answer scoring method and device, electronic equipment and storage medium Active CN113535913B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110614234.5A CN113535913B (en) 2021-06-02 2021-06-02 Answer scoring method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110614234.5A CN113535913B (en) 2021-06-02 2021-06-02 Answer scoring method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113535913A true CN113535913A (en) 2021-10-22
CN113535913B CN113535913B (en) 2023-12-01

Family

ID=78095006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110614234.5A Active CN113535913B (en) 2021-06-02 2021-06-02 Answer scoring method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113535913B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070010330A1 (en) * 2005-01-04 2007-01-11 Justin Cooper System and method forming interactive gaming over a TV network
CN102999161A (en) * 2012-11-13 2013-03-27 安徽科大讯飞信息科技股份有限公司 Implementation method and application of voice awakening module
CN106782536A (en) * 2016-12-26 2017-05-31 北京云知声信息技术有限公司 A kind of voice awakening method and device
CN109815491A (en) * 2019-01-08 2019-05-28 平安科技(深圳)有限公司 Answer methods of marking, device, computer equipment and storage medium
US20200090639A1 (en) * 2018-09-13 2020-03-19 Quanta Computer Inc. Speech correction system and speech correction method
CN111126553A (en) * 2019-12-25 2020-05-08 平安银行股份有限公司 Intelligent robot interviewing method, equipment, storage medium and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070010330A1 (en) * 2005-01-04 2007-01-11 Justin Cooper System and method forming interactive gaming over a TV network
CN102999161A (en) * 2012-11-13 2013-03-27 安徽科大讯飞信息科技股份有限公司 Implementation method and application of voice awakening module
CN106782536A (en) * 2016-12-26 2017-05-31 北京云知声信息技术有限公司 A kind of voice awakening method and device
US20200090639A1 (en) * 2018-09-13 2020-03-19 Quanta Computer Inc. Speech correction system and speech correction method
CN109815491A (en) * 2019-01-08 2019-05-28 平安科技(深圳)有限公司 Answer methods of marking, device, computer equipment and storage medium
CN111126553A (en) * 2019-12-25 2020-05-08 平安银行股份有限公司 Intelligent robot interviewing method, equipment, storage medium and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
白菊;何聚厚;: "应用于问答系统的Lucene相似度检索算法改进", 计算机技术与发展, no. 11, pages 85 - 88 *

Also Published As

Publication number Publication date
CN113535913B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
US11749414B2 (en) Selecting speech features for building models for detecting medical conditions
CN112259106B (en) Voiceprint recognition method and device, storage medium and computer equipment
US11373641B2 (en) Intelligent interactive method and apparatus, computer device and computer readable storage medium
Jiang et al. Investigation of different speech types and emotions for detecting depression using different classifiers
Jing et al. Prominence features: Effective emotional features for speech emotion recognition
US9934779B2 (en) Conversation analyzing device, conversation analyzing method, and program
Sethu et al. Speech based emotion recognition
Yin et al. Automatic cognitive load detection from speech features
KR102444012B1 (en) Device, method and program for speech impairment evaluation
Wang et al. Depression speech recognition with a three-dimensional convolutional network
Deb et al. Fourier model based features for analysis and classification of out-of-breath speech
Gosztolya Posterior-thresholding feature extraction for paralinguistic speech classification
CN102184654A (en) Reading supervision method and device
Bayerl et al. Detecting vocal fatigue with neural embeddings
Humayun et al. Native language identification for Indian-speakers by an ensemble of phoneme-specific, and text-independent convolutions
Deb et al. Classification of speech under stress using harmonic peak to energy ratio
Kirkham et al. Diachronic phonological asymmetries and the variable stability of synchronic contrast
Tsai et al. Self-defined text-dependent wake-up-words speaker recognition system
Mirhassani et al. Fuzzy-based discriminative feature representation for children's speech recognition
Safavi Speaker characterization using adult and children’s speech
Alshammri IoT‐Based Voice‐Controlled Smart Homes with Source Separation Based on Deep Learning
Zhang et al. Multimodal emotion recognition integrating affective speech with facial expression
CN113535913B (en) Answer scoring method and device, electronic equipment and storage medium
Gupta et al. Literature survey and review of techniques used for automatic assessment of Stuttered Speech
JP7107377B2 (en) Speech processing device, speech processing method, and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant