CN111710340A

CN111710340A - Method, device, server and storage medium for identifying user identity based on voice

Info

Publication number: CN111710340A
Application number: CN202010505687.XA
Authority: CN
Inventors: 杨楠
Original assignee: Shenzhen Kaniu Technology Co ltd
Current assignee: Shenzhen Kaniu Technology Co ltd
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2020-09-25

Abstract

The embodiment of the invention discloses a method, a device, a server and a storage medium for identifying user identity based on voice, wherein the method comprises the following steps: acquiring verification pronunciation generated by reading verification information of a user, wherein the verification information comprises one or more of characters, symbols or numbers; and confirming whether the verification pronunciation is matched with the preset pronunciation of the verification information, and if so, determining the identity of the user according to the voiceprint of the verification pronunciation. According to the method for identifying the user identity based on the voice, provided by the embodiment of the invention, the user identity is identified by identifying the verification pronunciation of the user reading verification information, so that non-personal illegal operation behaviors can be effectively prevented in common links related to privacy operations, such as account login, transaction information inquiry, transaction record deletion and the like, the security of the user account is improved, and the user experience is enhanced.

Description

Method, device, server and storage medium for identifying user identity based on voice

Technical Field

The embodiment of the invention relates to a voice recognition technology, in particular to a method, a device, a server and a storage medium for recognizing user identity based on voice.

Background

Fingerprints and human faces as biological characteristics are widely used in various technical fields of identity confirmation, such as fingerprint door lock, human face check-in, mobile phone fingerprints, human face payment and the like, but another biological characteristic is usually ignored by people, namely voice. The change of the sound of an adult with the age is very small and can be basically ignored, the distinction degree of the sound of men and women, children and adults is very large, in addition, the sound difference between people is also very large on the fine granularity, and more effective sound biological characteristic purposes can be developed by utilizing the difference.

In the face recognition technology, along with the rapid popularization of smart phone devices, the difficulty of acquiring face data gradually decreases, and the fingerprint recognition technology is also developed to acquire the face data under a screen. For voiceprint recognition, only a microphone is needed for collection, and other equipment is not needed. In the user angle, most people do not want to see own faces in the camera, so the experience effect of the face recognition user is not high; for fingerprint identification, fingerprint identification usually needs identification equipment and fingers to keep clean, otherwise the identification effect can be very poor, for voiceprint identification, only need to make sound can accomplish the identification, experience effect is better for the user.

Therefore, the verification through voice is more easily welcomed by users, but in the voice recognition in the prior art, some schemes only introduce the whole identity confirmation process, but in order to explain the key technologies related to each module in the process, some adopted technologies are laggard, are not in line with the development trend of deep learning at present, and can not effectively represent the difference of the identities of people.

Disclosure of Invention

The invention provides a method for identifying a user identity based on voice, which aims to improve the safety of a user account and enhance the user experience.

In a first aspect, an embodiment of the present invention provides a method for identifying a user identity based on voice, including:

acquiring verification pronunciation generated by reading verification information of a user, wherein the verification information comprises one or more of characters, symbols or numbers;

and confirming whether the verification pronunciation is matched with the preset pronunciation of the verification information, and if so, determining the identity of the user according to the voiceprint of the verification pronunciation.

Optionally, before the obtaining of the verification pronunciation generated by the user reading verification information, the method further includes:

acquiring a plurality of sample pronunciations generated by reading a plurality of pieces of sample information by a user;

and if the plurality of sample pronunciations are matched with the preset pronunciations corresponding to the sample information, confirming the voiceprint range of the user according to the plurality of sample pronunciations.

Optionally, the determining whether the verification pronunciation matches the preset pronunciation of the verification information includes: and converting the verification pronunciation into a verification graphic feature by using a first deep learning model, and judging whether the verification graphic feature is matched with a preset graphic feature corresponding to a preset pronunciation of the verification information.

Optionally, the converting the verification utterance into the verification pattern feature using the first deep learning model includes:

each word, symbol or number of the verification utterance is separately converted to a verification graph feature using a first deep learning model.

and converting all characters, symbols or numbers of the verification pronunciation into a verification graphic feature for recognition by using a first deep learning model.

Optionally, the determining the user identity according to the voiceprint of the verification pronunciation includes: and converting the verification pronunciation into a frequency distribution graph by using a second deep learning model, and confirming the voiceprint of the user and the user identity corresponding to the voiceprint according to the frequency distribution graph.

Optionally, the verification information is one or more of a random first-digit word, symbol or number and one or more of a second-digit word, symbol or number.

In a second aspect, an embodiment of the present invention further provides a device for recognizing a user identity based on voice, where the device includes:

the voice acquisition module is used for acquiring verification pronunciation generated by reading verification information by a user, wherein the verification information comprises one or more of characters, symbols or numbers;

and the voice recognition module is used for confirming whether the verification pronunciation is matched with the preset pronunciation of the verification information or not, and if so, determining the identity of the user according to the voiceprint of the verification pronunciation.

In a third aspect, an embodiment of the present invention further provides a server, where the server includes:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a method for recognizing a user identity based on speech as in any one of the above.

In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements any of the above-mentioned methods for recognizing a user identity based on voice.

Drawings

Fig. 1 is a flowchart of a method for identifying a speaker identity based on a dynamic voice password according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for identifying the identity of a speaker based on a dynamic voice password according to a second embodiment of the present invention;

FIG. 3 is a diagram illustrating identification of numbers according to waveforms in the second embodiment;

FIG. 4 is a schematic structural diagram of an apparatus for recognizing a user identity based on speech in an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer server according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. A process may be terminated when its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc.

Furthermore, the terms "first," "second," and the like may be used herein to describe various orientations, actions, steps, elements, or the like, but the orientations, actions, steps, or elements are not limited by these terms. These terms are only used to distinguish one direction, action, step or element from another direction, action, step or element. For example, a first deep learning model may be referred to as a second deep learning model, and similarly, a second deep learning model may be referred to as a first deep learning model, without departing from the scope of the present application. Both the first deep learning model and the second deep learning model are deep learning models, but they are not the same deep learning model. The terms "first", "second", etc. are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Example one

Fig. 1 is a flowchart of a method for recognizing an identity of a speaker based on a dynamic voice password according to an embodiment of the present invention, where the embodiment is suitable for a situation where an identity of a user is confirmed by recognizing a voice of the user, and specifically includes the following steps:

step 100, obtaining a verification pronunciation generated by reading verification information by a user, wherein the verification information comprises one or more of characters, symbols or numbers.

In this embodiment, the verification information is generated by the server according to the verification request of the user, and is sent to the user terminal after being generated so as to be used for voice verification by the user, the verification information is similar to a mobile phone verification code, for example, a digital string 01273569 or a text string "it is good today's weather", in this embodiment, the verification information is described by taking the digital verification code as an example, common digital verification codes include 4-bit, 6-bit and 8-bit digital verification codes, and it is considered that a voice signal is to be collected later, and the voice length is guaranteed to be valid, so that an 8-bit digital verification code is adopted. The verification pronunciation is audio information sent after the user reads the verification information, and the audio information can be collected in a microphone of the user terminal or a recording function of the terminal APP and uploaded to the server for verification. After the server acquires the verification pronunciation of the user, the server can carry out matching through the preset pronunciation in the database to verify whether the verification pronunciation of the user is accurate.

In this embodiment, each bit of text, symbol, or number in the verification information is randomly generated, for example, each bit of text, symbol, or number is randomly generated by a random number generator, or may be randomly generated by a custom function, and the random generation of each bit of text, symbol, or number avoids that the number of samples in the verification information is too small, which results in too small number of verification information and lack of diversity. In an alternative embodiment, each character, symbol or number in the verification information is independent from each other and is not repeated, for example, the number string 18475920, and each number in the number string is not repeated, so that each number is ensured to be separately identified and matched, and the accuracy of voice recognition is improved. In an alternative embodiment, the distance between two words, symbols or numbers in the verification information, where the acoustic pronunciation is close, is at least 2 bits, and when the number 2 and the number 8 are present, the distance between the two words, symbols or numbers needs to be kept above 2 bits, because the acoustic pronunciations are closer and are easy to be confused in linguistics with the number 2 and the number 8, and especially when the two words, symbols or numbers are continuous, the two words, symbols or numbers need to be spaced apart. In an alternative embodiment, the verification information is one or more of a random first digit of text, symbol, or number plus one or more of a second digit of text, symbol, or number. In order to improve the throughput and enhance the user experience, the digital verification code is formed by adding the random number of the fixed number and the fixed number of the fixed number, specifically, for example, three strings of verification codes are generated in the registration stage, which are respectively: 84672591, 07632591 and 48672591, it can be seen that the last four digits 2591 are fixed, the first four digits are random, and the new authentication code, such as 36702591, is obtained after the first four digits are random in the authentication stage. The method is used for preparing for voiceprint recognition, fixes partial text, limits the voiceprint recognition to be a text related task, can effectively improve the recognition rate, and increases the verification passing rate.

And step 110, confirming whether the verification pronunciation is matched with a preset pronunciation of the verification information, and if so, determining the identity of the user according to the voiceprint of the verification pronunciation.

In this embodiment, the voiceprint refers to the sound wave spectrum that carries speech information that shows with the electro-acoustic apparatus, and the voiceprint has the characteristics of specificity and stability, and everyone's voiceprint all is inequality and along with age growth, and the voiceprint change is also little, consequently through the voiceprint recognition user identity success rate higher. The preset pronunciation is a standard pronunciation of characters, symbols or numbers, which can be obtained by averaging pronunciations of multiple persons, and is used for comparing with a verification pronunciation of a user to judge whether the pronunciation of the user is accurate. And after the server receives the verification pronunciation generated by the user according to the verification information, if the user pronunciation is matched with the preset pronunciation, the server indicates that the user reads the verification information accurately, and the server performs next step of voiceprint recognition to further confirm the identity of the user. In the voiceprint recognition, the server matches the verification pronunciation of the user with the user voice backup recorded in the database, if the similarity degree of the voiceprint of the verification pronunciation and the voiceprint of the voice backup is within the range of a preset threshold value, the verification is passed, and if the similarity degree is not within the range of the preset threshold value, the verification fails. Illustratively, a user needs to perform identity authentication when logging in a mobile phone APP, at the moment, the user clicks to start an authentication server to randomly generate a string of authentication codes, the user reads audio information of the authentication codes and sends the audio information to the server for authentication and matching, if the audio information passes the authentication codes, the user is allowed to log in the mobile phone APP, and if the audio information does not pass the authentication codes, the user is not allowed to log in, so that the safety of the user when using the mobile phone APP is ensured.

The embodiment discloses a method for identifying user identity based on voice, which comprises the following steps: acquiring verification pronunciation generated by reading verification information of a user, wherein the verification information comprises one or more of characters, symbols or numbers; and confirming whether the verification pronunciation is matched with the preset pronunciation of the verification information, and if so, determining the identity of the user according to the voiceprint of the verification pronunciation. According to the method for identifying the user identity based on the voice, provided by the embodiment of the invention, the user identity is identified by identifying the verification pronunciation of the user reading verification information, so that non-personal illegal operation behaviors can be effectively prevented in common links related to privacy operations, such as account login, transaction information inquiry, transaction record deletion and the like, the security of the user account is improved, and the user experience is enhanced.

Example two

Fig. 2 is a flowchart of a method for identifying an identity of a speaker based on a dynamic voice password according to a second embodiment of the present invention, where the second embodiment is suitable for a situation where an identity of a user is confirmed by identifying a voice of the user, and specifically includes the following steps:

and 200, acquiring a plurality of sample pronunciations generated by reading a plurality of pieces of sample information by a user.

In this embodiment, the sample information is generated by the server according to the authentication request of the user, and is sent to the user terminal after being generated for the user to perform voice authentication, where the authentication information is similar to a mobile phone authentication code, such as a numeric string 01273569 or a text string "it is good today's weather". The sample voice is audio information sent after the user reads the verification information, and the audio information can be collected through a microphone of the user terminal or a recording function of the terminal APP and the like and uploaded to the server for verification. In this embodiment, the format and the generation rule of the sample information are the same as those of the verification information in the first embodiment, and the specific format and the generation rule are described in detail in the first embodiment. For example, when a user uses APP for the first time, a sample pronunciation of the user needs to be collected for being used as a verification audio when the user logs in later, the server sends a plurality of pieces of sample information to the user handsets such as 84672591, 07632591 and 48672591, and the user needs to read according to the plurality of pieces of sample information to generate a plurality of sample pronunciations to be sent to the server for verification.

And 210, if the sample pronunciations are all matched with the preset pronunciations corresponding to the sample information, confirming the voiceprint range of the user according to the sample pronunciations.

In this embodiment, the server receives multiple sample pronunciations of the user and then matches the multiple sample pronunciations with a preset pronunciation, if the multiple sample pronunciations are within a fault-tolerant range of the preset pronunciation, the server successfully verifies the multiple sample pronunciations, and the server matches and stores a mean value of voiceprints corresponding to the multiple sample pronunciations of the user and identity information of the user in a database. Illustratively, a plurality of pieces of sample information are collected for the user a, which are 84672591, 07632591 and 48672591, and correspond to the voiceprint a, the voiceprint B and the voiceprint C, respectively, the server calculates an average value of the voiceprint a, the voiceprint B and the voiceprint C to obtain a voiceprint D and confirms a threshold range of the voiceprint D, and stores the threshold range of the voiceprint D and the voiceprint D with various pieces of user information of the user a, such as name, telephone and the like. When the user who speaks the nail puts forward authentication, the voiceprint X corresponding to the voice of the user is collected and matched with the voiceprint D of the nail in the database, and if the voiceprint X accords with the threshold range of the voiceprint D, the user is proved to be authenticated to be the nail of the user.

Step 220, obtaining a verification pronunciation generated by reading verification information of a user, wherein the verification information comprises one or more of characters, symbols or numbers.

Step 230, converting the verification pronunciation into a verification pattern feature by using a first deep learning model, and judging whether the verification pattern feature matches a preset pattern feature corresponding to a preset pronunciation of the verification information.

In this embodiment, the first deep learning model is vgg13 (a lightweight version of the vgg model) for identifying specific words, symbols, or numbers from a frequency spectrogram. The preset pattern feature is to convert the verification utterance into a frequency distribution pattern, and illustratively, the frequency distribution pattern of each number is different, so that each number can be distinguished by the frequency distribution pattern. In the embodiment, each character, symbol or number of the verification pronunciation is individually converted into the verification graphic feature by using the first deep learning model, that is, the whole character, symbol or number is divided into single characters, symbols or numbers to be recognized one by one, for example, the number string 57043102 is divided into eight individual numbers, i.e., 5, 7, 0, 4, 3, 1, 0 and 2, to be recognized one by one. In the recognition process, the voice signal is segmented by removing the silent segment in the verification pronunciation, and then each voiced segment is recognized, so that the Chinese digital recognition of the continuous voice signal is realized. Referring to fig. 3, fig. 3 is a schematic diagram of the present embodiment according to waveform identification numbers, and the waveform in fig. 3 is a time-domain signal waveform, because the time-domain signal carries a single information, the available features are less, and especially the discrimination between signals is not high, so that when analyzing a speech signal, fourier transform is often adopted to convert the time domain into the frequency domain, and the signal distribution is observed in the frequency domain. After transformation, since each number has a different corresponding frequency distribution and a different formant, a specific number can be identified according to the frequency distribution and formant condition in the frequency map. In the vgg13 model, training data cover at least more than 60 persons, including male and female, each of which is a single pronunciation of one number from 0 to 9, and data enhancement is performed on the basis, including pruning, loudness compensation, adding background noise, adjusting pitch and adjusting speech speed, and ten-fold cross validation is adopted in the training process.

In an alternative embodiment, all words, symbols or numbers of the verification utterance are recognized as a whole using a first deep learning model, i.e., all words, symbols or numbers are recognized as a whole, such as the string of numbers 57043102 as a whole.

In an alternative embodiment, using mel-energy spectrogram of speech signal as input features, the first deep learning model uses CNN-LSTM as model, which is different from vgg13 model in that the input and output are no longer single digital speech signal, and are continuous speech signals, for example, the input is the pronunciation of 01234567, the output is eight digits of 01234567, an end-to-end idea, and adding CTC as loss function, it will automatically align the prediction result of each frame, merge the repeated terms, and give the sequence result with highest confidence.

Step 240, converting the verification pronunciation into a frequency distribution map by using a second deep learning model, and confirming a voiceprint of the user and the user identity corresponding to the voiceprint according to the frequency distribution map.

In this embodiment, the second deep learning model is an identity coding model, and the model structure is a combination of CNN, RNN, LSTM, GRU, and the like, which is preferred for performance. The voiceprints are input into the identity coding model and converted into the frequency distribution map, and the voiceprints of each person can be distinguished by reading the frequency distribution map for identification because the voiceprints of each person are different and the frequency distribution map is different. In the voiceprint feature extraction stage, original MFCC features are abandoned, the voiceprint feature extraction stage is mainly characterized in that the original MFCC features are abandoned, the voiceprint feature extraction stage is mainly obtained through manual extraction and dimension reduction, excessive information is screened out subjectively, and model self-learning is not facilitated. The Mel energy spectrum chart is used for characterizing the frequency distribution of sounds which can be heard by a person, is a deep characteristic of the person for distinguishing things through the sounds, and is more suitable for constructing a speaker recognition system by utilizing the distribution characteristic in the Mel frequency domain. Although the mel frequency domain is nonlinear, the relationship between the ordinary frequency domain and the mel frequency domain is linear, and in the mel frequency domain, the human perception of the tone is also linear. Obviously, after such conversion, the voice signal becomes an image carrying voiceprint information, and for a single signal, the mel-energy spectrum is black and white, which can be understood as a single-channel picture.

In the training stage of the identity coding model, the target of model training is more concerned than the structure in the stage, the problem of model multiplexing is considered, the adopted discrimination model is different from the generated model, only one model exists in a plurality of categories, but the problem of fixed number of categories also exists.

In the first alternative embodiment, an identity coding task is directly trained, an identity coding task is not trained through a classification task, specifically, a twin network structure is adopted, the twin network structure is two models with the same structure and shared weight, a pair of two data are input each time, and when the two models are the same person, the aim is to reduce the difference between two network outputs; when the two people are different, the difference between the two network outputs is increased, and after final convergence, the model can be understood to have learned the capability of distinguishing the sounds of the same person and different persons, and the result vector output by the model can be used as the identity code.

In the second alternative embodiment, based on the idea of the first alternative embodiment, a better result can be obtained by using a ternary loss function, where the input is three, one metadata, one positive case with the same person as the metadata, and one negative case with different persons as the metadata, and the goal is to expect that the difference between the metadata and the positive case is far smaller than the difference between the metadata and the negative case for each input, which is more beneficial for the network from the voiceprint characterization to the identity of different persons.

The embodiment of the invention discloses a method for identifying user identity based on voice, which comprises the following steps: acquiring a plurality of sample pronunciations generated by reading a plurality of pieces of sample information by a user; if the plurality of sample pronunciations are matched with the preset pronunciations corresponding to the sample information, confirming the voiceprint range of the user according to the plurality of sample pronunciations; acquiring verification pronunciation generated by reading verification information of a user, wherein the verification information comprises one or more of characters, symbols or numbers; converting the verification pronunciation into a verification graphic feature by using a first deep learning model, and judging whether the verification graphic feature is matched with a preset graphic feature corresponding to a preset pronunciation of the verification information; and converting the verification pronunciation into a frequency distribution graph by using a second deep learning model, and confirming the voiceprint of the user and the user identity corresponding to the voiceprint according to the frequency distribution graph. According to the method for identifying the user identity based on the voice, provided by the embodiment of the invention, the user identity is identified by identifying the verification pronunciation of the user reading verification information, so that non-personal illegal operation behaviors can be effectively prevented in common links related to privacy operations, such as account login, transaction information inquiry, transaction record deletion and the like, the security of the user account is improved, and the user experience is enhanced.

EXAMPLE III

The device for recognizing the user identity based on the voice provided by the embodiment of the invention can implement the method for recognizing the user identity based on the voice provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. Fig. 4 is a schematic structural diagram of an apparatus 300 for recognizing a user identity based on voice in an embodiment of the present invention. Referring to fig. 4, the apparatus 300 for recognizing a user identity based on voice according to an embodiment of the present invention may specifically include:

The embodiment discloses a device based on speech recognition user identity, including: the voice acquisition module is used for acquiring verification pronunciation generated by reading verification information by a user, wherein the verification information comprises one or more of characters, symbols or numbers; and the voice recognition module is used for confirming whether the verification pronunciation is matched with the preset pronunciation of the verification information or not, and if so, determining the identity of the user according to the voiceprint of the verification pronunciation. According to the method for identifying the user identity based on the voice, provided by the embodiment of the invention, the user identity is identified by identifying the verification pronunciation of the user reading verification information, so that non-personal illegal operation behaviors can be effectively prevented in common links related to privacy operations, such as account login, transaction information inquiry, transaction record deletion and the like, the security of the user account is improved, and the user experience is enhanced.

Example four

Fig. 5 is a schematic structural diagram of a computer server according to an embodiment of the present invention, as shown in fig. 5, the computer server includes a memory 410 and a processor 420, the number of the processors 420 in the computer server may be one or more, and one processor 420 is taken as an example in fig. 5; the memory 410 and the processor 420 in the device may be connected by a bus or other means, and fig. 5 illustrates the connection by a bus as an example.

The memory 410 is used as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the method for recognizing a user identity based on voice in the embodiment of the present invention (e.g., the voice obtaining module 310 and the voice recognition module 320 in the apparatus 300 for recognizing a user identity based on voice), and the processor 420 executes various functional applications and data processing of the device/terminal/equipment by executing the software programs, instructions, and modules stored in the memory 410, so as to implement the method for recognizing a user identity based on voice.

Wherein the processor 420 is configured to run the computer program stored in the memory 410, and implement the following steps:

In one embodiment, the computer program of the computer device provided by the embodiment of the present invention is not limited to the above method operations, and may also perform related operations in the method for recognizing the user identity based on voice provided by any embodiment of the present invention.

The memory 410 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 410 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 410 may further include memory located remotely from the processor 420, which may be connected to devices/terminals/devices through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment discloses a server for identifying user identity based on voice, which is used for executing the following method, and comprises the following steps: acquiring verification pronunciation generated by reading verification information of a user, wherein the verification information comprises one or more of characters, symbols or numbers; and confirming whether the verification pronunciation is matched with the preset pronunciation of the verification information, and if so, determining the identity of the user according to the voiceprint of the verification pronunciation. According to the server for identifying the user identity based on the voice, the user identity is identified by identifying the verification pronunciation of the user reading verification information, non-user illegal operation behaviors can be effectively prevented in common links related to privacy operations, such as account login, transaction information inquiry, transaction record deletion and the like, the safety of the user account is improved, and the user experience is enhanced.

EXAMPLE five

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method for recognizing a user identity based on speech, the method including:

Of course, the storage medium containing the computer-executable instructions provided by the embodiments of the present invention is not limited to the method operations described above, and may also perform related operations in a method for recognizing a user identity based on a voice according to any embodiment of the present invention.

The computer-readable storage media of embodiments of the invention may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or terminal. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The embodiment discloses a storage medium for recognizing user identity based on voice, which is used for executing the following method, comprising the following steps: acquiring verification pronunciation generated by reading verification information of a user, wherein the verification information comprises one or more of characters, symbols or numbers; and confirming whether the verification pronunciation is matched with the preset pronunciation of the verification information, and if so, determining the identity of the user according to the voiceprint of the verification pronunciation. According to the storage medium for recognizing the user identity based on the voice, provided by the embodiment of the invention, the user identity is recognized by recognizing the verification pronunciation of the user reading verification information, so that non-personal illegal operation behaviors can be effectively prevented in common links related to privacy operations, such as account login, transaction information inquiry, transaction record deletion and the like, the security of a user account is improved, and the user experience is enhanced.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for recognizing user identity based on voice is characterized by comprising the following steps:

2. The method for recognizing user identity based on voice according to claim 1, wherein the obtaining of the verification pronunciation generated by the verification information read by the user further comprises:

3. The method for recognizing user identity based on voice according to claim 1, wherein the confirming whether the verification pronunciation matches the preset pronunciation of the verification information comprises: and converting the verification pronunciation into a verification graphic feature by using a first deep learning model, and judging whether the verification graphic feature is matched with a preset graphic feature corresponding to a preset pronunciation of the verification information.

4. The method for recognizing user identity based on speech according to claim 3, wherein said converting the verification utterance into a verification pattern using the first deep learning model comprises:

5. The method for recognizing user identity based on speech according to claim 3, wherein said converting the verification utterance into a verification pattern using the first deep learning model comprises:

6. The method of claim 1, wherein the determining the user identity based on the voiceprint of the verification utterance comprises: and converting the verification pronunciation into a frequency distribution graph by using a second deep learning model, and confirming the voiceprint of the user and the user identity corresponding to the voiceprint according to the frequency distribution graph.

7. The method of claim 1, wherein the verification information is one or more of a random first digit of text, symbol, or number plus a second digit of text, symbol, or number.

8. An apparatus for recognizing user identity based on voice, comprising:

9. A server, characterized in that the server comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a method for recognizing a user identity based on speech as recited in any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method for recognizing a user identity on the basis of speech as claimed in any one of claims 1 to 7.