CN113555022B

CN113555022B - Method, device, equipment and storage medium for identifying same person based on voice

Info

Publication number: CN113555022B
Application number: CN202110836229.9A
Authority: CN
Inventors: 刘源; 王健宗; 彭俊清
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Filing date: 2021-07-23
Publication date: 2024-11-12
Anticipated expiration: 2041-07-23

Abstract

The invention relates to the field of artificial intelligence, and discloses a method, a device, equipment and a storage medium for identifying the same person based on voice, wherein the method comprises the following steps: extracting characteristic parameters of the voice to be recognized, determining the age bracket of the target user based on a preset vector machine model and the characteristic parameters, extracting voice data corresponding to the age bracket from a voice database of a preset registered user, respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting tone characteristic vectors, and judging whether the target user and the registered user are the same person. The invention extracts the registered user voice with the same age group as the target user for the same person comparison through format conversion and age recognition of the voice, thereby improving the recognition rate of the voice and the accuracy of the same person recognition. In addition, the invention also relates to a blockchain technology, and the voice to be recognized and the characteristic parameters can be stored in the blockchain.

Description

Method, device, equipment and storage medium for identifying same person based on voice

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for identifying a person based on voice.

Background

With the continuous development of artificial intelligence, speech is widely applied in many fields, such as the field of human-computer interaction, for example, speech control equipment or intelligent speech dialogue can be performed through a robot, and also disease auxiliary diagnosis, health management, remote consultation and the like can be supported by speech, so that a large number of human-computer interaction products need to distinguish speakers themselves, namely, identify and distinguish the identities of the speakers through speech.

In the prior art, when the identity of a speaker is identified and distinguished according to voice recognition, only a section of voice characteristics with limited length in voice data of a target user is extracted for recognition, the result cannot accurately represent the individual characteristics of one speaker, and the recognition result is calculated based on probability, so that very high resolution is difficult to achieve, and the accuracy of the identification of the same person is low.

Disclosure of Invention

The invention mainly aims to solve the technical problem of low accuracy of the same person recognition based on voice in the prior art.

The first aspect of the present invention provides a method for identifying a person based on voice, the method for identifying a person based on voice comprising: acquiring voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized; performing parameter analysis on the mark parameter information to determine the format type and attribute information of the voice to be recognized; performing format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after format conversion; based on a preset vector machine model and the characteristic parameters, performing age recognition on the voice to be recognized, determining an age bracket of the target user, and extracting voice data corresponding to the age bracket from a voice database of a preset registered user; and respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the voice data with the tone characteristic vectors of the voice to be recognized, and judging whether the target user and the registered user are the same person.

Optionally, in a first implementation manner of the first aspect of the present invention, the performing format conversion on the voice to be recognized according to the format type and the attribute information, and extracting feature parameters of the voice to be recognized after format conversion includes: extracting the sampling rate, the bit rate and the sound channel in the attribute information of the voice to be recognized according to the format type; judging whether the sampling rate and the bit rate meet preset requirements; if the preset requirement is not met, converting the sampling rate and the bit rate based on a preset conversion rule, and judging whether the sound channel of the voice to be recognized is a mono channel or not; if the channel is not the mono channel, converting the channel into the mono channel according to a preset channel conversion rule; and extracting the characteristic parameters of the voice to be recognized after the format conversion, wherein the characteristic parameters comprise time domain characteristic parameters and frequency domain characteristic parameters.

Optionally, in a second implementation manner of the first aspect of the present invention, the performing age recognition on the voice to be recognized based on a preset vector machine model and the feature parameter, determining an age bracket of the target user, and extracting voice data corresponding to the age bracket from a voice database of a preset registered user includes: performing dimension reduction and polymerization treatment on the characteristic parameters to obtain age characteristic parameters; based on a preset vector machine model and the age characteristic parameters, carrying out age recognition on the voice to be recognized to obtain a recognition result; comparing the recognition result with the recognition rate in the vector machine model, and calculating the confidence coefficient of the recognition result; determining the age bracket of the target user according to the confidence level; extracting voice data corresponding to the age group from a voice database of a preset registered user.

Optionally, in a third implementation manner of the first aspect of the present invention, the inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting a timbre feature vector, and comparing the timbre feature vector, and determining whether the target user and the registered user are the same person includes: respectively extracting the voice data and voiceprint features of the voice to be recognized based on a preset deep convolutional neural network; clustering the voiceprint features to obtain tone feature vectors; calculating the similarity value of the tone characteristic vector, and judging whether the similarity value is not smaller than a preset tone similarity threshold value or not; if yes, determining that the target user and the registered user are the same person.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the clustering the voiceprint features to obtain a timbre feature vector includes: calculating a chromaticity characteristic value of the voiceprint characteristic, and generating a voiceprint matrix according to the chromaticity characteristic value; inputting the voiceprint features into the deep convolutional neural network, and outputting tone feature representations; mapping the tone characteristic representation to a preset characteristic space, and quantitatively characterizing the tone characteristic representation according to the characteristic space to obtain a tone characteristic vector.

Optionally, in a fifth implementation manner of the first aspect of the present invention, before extracting the voice data and the voiceprint feature of the voice to be recognized in the preset deep convolutional neural network, the method further includes: respectively carrying out framing treatment on the voice data and the voice data to be recognized to obtain an audio frame; extracting short-time energy of the audio frame, and judging whether the short-time energy is smaller than a preset energy threshold, wherein the short-time energy is the intensity degree of the audio frame at different moments; if yes, the corresponding audio frame is removed.

Optionally, in a sixth implementation manner of the first aspect of the present invention, after the inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, respectively, outputting a timbre feature vector, comparing the timbre feature vector, and determining whether the target user and the registered user are the same person, the method further includes: extracting frame voiceprint features of the voice to be recognized; based on a preset time delay neural network, calculating posterior probability of the frame voiceprint feature; calculating a thermal independent value of the posterior probability; classifying the frame voiceprint features according to the hot unique value, and identifying the frame voiceprint features according to the classification result; and according to the identification, carrying out voice registration on the target user, and storing the voice to be recognized into a voice database of the registered user.

A second aspect of the present invention proposes a voice-based co-person recognition apparatus, the voice-based co-person recognition apparatus comprising: the acquisition module is used for acquiring the voice to be recognized of the target user and extracting the mark parameter information of the voice to be recognized; the analysis module is used for carrying out parameter analysis on the mark parameter information and determining the format type and attribute information of the voice to be recognized; the conversion module is used for carrying out format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after the format conversion; the recognition module is used for carrying out age recognition on the voice to be recognized based on a preset vector machine model and the characteristic parameters, determining the age bracket of the target user, and extracting voice data corresponding to the age bracket from a voice database of a preset registered user; the comparison module is used for respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the voice data with the tone characteristic vectors of the voice to be recognized, and judging whether the target user and the registered user are the same person.

Optionally, in a first implementation manner of the second aspect of the present invention, the conversion module includes: a first extracting unit, configured to extract a sampling rate, a bit rate and a channel in the attribute information of the speech to be recognized according to the format type; the judging unit is used for judging whether the sampling rate and the bit rate meet preset requirements; the first conversion unit is used for converting the sampling rate and the bit rate based on a preset conversion rule and judging whether the sound channel of the voice to be recognized is mono or not if the sampling rate and the bit rate do not meet the preset requirement; the second conversion unit is used for converting the sound channel into a mono channel according to a preset sound channel conversion rule if the sound channel is not mono; and the second extraction unit is used for extracting the characteristic parameters of the voice to be recognized after the format conversion, wherein the characteristic parameters comprise time domain characteristic parameters and frequency domain characteristic parameters.

Optionally, in a second implementation manner of the second aspect of the present invention, the identification module is specifically configured to: performing dimension reduction and polymerization treatment on the characteristic parameters to obtain age characteristic parameters; based on a preset vector machine model and the age characteristic parameters, carrying out age recognition on the voice to be recognized to obtain a recognition result; comparing the recognition result with the recognition rate in the vector machine model, and calculating the confidence coefficient of the recognition result; determining the age bracket of the target user according to the confidence level; extracting voice data corresponding to the age group from a voice database of a preset registered user.

Optionally, in a third implementation manner of the second aspect of the present invention, the comparison module includes: the third extraction unit is used for respectively extracting the voice data and the voiceprint characteristics of the voice to be recognized based on a preset deep convolutional neural network; the clustering unit is used for carrying out clustering processing on the voiceprint features to obtain tone feature vectors; the calculating unit is used for calculating the similarity value of the tone characteristic vector and judging whether the similarity value is not smaller than a preset tone similarity threshold value or not; and the determining unit is used for determining that the target user and the registered user are the same person if the similarity value is not smaller than a preset tone similarity threshold value.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the clustering unit is specifically configured to: calculating a chromaticity characteristic value of the voiceprint characteristic, and generating a voiceprint matrix according to the chromaticity characteristic value; inputting the voiceprint features into the deep convolutional neural network, and outputting tone feature representations; mapping the tone characteristic representation to a preset characteristic space, and quantitatively characterizing the tone characteristic representation according to the characteristic space to obtain a tone characteristic vector.

Optionally, in a fifth implementation manner of the second aspect of the present invention, the voice-based co-person identifying device further includes a rejection module, which is specifically configured to: respectively carrying out framing treatment on the voice data and the voice data to be recognized to obtain an audio frame; extracting short-time energy of the audio frame, and judging whether the short-time energy is smaller than a preset energy threshold value or not; if yes, the corresponding audio frame is removed.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the voice-based recognition device further includes a registration module, specifically configured to: extracting frame voiceprint features of the voice to be recognized; based on a preset time delay neural network, calculating posterior probability of the frame voiceprint feature; calculating a thermal independent value of the posterior probability; classifying the frame voiceprint features according to the hot unique value, and identifying the frame voiceprint features according to the classification result; and according to the identification, carrying out voice registration on the target user, and storing the voice to be recognized into a voice database of the registered user.

A third aspect of the present invention provides a speech-based co-person recognition device comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line; the at least one processor invokes the instructions in the memory to cause the voice-based co-person recognition device to perform the steps of the voice-based co-person recognition method described above.

A fourth aspect of the invention provides a computer readable storage medium having instructions stored thereon which, when run on a computer, cause the computer to perform the steps of the above-described voice-based method of identification of a person.

According to the technical scheme provided by the invention, the mark parameter information in the voice to be recognized of the target user is obtained, the mark parameter information is subjected to parameter analysis, the characteristic parameter of the voice to be recognized is extracted, the age bracket of the target user is determined based on the preset vector machine model and the characteristic parameter, the voice data corresponding to the age bracket is extracted from the voice database of the preset registered user, the voice data of the registered user needing to be recognized by the same person is directionally selected, the recognition rate of the voice is improved, the voice data and the voice to be recognized are respectively input into the preset deep convolutional neural network, the tone characteristic vector is output, and whether the target user and the registered user are the same person or not is judged.

Drawings

FIG. 1 is a schematic diagram of a first embodiment of a voice-based method for identifying a person in an embodiment of the present invention;

FIG. 2 is a diagram of a second embodiment of a voice-based method of recognition of a person in an embodiment of the present invention;

FIG. 3 is a diagram of a third embodiment of a voice-based method of recognition of a person in an embodiment of the present invention;

FIG. 4 is a diagram of a fourth embodiment of a voice-based method of recognition of a person in an embodiment of the present invention;

FIG. 5 is a schematic diagram of an embodiment of a voice-based co-identification apparatus according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another embodiment of a voice-based co-identification apparatus in accordance with an embodiment of the present invention;

FIG. 7 is a schematic diagram of an embodiment of a voice-based co-identification device in accordance with an embodiment of the invention.

Detailed Description

The embodiment of the invention provides a method, a device, equipment and a storage medium for identifying the same person based on voice, which are used for acquiring mark parameter information in the voice to be identified of a target user, carrying out parameter analysis on the mark parameter information, extracting characteristic parameters of the voice to be identified, determining the age of the target user based on a preset vector machine model and the characteristic parameters, extracting voice data corresponding to the age from a voice database of a preset registered user, directionally selecting the voice data of the registered user needing to be identified by the same person, improving the identification rate of the voice, respectively inputting the voice data and the voice to be identified into a preset deep convolutional neural network, outputting tone characteristic vectors, and judging whether the target user and the registered user are the same person.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, the following describes details of an embodiment of the present invention, referring to fig. 1, a first embodiment of a voice-based method for identifying a person in an embodiment of the present invention includes:

101, acquiring voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized;

In the embodiment of the invention, the voice-based co-person identification method can be applied to intelligent diagnosis and treatment and remote consultation. The medical platform collects voice to be recognized of a target user in real time in a digital inquiry process based on medical big data, uploads the collected voice to be recognized to a server for storage, performs preprocessing on the voice to be recognized when storing the voice to be recognized, and extracts sign parameter information from the voice to be recognized after preprocessing. Wherein the preprocessing includes framing, windowing, and pre-emphasis. The frame division is to cut off the voice signal according to the short-time stability, the frame length is generally 20ms, and the frame shift is generally 10ms; the hamming window or the hanning window is generally adopted for windowing, because the width of the main lobe corresponds to the frequency resolution, the wider the width of the main lobe is, the lower the corresponding frequency resolution is, so that the energy is concentrated on the main lobe as much as possible when a window function is selected, or the relative amplitude of the maximum side lobe height is as small as possible, and the hamming window has larger side lobe attenuation in amplitude-frequency characteristics and can reduce the Gibbs effect, so that the hamming window is generally selected for the windowing processing of the voice signal; because the voice signal is easily affected by glottal excitation and oral-nasal radiation, the attenuation of 6 dB/octave occurs in the frequency component above 800Hz, so that the energy of a high-frequency part is required to be improved by a pre-emphasis method, the high-frequency loss is compensated by the machine, and a first-order high-pass filter is generally adopted to realize pre-emphasis. In addition, the pre-processing may also include anti-aliasing filtering.

In addition, the embodiment of the invention can acquire and process the related voice data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The server in the embodiment of the invention can be an independent server, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligent platforms, and the like.

102, Carrying out parameter analysis on the mark parameter information, and determining the format type and attribute information of the voice to be recognized;

And carrying out parameter analysis on each parameter in the mark parameter information, and determining the format type and attribute information of the voice to be recognized according to the types of each parameter contained in the mark parameter information because the mark parameter information of different format types is inconsistent. The format types of the voice comprise WAV, MP3 and OGG. These speech formats have industry-unified standards, and the recording device uses a certain format, so that the player must decode the format, otherwise, the speech cannot be played normally. Specifically, the marking parameter information is input into a voice format recognition unit of the server to verify the file header, standard parameters and the like of the WAV format, if the standard of the WAV is met, the voice is judged to be in the WAV format, otherwise, the marking parameter information is subjected to decoder/parameter verification of MP3 format standard, if the standard of MP3 is met, the voice is judged to be in the MP3 format, otherwise, the marking parameter information is subjected to decoder/parameter verification of OGG format standard, if the standard of OGG is met, the voice is judged to be in the OGG format, and the format is returned to the server. If the format of the voice still cannot be recognized, an information prompt which does not support the format voice is sent out.

In this embodiment, WAV is a sound file Format developed by microsoft, which conforms to the RIFF (Resource INTERCHANGE FILE Format) file specification, and is used to save audio information resources of Windows platform. The WAV Header (Header) is a piece of data at the beginning of a file that takes on a task, typically a description of the body data. The WAV file header consists of: the RIFF block (RIFF-Chunk), the Format block (Format-Chunk), the append block (face-Chunk), and the Data block (Data-Chunk) are composed of 4 parts. The header contains 44 bytes of flag parameter information, including: a 4-byte RIFF flag, a 4-byte file length, a 4-byte "WAVE" type block identifier, etc., and by judging these flag parameter information, it can be determined whether the voice file is a WAV.

MP3, which is known as MPEG Audio Layer 3, is an efficient computer Audio coding scheme that converts Audio files into smaller files with a smaller extension of MP3 at a larger compression ratio, substantially preserving the sound quality of the source file. MP3 files are largely divided into three parts: ID3V2, audio data, ID3V1. Parameter information such as sampling rate, bit rate and the like is recorded in the audio data, and whether the voice file is MP3 can be determined by judging the mark parameter information.

OGG, collectively referred to as OGG Vorbis, is an audio compression format, similar to the music format of MP3 and the like. The OGG file is decoded to form a bit stream, and the front of the bit stream is three packet headers, which are sequentially in the order of the file: an identification Header (identification Header), a comment Header (comment Header), and a Setup Header (Setup Header). The identification header sets simple audio characteristics (such as sampling rate and number of channels) of the version and the stream, and by judging the audio characteristics, whether the voice to be recognized is in the OGG format can be determined.

103, Carrying out format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after the format conversion;

according to the format type, extracting the sampling rate, the bit rate and the sound in the attribute information of the voice to be recognized, respectively comparing whether the sampling rate and the bit rate are the same as the preset requirements, namely judging whether the sampling rate and the bit rate meet the preset requirements, and if the sampling rate and the bit rate do not meet the preset requirements, format conversion is not needed, wherein the preset requirements are that the sampling rate is 8k, the bit rate is 16bit, and the process of format conversion of the voice to be recognized is to convert the sampling rate and the bit rate of the voice to be recognized into the sampling rate and the bit rate meeting the preset requirements according to a preset conversion rule.

Judging whether the sound channel of the voice to be recognized is mono, if the voice to be recognized is not mono, namely, the voice to be recognized is double-channel, converting the voice to be recognized into mono according to a preset sound channel conversion rule. Specifically, the server provides a jni interface based on channel separation of the C++ program, parameters (left channel or right channel) are defined in the interface, the Java program calls the interface and transmits the left channel to the interface, and after the C++ program finishes processing, double-channel voice is converted into single channel and transmitted to the Java program.

Extracting feature parameters contained in the converted voice to be recognized, wherein the feature parameters comprise time domain feature parameters and frequency domain feature parameters, the time domain feature parameters comprise short-time zero-crossing rate, short-time energy spectrum and pitch period, and the frequency domain feature parameters comprise Linear Prediction Cepstrum Coefficient (LPCC) and Mel Frequency Cepstrum Coefficient (MFCC).

104, Based on a preset vector machine model and characteristic parameters, performing age recognition on the voice to be recognized, determining the age bracket of the user, and extracting voice data corresponding to the age bracket from a voice database of a preset registered user;

Inputting the characteristic parameters into a preset vector machine model, carrying out parameter analysis on the characteristic parameters by the vector machine model, extracting age characteristic parameters from the characteristic parameters, carrying out age recognition on the voice to be recognized according to the age characteristic parameters, and obtaining a recognition result, namely determining an age bracket corresponding to the voice to be recognized. After determining the age group corresponding to the voice to be recognized, extracting voice data of the same age group from a voice database of a preset registered user. The voice database of the preset registered user comprises voice data of all registered users registered through the medical platform in the server.

After determining the age group of the target user corresponding to the voice to be recognized, extracting voice data corresponding to the age group from a voice database of a preset registered user, namely extracting the voice data of the registered user with the same age group from the voice database. Respectively carrying out framing treatment on voice data and voice data to be identified to obtain an audio frame; short-time energy of the audio frame is extracted, and whether the short-time energy is smaller than a preset energy threshold value or not is judged, wherein the short-time energy is the intensity degree of the audio frame at different moments; if the short-time energy is smaller than the preset energy threshold, the corresponding audio frames are removed, namely, the audio frames are filtered on the voice data and the voice to be recognized, so that the subsequent accuracy of the tone characteristic vector is improved.

And 105, respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the tone characteristic vectors of the voice data and the voice to be recognized, and judging whether the target user and the registered user are the same person.

The voice data and the voice to be recognized are respectively input into a deep convolutional neural network preset by a server, the voice is analyzed through the deep convolutional neural network, three-dimensional tone characteristic representation is output, the tone characteristic representation is mapped into a preset space, tone characteristics are quantitatively represented through a characteristic space, and tone characteristic vectors are obtained. And carrying out vector comparison on the tone characteristic vector of the registered user and the tone characteristic vector corresponding to the voice to be recognized in a characteristic space, when the tone characteristic vector is consistent with the tone characteristic vector, indicating that the target user corresponding to the voice to be recognized is the same person as the registered user, and when the tone characteristic vector is inconsistent with the tone characteristic vector, indicating that the target user corresponding to the voice to be recognized is not the same person as the registered user, wherein a preset deep convolutional neural network is trained in advance. The training process of the deep convolutional neural network is the prior art and will not be described in detail herein.

In this embodiment, the voice data of two target users may be obtained at will, format conversion, age identification and comparison of tone feature vectors may be performed on the voice data of the two targets, and whether the two target users are the same person may be determined. That is, in this embodiment, the target user to be identified and the registered user may be identified by the same person, or the target users to be identified may be identified by the same person.

In the embodiment of the invention, format conversion and age recognition processing are carried out on the voice to be recognized, then the voice data of the registered user with the same age range as the voice to be recognized is directionally selected to carry out the same person recognition according to the processing result, the recognition rate of the voice is improved, and then the tone characteristic vector is extracted from the voice to carry out vector comparison, so that the accuracy of the same person recognition is improved.

Referring to fig. 2, a second embodiment of a voice-based method for identifying a person in an embodiment of the present invention includes:

201, acquiring voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized;

202, carrying out parameter analysis on the mark parameter information, and determining the format type and attribute information of the voice to be recognized;

203, performing format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after format conversion;

204, performing dimension reduction and aggregation treatment on the characteristic parameters to obtain age characteristic parameters;

In the embodiment, the dimension reduction processing is to reduce the dimension of the data of the characteristic parameters by adopting a principal component analysis algorithm (PCA algorithm), the aggregation is to aggregate the data of the characteristic parameters by adopting a K-means clustering algorithm (K-means algorithm), and the age characteristic parameters are obtained after the dimension reduction processing is performed on the characteristic parameters. The process of performing the dimension reduction and the polymerization on the data belongs to the prior art, and is not described herein.

205, Performing age recognition on the voice to be recognized based on a preset vector machine model and age characteristic parameters to obtain a recognition result;

The server is provided with a vector machine model, according to a support vector machine (Support Vector Machine) method of the vector machine model, age characteristic parameters are mapped into a high-dimensional or even infinite-dimensional characteristic space (Hilbert space) through a nonlinear mapping p, a kernel function expansion theorem is applied, and on the premise of hardly increasing computational complexity, the age range to which the voice to be recognized belongs is recognized according to the vector machine model and the age characteristic parameters, so that a recognition result is obtained.

206, Comparing the recognition result with the recognition rate in the vector machine model, and calculating the confidence coefficient of the recognition result;

207, determining the age bracket of the target user according to the confidence level;

The recognition rate of each age group is set in the vector machine model, the recognition result is compared with the recognition rate, and the confidence coefficient of the recognition result is calculated, namely, the confidence coefficient of the recognition result is analyzed according to the recognition rate. When the confidence value is not smaller than the preset confidence threshold value, the recognition result is accurate, and the age bracket of the target user corresponding to the voice to be recognized can be determined.

208, Extracting voice data corresponding to the age group from a voice database of a preset registered user;

209, respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the voice data with the tone characteristic vectors of the voice to be recognized, and judging whether the target user and the registered user are the same person.

After determining the age group of the target user corresponding to the voice to be recognized, extracting voice data corresponding to the age group from a voice database of a preset registered user, namely extracting the voice data of the registered user with the same age group from the voice database. The voice data and the voice to be recognized are respectively input into a deep convolutional neural network preset by a server, the voice is analyzed through the deep convolutional neural network, three-dimensional tone characteristic representation is output, the tone characteristic representation is mapped into a preset space, tone characteristics are quantitatively represented through a characteristic space, and tone characteristic vectors are obtained. And carrying out vector comparison on the tone characteristic vector of the registered user and the tone characteristic vector corresponding to the voice to be recognized in a characteristic space, when the tone characteristic vector is consistent with the tone characteristic vector, indicating that the target user corresponding to the voice to be recognized is the same person as the registered user, and when the tone characteristic vector is inconsistent with the tone characteristic vector, indicating that the target user corresponding to the voice to be recognized is not the same person as the registered user, wherein a preset deep convolutional neural network is trained in advance. The training process of the deep convolutional neural network is the prior art and will not be described in detail herein.

In the embodiment of the present invention, steps 201 to 203 are identical to steps 101 to 103 in the first embodiment of the voice-based person recognition method described above, and will not be described herein.

In the embodiment of the invention, the voice to be recognized is subjected to format conversion, the characteristic parameters are extracted, the characteristic parameters are subjected to reduction and aggregation treatment to obtain the age characteristic parameters, the age recognition is carried out according to the age characteristic parameters, the confidence level of the recognition result is calculated, and the age range of the voice to be recognized is determined, so that the voice data of the registered user with the same age range as the voice to be recognized is directionally selected for carrying out the same person recognition, and the recognition rate of the voice is improved.

Referring to fig. 3, a third embodiment of a voice-based method for identifying a person in an embodiment of the present invention includes:

301, acquiring voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized;

302, carrying out parameter analysis on the mark parameter information, and determining the format type and attribute information of the voice to be recognized;

303, performing format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after format conversion;

304, based on a preset vector machine model and characteristic parameters, performing age recognition on the voice to be recognized, determining the age bracket of the user, and extracting voice data corresponding to the age bracket from a preset voice database of the registered user;

305, respectively extracting voice data and voiceprint characteristics of the voice to be recognized based on a preset deep convolutional neural network;

in the present embodiment, a voiceprint feature (Voice print) is a feature representing the Voice characteristics of a user, i.e., a speaker. The extraction of voiceprint features can be performed by a pre-set neural network model, wherein the neural network model is a pretrained sequential deep convolutional neural network.

Specifically, according to a voice endpoint detection method (VAD detection) in the deep convolutional neural network, endpoint detection is performed on voice data and voice to be recognized respectively, so that the voice data and the voice to be recognized are divided into audio data of a plurality of sections of speaker speaking, for example, audio data corresponding to each time section of 0-3 seconds, 4-7 seconds and 7-10 seconds respectively, and feature (embedding) extraction is performed according to the audio data, namely, voiceprint features corresponding to each section of audio data are extracted. The voiceprint feature can be regarded as a vector, and the dimension can be set as desired, for example 128 or 512 dimensions, by which the unique characteristics of the speaker can be characterized. The audio data with different durations can be extracted to obtain a vector with fixed dimension. For example, the matrix corresponding to each audio data may be input into the deep convolutional neural network, and the voiceprint may be frequency, and the matrix is formed according to time sequence, that is, the matrix is a two-dimensional array of time-frequency, and the vector of each corresponding fixed dimension is output through the deep convolutional neural network.

306, Clustering the voiceprint features to obtain tone feature vectors;

In this embodiment, the clustering process may be clustering or spectral clustering by using a K-means clustering algorithm (K-means), where K represents the number of categories, and K may be determined according to the number of speakers in the target speech data.

After the server extracts the corresponding voiceprint features, a matrix can be formed by the voiceprint features. Each row of the matrix may represent a voiceprint feature corresponding to a segment of audio data in the speech, where the voiceprint feature is a fixed dimension vector, and the duration of the audio data corresponding to each row may be different. For example, the first row of the matrix may represent vectors for 0-3 seconds, the second row may represent vectors for 4-7 seconds, the third row may represent vectors for 7-10 seconds, and so on. Clustering the matrix of the voiceprint features to obtain a clustering result of the voiceprint features corresponding to each section of audio data. After clustering, vectorizing the voiceprint features to obtain tone feature vectors.

307, Calculating the similarity value of the tone characteristic vector, and judging whether the similarity value is not smaller than a preset tone similarity threshold value;

308, if the similarity value is not less than the preset tone similarity threshold value, determining that the target user and the registered user are the same person.

And performing similarity calculation on the tone characteristic vector of the target user corresponding to the voice to be recognized and the tone characteristic vector of the registered user, namely calculating the similarity value of the two tone characteristic vectors, and judging whether the target user corresponding to the voice to be recognized and the registered user are the same person or not according to the similarity value. And when the similarity value of the two tone color feature vectors is not smaller than the preset tone color similarity threshold value, determining that the target user and the registered user are the same person, and when the similarity value is smaller than the tone color similarity threshold value, determining that the target user and the registered user are not the same person.

In the embodiment of the present invention, steps 301 to 304 are identical to steps 101 to 104 in the first embodiment of the above-mentioned voice-based identification method, and are not described herein.

In the embodiment of the invention, the voiceprint characteristics in the voices of the target user and the registered user are extracted, the voiceprint characteristics are clustered to obtain the tone characteristic vector, the similarity value of the tone characteristic vector is calculated, and whether the target user and the registered user are the same person or not is determined according to the comparison of the similarity value and the similarity threshold value, so that the accuracy of identifying the same person is improved.

Referring to fig. 4, a fourth embodiment of a voice-based method for identifying a person in an embodiment of the present invention includes:

401, acquiring voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized;

402, carrying out parameter analysis on the mark parameter information to determine the format type and attribute information of the voice to be recognized;

403, performing format conversion on the voice to be recognized according to the format type and the attribute information, and extracting characteristic parameters of the voice to be recognized after format conversion;

404, performing age recognition on the voice to be recognized based on a preset vector machine model and characteristic parameters, determining the age bracket of the user, and extracting voice data corresponding to the age bracket from a voice database of a preset registered user;

405, respectively inputting voice data and voice to be recognized into a preset deep convolutional neural network, outputting tone characteristic vectors, comparing the tone characteristic vectors, and judging whether a user and a registered user are the same person;

406, extracting frame voiceprint features of the voice to be recognized;

When the target user and the registered user are not the same person, the target user can be registered by voice to become the registered user, or when the target user and the registered user are the same person, the voice can be registered and added into a voice database of the registered user to enrich the voice data of the registered user. In this embodiment, the voiceprint feature can be a mel-frequency cepstral coefficient (Mel Frequency Cepstrum Coefficient, MFCC). Specifically, the characteristic parameters MFCC of the voice to be recognized are directly extracted, and because the MFCC can automatically frame in the extraction process, and the MFCC corresponding to each frame is obtained through processing, the frame voiceprint characteristic corresponding to each frame in the voice to be recognized is obtained. Or firstly slicing the voice to be recognized by taking the frame as a unit, and respectively extracting the MFCC characteristics of each slice, thereby obtaining the corresponding MFCC characteristics of each frame, namely the frame voiceprint characteristics.

407, Calculating posterior probability of the frame voiceprint feature based on a preset time delay neural network;

In this embodiment, a generic background model (Universal Background Mode, UBM) implemented with a time-lapse neural network (TIME DELAY neural network, TDNN), i.e., TDNN-UBM model, is used to calculate the posterior probability. Specifically, each frame of voiceprint features are respectively used as data input, and a posterior probability (TIME DELAY neural network, TDNN) corresponding to each frame in the voice to be recognized is obtained based on the TDNN-UBM model.

408, Calculating a thermal independent value of the posterior probability;

409, classifying the frame voiceprint features according to the hot unique value, and identifying the frame voiceprint features according to the classification result;

And classifying the voiceprint features of each frame according to the obtained posterior probability. And calculating a heat independent value (one-hot) of each posterior probability, classifying the frame voiceprint features corresponding to the same heat independent value into the same class, and recording the heat independent value as a type identifier of the corresponding class, namely identifying the frame voiceprint features according to the heat independent value.

In this embodiment, the collected voice to be recognized is classified, and the corresponding type identifier is recorded for searching and matching in the subsequent voiceprint recognition process. Acquiring posterior probability corresponding to each frame segment by means of a TDNN-UBM model, classifying the frame segments based on the posterior probability, and thus completing convergence of the voice to be recognized, and extracting key features in the voice to be recognized; and then classifying the frame fragments with the same type into the same type to obtain more definite identification characteristics, and providing more comprehensive identification verification for the subsequent identification process so as to improve the identification accuracy. In addition, when classifying the frame voiceprint features based on the posterior probability, the classification standard can be generated by calculating the thermal uniqueness of the posterior probability, so that the classification accuracy is improved.

And 410, registering the target user with the voice according to the identification, and storing the voice to be recognized into a voice database of the registered user.

According to the type identification, the user corresponding to the voice to be recognized carries out voice registration, a 1:1 registration interface provided by a medical platform erected on a server is called, parameters (the identification number of the registered user and the path of the voice) are defined in the registration interface, the registration is realized by calling the registration interface, and when the registration is completed, the voice to be recognized for voice registration is stored in a voice database of the registered user. In addition, the voice at the time of voice registration is not the original voice to be recognized, but the voice in 8k, 16bit, wav format obtained after having undergone the above-described steps (the same person recognition processing, classification, and identification).

In the embodiment of the present invention, steps 401 to 405 are identical to steps 101 to 105 in the first embodiment of the above-mentioned voice-based identification method, and are not described herein.

In the embodiment of the invention, after the identification of the same person is finished, the frame voiceprint features of the voice of the target user are extracted, and the frame voiceprint features are classified and identified, so that the voice of the target user is registered, and the voice is stored in a language database to expand the registered user and voice data.

The above describes a method for voice-based co-person recognition in the embodiment of the present invention, and the following describes a device for voice-based co-person recognition in the embodiment of the present invention, please refer to fig. 5, an embodiment of the device for voice-based co-person recognition in the embodiment of the present invention includes:

The obtaining module 501 is configured to obtain a voice to be recognized of a target user, and extract flag parameter information of the voice to be recognized;

The analysis module 502 is configured to perform parameter analysis on the flag parameter information, and determine a format type and attribute information of the voice to be recognized;

a conversion module 503, configured to perform format conversion on the speech to be recognized according to the format type and the attribute information, and extract feature parameters of the speech to be recognized after format conversion;

The recognition module 504 is configured to perform age recognition on the voice to be recognized based on a preset vector machine model and the feature parameters, determine an age bracket of the target user, and extract voice data corresponding to the age bracket from a voice database of a preset registered user;

And the comparison module 505 is configured to input the voice data and the voice to be recognized into a preset deep convolutional neural network, output corresponding tone characteristic vectors, and compare the tone characteristic vectors of the voice data and the voice to be recognized, so as to determine whether the target user and the registered user are the same person.

In the embodiment of the invention, the format conversion and age recognition processing are carried out on the voice to be recognized through the voice-based co-person recognition device, so that the voice data of the registered user with the same age range as the voice to be recognized are directionally selected to carry out the co-person recognition according to the processing result, the recognition rate of the voice is improved, and the tone characteristic vector is extracted from the voice to carry out vector comparison, thereby improving the accuracy of the co-person recognition.

Referring to fig. 6, another embodiment of the voice-based recognition apparatus according to the present invention includes:

Wherein the conversion module 503 includes:

A first extracting unit 5031, configured to extract a sampling rate, a bit rate, and a channel in the attribute information of the speech to be recognized according to the format type;

A judging unit 5032, configured to judge whether the sampling rate and the bit rate meet a preset requirement;

a first conversion unit 5033, configured to convert the sampling rate and the bit rate based on a preset conversion rule if the sampling rate and the bit rate do not meet a preset requirement, and determine whether the channel is mono;

A second converting unit 5034, configured to convert the channel of the speech to be recognized into a mono channel according to a preset channel conversion rule if the channel is not mono;

a second extracting unit 5035, configured to extract feature parameters of the speech to be recognized after the format conversion, where the feature parameters include a time domain feature parameter and a frequency domain feature parameter.

Wherein, the identification module 504 is specifically configured to:

performing dimension reduction and polymerization treatment on the characteristic parameters to obtain age characteristic parameters;

Based on a preset vector machine model and the age characteristic parameters, carrying out age recognition on the voice to be recognized to obtain a recognition result;

Comparing the recognition result with the recognition rate in the vector machine model, and calculating the confidence coefficient of the recognition result;

determining the age bracket of the target user according to the confidence level;

extracting voice data corresponding to the age group from a voice database of a preset registered user.

Wherein the comparison module 505 includes:

A third extracting unit 5051, configured to extract, based on a preset deep convolutional neural network, the voice data and the voiceprint feature of the voice to be recognized respectively;

A clustering unit 5052, configured to perform clustering processing on the voiceprint features to obtain a tone feature vector;

a calculating unit 5053, configured to calculate a similarity value of the timbre feature vector, and determine whether the similarity value is not less than a preset timbre similarity threshold;

And a determining unit 5054, configured to determine that the target user and the registered user are the same person if the similarity value is not less than a preset tone similarity threshold.

Wherein, the clustering unit 5052 is specifically configured to:

calculating a chromaticity characteristic value of the voiceprint characteristic, and generating a voiceprint matrix according to the chromaticity characteristic value;

inputting the voiceprint features into the deep convolutional neural network, and outputting tone feature representations;

Mapping the tone characteristic representation to a preset characteristic space, and quantitatively characterizing the tone characteristic representation according to the characteristic space to obtain a tone characteristic vector.

Wherein, the voice-based co-person recognition device further includes a rejection module 506, which is specifically configured to:

Respectively carrying out framing treatment on the voice data and the voice data to be recognized to obtain an audio frame;

extracting short-time energy of the audio frame, and judging whether the short-time energy is smaller than a preset energy threshold value or not;

If yes, the corresponding audio frame is removed.

Wherein, the voice-based co-person recognition device further comprises a registration module 507, which is specifically configured to:

extracting frame voiceprint features of the voice to be recognized;

based on a preset time delay neural network, calculating posterior probability of the frame voiceprint feature;

calculating a thermal independent value of the posterior probability;

Classifying the frame voiceprint features according to the hot unique value, and identifying the frame voiceprint features according to the classification result;

And according to the identification, carrying out voice registration on the target user, and storing the voice to be recognized into a voice database of the registered user.

In the embodiment of the invention, the voice-based co-person recognition device performs format conversion and age recognition on the voice, so that the voice of the registered user with the same age range as the target user is extracted for co-person comparison, and the recognition rate of the voice and the accuracy of the co-person recognition are improved.

Referring to fig. 7, an embodiment of the voice-based recognition device in the embodiment of the present invention will be described in detail from the viewpoint of hardware processing.

Fig. 7 is a schematic diagram of a voice-based co-person recognition device according to an embodiment of the present invention, where the voice-based co-person recognition device 700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 710 (e.g., one or more processors) and a memory 720, and one or more storage mediums 730 (e.g., one or more mass storage devices) storing application programs 733 or data 732. Wherein memory 720 and storage medium 730 may be transitory or persistent. The program stored on the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations on the voice-based co-identification apparatus 700. Still further, the processor 710 may be configured to communicate with the storage medium 730 to execute a series of instruction operations in the storage medium 730 on the voice-based co-recognition device 700.

The voice-based persona identification device 700 may also include one or more power sources 740, one or more wired or wireless network interfaces 750, one or more input/output interfaces 760, and/or one or more operating systems 731, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the voice-based co-identification device structure shown in fig. 7 is not limiting of the voice-based co-identification device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, in which instructions are stored which, when executed on a computer, cause the computer to perform the steps of the speech-based method for identifying a person.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and units described above may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The method for identifying the same person based on the voice is characterized by comprising the following steps:

Acquiring voice to be recognized of a target user, and extracting mark parameter information of the voice to be recognized;

performing parameter analysis on the mark parameter information to determine the format type and attribute information of the voice to be recognized;

extracting the sampling rate, the bit rate and the sound channel in the attribute information of the voice to be recognized according to the format type;

Judging whether the sampling rate and the bit rate meet preset requirements;

if the preset requirement is not met, converting the sampling rate and the bit rate based on a preset conversion rule, and judging whether the sound channel of the voice to be recognized is a mono channel or not;

If the channel is not the mono channel, converting the channel into the mono channel according to a preset channel conversion rule;

Extracting characteristic parameters of the voice to be recognized after format conversion, wherein the characteristic parameters comprise time domain characteristic parameters and frequency domain characteristic parameters;

based on a preset vector machine model and the characteristic parameters, performing age recognition on the voice to be recognized, determining an age bracket of the target user, and extracting voice data corresponding to the age bracket from a voice database of a preset registered user;

And respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the voice data with the tone characteristic vectors of the voice to be recognized, and judging whether the target user and the registered user are the same person.

2. The voice-based co-person recognition method according to claim 1, wherein the performing age recognition on the voice to be recognized based on a preset vector machine model and the feature parameter, determining an age group of the target user, and extracting voice data corresponding to the age group from a voice database of a preset registered user comprises:

3. The voice-based co-person recognition method according to claim 2, wherein the inputting the voice data and the voice to be recognized into a predetermined deep convolutional neural network, respectively, outputting a tone feature vector, and comparing the tone feature vector, and determining whether the target user and the registered user are co-persons comprises:

respectively extracting the voice data and voiceprint features of the voice to be recognized based on a preset deep convolutional neural network;

Clustering the voiceprint features to obtain tone feature vectors;

calculating the similarity value of the tone characteristic vector, and judging whether the similarity value is not smaller than a preset tone similarity threshold value or not;

if yes, determining that the target user and the registered user are the same person.

4. The voice-based co-person recognition method according to claim 3, wherein the clustering the voiceprint features to obtain a timbre feature vector comprises:

5. The voice-based co-person recognition method according to claim 4, further comprising, before extracting the voice data and the voiceprint feature of the voice to be recognized, respectively, in the preset deep convolutional neural network:

Extracting short-time energy of the audio frame, and judging whether the short-time energy is smaller than a preset energy threshold, wherein the short-time energy is the intensity degree of the audio frame at different moments;

If yes, the corresponding audio frame is removed.

6. The voice-based co-person recognition method according to any one of claims 1 to 5, wherein after the voice data and the voice to be recognized are input into a preset deep convolutional neural network, respectively, a tone feature vector is output, and the tone feature vector is compared, whether the target user and the registered user are co-persons is determined, further comprising:

extracting frame voiceprint features of the voice to be recognized;

calculating a thermal independent value of the posterior probability;

7. A voice-based co-person recognition apparatus, the voice-based co-person recognition apparatus comprising:

The acquisition module is used for acquiring the voice to be recognized of the target user and extracting the mark parameter information of the voice to be recognized;

The analysis module is used for carrying out parameter analysis on the mark parameter information and determining the format type and attribute information of the voice to be recognized;

The conversion module is used for extracting the sampling rate, the bit rate and the sound channel in the attribute information of the voice to be recognized according to the format type; judging whether the sampling rate and the bit rate meet preset requirements; if the preset requirement is not met, converting the sampling rate and the bit rate based on a preset conversion rule, and judging whether the sound channel of the voice to be recognized is a mono channel or not; if the channel is not the mono channel, converting the channel into the mono channel according to a preset channel conversion rule; extracting characteristic parameters of the voice to be recognized after format conversion, wherein the characteristic parameters comprise time domain characteristic parameters and frequency domain characteristic parameters;

the recognition module is used for carrying out age recognition on the voice to be recognized based on a preset vector machine model and the characteristic parameters, determining the age bracket of the target user, and extracting voice data corresponding to the age bracket from a voice database of a preset registered user;

The comparison module is used for respectively inputting the voice data and the voice to be recognized into a preset deep convolutional neural network, outputting corresponding tone characteristic vectors, comparing the voice data with the tone characteristic vectors of the voice to be recognized, and judging whether the target user and the registered user are the same person.

8. A voice-based co-person identification device, the voice-based co-person identification device comprising:

A memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line;

The at least one processor invoking the instructions in the memory to cause the co-identification device to perform the steps of the voice-based co-identification method of any of claims 1-6.

9. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the steps of the speech based co-person recognition method according to any of claims 1-6.