CN114677634B - Surface label identification method and device, electronic equipment and storage medium - Google Patents
Surface label identification method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN114677634B CN114677634B CN202210595750.2A CN202210595750A CN114677634B CN 114677634 B CN114677634 B CN 114677634B CN 202210595750 A CN202210595750 A CN 202210595750A CN 114677634 B CN114677634 B CN 114677634B
- Authority
- CN
- China
- Prior art keywords
- face
- user
- video data
- authentication
- audio data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 70
- 238000004458 analytical method Methods 0.000 claims abstract description 70
- 230000008569 process Effects 0.000 claims abstract description 35
- 238000001514 detection method Methods 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 9
- 238000000926 separation method Methods 0.000 claims description 9
- 230000003993 interaction Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 19
- 230000006870 function Effects 0.000 description 10
- 230000006399 behavior Effects 0.000 description 6
- 238000000605 extraction Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
- G06F21/32—User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Collating Specific Patterns (AREA)
Abstract
The application provides a surface label identification method, a surface label identification device, electronic equipment and a storage medium, which relate to the technical field of man-machine interaction, and the method comprises the following steps: receiving surface label video data of a user, and extracting target audio data from the surface label video data; analyzing the surface label video data to generate video data representing the process of answering questions by the user; and obtaining an analysis result based on the target audio data or the target audio data and the video data, and performing surface-to-surface authentication on the user. By adopting the method provided by the embodiment of the application, the security of the face-to-face label can be improved, and the face-to-face label experience of a user can be improved.
Description
Technical Field
The application relates to the field of human-computer interaction, in particular to a surface label identification method and device, electronic equipment and a storage medium.
Background
At present, the intelligent video surface sign has more and more extensive application, and certain risk is generated while artificial customer service is replaced by using AI technology. Currently, the mainstream of face-to-face products is to determine whether a face-to-face user is a real person and whether the face-to-face user is a personal face-to-face by determining whether a person is in front of an image pickup apparatus, whether a face of the person is consistent with living photograph analysis, and whether a voice-to-text answer to a question matches with a known answer.
However, there are great safety problems in performing video face-to-face signing in the current manner, for example, a face in front of a camera may be a dummy, and a person who makes a face-to-face signing fraud may be performed before the camera by using a photo, a video or a printed face of another person; there may be various situations that a real person is in front of the camera, but the speaker is not the person himself, the voice of the user answering the question is not in accordance with the real age or sex, the face of the user is shielded, the voice collected by the face-tag product is not the face-tag user, and the like. Therefore, the problems of low surface label safety and poor user experience exist.
Disclosure of Invention
An embodiment of the present application provides a method and an apparatus for identifying a face-to-face label, an electronic device, and a storage medium, so as to improve security of the face-to-face label.
In a first aspect, an embodiment of the present application provides a face-to-face identification method, including:
receiving surface label video data of a user, and extracting target audio data from the surface label video data;
analyzing the surface label video data to generate video data representing the process of answering questions by the user;
and obtaining an analysis result based on the target audio data or the target audio data and the video data, and performing surface-to-surface authentication on the user.
In the implementation process, the video data of the face label of the user can be analyzed, the audio data and the face label image of the user are further analyzed on the basis of identifying the sound and the image during face label of the user, whether the user has fraud behaviors in the face label process is judged, and therefore the safety of the face label of the user can be improved. Wherein, the specific analysis mode comprises: judging whether the user has fraud behavior in the process of signing on the surface only based on the audio data or based on the analysis result of the target audio data and the video data. The security of the video surface label can be effectively improved through the two analysis modes.
Optionally, the extracting target audio data from the tag video data may include:
segmenting the time domain of the surface label video data according to the sequence of answering questions by the user to obtain a plurality of segmented videos;
and carrying out audio-video separation on each segmented video to obtain segmented audio data corresponding to each segmented video, and taking the segmented audio data as the target audio data.
Optionally, after performing audio-video separation on each segmented video to obtain segmented audio data corresponding to each segmented video, the method may further include:
extracting texts from each segmented audio data to obtain answer texts of user answer questions;
identifying the answer text, and judging whether words in a preset warning word set exist in the answer text;
the performing surface-to-label authentication on the user based on the target audio data comprises: and when the words in the preset warning word set exist in the answer text, determining that the user does not pass the face-to-face authentication.
In the implementation process, when target audio data and video data of a user are analyzed, the user is pre-authenticated based on the preset warning words, and the subsequent face-signing step is cancelled when the user does not pass the pre-authentication, so that face-signing resources can be saved, and face-signing efficiency is improved.
Optionally, obtaining an analysis result based on the target audio data and the video data, and performing surface-to-surface authentication on the user includes:
sequentially taking the images in the video data as the input of a mouth state recognition model;
for one image, detecting whether a face exists in the image, and when the face exists in the image, taking the image as a target detection image;
taking the target detection image as the input of a human face gesture recognition model to obtain a characteristic angle representing the human face gesture of the user; wherein the characteristic angle comprises: values of face depression angle, declination angle and roll angle;
extracting a target user face from the target detection image with the minimum value of the sum of the characteristic angles;
generating an analysis video of the user answering the question according to the face of the target user and the answer text of the user answering the question;
and judging whether the mouth shapes of the face in the face-to-face video data and the analysis video are consistent, and determining that the user passes face-to-face authentication when the mouth shapes of the face in the face-to-face video data and the analysis video are consistent.
In the implementation process, a simulated face representing the user answering the question can be generated based on the face-to-face video of the user, the simulated face is driven by using the text of the user answering the question extracted from the face-to-face video data, the video simulating the user answering the question is generated, and the generated simulated video and the face-to-face video of the user are analyzed, so that the face-to-face authentication of the user can be performed. On the other hand, the face image with the most positive angle of the user relative to the camera equipment is determined by calculating the angles of the three degrees of freedom representing the head posture of the user, and the simulated face is generated by the face image, so that the deviation of the face-to-face video of the simulated user can be reduced, and the accuracy of face-to-face authentication is further improved.
Optionally, the target audio data includes a voiceprint feature of a user, and the extracting the target audio data from the tag video data includes: extracting voiceprint features of each question answered by the user from the face-to-face video data when face-to-face authentication of the user is determined;
the performing surface-to-label authentication on the user based on the target audio data comprises: and taking the voiceprint feature of the user for answering the first question as a reference voiceprint feature, taking the voiceprint feature of the user for answering other questions as an analysis voiceprint feature, and analyzing the reference voiceprint feature and the analysis voiceprint feature in sequence to determine whether the user passes the face-to-face authentication or not.
Optionally, after the voiceprint feature of the user answering the first question is taken as a reference voiceprint feature, the voiceprint feature of the user answering other questions is taken as an analysis voiceprint feature, the reference voiceprint feature and the analysis voiceprint feature are sequentially analyzed, and it is determined whether the user passes through surface-to-surface authentication, the method further includes:
storing the reference voiceprint characteristics into a preset database;
when a new one-time face-to-face sign authentication is carried out, whether a current user is a new face-to-face sign user is confirmed, wherein the new face-to-face sign user represents that the current user does not participate in the face-to-face sign authentication before;
when the current user is a new face-to-face sign user, extracting multiple sections of audio data of the answer questions of the current user from the face-to-face sign video data of the current user, and obtaining the voiceprint features of the current user based on the multiple sections of audio data;
analyzing the voiceprint features of the current user and the reference voiceprint features in the preset database, and determining the similarity between the voiceprint features and the reference voiceprint features;
the performing surface-to-label authentication on the user based on the target audio data comprises: and when the similarity is higher than a preset threshold value, determining that the current user does not pass the face-to-face authentication.
In the implementation process, the user can be subjected to face-to-face label authentication aiming at different face-to-label identification application scenes, and the authentication steps are adaptively set according to the face-to-label problem and the face-to-label times of the user, so that the face-to-label efficiency and the face-to-label authentication accuracy can be improved.
Optionally, the performing surface-to-surface authentication on the user based on the target audio data may include:
taking the target audio data as an input of a trained audio recognition model to obtain an age characteristic and a gender characteristic of the user at the same time based on the audio recognition model;
acquiring age information and gender information of a user, analyzing the age characteristics and the age information and analyzing the gender characteristics to obtain an analysis result, and determining whether the user passes face-to-face authentication based on the analysis result.
In the implementation process, the voice of the user answering the questions can be identified based on the audio identification model by constructing the audio identification model for identifying the age and the gender of the user at the same time, so that the model identification speed can be increased, and the resources used by model deployment can be reduced. On the other hand, voice data are difficult to obtain, and small samples of each task can be combined by using a multi-task mode, so that the overall sample size of the model and the recognition accuracy of the model are improved.
In a second aspect, an embodiment of the present application provides a face-tag identification apparatus, including:
the data acquisition module is used for receiving the surface label video data of a user and extracting target audio data from the surface label video data;
the analysis module is used for analyzing the surface label video data to generate video data representing the process of answering questions by the user;
and the authentication module is used for obtaining an analysis result based on the target audio data or the target audio data and the video data and carrying out surface-to-surface signature authentication on the user.
In the implementation process, the video data of the face label of the user can be analyzed, the audio data and the face label image of the user are further analyzed on the basis of identifying the sound and the image during face label of the user, whether the user has fraud behaviors in the face label process is judged, and therefore the safety of the face label of the user can be improved. Wherein, the specific analysis mode comprises: judging whether the user has fraud behavior in the process of signing on the surface only based on the audio data or based on the analysis result of the target audio data and the video data. The safety of the surface label can be effectively improved through the two analysis modes.
Optionally, the data acquisition module may be configured to:
segmenting the time domain of the surface label video data according to the sequence of answering questions by the user to obtain a plurality of segmented videos; and carrying out audio-video separation on each segmented video to obtain segmented audio data corresponding to each segmented video, and taking the segmented audio data as the target audio data.
Optionally, the face-to-face recognition device may further include a text recognition module, configured to perform text extraction on each of the segmented audio data to obtain an answer text of a user answering the question; and identifying the answer text, and judging whether words in a preset warning word set exist in the answer text.
The authentication module may be specifically configured to determine that the user fails to pass surface tag authentication when a word in the preset warning word set exists in the answer text.
In the implementation process, when target audio data and video data of a user are analyzed, the user is pre-authenticated based on the preset warning words, and the subsequent face-signing step is cancelled when the user does not pass the pre-authentication, so that face-signing resources can be saved, and face-signing efficiency is improved.
Optionally, the authentication module may be specifically configured to:
sequentially taking the images in the video data as the input of a mouth state recognition model; for one image, detecting whether a face exists in the image, and when the face exists in the image, taking the image as a target detection image; the target detection image is used as the input of a face gesture recognition model, and a characteristic angle representing the face gesture of the user is obtained; wherein the characteristic angle comprises: values of face depression angle, declination angle and roll angle; extracting a target user face from the target detection image with the minimum value of the sum of the characteristic angles; generating an analysis video of the user answering the question according to the face of the target user and the answer text of the user answering the question; and judging whether the mouth shapes of the face in the face-to-face video data and the analysis video are consistent, and determining that the user passes face-to-face authentication when the mouth shapes of the face in the face-to-face video data and the analysis video are consistent.
In the implementation process, a simulated face representing the user answering the question can be generated based on the face-to-face video of the user, the simulated face is driven by using the text of the user answering the question extracted from the face-to-face video data, the video simulating the user answering the question is generated, and the generated simulated video and the face-to-face video of the user are analyzed, so that the face-to-face authentication of the user can be performed. On the other hand, the face image with the most positive angle of the user relative to the camera equipment is determined by calculating the angles of the three degrees of freedom representing the head posture of the user, and the simulated face is generated by the face image, so that the deviation of the face-to-face video of the simulated user can be reduced, and the accuracy of face-to-face authentication is further improved.
Optionally, the target audio data includes a voiceprint feature of the user, and the data obtaining module may be specifically configured to:
and when the face-to-face authentication of the user is determined, extracting the voiceprint features of each question answered by the user from the face-to-face video data.
The authentication module may be specifically configured to:
and taking the voiceprint feature of the user for answering the first question as a reference voiceprint feature, taking the voiceprint feature of the user for answering other questions as an analysis voiceprint feature, and analyzing the reference voiceprint feature and the analysis voiceprint feature in sequence to determine whether the user passes the face-to-face authentication or not.
Optionally, the data obtaining module may further be specifically configured to:
storing the reference voiceprint characteristics into a preset database; when new face-to-face sign authentication is carried out, whether a current user is a new face-to-face sign user or not is confirmed, wherein the new face-to-face sign user represents that the current user does not participate in face-to-face sign authentication before; when the current user is a new face-to-face sign user, extracting multiple sections of audio data of the answer questions of the current user from the face-to-face sign video data of the current user, and obtaining the voiceprint features of the current user based on the multiple sections of audio data; and analyzing the voiceprint characteristics of the current user and the reference voiceprint characteristics in the preset database, and determining the similarity between the voiceprint characteristics and the reference voiceprint characteristics.
The authentication module may be specifically configured to:
and when the similarity is higher than a preset threshold value, determining that the current user does not pass the face-to-face authentication.
In the implementation process, the user can be subjected to face-to-face label authentication aiming at different face-to-label identification application scenes, and the authentication steps are adaptively set according to the face-to-label problem and the face-to-label times of the user, so that the face-to-label efficiency and the face-to-label authentication accuracy can be improved.
Optionally, the authentication module may be further operable to:
taking the target audio data as an input of a trained audio recognition model so as to obtain an age characteristic and a gender characteristic of a user simultaneously based on the audio recognition model; acquiring age information and gender information of a user, analyzing the age characteristics and the age information and analyzing the gender characteristics to obtain an analysis result, and determining whether the user passes face-to-face authentication based on the analysis result.
In the implementation process, the voice of the user answering the questions can be identified based on the audio identification model by constructing the audio identification model for identifying the age and the gender of the user at the same time, so that the model identification speed can be increased, and the resources used by model deployment can be reduced. On the other hand, the voice data is difficult to obtain, and the small samples of each task can be combined together by using a multi-task mode, so that the overall sample size of the model and the recognition accuracy of the model are improved.
In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores program instructions, and the processor executes steps in any one of the foregoing implementation manners when reading and executing the program instructions.
In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, where computer program instructions are stored in the computer-readable storage medium, and when the computer program instructions are read and executed by a processor, the steps in any of the foregoing implementation manners are performed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic step diagram of a surface tag identification method according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating a step of extracting target audio data from user tag video data according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating a step of performing surface tag authentication on a user based on a configuration keyword according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a step of analyzing target audio data and video data according to an embodiment of the present application;
fig. 5 is a schematic diagram illustrating a step of performing surface-to-surface authentication on a user based on voiceprint features according to an embodiment of the present application;
fig. 6 is a schematic diagram of an authentication step when multiple surface signatures need to be performed on a user according to an embodiment of the present application;
fig. 7 is a schematic diagram illustrating a step of performing face-to-face authentication on a user according to gender and age according to an embodiment of the present application;
fig. 8 is a schematic view of a face-tag recognition device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. For example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. In addition, the functional modules in the embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
During research, the applicant finds that in an unmanned video surface sign, a user is likely to have fraud in answering questions, such as: the user has no mouth moving when answering the question, no voice when answering the question, multiple speakers when answering the question, the voice and gender of the user not conforming when answering the question, the front and back of the voiceprint not conforming when answering the question, and the voice age and real age of the user not conforming when answering the question. The above situation is generally considered to be the case when the user is fraudulent.
At present, no solution for fraud recognition according to audio and video when a user answers a question exists in a surface-signing product on the market, and whether the user answers the question for himself or herself cannot be judged, so that the problems that safety is low and user experience is affected exist in the surface-signing process at present.
Based on the fact, the embodiment of the application analyzes the audios and videos of the user answering the questions, so that whether fraud exists when the user answers the questions and whether the user answers the questions personally is determined. Referring to fig. 1, fig. 1 is a schematic diagram of steps of a surface tag identification method according to an embodiment of the present application, where the steps of the surface tag identification method may include:
in step S11, the user' S label video data is received, and the target audio data is extracted from the label video data.
The video data of the face sign can be videos recorded when a user faces to a face and carries out problem confirmation, face sign videos of a call between an AI virtual person and the user videos, or video data obtained by the user through self-service self-shooting and face sign; the surface sign can be a procedure of paying the cost required by the loan from the user to the loan bank and carrying out the interview and the signature, can be applied to the video recording scene of various businesses requiring the uploading of standardized documents, such as trusted payment, borrower subscription and the like, and can also be applied to the risk prompt of large-amount and high-risk passenger groups of common consumption loan and operation products and other scenes requiring the user to confirm information and store audio and video.
The target audio data comprise voice content representing the user answering the countersign questions or voice characteristics of the user, and can be extracted from the countersign video data based on audio-video separation.
For example, the user may perform the face-to-face authentication through a mobile terminal, or may perform the face-to-face authentication through a fixed terminal set by a bank or other institution, where the mobile terminal may be an electronic device with a networking function, and the electronic device may be a configurator of an engineering device, a mobile phone, a tablet computer, a personal digital assistant, or a dedicated face-to-face terminal. The fixed terminal may be a computer, server, etc. The mobile terminal and the fixed terminal can be provided with a camera or can be externally connected with the camera, the camera can be a camera, and the camera is used for collecting the face label video data of a user.
In step S12, the label video data is parsed to generate video data representing the process of the user answering the question.
For example, the video data may be judged by the mouth state recognition model to determine whether the user answers the question, and the training process and the application process of the mouth state recognition model are explained in the following.
In step S13, a face-to-face authentication is performed on the user based on the analysis result obtained by the target audio data and the video data.
The method for analyzing the target audio data and the video data may include, but is not limited to, generating a simulated portrait of the user based on the video data, driving the simulated portrait to read text content corresponding to the target audio data, and analyzing a mouth shape of the simulated portrait and a mouth shape of the user; performing lip language recognition on the video data, performing text recognition on the target audio data, and analyzing the text obtained by lip language recognition and the text obtained by text recognition on the target audio data.
In the embodiment of analyzing the mouth shape of the simulated portrait and the mouth shape of the user, the analysis may be performed by extracting key points of the simulated portrait and the mouth shape of the user, calculating a mouth shape area difference between corresponding key points, and determining whether the area difference is smaller than a preset threshold value so as to determine whether the user passes the surface-to-surface authentication. In the embodiment of analyzing the text obtained by lip language recognition and the text obtained by text recognition, the analysis mode may be to perform keyword analysis on the texts, or may be to calculate the correlation between the two texts, and determine whether the user passes the face-to-face authentication based on the correlation.
Therefore, the method and the device for identifying the surface label of the user can analyze the video data of the surface label of the user, further analyze the audio data and the image of the surface label of the user on the basis of identifying the sound and the image during the surface label of the user, and judge whether the user has fraudulent behaviors in the surface label process, so that the safety of the surface label of the user can be improved.
In an optional embodiment, regarding step S11, an embodiment of the present application provides an implementation manner of extracting target audio data from user tag video data, please refer to fig. 2, where fig. 2 is a schematic diagram of a step of extracting target audio data from user tag video data provided by an embodiment of the present application, and the step of extracting target audio data from user tag video data may include:
in step S21, the time domain of the label video data is segmented according to the order of the user to answer the questions, so as to obtain a plurality of segmented videos.
Illustratively, the tagging video data may be segmented with the tagging system sending a question to the user as an identification each time, the tagging system may be a system running in a fixed terminal or a mobile terminal for tagging authentication. The order in which the user answers the questions may be the order of the questions or the order of the types of the questions, for example, the surface-mounted sign system may send an instruction to the user to answer the questions, or may send a text or a voice to prompt the user to follow up the reading.
In step S22, performing audio-video separation on each segmented video to obtain segmented audio data corresponding to each segmented video, and using the segmented audio data as the target audio data.
The audio and video separation mode can be based on extracting the audio track in each segmented video by video editing software to obtain segmented audio data corresponding to each segmented video, or can also be based on directly extracting the total audio track of the surface label data to obtain total audio data, and segmenting the total audio data at the time point when the surface label system sends the problem to the user as the identification each time to obtain the segmented audio data corresponding to each segmented video.
In an optional embodiment, after the segmented audio data corresponding to each segmented video is obtained in step S22, an implementation manner of performing surface tag authentication on a user based on a configuration keyword is further provided in the embodiments of the present application, please refer to fig. 3, where fig. 3 is a schematic diagram illustrating a step of performing surface tag authentication on a user based on a configuration keyword provided in the embodiments of the present application, where the step of performing surface tag authentication on a user based on a configuration keyword may include:
in step S31, text extraction is performed on each of the segmented audio data to obtain answer text for the user to answer the question.
In step S32, the answer text is recognized, and it is determined whether there are words in a preset warning word set in the answer text.
The preset warning word set can be specifically set according to a scene of the surface sign application, and for example, when the application scene is that a bank processes house loan business of a user, warning words such as 'intermediary', 'black birth', 'hand turning' and the like can be set in the preset warning word set.
In step S33, when there are words in the preset warning word set in the answer text, it is determined that the user has not been authenticated by face-to-face labeling.
Specifically, the answer text may be first subjected to word segmentation, and the word segmentation tool may be a jieba word segmentation tool of python; and filtering stop words of all the segmented words obtained by the word segmentation processing, analyzing the segmented words filtered based on the stop word dictionary and words in a preset warning word set, and determining whether consistent words exist, wherein if the consistent words exist, the user can be directly determined not to pass the face-to-face authentication without the following face-to-face steps.
Therefore, when analyzing the target audio data and the video data of the user, the embodiment of the application performs pre-authentication on the user based on the preset warning words, and cancels the subsequent surface signing step when the user does not pass the pre-authentication, so that the surface signing resources can be saved, and the surface signing efficiency can be improved.
In an alternative embodiment, referring to step S13, an implementation manner of analyzing target audio data and video data is provided in the embodiment of the present application, please refer to fig. 4, where fig. 4 is a schematic diagram of the steps of analyzing target audio data and video data provided in the embodiment of the present application. The step of analyzing the target audio data and the video data may include:
in step S41, images in the video data are sequentially input as a mouth state recognition model, and for one of the images, whether or not a human face exists in the image is detected, and when a human face exists in the image, the image is used as a target detection image.
Specifically, videos of users answering questions can be analyzed to form video data, image arrays are identified based on a face detection model, and images with faces are stored in new video data. To take the new video data as input for the face pose recognition model.
Wherein, mouth state identification model can be an 8 Alex network of layer, in order to promote the recognition rate of network, can be when opening mouth discernment, advance people face detection to with the face the latter half that detects as user's mouth type identification area, and input this latter half to mouth identification model and discern.
If the mouth of the user moves, the user may freely open the mouth and speak and not make a sound, but other people answer the question, so that whether the mouth shape of the user is aligned with the mouth shape of the text (the text can be obtained through voice recognition) for answering the question needs to be judged, if the mouth shape of the user is aligned with the mouth shape of the text, the user can be confirmed to answer the question, and if the mouth shape of the user is not aligned, the user is confirmed to be fraudulent.
In step S42, the target detection image is used as an input of a face pose recognition model to obtain a feature angle representing a face pose of the user; wherein the characteristic angle comprises: values of face depression, declination and roll.
Therein, it may be assumed that a human head is modeled as a solid, rigid object. Under this assumption, the pose of the human head is limited to three degrees of freedom (DOF), including Pitch, Yaw, and Roll. Human head pose estimation for the user can therefore be performed based on these three degrees of freedom.
In step S43, a target user face is extracted from the target detection image having the smallest value of the sum of the feature angles.
The relative angle of the face with respect to the image pickup device can be determined based on the sum of the depression angle, the declination angle and the roll angle, the face in the target detection image with the smallest sum is considered to be the most positive face, and the face is taken as the face of the target user, wherein the meaning of the face of the target user is that one face can be extracted from each target detection image, but the angle of the face with respect to the image pickup device may not be the positive angle, so that the selected face is taken as the face of the target user to distinguish other extracted faces.
In step S44, an analysis video of the user answering the question is generated based on the target user face and the answer text of the user answering the question.
The method can be used for coding texts and human faces, coding human face features and text features, generating images by using a Lstm decoding structure, synthesizing the generated images into a video, and then driving the human faces of target users to speak by using the texts to generate an analysis video.
In step S45, it is determined whether the face-to-face video data matches the mouth shape of the face in the analysis video, and when the face-to-face video data matches the mouth shape of the face in the analysis video, it is determined that the user passes the face-to-face authentication.
Illustratively, the original video and the generated speaker video may be subjected to mouth key point extraction respectively, mouth area differences between corresponding key points are calculated, the mouth area differences are compared with a preset mouth area difference threshold, if the mouth area differences are larger than the mouth area difference threshold, it is indicated that the user is speaking, otherwise, the user may be considered to have suspicion of face-to-face fraud.
Therefore, the method and the device for authenticating the surface label can generate the simulated face representing the user answering the question based on the surface label video of the user, drive the simulated face by using the text of the user answering the question extracted from the surface label video data, generate the video simulating the user answering the question, and analyze the generated simulated video and the surface label video of the user, so that the surface label authentication of the user can be performed. On the other hand, the face image with the most positive angle of the user relative to the camera equipment is determined by calculating the angles of the three degrees of freedom representing the head posture of the user, and the simulated face is generated by the face image, so that the deviation of the face-to-face video of the simulated user can be reduced, and the accuracy of face-to-face authentication is further improved.
In the process of research, the applicant also finds that sound is similar to human face, the voiceprint of each person is different, the characteristics of the voiceprint of different persons are different, in the video countersigning scene, AI technology is usually used for countersigning instead of manual countersigning, so that the situation that a plurality of questions on one countersigning are possibly answered by a plurality of persons exists, and the sound of each answered person is possibly different. Even if manual face-signing is carried out, because customer service is different, people who face-signing are different, a person probably pretends to be multiple users and multiple customer services to carry out face-signing, but because information among the customer services is asynchronous, and voiceprint analysis is not carried out, the fraud can be caused.
For the fraud, voiceprint analysis is needed to identify, if the same user has inconsistent voiceprints before and after answering questions in one face-signing process, the fraud is certain, and if the voices of the user are the same in multiple face-signing processes, the fraud user is likely to imitate multiple users to carry out video face-signing. Therefore, in order to prevent voiceprint fraud, the embodiment of the present application further provides an implementation manner for performing surface-to-surface authentication on a user based on voiceprint features.
Referring to fig. 5, fig. 5 is a schematic diagram illustrating a step of performing surface-to-surface authentication on a user based on voiceprint features according to an embodiment of the present application. The target audio data may further include a voiceprint feature of the user, and the implementation manner of performing surface-to-surface authentication on the user based on the voiceprint feature may include the following steps:
in step S51, upon determining that the user has been subjected to one face-to-face authentication, a voiceprint feature of the user answering each question is extracted from the face-to-face video data.
In step S52, the voiceprint feature of the user answering the first question is used as a reference voiceprint feature, the voiceprint feature of the user answering other questions is used as an analysis voiceprint feature, and the reference voiceprint feature and the analysis voiceprint feature are sequentially analyzed to determine whether the user passes the face-to-face authentication.
The accuracy of the voiceprint recognition is easily affected by different lengths of voices and text contents, the recognition accuracy is higher when the voices are longer, and the recognition accuracy is also higher if the voices are generated by the same text. In the face-to-face scene, answers are mostly fixed answers, for example, answer-type questions enable a user to answer yes, or reading-type questions enable the user to read for a while, for example, "i know that the loan is only used for consumption loan", in order to improve the accuracy of voiceprint recognition, different kinds of questions can be processed differently.
In some application scenarios, only one face-to-face signature authentication needs to be performed on the user, and if the questions are answer-type questions and the user needs to answer a plurality of questions, voiceprint features are extracted from the questions, and pairwise analysis is performed. In this embodiment, the first question may be the first question answered by the user, or may be a question selected from a plurality of questions by other means, such as the question answered by the user for the longest time, the last question answered by the user, and the like.
When audio data analysis is performed, similarity calculation may be performed on a user voice of an answer question as a reference voice to extract a reference voiceprint feature and a reference voiceprint feature extracted from a subsequent voice, so as to determine whether fraud exists.
If an answer-type question and a reading-type question simultaneously appear in the one-time face-signing scene, if the user reads a text and then answers the question, the speech of the utterance may be read as a reference sound, from which a reference voiceprint feature is extracted, if the subsequent reading text exists or exists, the voiceprint features extracted from the voice of the subsequent reading text can be analyzed with the basic voiceprint features to judge whether fraud exists or not, if there are subsequently answered questions that are spoken with short speech, long spoken text, such cross-text analysis can be subject to significant errors, where the sound of answering questions cannot be analyzed directly, therefore, after the voices of all answering questions are spliced, a long voice is formed, voiceprint features are extracted from the long voice, similarity calculation is carried out on the voiceprint features and the voiceprint features, and whether the voices of the same person are the voices or not is determined.
If the questions are answered first and then the texts are read aloud, the voice of a first answer question can be used as reference voice to extract reference voiceprint features, similarity calculation is carried out on the extracted voiceprint features and the voiceprint features extracted from the voice of subsequent answer questions, the answer questions are consistent, if the aloud texts exist, the voices of all answer questions can be spliced after the questions are asked by the user, and then the voices are analyzed with the voiceprint features extracted from the aloud questions, so that whether the user has a fraud suspicion or not is determined.
In some possible application scenarios, multiple face-to-face authentication needs to be performed on the user, so after step S52, an embodiment of the present application provides an implementation manner of authentication when multiple face-to-face checks need to be performed on the user, please refer to fig. 6, where fig. 6 is a schematic diagram of an authentication step when multiple face-to-face checks need to be performed on the user according to an embodiment of the present application, and the authentication step when multiple face-to-face checks need to be performed on the user may include:
in step S61, the reference voiceprint feature is stored in a preset database.
The database may be run in a data storage terminal, and the data storage terminal may be one storage device, or a storage array composed of a plurality of storage devices, such as a Redundant Array of Independent Disks (RAID), or the data storage terminal may be one server, or a server cluster composed of several servers.
In step S62, when performing a new face-to-face authentication, it is determined whether the current user is a new face-to-face user, where the new face-to-face user represents that the current user has not participated in the face-to-face authentication before.
In step S63, when the current user is a new label-reading user, extracting multiple pieces of audio data of the current user for answering questions from the label-reading video data of the current user, and obtaining a voiceprint feature of the current user based on the multiple pieces of audio data.
In step S64, the voiceprint feature of the current user is analyzed with the reference voiceprint feature in the preset database, a similarity between the voiceprint feature and the reference voiceprint feature is determined, and when the similarity is higher than a preset threshold, it is determined that the current user does not pass through surface tag authentication.
When each time of face-signing is finished, if the current user is confirmed to be a new face-signing user, all voices in the face-signing process of the user need to be spliced, voiceprint features are extracted from the spliced voices, similarity calculation is carried out on the voiceprint features and voiceprint data in a database, if the similarity is higher than a preset warning value, such as 80% or 90%, the fact that the user has a fraud behavior can be confirmed, otherwise, the voiceprint features do not exist, and the voiceprint features can be stored in the database.
Illustratively, the voiceprint feature extraction algorithm may include: preparing samples, wherein the samples are the sound waveform and the user ID of a user, and each user has at least two sections of voice; preprocessing the voice data of a user, eliminating the influence of silence and noise, and aligning the voice; performing feature extraction on the preprocessed data, and extracting Mel Spectrogram/Fbank features and converting the Mel Spectrogram/Fbank features into a Spectrogram; designing a neural network based on spectrogram data to perform model training; and taking the last layer of characteristics of the model as the voiceprint characteristics of the user, and selecting a proper similarity calculation function for voiceprint analysis.
Therefore, the method and the device can perform face-to-face identification authentication on the user according to different face-to-face identification application scenes, and adaptively set the authentication steps according to the face-to-face problem and the face-to-face times of the user, so that the face-to-face efficiency and the face-to-face authentication accuracy can be improved.
Further, during the research process, the applicant also finds that the voices are different between men and women, the voices of women and men have greater recognition, the voices can also be distinguished from the ages, and the voices of the elderly, middle-aged people and young people also have different meanings, so that the embodiment of the present application further provides an implementation manner of performing face-tag authentication on the user according to the gender and the age, please refer to fig. 7, and fig. 7 is a schematic diagram of steps of performing face-tag authentication on the user according to the gender and the age provided by the embodiment of the present application. The step of authenticating the face-to-face for gender and age may include:
in step S71, the target audio data is used as an input of the trained audio recognition model to simultaneously obtain the age feature and the gender feature of the user based on the audio recognition model.
In step S72, age information and gender information of the user are obtained, the age characteristic and the age information are analyzed, the gender information and the gender characteristic are analyzed, the analysis result is obtained, and it is determined whether the user passes through face-to-face authentication based on the analysis result.
In order to improve the efficiency of face-to-face recognition, the embodiment of the invention does not train the model for recognition of gender and age independently, but provides a mode of using multi-task learning, and gender and age are recognized simultaneously through one model.
Converting the voice of the user answering the questions into a spectrogram, wherein the specific steps can comprise that a time domain voice waveform is subjected to Fourier transform to obtain a frequency domain spectrogram; constructing a multi-task learning network based on the spectrogram, wherein a backbone network of the network can be realized by using resnet34 and is marked as main _ resnet 34;
aiming at gender identification, after a main network main _ resnet34, a gender identification classification network is established and is marked as sex _ net, the network can be a full connection of a layer 2, a loss function can be marked as loss _ sex, and cross entropy is used as the loss function; for age identification, after a main network main _ resnet34, an age identification regression network is established, denoted as age _ net, which may be a 2-layer fully connected network, and a loss function denoted as loss _ age, using cross entropy as a loss function;
training the model, wherein the loss of the model is the weighting of gender loss and age loss, and the weighting weight can be 1, which means that the two models are equally important, i.e. loss = loss _ sex + loss _ age; after model training is completed, voice age and gender prediction can be performed by inputting a piece of speech of the user answering the face-to-face question, and fraud exists if the voice age and gender do not match the real gender and age. In addition, since the error of the age may be relatively large, an age error threshold may be set, and it may be considered that the user passes the face-to-face authentication when the difference between the predicted age and the real age is smaller than the age error threshold.
Therefore, the voice recognition method and device can build the audio recognition model for recognizing the age and the gender of the user at the same time, recognize the voice of the user answering the questions based on the audio recognition model, improve the speed of model recognition, and reduce resources used for model deployment. On the other hand, voice data are difficult to obtain, and small samples of each task can be combined by using a multi-task mode, so that the overall sample size of the model and the recognition accuracy of the model are improved.
Based on the same inventive concept, the embodiment of the present application further provides a surface tag identification apparatus 80, please refer to fig. 8, and fig. 8 is a schematic diagram of the surface tag identification apparatus provided in the embodiment of the present application. The face-tag identification device 80 may include:
and the data acquisition module 81 is configured to receive the tag video data of the user and extract target audio data from the tag video data.
And the analysis module 82 is used for analyzing the label-facing video data to generate video data representing the process of answering questions by the user.
And the authentication module 83 is configured to obtain an analysis result based on the target audio data and the video data, and perform surface-to-surface authentication on the user.
Optionally, the data acquisition module 81 may be configured to:
segmenting the time domain of the surface label video data according to the sequence of answering questions by the user to obtain a plurality of segmented videos; and carrying out audio-video separation on each segmented video to obtain segmented audio data corresponding to each segmented video, and taking the segmented audio data as the target audio data.
Optionally, the face-to-face recognition device 80 may further include a text recognition module, configured to perform text extraction on each of the segmented audio data to obtain an answer text of the user answering the question; and identifying the answer text, and judging whether words in a preset warning word set exist in the answer text.
The authentication module 83 may be specifically configured to determine that the user fails to pass surface tag authentication when a word in the preset warning word set exists in the answer text.
Optionally, the authentication module 83 may be specifically configured to:
sequentially taking the images in the video data as the input of a mouth state recognition model; for one image, detecting whether a face exists in the image, and when the face exists in the image, taking the image as a target detection image; taking the target detection image as the input of a human face gesture recognition model to obtain a characteristic angle representing the human face gesture of the user; wherein the characteristic angle comprises: values of face depression angle, declination angle and roll angle; extracting a target user face from the target detection image with the minimum value of the sum of the characteristic angles; generating an analysis video of the user answering the question according to the face of the target user and the answer text of the user answering the question; and judging whether the mouth shapes of the face in the face-to-face video data and the analysis video are consistent, and determining that the user passes face-to-face authentication when the mouth shapes of the face in the face-to-face video data and the analysis video are consistent.
Optionally, the target audio data includes a voiceprint feature of the user, and the data obtaining module 81 may be specifically configured to:
and when the face-to-face authentication of the user is determined, extracting the voiceprint features of each question answered by the user from the face-to-face video data.
The authentication module 83 may be specifically configured to:
and taking the voiceprint feature of the user for answering the first question as a reference voiceprint feature, taking the voiceprint feature of the user for answering other questions as an analysis voiceprint feature, and analyzing the reference voiceprint feature and the analysis voiceprint feature in sequence to determine whether the user passes the face-to-face authentication or not.
Optionally, the data obtaining module 81 may be further specifically configured to:
storing the reference voiceprint characteristics into a preset database; when a new one-time face-to-face sign authentication is carried out, whether a current user is a new face-to-face sign user is confirmed, wherein the new face-to-face sign user represents that the current user does not participate in the face-to-face sign authentication before; when the current user is a new face-sign user, extracting multiple sections of audio data of answer questions of the current user from face-sign video data of the current user, and obtaining voiceprint features of the current user based on the multiple sections of audio data; and analyzing the voiceprint characteristics of the current user and the reference voiceprint characteristics in the preset database, and determining the similarity between the voiceprint characteristics and the reference voiceprint characteristics.
The authentication module 83 may be specifically configured to:
and when the similarity is higher than a preset threshold value, determining that the current user does not pass the face-to-face authentication.
Optionally, the authentication module 83 may also be configured to:
taking the target audio data as an input of a trained audio recognition model so as to obtain an age characteristic and a gender characteristic of a user simultaneously based on the audio recognition model; acquiring age information and gender information of a user, analyzing the age characteristics and the age information and analyzing the gender characteristics to obtain an analysis result, and determining whether the user passes face-to-face authentication based on the analysis result.
Based on the same inventive concept, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores program instructions, and the processor executes the steps in any one of the above implementation manners when reading and executing the program instructions.
Based on the same inventive concept, embodiments of the present application further provide a computer-readable storage medium, where computer program instructions are stored, and when the computer program instructions are read and executed by a processor, the computer program instructions perform steps in any of the above-mentioned implementation manners.
The computer-readable storage medium may be a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and other various media capable of storing program codes. The storage medium is used for storing a program, and the processor executes the program after receiving an execution instruction.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
Alternatively, all or part may be implemented by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part.
The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (9)
1. A face-tag identification method is characterized by comprising the following steps:
receiving surface label video data of a user, and extracting target audio data from the surface label video data;
analyzing the surface label video data to generate video data representing the process of answering questions by the user; and
obtaining an analysis result based on the target audio data and the video data, and performing surface-to-surface authentication on the user;
the obtaining of an analysis result based on the target audio data and the video data, and performing surface-to-surface authentication on the user includes:
sequentially taking the images in the video data as the input of a mouth state recognition model;
for one image, detecting whether a face exists in the image, and when the face exists in the image, taking the image as a target detection image;
taking the target detection image as the input of a human face gesture recognition model to obtain a characteristic angle representing the human face gesture of the user; wherein the characteristic angle comprises: values of face depression angle, declination angle and roll angle;
extracting a target user face from the target detection image with the minimum value of the sum of the characteristic angles;
generating an analysis video of the user answering the question according to the face of the target user and the answer text of the user answering the question; and
judging whether the mouth shapes of the face label video data and the face in the analysis video are consistent or not; and when the face-to-face identification video data is consistent with the mouth shape of the face in the analysis video, determining that the user passes face-to-face identification.
2. The method of claim 1, wherein the extracting target audio data from the label video data comprises:
segmenting the time domain of the surface label video data according to the sequence of answering questions by the user to obtain a plurality of segmented videos; and
and carrying out audio-video separation on each segmented video to obtain segmented audio data corresponding to each segmented video, and taking the segmented audio data as the target audio data.
3. The method according to claim 2, wherein after the audio-video separation is performed on each of the segmented videos to obtain the segmented audio data corresponding to each of the segmented videos, the method further comprises:
extracting texts from each segmented audio data to obtain answer texts of user answer questions; and
identifying the answer text, and judging whether words in a preset warning word set exist in the answer text;
the performing surface-to-label authentication on the user based on the target audio data comprises:
and when the words in the preset warning word set exist in the answer text, determining that the user does not pass the face-to-face authentication.
4. The method of claim 1, wherein the target audio data comprises a voiceprint feature of a user;
the extracting target audio data from the tag-facing video data comprises:
extracting voiceprint features of each question answered by the user from the face-to-face video data when face-to-face authentication of the user is determined;
the performing surface-to-label authentication on the user based on the target audio data comprises:
and taking the voiceprint feature of the user for answering the first question as a reference voiceprint feature, taking the voiceprint feature of the user for answering other questions as an analysis voiceprint feature, and analyzing the reference voiceprint feature and the analysis voiceprint feature in sequence to determine whether the user passes the face-to-face authentication or not.
5. The method of claim 4,
after the voiceprint feature of the user answering the first question is used as a reference voiceprint feature, the voiceprint feature of the user answering other questions is used as an analysis voiceprint feature, the reference voiceprint feature and the analysis voiceprint feature are sequentially analyzed, and whether the user passes the face-to-face authentication is determined, the method further comprises the following steps:
storing the reference voiceprint characteristics into a preset database;
when a new surface-to-label authentication is carried out, whether the current user is a new surface-to-label user is determined; wherein the new face-to-face user represents that the current user has not participated in face-to-face authentication before;
when the current user is a new face-sign user, extracting multiple sections of audio data of answer questions of the current user from face-sign video data of the current user, and obtaining voiceprint features of the current user based on the multiple sections of audio data; and
analyzing the voiceprint features of the current user and the reference voiceprint features in the preset database, and determining the similarity between the voiceprint features and the reference voiceprint features;
the performing surface-to-label authentication on the user based on the target audio data comprises:
and when the similarity is higher than a preset threshold value, determining that the current user does not pass the face-to-face authentication.
6. The method of claim 1, wherein the performing surface-to-surface authentication of the user based on the target audio data comprises:
taking the target audio data as an input of a trained audio recognition model to obtain an age characteristic and a gender characteristic of the user at the same time based on the audio recognition model; and
acquiring age information and gender information of a user, analyzing the age characteristics and the age information and analyzing the gender characteristics to obtain an analysis result, and determining whether the user passes face-to-face authentication based on the analysis result.
7. A face-tag identification device, comprising:
the data acquisition module is used for receiving the surface label video data of a user and extracting target audio data from the surface label video data;
the analysis module is used for analyzing the surface label video data to generate video data representing the process of answering questions by the user; and
the authentication module is used for obtaining an analysis result based on the target audio data and the video data and carrying out surface-to-surface authentication on the user;
the authentication module is specifically configured to take images in the video data as input of a mouth state recognition model in sequence; for one image, detecting whether a face exists in the image, and when the face exists in the image, taking the image as a target detection image; taking the target detection image as the input of a human face gesture recognition model to obtain a characteristic angle representing the human face gesture of the user; wherein the characteristic angle comprises: values of face depression angle, declination angle and roll angle; extracting a target user face from the target detection image with the minimum value of the feature angle sum; generating an analysis video of the user answering the question according to the face of the target user and the answer text of the user answering the question; judging whether the mouth shape of the face label video data is consistent with that of the face in the analysis video or not; and when the face-to-face identification video data is consistent with the mouth shape of the face in the analysis video, determining that the user passes face-to-face identification.
8. An electronic device comprising a memory having stored therein program instructions and a processor that, when executed, performs the steps of the method of any of claims 1-6.
9. A computer-readable storage medium, having stored thereon computer program instructions, which, when executed by a processor, perform the steps of the method of any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210595750.2A CN114677634B (en) | 2022-05-30 | 2022-05-30 | Surface label identification method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210595750.2A CN114677634B (en) | 2022-05-30 | 2022-05-30 | Surface label identification method and device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114677634A CN114677634A (en) | 2022-06-28 |
CN114677634B true CN114677634B (en) | 2022-09-27 |
Family
ID=82080187
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210595750.2A Active CN114677634B (en) | 2022-05-30 | 2022-05-30 | Surface label identification method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114677634B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115471327A (en) * | 2022-11-02 | 2022-12-13 | 平安银行股份有限公司 | Remote face signing method and device for banking transaction and computer storage medium |
CN116112630B (en) * | 2023-04-04 | 2023-06-23 | 成都新希望金融信息有限公司 | Intelligent video face tag switching method |
Citations (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105119872A (en) * | 2015-02-13 | 2015-12-02 | 腾讯科技(深圳)有限公司 | Identity verification method, client, and service platform |
CN106341380A (en) * | 2015-10-15 | 2017-01-18 | 收付宝科技有限公司 | Method, device and system for performing remote identity authentication on user |
WO2017198014A1 (en) * | 2016-05-19 | 2017-11-23 | 阿里巴巴集团控股有限公司 | Identity authentication method and apparatus |
CN107464115A (en) * | 2017-07-20 | 2017-12-12 | 北京小米移动软件有限公司 | personal characteristic information verification method and device |
CN107809608A (en) * | 2016-08-24 | 2018-03-16 | 方正国际软件(北京)有限公司 | A kind of generation method and device of digital signature video |
CN108399395A (en) * | 2018-03-13 | 2018-08-14 | 成都数智凌云科技有限公司 | The compound identity identifying method of voice and face based on end-to-end deep neural network |
CN108717663A (en) * | 2018-05-18 | 2018-10-30 | 深圳壹账通智能科技有限公司 | Face label fraud judgment method, device, equipment and medium based on micro- expression |
CN109510806A (en) * | 2017-09-15 | 2019-03-22 | 阿里巴巴集团控股有限公司 | Method for authenticating and device |
CN110136698A (en) * | 2019-04-11 | 2019-08-16 | 北京百度网讯科技有限公司 | For determining the method, apparatus, equipment and storage medium of nozzle type |
CN110400251A (en) * | 2019-06-13 | 2019-11-01 | 深圳追一科技有限公司 | Method for processing video frequency, device, terminal device and storage medium |
CN110533288A (en) * | 2019-07-23 | 2019-12-03 | 平安科技(深圳)有限公司 | Business handling process detection method, device, computer equipment and storage medium |
CN110555330A (en) * | 2018-05-30 | 2019-12-10 | 百度在线网络技术(北京)有限公司 | image surface signing method and device, computer equipment and storage medium |
CN111429143A (en) * | 2019-01-10 | 2020-07-17 | 上海小蚁科技有限公司 | Transfer method, device, storage medium and terminal based on voiceprint recognition |
CN111598686A (en) * | 2020-07-21 | 2020-08-28 | 成都新希望金融信息有限公司 | Video surface signing method and system based on intelligent face recognition |
CN112288398A (en) * | 2020-10-29 | 2021-01-29 | 平安信托有限责任公司 | Surface label verification method and device, computer equipment and storage medium |
CN112788269A (en) * | 2020-12-30 | 2021-05-11 | 未鲲(上海)科技服务有限公司 | Video processing method, device, server and storage medium |
CN112818316A (en) * | 2021-03-08 | 2021-05-18 | 南京大正智能科技有限公司 | Voiceprint-based identity recognition and application method, device and equipment |
CN113032758A (en) * | 2021-03-26 | 2021-06-25 | 平安银行股份有限公司 | Video question-answer flow identity identification method, device, equipment and storage medium |
CN113903338A (en) * | 2021-10-18 | 2022-01-07 | 深圳追一科技有限公司 | Surface labeling method and device, electronic equipment and storage medium |
CN114090989A (en) * | 2021-11-03 | 2022-02-25 | 支付宝(杭州)信息技术有限公司 | Identity authentication method, system and device |
CN114155460A (en) * | 2021-11-29 | 2022-03-08 | 平安科技(深圳)有限公司 | Method and device for identifying user type, computer equipment and storage medium |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8897500B2 (en) * | 2011-05-05 | 2014-11-25 | At&T Intellectual Property I, L.P. | System and method for dynamic facial features for speaker recognition |
KR101971697B1 (en) * | 2012-02-24 | 2019-04-23 | 삼성전자주식회사 | Method and apparatus for authenticating user using hybrid biometrics information in a user device |
CN106203369A (en) * | 2016-07-18 | 2016-12-07 | 三峡大学 | Active stochastic and dynamic for anti-counterfeiting recognition of face instructs generation system |
CN110413841A (en) * | 2019-06-13 | 2019-11-05 | 深圳追一科技有限公司 | Polymorphic exchange method, device, system, electronic equipment and storage medium |
CN113707124A (en) * | 2021-08-30 | 2021-11-26 | 平安银行股份有限公司 | Linkage broadcasting method and device of voice operation, electronic equipment and storage medium |
CN114245204B (en) * | 2021-12-15 | 2023-04-07 | 平安银行股份有限公司 | Video surface signing method and device based on artificial intelligence, electronic equipment and medium |
-
2022
- 2022-05-30 CN CN202210595750.2A patent/CN114677634B/en active Active
Patent Citations (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105119872A (en) * | 2015-02-13 | 2015-12-02 | 腾讯科技(深圳)有限公司 | Identity verification method, client, and service platform |
CN106341380A (en) * | 2015-10-15 | 2017-01-18 | 收付宝科技有限公司 | Method, device and system for performing remote identity authentication on user |
WO2017198014A1 (en) * | 2016-05-19 | 2017-11-23 | 阿里巴巴集团控股有限公司 | Identity authentication method and apparatus |
CN107809608A (en) * | 2016-08-24 | 2018-03-16 | 方正国际软件(北京)有限公司 | A kind of generation method and device of digital signature video |
CN107464115A (en) * | 2017-07-20 | 2017-12-12 | 北京小米移动软件有限公司 | personal characteristic information verification method and device |
CN109510806A (en) * | 2017-09-15 | 2019-03-22 | 阿里巴巴集团控股有限公司 | Method for authenticating and device |
CN108399395A (en) * | 2018-03-13 | 2018-08-14 | 成都数智凌云科技有限公司 | The compound identity identifying method of voice and face based on end-to-end deep neural network |
CN108717663A (en) * | 2018-05-18 | 2018-10-30 | 深圳壹账通智能科技有限公司 | Face label fraud judgment method, device, equipment and medium based on micro- expression |
CN110555330A (en) * | 2018-05-30 | 2019-12-10 | 百度在线网络技术(北京)有限公司 | image surface signing method and device, computer equipment and storage medium |
CN111429143A (en) * | 2019-01-10 | 2020-07-17 | 上海小蚁科技有限公司 | Transfer method, device, storage medium and terminal based on voiceprint recognition |
CN110136698A (en) * | 2019-04-11 | 2019-08-16 | 北京百度网讯科技有限公司 | For determining the method, apparatus, equipment and storage medium of nozzle type |
CN110400251A (en) * | 2019-06-13 | 2019-11-01 | 深圳追一科技有限公司 | Method for processing video frequency, device, terminal device and storage medium |
CN110533288A (en) * | 2019-07-23 | 2019-12-03 | 平安科技(深圳)有限公司 | Business handling process detection method, device, computer equipment and storage medium |
CN111598686A (en) * | 2020-07-21 | 2020-08-28 | 成都新希望金融信息有限公司 | Video surface signing method and system based on intelligent face recognition |
CN112288398A (en) * | 2020-10-29 | 2021-01-29 | 平安信托有限责任公司 | Surface label verification method and device, computer equipment and storage medium |
CN112788269A (en) * | 2020-12-30 | 2021-05-11 | 未鲲(上海)科技服务有限公司 | Video processing method, device, server and storage medium |
CN112818316A (en) * | 2021-03-08 | 2021-05-18 | 南京大正智能科技有限公司 | Voiceprint-based identity recognition and application method, device and equipment |
CN113032758A (en) * | 2021-03-26 | 2021-06-25 | 平安银行股份有限公司 | Video question-answer flow identity identification method, device, equipment and storage medium |
CN113903338A (en) * | 2021-10-18 | 2022-01-07 | 深圳追一科技有限公司 | Surface labeling method and device, electronic equipment and storage medium |
CN114090989A (en) * | 2021-11-03 | 2022-02-25 | 支付宝(杭州)信息技术有限公司 | Identity authentication method, system and device |
CN114155460A (en) * | 2021-11-29 | 2022-03-08 | 平安科技(深圳)有限公司 | Method and device for identifying user type, computer equipment and storage medium |
Non-Patent Citations (3)
Title |
---|
Raahat Devender Singh 等.Video content authentication techniques: a comprehensive survey.《Multimedia Systems》.2017, * |
基于语音比对的远程面试身份认证;林晓勤 等;《华东师范大学学报 (自然科学版)》;20201125;164-171 * |
用于可靠身份认证的唇语识别;杨龙生 等;《电视技术》;20181005;第42卷(第10期);88-91 * |
Also Published As
Publication number | Publication date |
---|---|
CN114677634A (en) | 2022-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10249304B2 (en) | Method and system for using conversational biometrics and speaker identification/verification to filter voice streams | |
US20210327431A1 (en) | 'liveness' detection system | |
CN114677634B (en) | Surface label identification method and device, electronic equipment and storage medium | |
US20190259388A1 (en) | Speech-to-text generation using video-speech matching from a primary speaker | |
CN110956966B (en) | Voiceprint authentication method, voiceprint authentication device, voiceprint authentication medium and electronic equipment | |
US20210350346A1 (en) | System and method for using passive multifactor authentication to provide access to secure services | |
Sahoo et al. | Emotion recognition from audio-visual data using rule based decision level fusion | |
US20220318349A1 (en) | Liveness detection using audio-visual inconsistencies | |
CN110136726A (en) | A kind of estimation method, device, system and the storage medium of voice gender | |
Mamyrbayev et al. | Development of security systems using DNN and i & x-vector classifiers | |
CN116935889B (en) | Audio category determining method and device, electronic equipment and storage medium | |
CN109817223A (en) | Phoneme marking method and device based on audio fingerprints | |
Saleema et al. | Voice biometrics: the promising future of authentication in the internet of things | |
CN111785280B (en) | Identity authentication method and device, storage medium and electronic equipment | |
CN108831230B (en) | Learning interaction method capable of automatically tracking learning content and intelligent desk lamp | |
Hazen et al. | Multimodal face and speaker identification for mobile devices | |
Cotter | Laboratory exercises for an undergraduate biometric signal processing course | |
US20240346850A1 (en) | Method and system for performing video-based automatic identity verification | |
Pawade et al. | Voice Based Authentication Using Mel-Frequency Cepstral Coefficients and Gaussian Mixture Model | |
CN118447855A (en) | Speaker role recognition method and device based on double-recording system | |
JP2000148187A (en) | Speaker recognizing method, device using the method and program recording medium therefor | |
Stewart et al. | LIVENESS'DETECTION SYSTEM | |
Singh | Home automation framework through Voice recognition System for home security | |
CN114973058A (en) | Surface labeling method and device, electronic equipment and storage medium | |
Lerato | Hierachical methods for large population speaker identification using telephone speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |