CN114490947A

CN114490947A - Dialog service method, device, server and medium based on artificial intelligence

Info

Publication number: CN114490947A
Application number: CN202210142940.9A
Authority: CN
Inventors: 刘荣荣
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2022-02-16
Filing date: 2022-02-16
Publication date: 2022-05-13

Abstract

The application is applicable to the field of artificial intelligence, and particularly relates to a conversation service method, device, server and medium based on artificial intelligence. The method comprises the steps of acquiring current voice data and current facial data of a user when a conversation service is triggered, carrying out audio analysis and semantic analysis on the current voice data, determining a current tone state, carrying out expression analysis on the current facial data, determining a current expression state, extracting question sentences in the current voice data when the current tone state is a first target state and the current expression state is a second target state, inputting the question sentences into a trained matching model to obtain answer sentences corresponding to the question sentences, combining tone and expression to identify user intentions, generating corresponding answer sentences when the tone and expression corresponding states meet conditions, and avoiding continuous conversation under the condition that the user has unsatisfied intentions, so that the user intentions can be accurately grasped, and the service quality of the conversation service is improved.

Description

Dialog service method, device, server and medium based on artificial intelligence

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a dialog service method, apparatus, server, and medium based on artificial intelligence.

Background

Currently, a user can obtain related information, information or services through an online service platform with an interactive function. When a user uses the online service platform to perform services such as consultation, the online service platform generally adopts an artificial intelligence mode to analyze the problems provided by the user, and automatically and intelligently generates answer sentences, so that the reply efficiency of the online service platform can be improved, but the artificial intelligence can not completely replace manual processing of all the problems, and the intention of the user can not be accurately represented only by performing semantic analysis on the sentences, the situation that the user replies to the online service platform is unsatisfied easily, the user experience is influenced, and the user complaint is caused, therefore, how to accurately identify the intention of the user becomes the problem to be solved urgently in order to improve the service quality.

Disclosure of Invention

In view of this, embodiments of the present application provide a session service method, an apparatus, a server, and a medium based on artificial intelligence, so as to solve the problem in the prior art that the user intention is not accurately identified and the service quality is affected.

In a first aspect, an embodiment of the present application provides a dialog service method based on artificial intelligence, where the dialog service method includes:

when a user triggers a conversation service, acquiring current sound data and current face data of the user;

performing audio analysis and semantic analysis on the current sound data to determine a corresponding current tone state, performing expression analysis on the current facial data to determine a corresponding current expression state;

when the current tone state is a first target state and the current expression state is a second target state, extracting question sentences in the current sound data, wherein the first target state represents that the user does not have negative tone, and the second target state represents that the user does not have negative expression;

inputting the question sentence into the trained matching model to obtain a reply sentence corresponding to the question sentence.

In a second aspect, an embodiment of the present application provides a dialog service device based on artificial intelligence, where the dialog service device includes:

the data acquisition module is used for acquiring current voice data and current face data of a user when the user triggers a conversation service;

the data analysis module is used for carrying out audio analysis and semantic analysis on the current sound data, determining a corresponding current tone state, carrying out expression analysis on the current facial data and determining a corresponding current expression state;

a question extraction module, configured to extract a question sentence from the current sound data when the current mood state is a first target state and the current expression state is a second target state, where the first target state indicates that the user does not have negative mood, and the second target state indicates that the user does not have negative expression;

and the question answering module is used for inputting the question sentences into the trained matching model to obtain answering sentences corresponding to the question sentences.

In a third aspect, an embodiment of the present application provides a server, where the server includes a processor, a memory, and a computer program stored in the memory and operable on the processor, and the processor, when executing the computer program, implements the conversation service method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the conversation service method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, which, when run on a server, causes the server to execute the session service method according to the first aspect.

Compared with the prior art, the embodiment of the application has the advantages that: the method and the device acquire the current voice data and the current face data of the user when the user triggers the dialogue service, performing audio analysis and semantic analysis on the current sound data to determine the corresponding current mood state, performing expression analysis on the current facial data, determining a corresponding current expression state, wherein the current mood state is a first target state, and when the current expression state is a second target state, extracting question sentences in the current voice data, inputting the question sentences into the trained matching model to obtain answer sentences corresponding to the question sentences, combining the mood and the expression to identify the user intention, and corresponding reply sentences are generated when the states corresponding to the tone and the expression both meet the conditions, so that the continuous conversation is avoided under the condition that the user has the discontent intentions and the like, the intention of the user can be accurately grasped, and the service quality of the conversation service is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the embodiments or the prior art description will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings may be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic flow chart illustrating a dialog service method based on artificial intelligence according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a dialog service method based on artificial intelligence according to a second embodiment of the present application;

fig. 3 is a schematic structural diagram of a dialog service device based on artificial intelligence according to a third embodiment of the present application;

fig. 4 is a schematic structural diagram of a server according to a fourth embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The server in the embodiment of the present application may be a palm top computer, a desktop computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cloud terminal device, a Personal Digital Assistant (PDA), and the like, and the specific type of the server is not limited in this embodiment of the present application.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

It should be understood that, the sequence numbers of the steps in the following embodiments do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In order to explain the technical means of the present application, the following description will be given by way of specific examples.

Referring to fig. 1, a schematic flow chart of a session service method based on artificial intelligence according to an embodiment of the present application is shown, where the session service method is applied to a server, the server is connected to a corresponding terminal device of a user side, and configures a corresponding human-computer interaction service interface for the terminal device of the user side to provide a session service. The server is connected with a corresponding database to obtain corresponding data. As shown in fig. 1, the conversation service method may include the steps of:

step S101, when the user triggers the dialogue service, the current voice data and the current face data of the user are obtained.

The user clicks a button of the conversation service in a man-machine interaction service interface configured by the server on the terminal equipment of the user to trigger the conversation service, and meanwhile, a sound collector and an image collector, namely a microphone and a camera, are also configured on the terminal equipment of the user side, and the sound collector and the image collector can be used for acquiring sound data and face data of the user triggering the conversation service. And the terminal equipment sends the acquired sound data and face data to the server, namely the server is considered to acquire the current sound data and the current face data.

Before sending voice data and face data to the server, the terminal device may first ask the user who triggered the session service whether to allow the terminal device to send the voice data and face data of the user to the server, and may provide the voice data and face data of the user to the server via the user allowing party, otherwise, the current session service is ended.

If the user does not allow the terminal device to send the voice data and the face data of the user to the server, a permission reminding can be output to remind the user that the voice data and the face data of the user need to be allowed to be sent to the terminal device server for conversation service, or if the user does not allow the terminal device to send the voice data and the face data of the user to the server, a service limiting reminding can be output to remind the user that the current service is possibly inaccurate. For example, a dialog box of "right-off reminder" pops up on the terminal device, and the user may perform a selection operation in the dialog box to realize right-off or decline right-off, or a dialog box of "limit reminder" pops up on the terminal device to display "current dialog service, which may be inaccurate".

In one embodiment, the server is connected to a corresponding display device and is capable of directly outputting a human-computer interaction interface, so that the server can be directly used by a user, wherein the server is connected to the display device and an input device, the display device is used for displaying a corresponding human-computer interaction service interface, the user triggers a conversation service through the input device, and the server is further provided with a sound collector and an image collector, namely a microphone and a camera, and is capable of directly acquiring sound data and face data of the user triggering the conversation service.

Step S102, carrying out audio analysis and semantic analysis on the current sound data, determining the corresponding current tone state, carrying out expression analysis on the current face data, and determining the corresponding current expression state.

In the application, the processing of the sound data can be divided into audio analysis and semantic analysis, wherein the audio analysis can be to analyze the sound track, the tone, the volume and the like of the sound, and then judge whether the sound is sharp, cluttered and the like, and the semantic analysis can be to analyze the words corresponding to the sound by adopting a natural language processing mode, and then judge whether the words have semantics of responsibility, responsibility and the like, and the tone state can be determined by combining the result of the audio analysis and the result of the semantic analysis. The mood state may refer to a state corresponding to a negative emotion appearing in the user or a state corresponding to no negative emotion appearing in the user. For example, when the voice in the voice data is relatively urgent and sharp, and meanwhile, the words corresponding to the voice data are in blame and blame, and the like, the voice state corresponding to the voice data can be determined to be the voice corresponding to the negative emotion by combining the voice and the words.

Voice-based emotion recognition can be divided into two broad categories, based on different representations of emotions. The first expression is the kind of emotion, and the most commonly used six basic emotions include happy (happy), sad (sadness), anger (anger), nausea (distust), fear (fear), surprise (surrise); the second expression is based on several dimensional vectors, the most common being arousal and value, arousal representing the degree of arousal, and value representing the degree of positive emotion, both of which represent their own degree of arousal by values, for example, the value interval corresponding to the dimension is [ -1, 1], where, -1 may represent very low addiction/negative and 1 may represent very active/positive. The voice data may be finally qualified as different emotion types by emotion recognition of voice, specifying that some of the six types of emotions are discontented in mood state, or as scores by emotion recognition of voice, specifying that the scores correspond to discontented mood state at a certain threshold.

For emotion recognition of characters, that is, semantic of characters is analyzed, in the present application, Natural Language Processing (NLP) may be used to analyze character data for semantic analysis, that is, Opinion Mining (Opinion Mining), so as to extract a viewpoint, where the viewpoint is used to represent whether the characters are actively or negatively expressed.

The emotion recognition based on the face may refer to analyzing expressions, and the image analysis based on the face recognition may determine a state corresponding to the expressions by identifying the expressions, the micro-expressions, and the like and performing corresponding expression similarity matching, where the state corresponding to the expressions may refer to an expression corresponding to a user having a negative emotion or a state corresponding to a user not having a negative emotion.

The expression analysis can be divided into two parts of face recognition and expression classification, wherein the obtained face data can be subjected to Opencv face recognition, and then Keras is used for expression classification management and emotion recognition to finally obtain an expression state.

Optionally, performing audio analysis and semantic analysis on the current sound data, and determining the corresponding current mood state includes:

detecting the amplitude and frequency of sound in the current sound data;

identifying the semantics of the current sound data and determining the corresponding attitude of the current sound data;

when the amplitude is lower than a first threshold value or the frequency is lower than a second threshold value and the corresponding attitude of the current voice data is positive, determining that the current tone state is a first target state;

and when the attitude corresponding to the current sound data is negative and/or the amplitude is not lower than the first threshold and the frequency is not lower than the second threshold, determining that the current tone state is not a first target state.

The amplitude and the frequency of the sound in the sound data are detected by processing the sound wave of the sound data, wherein the amplitude is used for representing the loudness of the sound, and the frequency is used for representing the pitch of the sound. Based on the analysis of loudness and pitch, a first threshold value corresponding to amplitude may be given, above which a greater loudness is indicated, and a second threshold value corresponding to frequency, above which a higher pitch is indicated.

The semantic meaning of the voice data is identified, the attitude corresponding to the voice data is determined, and the first target state is defined to be characterized in that the user does not have negative tone, so that the first target state requires that the amplitude is lower than a first threshold value or the frequency is lower than a second threshold value, the attitude corresponding to the current voice data is positive, and the other situations are not the first target state.

Optionally, recognizing the semantics of the current sound data, and determining the corresponding attitude of the current sound data includes:

performing noise reduction processing on the current sound data to obtain noise-reduced sound data;

extracting human voice data from the voice data subjected to noise reduction;

and converting the voice data into character data, identifying the character data by using natural language processing, and determining the attitude corresponding to the character data as the attitude corresponding to the current voice data.

Specifically, for semantic recognition, the NPL technique is adopted by recognizing characters corresponding to voice data, but the accuracy of semantic recognition is affected by the conversion from voice to character, so that before converting the character data, the voice data needs to be denoised, and the voice of the user needs to be extracted for conversion.

The Speech text conversion technology may be an Automatic Speech Recognition (ASR) technology, where Speech is used as a research object and is converted into a corresponding text through Speech signal processing and mode Recognition.

Step S103, when the current tone state is the first target state and the current expression state is the second target state, extracting question sentences in the current sound data.

Wherein the first target state represents that the user does not have negative tone, i.e. tone expression is neutral or satisfactory, and the second target state represents that the user does not have negative expression, i.e. expression is neutral or satisfactory.

When the current tone state is the first target state and the current expression state is the second target state, the situation that the current emotion of the user is not discontented and the like is shown, and the problem brought forward by the user can be solved by continuously using an automatic reply mode.

When the current tone state is not the first target state or the current expression state is not the second target state, the situation that the current emotion of the user is discontented and the like is shown, and probably because the user question cannot be accurately answered through automatic answering, the problem brought forward by the user cannot be responded in an automatic answering mode, and the emotion of the user is prevented from being excited.

In the present application, the problem statement in the current sound data is extracted, that is, the sound data is analyzed, the target field or the key field in the sound data is determined, and the target field or the key field is used as the problem that the user wants to consult in the current sound data. The ASR technique described above may be employed by also converting the voice data to a number of words before extracting the problem in the voice data.

In the method, problem sentences in the sound data are extracted based on an extraction model, the extraction model focuses on the query tone and query words, and the problem sentences are extracted based on the query tone and the query words.

Optionally, before extracting the question sentences in the current sound data, the method further includes:

if the current sound data has the previous sound data, performing audio analysis and semantic analysis on the previous sound data to determine a corresponding previous tone state;

acquiring a trained extraction model;

adjusting the trained extraction model according to the previous mood state to obtain an adjusted extraction model;

accordingly, extracting question sentences in the current sound data includes:

and extracting the question sentences in the current sound data by using the adjusted extraction model.

When the current voice data is not the first voice data sent by a user in a conversation service, the current voice data has previous voice data, and the tone state in the previous voice data is in a certain degree of relation with the current voice data, so that the previous tone state can be used as the input of an extraction model before the problem statement is extracted from the current voice data, and the parameter of the tone state in the extraction model is adjusted, so that the problem statement in the current voice data can be extracted more accurately by adjusting the extraction model. For example, when the previous mood state of the previous sound data is not satisfied, the mood associated with the question sentence in the current sound data may be increased, and therefore, the weight of the mood in the extraction model is increased, and then the extraction of the question sentence is performed.

And step S104, inputting the question sentence into the trained matching model to obtain a reply sentence corresponding to the question sentence.

The matching model can adopt machine learning, a neural network and the like, and in the application, the matching model can be a neural network model based on a bidirectional Long Short-Term Memory network (BLSTM) algorithm.

Firstly, the problem sentences of the users and the corresponding candidate answer sentences need to be subjected to word segmentation, word pause removal, part-of-speech tagging and the like, then all the word segments are converted into word vectors through word2vec, and then the feature representation of texts corresponding to the problem sentences is obtained, and Q is { Q ═ Q is used for the problem sentences in the training set₁,q₂,...q_nDenotes that the answer is defined as a ═ a₁,a₂,...,a_nAnd the specific steps comprise constructing a BLSTM model introducing an Attention mechanism, converting a question sentence Q and a candidate answer sentence A into characteristic vectors Q 'and A', and respectively converting the characteristic vectors Q 'and A' into the characteristic vectors A 'and A', and calculating in a hidden layer of a neural network to obtain an output vector, and then injecting the output information of the BLSTM model into an Attention model to perform weighted calculation on the output vector, and finally obtaining vector representation of a corresponding answer sentence.

The word vector is trained by using word2vec, the dimensionality of the word vector is set to be 150 dimensions, a data set is from a WebQA data set of network search and is divided into a training set and a testing set, and two evaluation criteria are adopted to verify the proposed question and answer: average precision Average (MAP) and Average Reciprocal Rank (MRR), where,

where avg (p (q)) represents the average precision value, rank (q) represents the ranked position of the first correct answer in the candidate answers, MRR represents the ranking of the first correct answer, and MAP only tests the ranking of all correct answers.

In the BLSTM model, which introduces the attention mechanism, which is a concept that allows the model to focus on past output, the attention mechanism treats each word vector equally, and the idea of the attention mechanism is to locally focus on the overall information by means of automatic weighting, in a way that mimics the thinking that the human brain's attention to different things is different.

Sending each output result of the bidirectional LSTM to an attention mechanism layer based on a BLSTM model of the attention mechanism, and automatically calculating each output result so as to obtain vector representation of the weighted sentence;

M＝tanh(H+B)

α＝softmax(W^TM)

γ＝Hα^T

in the formula, H represents an output vector [ H ]₁,h₂,...h_T]The method comprises the following steps of forming a matrix, wherein T represents the length of an input sentence, B represents a bias vector relative to H, firstly, a hidden layer of H is represented as M by utilizing a tanh function, a weight representation alpha of the sentence H can be calculated by utilizing a softmax function through known context vectors W and M, then, a vector representation gamma of the sentence is calculated by multiplying an output vector matrix H by a transpose of the weight matrix alpha, wherein the vector W is a representation of a relatively important part in sentence information, and the value of the vector W can be adjusted according to different situations in a model training process.

Compared with the LSTM, the accuracy of the BLSTM model introduced with the attention mechanism is obviously improved, the model combines the attention mechanism, ignores meaningless information and obtains more context semantic information, and therefore the effectiveness of answers can be obviously improved through the model provided in the natural language processing algorithm.

For example, for an environment service platform, a user starts the environment service platform to perform service consultation, the environment service platform collects voice and facial images of the user through a corresponding voice collector and an image collector, when a current tone state corresponding to the voice is a normal state (the voice is flat, the tone is flat, and the like), and a current expression state corresponding to the facial images is a normal state (smile, no eyebrow crumples, and the like), it indicates that the user is not full, therefore, the conversation service can be continued, the voice is extracted, a question is determined as "environmental protection material under a scene a", the question is input into a trained BLSTM model, and a corresponding sentence response can be obtained as follows: "the following results were found for you: the Y material is used under the Z engineering ".

Optionally, after inputting the question sentence into the trained matching model to obtain a reply sentence corresponding to the question sentence, the method further includes:

acquiring a reply utterance corresponding to the question sentence from a reply utterance database;

and combining the reply sentences with the reply dialogues to obtain the replies corresponding to the question sentences.

The answer operation is an operation preset in the answer operation database according to the dialogue scene, and the answer operation can be combined with the corresponding answer sentence to finally form the answer content to be answered. For example, the question statement is: and B, what the environment-friendly materials under the scene A exist, and the corresponding answer is called according to the following: "find the following information for you: a 'reply sentence' ″, the reply sentence being: y material, Z material, therefore, the content of the response is: the following information is found for you: y material and Z material.

In one embodiment, if a voice response is required, the content of the response is converted into voice according to a preset sound and output.

The embodiment of the application acquires the current voice data and the current face data of the user when the user triggers the dialogue service, performing audio analysis and semantic analysis on the current sound data to determine the corresponding current mood state, performing expression analysis on the current facial data, determining a corresponding current expression state, wherein the current mood state is a first target state, and when the current expression state is a second target state, extracting question sentences in the current voice data, inputting the question sentences into the trained matching model to obtain answer sentences corresponding to the question sentences, combining the mood and the expression to identify the user intention, and corresponding reply sentences are generated when the states corresponding to the tone and the expression both meet the conditions, so that the continuous conversation is avoided under the condition that the user has the discontent intentions and the like, the intention of the user can be accurately grasped, and the service quality of the conversation service is improved.

Referring to fig. 2, which is a schematic flowchart of a dialog service method based on artificial intelligence according to a second embodiment of the present application, as shown in fig. 2, the dialog service method may include the following steps:

in step S201, when the user triggers the dialog service, the current voice data and the current face data of the user are acquired.

Step S202, audio analysis and semantic analysis are carried out on the current sound data, a corresponding current tone state is determined, expression analysis is carried out on the current face data, and a corresponding current expression state is determined.

The content types of steps S201 to S204 are the same as those of steps S101 to S104, and reference may be made to the description of steps S101 to S104, which is not repeated herein.

Step S203, acquiring the personnel state information of the manual service when the current tone state is the third target state and/or the current expression state is the fourth target state.

Wherein the third target state represents that the user has negative tone, i.e. tone expression is unsatisfactory, and the fourth target state represents that the user has negative expression, i.e. expression is unsatisfactory.

The third target state is opposite to the first target state, the fourth target state is opposite to the second target state, and the situation that the current emotion of the user is discontented is shown when the current tone state is the third target state or the current expression state is the fourth target state.

In the application, when the user's question can not be answered automatically, the mode of manual intervention can also be adopted, and the corresponding question is answered by the manual work, thereby improving the user's experience. Based on the above, the server can be connected with the manual service platform, so that the personnel state information of each personnel of manual service in the manual service platform is obtained, including whether the personnel is online or not, whether the personnel is busy or not, whether the relevant scene can be solved or not, and the like.

In one embodiment, after acquiring the personnel state information of the manual service, if no personnel can provide the manual conversation service at present, a manual transfer prompt is also output to the user to remind the user that the user is assisting in transferring the manual work.

And step S204, determining the target person according to the person state information, and sending the current sound data to the target person.

And determining the personnel who are online, not busy and capable of answering the corresponding scene problems as target personnel according to the personnel state information, and sending the current sound data to the target personnel through the manual service platform, so that the target personnel can browse the current sound data and realize the calling of manual service. For example, after the target person is identified, an access request is sent to the target person, and after the target person has confirmed, the data is sent to the target person.

After the current voice data is sent to the target person, the reply of the target person to the current voice data can be obtained, and the reply is sent to the user, namely, manual conversation service is provided for the user. In one embodiment, the target person is docked with the user so that the target person and the user can have a direct conversation. For example, the dialog service interface is used to interface the target person with the user and the dialog may be conducted using voice and/or text.

Optionally, after sending the current sound data to the target person, the method further includes:

docking the target person with the user based on the user-triggered dialog service;

and under the condition of successful docking, acquiring reply information sent by the target personnel, and sending the reply information to the user.

The conversation service provided by the server can be docked with the target person, namely one end of the conversation service is modified to correspond to the target person by the robot, and the other end of the conversation service is the user, so that the docking of the target person and the user is realized.

According to the embodiment of the application, when a user triggers a conversation service, the current sound data and the current facial data of the user are obtained, audio analysis and semantic analysis are carried out on the current sound data, the corresponding current tone state is determined, expression analysis is carried out on the current facial data, the corresponding current expression state is determined, when the current tone state is a third target state or the current expression state is a fourth target state, manual service is called for conversation, the situation that the user is not full due to the fact that automatic answering cannot accurately answer user questions is avoided, automatic conversation is stopped, and the situation of exciting the user emotion is avoided.

Corresponding to the session service method of the foregoing embodiment, fig. 3 shows a block diagram of a session service device based on artificial intelligence according to a third embodiment of the present application, where the session service device is applied to a server, the server is connected to a corresponding terminal device of a user side, and configures a corresponding human-computer interaction service interface for the terminal device of the user side to provide a session service. The server is connected with a corresponding database to obtain corresponding data. For convenience of explanation, only portions related to the embodiments of the present application are shown.

Referring to fig. 3, the conversation service apparatus includes:

a data obtaining module 31, configured to obtain current voice data and current face data of a user when the user triggers a session service;

the data analysis module 32 is configured to perform audio analysis and semantic analysis on current sound data, determine a corresponding current tone state, perform expression analysis on current facial data, and determine a corresponding current expression state, where a first target state indicates that the user does not have negative tone, and a second target state indicates that the user does not have negative expression;

the question extraction module 33 is configured to extract a question and a sentence from the current sound data when the current mood state is the first target state and the current expression state is the second target state;

and the question answering module 34 is used for inputting the question sentence into the trained matching model to obtain an answering sentence corresponding to the question sentence.

Optionally, the session service device further includes:

the voice state obtaining module is used for carrying out audio analysis and semantic analysis on the previous sound data if the previous sound data exists in the current sound data before extracting the question sentences in the current sound data, and determining the corresponding previous voice state;

the extraction model acquisition module is used for acquiring a trained extraction model;

the extraction model adjusting module is used for adjusting the trained extraction model according to the previous mood state to obtain an adjusted extraction model;

accordingly, the problem extraction module 33 includes:

and the problem extraction unit is used for extracting the problem sentences in the current sound data by using the adjusted extraction model.

Optionally, the session service device further includes:

the state information acquisition module is used for acquiring the personnel state information of manual service after audio analysis and semantic analysis are carried out on the current sound data and the corresponding current tone state is determined and when the current tone state is a third target state and/or the current expression state is a fourth target state, the third target state represents that the user has negative tone, and the fourth target state represents that the user has negative expression;

and the first data sending module is used for determining the target personnel according to the personnel state information and sending the current sound data to the target personnel.

Optionally, the session service device further includes:

the docking module is used for docking the target person with the user based on the conversation service triggered by the user after the current sound data is sent to the target person;

and the second data sending module is used for acquiring the reply information sent by the target personnel and sending the reply information to the user under the condition of successful butt joint.

Optionally, the session service device further includes:

the dialect obtaining module is used for obtaining a reply dialect corresponding to the question sentence from the reply dialect database after inputting the question sentence into the trained matching model to obtain the reply sentence corresponding to the question sentence;

and the answer determining module is used for combining the answer sentences with the answer dialogues to obtain answers corresponding to the question sentences.

Optionally, the data analysis module 32 includes:

a detection unit for detecting the amplitude and frequency of the sound in the current sound data;

the attitude determination unit is used for identifying the semantics of the current sound data and determining the attitude corresponding to the current sound data;

the first state determining unit is used for determining that the current mood state is a first target state when the amplitude is lower than a first threshold value or the frequency is lower than a second threshold value and the corresponding attitude of the current voice data is positive;

and a second state determination unit, configured to determine that the current mood state is not the first target state when the attitude corresponding to the current sound data is negative and/or the amplitude is not lower than the first threshold and the frequency is not lower than the second threshold.

Optionally, the attitude determination unit includes:

the noise reduction processing unit is used for carrying out noise reduction processing on the current sound data to obtain noise-reduced sound data;

the voice extracting subunit is used for extracting voice data from the voice data subjected to noise reduction;

and the attitude determining subunit is used for converting the voice data into character data, identifying the character data by using natural language processing, and determining the attitude corresponding to the character data as the attitude corresponding to the current voice data.

It should be noted that, because the contents of information interaction, execution process, and the like between the modules are based on the same concept as that of the embodiment of the method of the present application, specific functions and technical effects thereof may be specifically referred to a part of the embodiment of the method, and details are not described here.

Fig. 4 is a schematic structural diagram of a server according to a fourth embodiment of the present application. As shown in fig. 4, the server 4 of this embodiment includes: at least one processor 40 (only one shown in fig. 4), a memory 41, and a computer program 42 stored in the memory 41 and executable on the at least one processor 40, the steps of any of the various dialog service method embodiments described above being implemented when the computer program 42 is executed by the processor 40.

The server 4 may include, but is not limited to, a processor 40, a memory 41. Those skilled in the art will appreciate that fig. 4 is merely an example of the server 4 and does not constitute a limitation of the server 4, and may include more or less components than those shown, or combine certain components, or different components, such as input output devices, network access devices, etc.

The Processor 40 may be a CPU, and the Processor 40 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 41 may in some embodiments be an internal storage unit of the server 4, such as a hard disk or a memory of the server 4. The memory 41 may be an external storage device of the server 4 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the server 4. Further, the memory 41 may also include both an internal storage unit of the server 4 and an external storage device. The memory 41 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of a computer program. The memory 41 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the above-mentioned apparatus may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method of the embodiments described above can be implemented by a computer program, which can be stored in a computer readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code, recording medium, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution media. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

When the computer program product runs on a server, the steps in the method embodiments can be implemented when the server executes the computer program product.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/server and method may be implemented in other ways. For example, the above-described apparatus/server embodiments are merely illustrative, and for example, a division of modules or units is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A dialog service method based on artificial intelligence, which is characterized in that the dialog service method comprises:

2. The dialogue service method according to claim 1, wherein before the extracting the question sentence in the current sound data, the dialogue service method further comprises:

if the current sound data exists in the previous sound data, performing audio analysis and semantic analysis on the previous sound data to determine a corresponding previous tone state;

acquiring a trained extraction model;

accordingly, the extracting the question sentence in the current sound data includes:

and extracting question sentences in the current sound data by using the adjusted extraction model.

3. The dialog service method according to claim 1, wherein after performing audio analysis and semantic analysis on the current sound data to determine a corresponding current mood state, further comprising:

when the current tone state is a third target state and/or the current expression state is a fourth target state, acquiring personnel state information of manual service, wherein the third target state represents that the user has negative tone, and the fourth target state represents that the user has negative expression;

and determining a target person according to the person state information, and sending the current sound data to the target person.

4. The conversation service method according to claim 3, further comprising, after said transmitting the current sound data to the target person:

and under the condition of successful docking, acquiring reply information sent by the target person, and sending the reply information to the user.

5. The dialogue service method according to claim 1, wherein after the inputting the question/sentence into the trained matching model to obtain a response sentence corresponding to the question/sentence, the method further comprises:

and combining the answer sentence with the answer sentence to obtain an answer corresponding to the question sentence.

6. The dialog service method according to any of claims 1 to 5, wherein the audio analysis and semantic analysis of the current sound data and the determining of the corresponding current mood state comprises:

detecting the amplitude and frequency of sound in the current sound data;

identifying the semantics of the current sound data, and determining the attitude corresponding to the current sound data;

7. The dialog service method according to claim 6, wherein the recognizing the semantics of the current sound data and the determining the corresponding attitude of the current sound data comprises:

extracting human voice data from the voice data subjected to noise reduction;

8. A dialog service device based on artificial intelligence, characterized in that the dialog service device comprises:

the data analysis module is used for carrying out audio analysis and semantic analysis on the current sound data, determining a corresponding current tone state, carrying out expression analysis on the current facial data, and determining a corresponding current expression state, wherein the first target state represents that the user does not have negative tone, and the second target state represents that the user does not have negative expression;

the question extraction module is used for extracting question sentences in the current sound data when the current tone state is a first target state and the current expression state is a second target state;

9. A server, characterized in that the server comprises a processor, a memory and a computer program stored in the memory and executable on the processor, the processor implementing the dialog service method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, implements a dialog service method according to any one of claims 1 to 7.