CN116386149A

CN116386149A - Sign language information processing method and system

Info

Publication number: CN116386149A
Application number: CN202310651017.2A
Authority: CN
Inventors: 杨阳; 潘彦蓉; 曾珂; 张晔; 张梦醒; 童凯; 曹宁; 张小兵; 刘睿; 胡天航; 金澄
Original assignee: Sure Enough Barrier Free Technology Suzhou Co ltd
Current assignee: Sure Enough Barrier Free Technology Suzhou Co ltd
Priority date: 2023-06-05
Filing date: 2023-06-05
Publication date: 2023-07-04
Anticipated expiration: 2043-06-05
Also published as: CN116386149B

Abstract

The embodiment of the specification provides a sign language information processing method and a system, wherein the method is executed by a processor and comprises the following steps: acquiring sign language image data; based on the sign language image data, determining multidimensional data of the sign language image data through a preprocessing model, wherein the multidimensional data at least comprises hand image data and background image data, and the preprocessing model is a machine learning model; determining at least one sign language word contained in the sign language image data and identification confidence thereof based on the multidimensional data of the sign language image data; determining at least one set of semantic data and semantic confidence thereof based on at least one sign language word and recognition confidence thereof; and selecting at least one group of semantic data with semantic confidence meeting the first preset condition as output data, and sending the output data and prompt information to a user.

Description

Sign language information processing method and system

Technical Field

The invention relates to the field of sign language translation, in particular to a sign language information processing method and a sign language information processing system.

Background

Sign language is a main mode of communication between hearing impaired people and the outside, and is a set of interaction tools consisting of hands, actions, expressions, gestures and the like. However, hearing impaired individuals with a very small number of people who are proficient in sign language have difficulty in conveying their own ideas and intentions to the hearing impaired individuals, which makes the life and travel of the deaf-mutes still face many difficulties.

In order to facilitate the communication between the hearing impaired and the hearing impaired, CN112257513a proposes a training method, a translation method and a system for sign language video translation model. The application extracts hand features and character features by processing the isolated word sign language video data, and trains a machine learning model by utilizing the extracted feature information. The trained sign language video translation model can translate sign language information. However, most hearing impaired people are completely different from the hearing impaired people in terms of their text expressions and grammatical structures due to their low educational level and unique expression and understanding patterns, and even through text, they cannot communicate with the hearing impaired people conveniently and unimpeded. This may have problems in that the translation is wrong or the result of the translation cannot be understood quickly by the hearing person.

Therefore, it is desirable to provide a sign language information processing method and system, which can accurately translate sign language information and improve the communication efficiency between normal people and hearing impaired people.

Disclosure of Invention

One or more embodiments of the present specification provide a sign language information processing method, the method being performed by a processor, the method comprising: acquiring sign language image data; based on the sign language image data, determining multidimensional data of the sign language image data through a preprocessing model, wherein the multidimensional data at least comprises hand image data and background image data, and the preprocessing model is a machine learning model; determining at least one sign language word and identification confidence degree contained in the sign language image data based on the multidimensional data of the sign language image data; determining at least one set of semantic data and semantic confidence thereof based on the at least one sign language word and recognition confidence thereof; and selecting at least one group of semantic data with the semantic confidence meeting a first preset condition as output data, and sending the output data and prompt information to a user.

One or more embodiments of the present specification provide a sign language information processing system, the system including: the device comprises an acquisition module, a determination module and an output module; the acquisition module is used for acquiring sign language image data; the determining module is used for: based on the sign language image data, determining multidimensional data of the sign language image data through a preprocessing model, wherein the multidimensional data at least comprises hand image data and background image data, and the preprocessing model is a machine learning model; determining at least one sign language word and identification confidence degree contained in the sign language image data based on the multidimensional data of the sign language image data; determining at least one set of semantic data and semantic confidence thereof based on the at least one sign language word and recognition confidence thereof; and selecting at least one group of semantic data with the semantic confidence meeting a first preset condition as output data, and sending the output data and prompt information to a user.

One or more embodiments of the present specification provide a sign language information processing apparatus including at least one processor and at least one memory; the at least one memory is configured to store computer instructions; the at least one processor is configured to execute at least some of the computer instructions to implement a sign language information processing method.

One or more embodiments of the present specification provide a computer-readable storage medium storing computer instructions that, when read by a computer in the storage medium, perform a sign language information processing method.

Drawings

The present specification will be further elucidated by way of example embodiments, which will be described in detail by means of the accompanying drawings. The embodiments are not limiting, in which like numerals represent like structures, wherein:

FIG. 1 is an exemplary flow chart of a sign language information processing method according to some embodiments of the present description;

FIG. 2 is an exemplary schematic diagram of a plurality of iterations shown in accordance with some embodiments of the present description;

FIG. 3 is an exemplary diagram of updating a first partition shown in accordance with some embodiments of the present disclosure;

FIG. 4 is an exemplary schematic diagram of a training sign language recognition model shown in accordance with some embodiments of the present description;

FIG. 5 is an exemplary diagram of determining semantic data and its semantic confidence according to some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present specification, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only some examples or embodiments of the present specification, and it is possible for those of ordinary skill in the art to apply the present specification to other similar situations according to the drawings without inventive effort. Unless otherwise apparent from the context of the language or otherwise specified, like reference numerals in the figures refer to like structures or operations.

It will be appreciated that "system," "apparatus," "unit" and/or "module" as used herein is one method for distinguishing between different components, elements, parts, portions or assemblies at different levels. However, if other words can achieve the same purpose, the words can be replaced by other expressions.

As used in this specification and the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus.

A flowchart is used in this specification to describe the operations performed by the system according to embodiments of the present specification. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

Today, hearing impaired people still face a lot of difficulties in life and travel, and sign language identification has been developed and widely studied in order to solve the communication gap between hearing impaired people and hearing impaired people. The existing sign language recognition technology can be divided into a contact type and a non-contact type. Sign language translation gloves are typical contact devices, but the gloves are expensive and inconvenient to carry, and have no good practical value. However, the non-contact type sign language recognition system is difficult to recognize information in the sign language image, and is prone to unclear translation.

In view of this, in some embodiments of the present disclosure, it is desirable to provide a sign language information processing method and system, which send semantic data and prompt information to a user based on sign language image data, so as to more accurately identify sign language information, improve communication efficiency between a normal person and a hearing impaired person, and save labor time and cost.

Disclosed in some embodiments of the present specification is a sign language information processing system, which includes an acquisition module, a determination module, and an output module.

In some embodiments, the acquisition module is configured to acquire sign language image data.

In some embodiments, the determining module is to: based on the sign language image data, determining multidimensional data of the sign language image data through a preprocessing model; determining at least one sign language word contained in the sign language image data and identification confidence thereof based on the multidimensional data of the sign language image data; at least one set of semantic data and its semantic confidence is determined based on the at least one sign language word and its recognition confidence. See fig. 1 for a more explanation of the contents of the preprocessing model, multidimensional data, etc.

In some embodiments, the determining module is further to: based on the multidimensional data of the sign language image data, at least one sign language word and the recognition confidence degree contained in the sign language image data are determined through multiple rounds of iteration. See fig. 2 for more explanation of multiple iterations.

In some embodiments, the determining module is further to: at least one set of semantic data and its semantic confidence is determined by a large language model based on at least one sign language word and its recognition confidence. See fig. 5 for more description of a large predictive model.

In some embodiments, the output module is configured to select at least one set of semantic data with semantic confidence satisfying a first preset condition as output data, and send the output data and the prompt information to the user. See fig. 2 for more description of determining output data.

Fig. 1 is an exemplary flow chart of a sign language information processing method according to some embodiments of the present description. In some embodiments, the process 100 may be performed by a processor. As shown in fig. 1, the process 100 includes the following steps.

Step 110, obtaining sign language image data.

Sign language image data is image data including a gesture operation of a person performing the gesture operation. The image data may include video data or image sequence data, etc. The person performing the gesture action may include, but is not limited to, an impaired person, an healthy person, etc. The gesture motion may be a hand motion related to a sign language.

In some embodiments, the sign language image data includes at least image data of a hand region. The hand region includes at least a palm and a portion of an arm.

In some embodiments, the processor may obtain sign language image data in a variety of ways. For example, the processor may obtain hand motion information for an hearing impaired person based on the acquisition device. Acquisition device refers to a device, such as a camera, capable of capturing a video or sequence of images. The processor may also obtain sign language image data from the storage device.

Step 120, determining multidimensional data of the sign language image data through a preprocessing model based on the sign language image data.

The multi-dimensional data is multi-dimensional information included in sign language image data.

In some embodiments, the multi-dimensional data may include hand image data, background image data.

The hand image data is image data including at least a hand region of a person performing a gesture operation. The hand image data can reflect the spatial information, motion trail and the like of the hand. In some embodiments the sign language image data may comprise a sequence of images of the hand region that are continuously segmented.

In some embodiments, the hand image data may include overall image data of a person performing the gesture motion. The whole image data is image data including a both-hand region and a trunk region for performing a hand motion.

The background image data refers to environmental information of a person performing the gesture. In some embodiments, the background image data may include a sequence of images of the continuously segmented background region.

In some embodiments, the sign language image data further comprises an image of a facial region, and the multi-dimensional data further comprises expression image data.

The image of the face region is image data of the face of the person performing the gesture operation. For example, the image of the facial region may reflect at least one of the identity, expression, etc. of the person performing the gesture.

The expression image data is image data including at least expression information of a person performing a gesture. In some embodiments, the emoji image data may include a sequence of images of successively segmented facial regions.

Some embodiments of the present disclosure may be used to subsequently obtain context by obtaining expressive image data, improving translation accuracy.

In some embodiments, a preprocessing model may be used to extract and segment the multidimensional data in the sign language image data. Segmentation may refer to distinguishing multi-dimensional data in sign language image data. For example, hand image data, background image data, and the like in sign language image data are distinguished.

In some embodiments, the preprocessing model may be a machine learning model. In some embodiments, the input of the preprocessing model may include sign language image data, and the output may include multi-dimensional data of the sign language image data (e.g., hand image data, background image data, expression image data, etc.).

In some embodiments, the pre-processing model may be trained from a plurality of first training samples with first labels. In some embodiments, the first training sample may include sample sign language image data, and the first training sample may be obtained from historical data. In some embodiments, the first label is historical multidimensional data corresponding to the first training sample, and the first label may be determined by a processor or by human labeling.

In some embodiments of the present disclosure, data in each dimension is identified and segmented based on the preprocessing model, so that subsequent sign language word recognition model processing is facilitated, mutual interference between data in each dimension can be reduced, and subsequent calculation amount can be reduced.

In some embodiments, the preprocessing model may also be a conventional segmentation algorithm. In some embodiments, the processor may determine the multidimensional data of the sign language image data by a conventional segmentation algorithm based on the sign language image data. Exemplary conventional segmentation algorithms may include, but are not limited to, a combination of one or more of thresholding, region growing, edge detection, and the like.

Step 130, determining at least one sign language word and the recognition confidence of the sign language word contained in the sign language image data based on the multidimensional data of the sign language image data.

The term "handicapped" refers to the words used for communication by the hearing impaired. The hand words may include words, phrases, and the like.

The recognition confidence is used for measuring the recognition accuracy of certain hand words. The higher the recognition confidence, the higher the recognition accuracy. The recognition confidence may be in the form of a value, a percentage, a score, etc.

The processor may determine the at least one sign language word and the recognition confidence level of the sign language word included in the sign language image data in a plurality of ways. In some embodiments, the processor may extract a key frame image in the sign language image data, calculate image similarity between the key frame image and a plurality of reference images, determine a reference word corresponding to a reference image having a highest image similarity as the word, and determine the recognition confidence according to the image similarity. For example, the greater the image similarity, the higher the recognition confidence. The key frame image is an action frame containing key changes of sign language image data. The reference image is a preset image containing gesture actions corresponding to sign words.

In some embodiments, the processor may determine at least one sign word and its recognition confidence level contained in the sign language image data through multiple iterations based on the multidimensional data of the sign language image data. See the relevant description of fig. 2 for more on the multiple iterations.

Step 140, determining at least one set of semantic data and its semantic confidence based on the at least one sign language word and its recognition confidence.

Semantic data refers to the meaning of the content of at least one sign language expression. In some embodiments, the semantic data may be a combination of a plurality of hand words. The plurality of hand words may constitute one or more combinations, i.e. correspond to one or more semantic data. For example, the sign words include (night, stay up, mid-night, morning), the corresponding semantic data may include (overnight without sleeping), (night stay up to morning), (night stay up, morning me), and the like.

In some embodiments, the semantic data may also include mood types. The mood type may be determined in a number of ways. In some embodiments, the mood type may be determined from the expression type output by the auxiliary paraphrasing model. In some embodiments, mood types may also be determined from the large language model. For example, the semantic data output by the large language model may include a mood type. For more description of the auxiliary paraphrasing model, the large language model, see the relevant description of fig. 5.

The semantic confidence refers to a measure of the accuracy of the result of semantic parsing based on sign language words. The higher the semantic confidence, the higher the accuracy of the resolved results. Semantic confidence may be in the form of a value, a percentage, a score, etc.

In some embodiments, the processor may determine the semantic confidence in a variety of ways. For example, the processor may determine the semantic confidence based on the n-element matched BLEU and NIST automatic evaluation methods.

In some embodiments, the processor may determine at least one set of semantic data and its semantic confidence by the large language model based on the at least one sign language word and its recognition confidence. For more on the large language model see the relevant description of fig. 5.

And 150, selecting at least one group of semantic data with semantic confidence meeting a first preset condition as output data, and sending the output data to a user.

The first preset condition is a determination condition for evaluating whether the semantic data is output data. In some embodiments, the first preset condition may include that the semantic confidence is greater than a semantic confidence threshold, or the like. The semantic confidence threshold may be based on a system default value, an empirical value, an artificial preset value, etc., or any combination thereof, and may be set according to actual requirements, which is not limited in this specification.

In some embodiments, the processor may output at least one set of semantic data satisfying the first preset condition to the user through the terminal device. In some embodiments, when there are multiple sets of semantic data meeting the first preset condition, the processor may further send a prompt message through the terminal device to prompt the user to confirm the semantic data.

In some embodiments of the present disclosure, by processing the multidimensional data to obtain at least one group of hand words, computation and recognition on a large amount of complex hand word image data can be reduced, and computation complexity is reduced; based on at least one group of hand words, semantic data is determined, and higher accuracy can be obtained than if hand language translation is directly performed based on hand language image data.

It should be noted that the above description of the process 100 is for illustration and description only, and is not intended to limit the scope of applicability of the present disclosure. Various modifications and changes to the process 100 will be apparent to those skilled in the art in light of the present description. However, such modifications and variations are still within the scope of the present description.

FIG. 2 is an exemplary schematic diagram of multiple iterations shown in accordance with some embodiments of the present description.

In some embodiments, the processor may determine at least one sign word and its recognition confidence level contained in the sign language image data through multiple iterations based on the multidimensional data of the sign language image data.

In some embodiments, as shown in fig. 2, at least one of the multiple iterations may include: determining a first partition interval of the round of iteration; determining first division data 220 based on the first division section 210; determining at least one candidate sign language word and recognition confidence 230 thereof in the first division interval through a sign language word recognition model based on the first division data 220; updating the first partition based at least on the at least one candidate sign word and the recognition confidence 230 within the first partition; and taking the updated first partition interval as a first partition interval of the next iteration, and stopping iteration until a preset iteration condition is met.

The first division section is a section obtained by dividing the voice image data. In some embodiments, the first division interval may be obtained by dividing the voice image data in units of frame numbers. In some embodiments, the first partition interval may be an interval range determined by a start frame position and an end frame position. For example, the first partition [ a, B ] indicates that the a-th frame is a start frame position and the B-th frame is an end frame position.

In some embodiments, in response to the current iteration being a first round, the processor may determine a first partition based on a preset interval length; and in response to the fact that the current iteration is the kth round (k is an integer and k is not equal to 1), determining the first partition obtained after the k-1 round update as the first partition corresponding to the current iteration round. For more explanation on updating the first partition, see the relevant description below.

The preset interval length refers to an interval length of a first division interval for determining a first round of iteration. The preset interval length can be preset by a system or people, etc.

In some embodiments, the processor may determine a sign language expression speed of the sign language image data through a sign language speed recognition model; and determining the preset interval length based on the sign language expression speed, the maximum frame length of the image corresponding to the single sign language word in the sign language word database and the standard image frame rate.

The sign language expression speed means a speed at which a gesture operation corresponding to a sign language is completed. In some embodiments, the sign language expression speed may be represented by the time required to complete a gesture action corresponding to a single sign language word.

In some embodiments, the sign language speed recognition model may be a machine learning model. For example, the sign language speed recognition model may be a neural network model or the like. In some embodiments, the input of the sign language speed recognition model may include sign language image data and the output may include sign language expression speed.

In some embodiments, the sign language speed recognition model may be trained from a plurality of second training samples with second labels. In some embodiments, the second training sample may include sample sign language image data, and the second training sample may be obtained from historical data. In some embodiments, the second label is a historical sign language expression speed corresponding to the second training sample, and the second label may be determined by the processor or by human labeling.

The frame length refers to the number of frames required to complete a gesture action in a single-hand word representation.

The maximum frame length is the maximum value of the frame lengths corresponding to a plurality of sign words in the sign word database. The sign language word database may include image data corresponding to a plurality of sign language words. The sign language database may be obtained based on historical sign language data or recorded by a relevant technician.

The standard image frame rate refers to the image frame rate of the image corresponding to the sign language word in the sign language word database. The image frame rate may represent the frequency (rate) at which bitmap images in frames called units appear continuously on the display.

In some embodiments, the processor may determine the preset interval length based on the maximum frame length, the standard image frame rate, the sign language expression speed corresponding to the maximum frame length, the sign language expression speed outputted by the sign language speed recognition model, and the image frame rate of the sign language influence data. For example, the preset section length may be determined based on a product of the maximum frame length, sign language expression speed scaling factor, and image frame rate scaling factor. The sign language expression speed corresponding to the maximum frame length can be determined based on the image duration of the single sign language word corresponding to the maximum frame length. The sign language expression speed proportionality coefficient may be determined based on a ratio of the sign language expression speed corresponding to the maximum frame length to the sign language expression speed output by the sign language speed recognition model. The image frame rate scaling factor may be determined based on a ratio of an image frame rate of sign language image data to a standard image frame rate of a sign language database.

In some embodiments of the present disclosure, by determining the preset interval length corresponding to the current sign language expression speed through the standard image frame rate, the maximum frame length and the sign language expression speed corresponding to the standard image frame rate, the maximum frame length can obtain accuracy higher than that of the interval length divided based on experience, and excessive interference caused by the preset interval length being too narrow or too wide to the subsequent sign language recognition can be avoided.

In some embodiments, the processor may determine the first partition based on the preset interval length and the first frame of sign language image data. For example, the start frame position of the first division section may be a first frame of the image data, and the end frame position may be a frame spaced apart from the first frame by a preset section length.

In some embodiments of the present disclosure, the number of frames required for completing each of the dialect words is different, and the interval length of the image data corresponding to the dialect word can be obtained through iterative update, so that the accuracy of the recognition of the subsequent dialect word is further improved.

The first division data refers to image data corresponding to a first division section.

In some embodiments, the first partition data includes partial data of the multi-dimensional data within the first partition. For example, the first division data includes a part of hand image data within the first division section. For another example, the first division data includes a part of background image data within the first division section.

In some embodiments, the processor may obtain the first partition data in a number of ways. For example, the processor may divide the multi-dimensional data based on the first division section to obtain the first division data.

The candidate word may refer to a word recognized in the first division section in the iterative process.

For more on the recognition confidence, see the relevant description of fig. 1.

In some embodiments, the processor may determine the at least one candidate sign word and its recognition confidence within the first partition by a variety of means. For example, the processor may determine at least one candidate sign language word and its recognition confidence within the first partition based on the manner in which the sign language word and its recognition confidence are determined similarly in fig. 1.

In some embodiments, the processor may determine at least one candidate sign word and its recognition confidence within the first division interval by a sign word recognition model based on the first division data.

In some embodiments, the sign language word recognition model may be a machine learning model. For example, the sign language word recognition model may be a convolutional neural network model or the like.

In some embodiments, the input of the sign language word recognition model may include first division data and the output may include at least one candidate sign language word and its recognition confidence.

In some embodiments, the output of the sign language word recognition model may also include a recognition interval corresponding to the candidate language word. The recognition interval refers to a partial interval of a single candidate word in the first divided interval. In some embodiments, the identified interval may be an interval range between a start frame position, an end frame position, corresponding to a single candidate sign word.

In some embodiments, the input of the sign language word recognition model further includes a sign language expression speed within the first division interval. For more description of sign language expression speed, see the related description above.

According to the embodiments of the invention, the sign language expression speed in the first division interval is considered, so that the sign language speed recognition model can more accurately position and recognize key sign language actions, and the recognition accuracy is improved.

In some embodiments, the sign language word recognition model may be obtained through training. For more on model training of the sign language word recognition model, see the relevant description of fig. 4.

In some embodiments, the processor may determine an update parameter for the first partition based on the recognition confidence of the at least one candidate word, and update the first partition based on the update parameter. The update parameter is a parameter for adjusting the first divided section. For example, the update parameters may include a parameter for adjusting a start frame position of the first division section, a parameter for adjusting an end frame position of the first division section, and a parameter for adjusting a preset section length. The update parameter may refer to an adjustment value in units of a frame number. In some embodiments, the processor may determine the update parameter based on a correspondence between the recognition confidence of the at least one candidate word and the update parameter.

In some embodiments, as shown in fig. 3, the processor may determine whether at least one recognition confidence corresponding to at least one candidate sign word recognized in the first partition meets a second preset condition, and update the first partition according to the determination result.

The second preset condition refers to a judgment condition for determining whether to update the first divided section.

In some embodiments, the second preset condition includes the recognition confidence being greater than a recognition confidence threshold. The recognition confidence threshold may be determined in a number of ways. For example, the recognition confidence threshold may be based on experience or system default settings.

In some embodiments, the recognition confidence threshold may be determined based on sign language expression speed. In some embodiments, the recognition confidence threshold may be positively correlated with sign language expression speed. The faster the sign language is expressed, the greater the recognition confidence threshold.

The sign language expression speed influences the recognition result of the sign language word recognition model. In some embodiments of the present disclosure, determining the recognition confidence threshold based on the sign language expression speed may reduce an influence of an error of the first confidence threshold on an update manner, and improve effectiveness of an iterative update process.

In some embodiments, as shown in fig. 3, in response to the recognition confidence satisfying the second preset condition, the processor may update the first divided interval based on the recognition interval of the candidate word and the preset interval length; in response to the recognition confidence not meeting the second preset condition, the processor may update the first partition based on the preset step size.

In some embodiments, in response to each of the at least one recognition confidence and/or the average of the at least one recognition confidence satisfying the second preset condition, the processor may determine an end frame position of the recognition interval corresponding to the last recognized candidate sign word as a start frame position of the updated first partition interval, and determine a frame position after the preset interval length as an end frame position of the updated first partition interval to update the first partition interval. For example, the first divided section is [0, 100], the recognition section of the candidate word 1 is [0, 30], the recognition section of the candidate word 2 is [40, 85], and the preset section length is 100, and the first divided section may be updated to [85,185].

In some embodiments, in response to any of the at least one recognition confidence and/or the mean of the at least one recognition confidence not meeting the second preset condition, the processor may adjust a start frame position, an end frame position of a previous first partition based on the preset step size, and update the first partition based on the adjusted start frame position and end frame position. For example, the first division section is [0, 100], the preset step size is 10, the unit is a frame, and the first division section is updated to [10, 110].

The preset step length is the sliding window length of the first partition interval updated each time. For example, the preset step length is 10 frames, which means that the first partition is moved backward by 10 frames to obtain the updated first partition. The preset step size may be determined in a number of ways. For example, the preset step size may be preset by a human or a system.

In some embodiments, the preset step size may be determined based on sign language expression speed. In some embodiments, the preset step size may be inversely related to sign language expression speed. The faster the sign language expression speed, the shorter the preset step size.

According to some embodiments of the specification, through the sign language expression speed, the preset step length is determined, so that the correspondence between the sign language words and the first division interval can be improved, and the recognition accuracy is further improved.

According to some embodiments of the present disclosure, the update manner is determined according to the second preset condition, so that accuracy of the first division interval in each iteration process can be improved, and errors in time division of the image data can be reduced, thereby improving accuracy of sign word recognition results.

The preset iteration condition is a determination condition for evaluating whether or not the iteration is stopped. In some embodiments, the preset iteration condition may include that the number of iterative updates has reached a preset number of times threshold, that sign language image data is completely identified, and so on. The preset times threshold may be a system default value, a system preset value, or the like.

In some embodiments, the processor may use the updated first partition as the first partition of the next iteration, and continue to update the next iteration until the adjustment update of the first partition is stopped when the preset iteration condition is met.

In some embodiments, the processor may determine, after the stopping of the iteration, at least one sign language word and its recognition confidence level included in the sign language image data based on the at least one candidate sign language word and its recognition confidence level determined in each of the plurality of iterations. For example, the processor may determine, as the finally recognized sign language word and its recognition confidence, a candidate sign language word and its recognition confidence having a recognition confidence greater than the recognition confidence threshold among the at least one candidate word iteratively output for each round.

In some embodiments, in the iterative process, the processor's recognition result for the first partition may include multiple sets. Each set of recognition results includes at least one candidate sign language word and its recognition confidence. For example, one set of recognition results may be [ (sign word 1, recognition interval 1, confidence 1), (sign word 2, recognition interval 2, confidence 2) ], and another set of recognition results may be [ (sign word 3, recognition interval 3, confidence 3), (sign word 4, recognition interval 4, confidence 4) ].

In some embodiments, when the recognition result of the first partition includes multiple groups, the processor may perform multiple rounds of iterative updating on the first partition based on each group of recognition results, and finally output multiple groups of iterative updating results; and determining a plurality of groups of final recognition results according to the plurality of groups of iterative updating results. Wherein each set of iterative updating results includes at least one candidate sign word and its confidence level. Each set of final recognition results includes at least one sign language word and its confidence level.

For example, taking the above example as an example, the processor may perform multiple iterative updating based on one set of recognition results [ (the word 1, the recognition interval 1, the confidence coefficient 1), (the word 2, the recognition interval 2, the confidence coefficient 2) ] and perform multiple iterative updating based on another set of recognition results [ (the word 3, the recognition interval 3, the confidence coefficient 3), (the word 4, the recognition interval 4, the confidence coefficient 4) ] to finally obtain two sets of iterative updating results, thereby obtaining two sets of final recognition results.

In some embodiments, the processor may output a set of final recognition results with the greatest average of the recognition confidence. In some embodiments, the processor may output a set of final recognition results that meet a preset confidence condition. In some embodiments, the preset confidence condition may include that the recognition confidence of each of the hand words in the set of final recognition results is greater than a first threshold. In some embodiments, the preset confidence condition may include that the recognition confidence mean of each of the hand words in the set of final recognition results is greater than a second threshold. Wherein the first and second thresholds are threshold conditions related to recognition confidence, which can be preset by the system or by human beings.

In some embodiments of the present disclosure, the recognition confidence is determined based on a second preset condition to determine an update manner of the first division intervals corresponding to the multiple iterations, so that the multi-dimensional data is divided according to the new first division interval in each iteration to better conform to the situation of actual sign language (for example, the sign language expression speed is fast or slow, etc.), so that the correspondence between the sign language word and the first division interval is better, and the recognition accuracy is improved.

FIG. 4 is an exemplary schematic diagram of a training sign language recognition model shown in accordance with some embodiments of the present description.

In some embodiments, as shown in FIG. 4, the processor may derive a sign language word recognition model through training. The training process comprises the following steps: acquiring a sample set, wherein the sample set comprises at least one gold standard sample and a label 410 thereof, one gold standard sample in the at least one gold standard sample corresponds to image expression data of a hand word, and the label of the gold standard sample comprises text expression data of the hand word corresponding to the gold standard sample and a maximum confidence value; and training the initial vocabulary recognition model 430 based on the sample set to determine the vocabulary recognition model 440.

The sample set is a sample set used for training an initial sign language word recognition model to obtain the sign language word recognition model. In some embodiments, the sample set may include at least one gold standard sample, a label thereof, and the like.

The golden standard sample corresponds to image expression data of a hand word.

The label corresponding to the gold standard sample is related to the sign word corresponding to the gold standard sample. In some embodiments, the label corresponding to the gold standard sample may include the text expression data of the word of the hand corresponding to the gold standard sample and the maximum confidence value.

The text expression data is used for carrying out text expression on a sign word corresponding to the gold standard sample.

The maximum confidence value refers to the maximum recognition confidence of the hand word corresponding to the gold standard sample. For example, when the recognition confidence is an arbitrary value within the range of 0 to 1, the maximum confidence value may be 1, which indicates that the image expression data corresponding to the sign word is completely matched with the corresponding text expression data.

In some embodiments, the processor may obtain the gold standard sample and its corresponding tag in a variety of ways. For example, the processor may obtain gold standard samples and their corresponding tags based on a number of experiments. An exemplary experiment may be: the Chinese sign language data set CSL (Chinese sign language dataset) is selected, each sentence in the Chinese sign language data set is recorded by sign language recording personnel through multiple gestures to obtain a gold standard sample, and labels corresponding to the gold standard sample are marked and determined by personnel or a processor.

In some embodiments, the processor may separately photograph image recording data of at least one sign language recording person for at least one sign language word according to at least one photographing angle by using a camera disposed at least one preset point location, and synchronously record voice expression data of at least one or more sign language words, where the at least one sign language recording person expresses sign languages with different dialect characteristics and/or different regional characteristics, respectively; splitting the image recording data according to single Chinese words, taking the image expression data of the single Chinese words obtained after splitting as a gold standard sample, and taking the text expression data of the corresponding voice expression data after text recognition as a label.

The preset point location is a preset shooting point location. For example, the preset point location may be set in the surrounding environment of the sign language personnel.

The photographing angle may include a photographing height, a photographing direction, a photographing distance, and the like. The photographing height may refer to a photographing position selected by the camera in a vertical direction, and may include a panning, a tilting, and a tilting. The shooting direction can refer to the relative position of the camera and the sign language recording personnel on the same horizontal plane for 360 degrees around. The shooting distance may refer to the horizontal distance of the camera from the sign language recorder.

The sign language recording personnel refers to a person who performs gesture actions and participates in sign language image recording.

The image recording data refers to data obtained by a camera. The image recording data may be images or videos.

The speech expression data is speech data corresponding to a sign word. For example, when a sign language person performs a gesture, the sign language person speaks a corresponding sign language word.

Splitting may refer to splitting the image recording data so that each sub-image recording data after splitting corresponds to a word of a hand.

In some embodiments, the processor may split the image recording data based on a video segmentation algorithm, which may include at least one of a temporal segmentation method, a scene segmentation method, a keyframe segmentation method, and the like.

According to some embodiments of the present disclosure, a plurality of cameras with preset points and a plurality of shooting angles are used to obtain a large number of training sample sets, so that accuracy and qualification of the sample sets can be improved, and subsequent training of the gesture recognition model is facilitated.

In some embodiments, the sample set further includes at least one variant sample and its tag 420.

Variant samples refer to samples that were subjected to variant processing based on gold standard samples. The variant samples are used to expand the sample set.

The label corresponding to the variant sample is associated with a sign word corresponding to the variant sample.

In some embodiments, the label corresponding to the variant sample may include text expression data of the word of the hand corresponding to the variant sample and an identification confidence.

In some embodiments, variant processing includes processing of gold standard samples using video processing methods. In some embodiments, variant processing may include types of stationary transforms and non-stationary transforms.

In some embodiments, the stationary transformation may be an enhancement process to the gold standard sample, including at least one of an increase or decrease in image rate, an increase or decrease in light, and the like.

In some embodiments, the non-stationary transformation may be an interference process on the gold standard samples, including at least one of frame-drawing, stitching, etc. of the image data. The frame extraction may be a random extraction or an extraction based on a preset interval. By non-stationary transformation, the recognition confidence corresponding to the variant sample can be reduced.

In some embodiments, the label to which the variant sample corresponds is related to the type of variant process. For example, the recognition confidence corresponding to the variant sample of the smooth transformation is consistent with or slightly reduced from the maximum confidence value of the corresponding gold standard sample; the recognition confidence corresponding to the variant sample of the non-stationary transformation is greatly reduced relative to the maximum confidence value of the corresponding gold standard sample. The small reduction means that the reduction amplitude is within a first preset range, the large reduction means that the reduction amplitude is within a second preset range, and the first preset range is smaller than the second preset range, and the two can be preset by a system or by people.

According to some embodiments of the specification, the gold standard sample is subjected to variant processing, so that the sample set is expanded, the association relation between input and output is enhanced, the accuracy of a sign language word recognition model is improved, and overfitting is prevented.

In some embodiments, the samples in the sample set are required to meet the diversity distribution requirement.

In some embodiments, the diversity distribution requirements include at least a shooting angle diversity distribution requirement and an individual diversity distribution requirement.

The requirement for the diversity distribution of shooting angles means that image expression data shot from different angles exist in a sample set, and the quantity of the image expression data shot from each angle reaches a first preset proportion. The first preset ratio may be based on experience or system default settings.

The individual diversity distribution requirement means that different sign language recording personnel and image expression data of different hand features are arranged in the sample set, and the quantity of the image expression data of each hand feature reaches a second preset proportion. The second preset ratio may be based on experience or system default settings.

In some embodiments, the diversity distribution requirements may also include morphological diversity distribution requirements.

The morphological diversity distribution requirement means that the sample set has image expression data with different sign language speeds, and the number of the image expression data of each sign language speed reaches a third preset proportion. The third preset ratio may be based on experience or system default settings. Note that, the sign language speed may be the whole sign language speed of the image recording data, or the sign language expression speed corresponding to a single sign language.

In some embodiments of the present disclosure, the sample set satisfies the morphological diversity distribution requirement, so that image expression data of each gesture can be obtained, and the diversity of the sample is improved.

According to some embodiments of the specification, the diversity distribution requirements are met through the sample set, the diversity of samples is realized, the accuracy of a training model is improved, and the phenomenon of overfitting or underfitting of a sign language recognition model is avoided.

In some embodiments, the sign language word recognition model may be trained from a large number of labeled sample sets. Illustratively, a plurality of labeled gold standard samples and a plurality of labeled variant samples are input into an initial sign language word recognition model, a loss function is constructed through the results of the labels and the initial sign language word recognition model, and parameters of the initial prediction model are iteratively updated through gradient descent or other methods based on the loss function. And when the preset conditions are met, model training is completed, and a trained sign language word recognition model is obtained. The preset condition may be that the loss function converges, the number of iterations reaches a threshold value, etc.

Some embodiments of the present disclosure extend the distribution of sample sets by providing diverse samples for different sign language words, making the generalization of the sign language word recognition model more robust.

In some embodiments, the processor may determine at least one set of semantic data and its semantic confidence 550 through the large language model 540 based on the at least one sign language word and its recognition confidence 510.

The large language model (Large Language Model, LLM) is a class of computer programs trained by deep learning algorithms, the purpose of which is to simulate the generation of human language, thereby generating text similar to human language. The large language model can adopt a multi-level and multi-granularity model structure. Wherein, a multi-level and multi-granularity model structure can mean that when the natural language text is generated, the model can gradually increase the level and granularity, so that more detailed and complex text is generated. Exemplary large language models may include, but are not limited to, chatGPT, GPT-3, BERT, XLnet, and the like.

In some embodiments, the input of the large language model 540 may be one or more of the spoken words and their recognition confidence 510 and the output may be at least one set of semantic data and their semantic confidence 550. For example, when the inputs of the large language model include (night, confidence a), (stay up, confidence b), (mid night, confidence c), (morning, confidence d), the outputs may include (night overnight, no sleep, confidence e), (night stay up to morning, confidence f), (night stay up, morning me, confidence g), etc. For more explanation about sign language words and their recognition confidence, semantic data, and their semantic confidence, see fig. 1.

In some embodiments, the input of the large language model 540 may also include a background type 523, an expression type 526.

The background type can refer to the background where the hearing impaired person in the sign language image is located when talking. For example, the types of contexts may include banks, vegetable markets, hospitals, and the like.

The expression type can refer to the expression appearing when the hearing impaired person in the sign language image converses. For example, expressions may include happiness, anger, and distraction, etc.

In some embodiments, the background type and the expression type may be obtained in a variety of ways. For example, the background type and the expression type may be determined by taking a manual input.

In some embodiments, as shown in FIG. 5, the background type 523, the expression type 526 may be obtained based on the auxiliary paraphrasing model 520.

In some embodiments, the auxiliary paraphrasing model 520 may be a machine learning model. In some embodiments, the auxiliary paraphrasing model 520 may be a machine learning model of the custom structure hereinafter. The auxiliary paraphrasing model 520 may also be a machine learning model of other structures, such as a neural network model, a convolutional neural network model, or the like.

In some embodiments, the auxiliary paraphrasing model 520 may include a scene determination layer 522, an expression recognition layer 525.

The scene determination layer 522 may determine the type of background in which the hearing impaired person is talking based on the background image data.

In some embodiments, the input of the scene determination layer 522 may be background image data 521 and the output may be a background type 523. Since the background of the dialog typically does not change frequently, the input to the scene determination layer 522 may also be a decimated single frame background image. See fig. 1 for a more description of background image data.

In some embodiments, the scene determination layer 522 may be a model of CNN or the like.

In some embodiments, the scene determination layer 522 may be trained based on a number of third training samples with third tags. The third training sample may be sample background image data, and the third label of the third training sample may be a background type corresponding to the sample background image data. In some embodiments, a third training sample may be obtained based on historical data and a third label may be determined based on manual annotation.

The expression recognition layer 525 may determine the type of expression that occurs when the hearing impaired person converses based on the expression image data.

In some embodiments, the input of the expression recognition layer 525 may be the expression image data 524 and the output may be the expression type 526. See fig. 1 for a more explanation of the expressive image data.

In some embodiments, expression recognition layer 525 may be a model of NN or the like.

In some embodiments, expression recognition layer 525 may be trained based on a number of fourth training samples with fourth tags. The fourth training sample may be sample expression image data, and the fourth label of the fourth training sample may be an expression type corresponding to the sample expression image data. In some embodiments, a fourth training sample may be obtained based on historical data and a fourth label may be determined based on manual annotation.

In some embodiments, the fourth training sample may include an expressive image occluded by the finger. For example, gestures of a sign language obscure an expressive image of a face. The fourth label of the fourth training sample may include an expression type corresponding to an expression image occluded by the finger. The ability of recognizing the expression type when the face is hidden can be improved by training the expression recognition model by using the expression image shielded by the finger, and more accurate expression type can be further obtained.

In some embodiments of the present disclosure, the background type and the expression type related to the dialogue can be obtained quickly by using the auxiliary paraphrasing model to obtain the background type and the expression type. By inputting the background type and the expression type into the large language model, the generation range of the semantic data can be limited, the content of the semantic data can be enriched, and more accurate and complete semantic data and semantic confidence can be obtained. For example, when the background type is bank and the sign language word includes (i am, previous, taken), the corresponding semantic data should be "i am to pick up money" instead of "i am to pick up money"; when the background type is the vegetable market and the sign language word includes (i am, vegetable, front), the corresponding semantic data should be "i am buying vegetable" instead of "i am buying vegetable's money".

In some embodiments, the input of the large language model 540 may also include data 530 associated with the segment of sign language imagery.

The data associated with the sign language video may refer to data associated with the sign language video content or scene. For example, the data associated with the sign language video may be historical data of the session in an actual session scene. The history data may include the first few sentences of the session, or may include the first few sentences of a session similar to the session.

In some embodiments of the present disclosure, by inputting data associated with the sign language image into a large language model, the mobility of the associated data is fully utilized, and more accurate semantic data and confidence can be obtained.

In some embodiments, off-the-shelf large language models may be employed for processing. For example using the ChatGPT et al artificial intelligence model.

In some embodiments of the present disclosure, a large language model is used to determine at least one set of semantic data and its semantic confidence, so that more accurate and more confidence semantic data can be obtained. The ready-made large language model is adopted, so that the acquisition is convenient without training, and the labor and time cost are saved.

Some embodiments of the present specification provide a sign language information processing apparatus, the apparatus including at least one processor and at least one memory; at least one memory for storing computer instructions; at least one processor is configured to execute at least some of the computer instructions to implement the sign language information processing method of any one of the embodiments of the present specification.

Some embodiments of the present description provide a computer-readable storage medium storing computer instructions that, when read by a computer in the storage medium, are executed by the computer to implement a sign language information processing method according to any one of the embodiments of the present description.

While the basic concepts have been described above, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations to the present disclosure may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this specification, and therefore, such modifications, improvements, and modifications are intended to be included within the spirit and scope of the exemplary embodiments of the present invention.

Meanwhile, the specification uses specific words to describe the embodiments of the specification. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present description. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present description may be combined as suitable.

Furthermore, the order in which the elements and sequences are processed, the use of numerical letters, or other designations in the description are not intended to limit the order in which the processes and methods of the description are performed unless explicitly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the present disclosure. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.

Likewise, it should be noted that in order to simplify the presentation disclosed in this specification and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the present description. Indeed, less than all of the features of a single embodiment disclosed above.

In some embodiments, numbers describing the components, number of attributes are used, it being understood that such numbers being used in the description of embodiments are modified in some examples by the modifier "about," approximately, "or" substantially. Unless otherwise indicated, "about," "approximately," or "substantially" indicate that the number allows for a 20% variation. Accordingly, in some embodiments, numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties sought to be obtained by the individual embodiments. In some embodiments, the numerical parameters should take into account the specified significant digits and employ a method for preserving the general number of digits. Although the numerical ranges and parameters set forth herein are approximations that may be employed in some embodiments to confirm the breadth of the range, in particular embodiments, the setting of such numerical values is as precise as possible.

Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., referred to in this specification is incorporated herein by reference in its entirety. Except for application history documents that are inconsistent or conflicting with the content of this specification, documents that are currently or later attached to this specification in which the broadest scope of the claims to this specification is limited are also. It is noted that, if the description, definition, and/or use of a term in an attached material in this specification does not conform to or conflict with what is described in this specification, the description, definition, and/or use of the term in this specification controls.

Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.

Claims

1. A sign language information processing method, the method being performed by a processor, the method comprising:

acquiring sign language image data;

based on the sign language image data, determining multidimensional data of the sign language image data through a preprocessing model, wherein the multidimensional data at least comprises hand image data and background image data, and the preprocessing model is a machine learning model;

determining at least one sign language word and identification confidence degree contained in the sign language image data based on the multidimensional data of the sign language image data;

determining at least one set of semantic data and semantic confidence thereof based on the at least one sign language word and recognition confidence thereof;

and selecting at least one group of semantic data with the semantic confidence meeting a first preset condition as output data, and sending the output data and prompt information to a user.

2. The method of claim 1, wherein the determining at least one sign word and its recognition confidence level contained in the sign language image data based on the multidimensional data of the sign language image data comprises:

determining at least one sign language word and identification confidence thereof contained in the sign language image data through multiple rounds of iteration based on the multidimensional data of the sign language image data;

at least one of the plurality of iterations includes:

determining a first partition interval of the round of iteration;

determining first division data based on the first division section, wherein the first division data comprises partial data of the multidimensional data in the first division section;

determining at least one candidate sign language word and recognition confidence level in the first division interval through a sign language word recognition model based on the first division data, wherein the sign language word recognition model is a machine learning model;

updating the first partition interval based at least on the at least one candidate sign language word and its recognition confidence;

taking the updated first partition interval as the first partition interval of the next iteration, and stopping iteration until a preset iteration condition is met;

The determining, through multiple iterations, at least one sign language word and an identification confidence coefficient thereof, where the sign language image data includes:

and determining the at least one sign language word contained in the sign language image data and the recognition confidence thereof based on the at least one candidate sign language word and the recognition confidence thereof determined by each iteration in the plurality of iterations.

3. The method of claim 2, wherein updating the first partition based at least on the sign language words and their recognition confidence within the first partition comprises:

updating the first divided section based on the recognition section of the at least one candidate word and a preset section length in response to the recognition confidence satisfying a second preset condition;

and updating the first partition interval based on a preset step length in response to the recognition confidence degree not meeting a second preset condition.

4. The method of claim 2, wherein the sign language word recognition model is derived based on training comprising:

acquiring a sample set, wherein the sample set comprises at least one gold standard sample and a label thereof, one gold standard sample in the at least one gold standard sample corresponds to image expression data of a hand word, and the label of the gold standard sample comprises the word expression data of the hand word corresponding to the gold standard sample and a maximum confidence value;

Training an initial sign language word recognition model based on the sample set, and determining the sign language word recognition model.

5. The method of claim 4, wherein the method of obtaining the at least one gold standard sample and its tag comprises:

shooting image recording data of at least one sign language recording person aiming at least one sign language word through a camera arranged at least one preset point according to at least one shooting angle, and synchronously recording voice expression data of at least one or more sign language words, wherein the at least one sign language recording person expresses sign languages with different dialect characteristics and/or different regional characteristics respectively;

splitting the image recording data according to single language words, taking the image expression data of the single language words obtained after splitting as a gold standard sample, and taking the text expression data of the voice expression data corresponding to the image expression data after text recognition as a label.

6. The method of claim 1, wherein determining at least one set of semantic data and its semantic confidence based on the at least one sign language word and its recognition confidence comprises:

And determining the at least one group of semantic data and the semantic confidence thereof through a large language model based on the at least one sign language word and the recognition confidence thereof.

7. The sign language information processing system is characterized by comprising an acquisition module, a determination module and an output module;

the acquisition module is used for acquiring sign language image data;

the determining module is used for:

the output module is used for selecting at least one group of semantic data with the semantic confidence meeting a first preset condition as output data and sending the output data and prompt information to a user.

8. The system of claim 7, wherein the determination module is further configured to:

at least one of the plurality of iterations includes:

determining a first partition interval of the round of iteration;

9. A sign language information processing apparatus, comprising at least one processor and at least one memory;

the at least one memory is configured to store computer instructions;

the at least one processor is configured to execute at least some of the computer instructions to implement the sign language information processing method according to any one of claims 1 to 6.

10. A computer-readable storage medium storing computer instructions, wherein when the computer instructions in the storage medium are read by a computer, the computer performs the sign language information processing method according to any one of claims 1 to 6.