GB2231246A - Converting text input into moving-face picture - Google Patents
Converting text input into moving-face picture Download PDFInfo
- Publication number
- GB2231246A GB2231246A GB9005142A GB9005142A GB2231246A GB 2231246 A GB2231246 A GB 2231246A GB 9005142 A GB9005142 A GB 9005142A GB 9005142 A GB9005142 A GB 9005142A GB 2231246 A GB2231246 A GB 2231246A
- Authority
- GB
- United Kingdom
- Prior art keywords
- mouth
- shape
- phoneme
- picture
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 claims abstract description 29
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 23
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 8
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 8
- 230000015654 memory Effects 0.000 claims description 16
- 230000007704 transition Effects 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 11
- 239000003607 modifier Substances 0.000 claims description 5
- 230000001755 vocal effect Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 10
- 238000004519 manufacturing process Methods 0.000 description 5
- 230000004044 response Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 206010011878 Deafness Diseases 0.000 description 1
- 241000824268 Kuma Species 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/20—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Processing Or Creating Images (AREA)
- Image Generation (AREA)
Abstract
A moving picture of a face with mouth-shape variations corresponding to a text sentence input is produced. The input sentence is divided into a train of phonemes and a speech synthesis technique capable of outputting a voice feature of each phoneme and its duration is utilized. Based on the voice feature, a mouth-shape feature corresponding to each phoneme is determined 3. Based on the mouth-shape feature, the value of a mouth-shape parameter is determined 5, 4 for representing a mouth shape. Further, the value of the mouth-shape parameter for each frame of the moving picture is controlled 2 in accordance with the duration of each phoneme, thereby synthesizing the moving face picture having mouth-shape variations which agree with the speech output. <IMAGE>
Description
PICTURE SYNTHESIZING METHOD AND APPARATUS
The present invention relates to a method for synthesizing a picture through digital processing, and more particularly, to a system for synthesizing a (still or moving) picture of a face wh ch represents changes in the shape of a mouth accompanyin. the production of a speech output.
When a man utters a vocal sound, vocal information is produced by an articulator, rnd at the same time, his mouth moves as he utters (i.e. ch. :ges in the shape of the mouth in outward appearance). A method, which converts a sentence input as an input text to E czech information and outputs it, is called a speech synthesi , and this method has achieved a fair success.In contrast -hereto, few reports have been published on a method for roducing a picture of a face which has mouth-shape vari tions in correspondence to an input sentence, except the following report by Kiyotoshi
Matsuoka and Kenji Kurosu
The method proposed JT Matsuoka and Kurose is disclosed in a published paper [ Ki# toshi Matsuoka and Kenji Kurose: "A moving picture program for a training in speech reading for the deaf," Journal o the Institute of Electronic
Information and Communic tion Engineers of Japan, Vol. J70-D,
No. 11, pp. 2167-2171 (l member 1987)]
Besides, there has Iso been reported, as a related prior art, a method foi presuming mouth-shape variations corresponding to an in t text.This method is disclosed in a published paper [Shi Jo Morishima, Kiyoharu Aizawa and
Hiroshi Hara: "studies of automatic synthesis of expressions on the basis of speech information," 4TH NICOGRAPH article contest, Collection of Articles, pp. 139-146, Nihon Computer
Graphics Association (November 1988) ] . This article proposes a method which calculates the logarithmic mean power of input speech information and controls the opening of the mouth accordingly and a method which calculates a linear prediction coefficient corresponding to the format characteristic of the vocal tract and presumes the mouth shape.
The method by Matsuoka and Kurose has been described above as a conventional method for producing pictures of a face which have mouth-shape variations corresponding to a sentence (an input text) being input, but this method poses such problems as follows: Although a vocal sound and the mouth shape are closely related to each other in utterance, the method basically syllabicates the sentence and selects mouth-shape patterns on the basis of the correspondence in terms of characters, and consequently, the correlation between the speech generating mechanism and the mouth-shape generation is insufficient. This introduces difficulty in producing the mouth shape correctly in correspondence to the speech output.Further, although a phoneme (a minimum unit in utterance, a syllable being composed of a plurality of phonemes) differs in duration in accordance to the connection between it and the preceding and following phonemes, the method by Matsuoka and Kurose fixedly assigns four frames to each syllable, and consequently, it is difficult to represent natural mouth-shape variations in correspondence to the input sentence. Moreover, in the case of outputting the sound and the mouth-shape picture in response to the sentence being input, it is difficult to match them with each other.
The method proposed by Morishima, Aizawa and Harashima is to presume the mouth shape on the basis of input speech information, and hence cannot be applied to the production of a moving picture which has mouth-shape variations corresponding to the input sentence.
In view of the above, an object of the present invention is to provide picture synthesizing method and apparatus which permit the representation of mouth-shape variations, which correspond accurately to speech outputs and agree with the durations of phonemes.
According to an aspect of the present invention, the picture synthesizing method for generating a moving face picture with mouth-shape variations corresponding to a sentence input divides the sentence input into a train of phonemes and utilizes the speech synthesis technique capable of outputting a voice feature of each phoneme and its duration. Based on the voice feature, a mouth-shape feature corresponding to each phoneme is determined. Based on the mouth-shape feature, the value of a mouth-shape parameter is determined for representing a concrete mouth shape. Further, the value of the mouth-shape parameter for each frame of the moving picture is controlled in accordance with the duration of each phoneme, thereby synthesizing the moving face picture having mouth-shape variations which agree with the speech output.
According to another aspect of the present invention, the picture synthesizing apparatus comprises: an input terminal for receiving a sentence input; a speech synthesizer which divides the input sentence from the input terminal into a train of phonemes and outputs a voice feature for each phoneme and its duration; a converter which converts the voice feature for each phoneme into a mouth-shape feature; a conversion table which establishes correspondence between various mouth-shape features and mouth-shape parameters representing concrete mouth shapes; a unit which obtains from the conversion table a mouth-shape parameter corresponding to the mouth-shape feature for each phoneme; a time adjuster wherein the value of the mouth-shape parameter output from the unit is controlled in accordance with the duration for each phoneme from the speech synthesizer so as to generate a moving picture provided as a train of pictures spaced apart for a fixed period of time; and a picture generator which generates a picture in accordance with the mouth-shape parameter output from said unit under control of the timing control section.
According to still another aspect of the present invention, the moving picture synthesizing apparatus comprises: an input terminal for receiving a sentence input; a speech synthesizer which divides the input sentence from the input terminal into a train of phonemes and outputs a voice feature for each phoneme and its duration; a converter which converts the voice feature for each phoneme into a mouthshape feature; a conversion table which establishes correspondence between various mouth-shape features and mouthshape parameters representing concrete mouth shape; a unit which obtains from the conversion table a mouth-shape parameter corresponding to the mouth-shape feature for each phoneme; a time adjuster wherein the value of the mouthshape parameter output from the unit is controlled in accordance with the duration for each phoneme from the speech synthesizer so as to generate a moving picture provided as a train of pictures spaced apart for a fixed period of time; a picture generator which generates a picture in accordance with the mouth-shape parameter output from the unit under control of the time adjuster; a transition detector for detecting a transition from a certain phoneme to the next in accordance with the output of the time adjuster; a memory capable of storing, for at least more than one frame period, the values of the mouth-shape parameters used in the picture generator; and a mouth-shape parameter modifier for obtaining an intermediate value between the value of the mouth-shape parameter stored in the memory and the value of the mouth-shape parameter provided from the unit. During the transition from a certain phoneme to the next an intermediate mouth shape is generated, producing a moving face picture with smooth mouth-shape variations.
The present invention will be described in detail below in comparison with prior art with reference to accompanying drawing, in which:
Fig. 1 is a block diagram corresponding to an embodiment of the present invention;
Figs. 2A and 2B are diagrams showing examples of parameter for resresenting a mouth shape;
Fig. 3 is a block diagram corresponding to an example of the operation of a time adjuster employed in the present invention;
Fig. 4 is a block diagram corresponding to another embodiment of the present invention;
Fig. 5 is a block diagram corresponding to an example of the operation of a transition detector employed in the embodiment shown in Fig. 4; and
Fig. 6 is a block diagram corresponding to the operation of a conventional picture synthesizing system.
To make differences between prior art and the present invention clear, an example of prior art will first be described.
The method of the first-mentioned paper is executed in the form of a program, and the basic concept of obtaining mouth-shape variations corresponding to the input sentence is shown in Fig. 6.
In Fig. 6 reference numeral 50 indicates a syllable separator, 51 a unit making correspondence between syllables and mouth-shape patterns, 52 a table containing correspondence between syllables and mouth-shape patterns, 53 a mouth-shape selector, and 54 a memory for mouth-shape. Next, the operations of these units will be described in brief.
The syllable separator 50 divides an input sentence (an input text) in syllables. For instance, an input "kuma" in
Japanese is divided into syllables "ku" and "ma". The table 52 is one that prestores the correspondence between prepared syllables and mouth-shape patterns. The syllables each represent a group of sounds "a", "ka", etc. The mouth-shape patterns include big ones ( < A > < I > < U > < E > < K > , etc.) and small one ( < u > < o > < k > < s > , etc.) and indicate the kinds of the mouth shapes. They are used to prestore as a table the correspondence between the syllables and the mouth-shape patterns in such forms as < A > < * > < A > for "a" and < K > < * > < A > for "ka", for example.In this case, the symbol < * > indicates an intermediate mouth shape. The unit 51 reads out, for each syllable from the syllable separator 50, the corresponding mouth-shape pattern from the table 52. The memory for mouth-shape 54 is one that prestores, for each of the above-mentioned mouth-shape patterns, a concrete mouth shape as a graphic form or shape parameter. The mouth shape selector 53, when receives mouth-shape patterns from the unit 51, sequentially refers to contents of the memory for mouth-shape 54 to select and outputs concrete mouth shapes as output pictures. At this time, intermediate mouth shapes (intermediate between the preceding following mouth shapes) are also produced. For providing the output as a moving picture, the mouth shape for each syllable is fixedly assigned four frames.
In the following, the present invention will be described.
Fig. 1 is a block diagram for explaining an embodiment of the present invention. Now, assume that input information is an input text (a sentence) obtainable from a keyboard or file unit such as a magnetic disk. In Fig. 1 reference numeral 1 indicates a speech synthesizer, 2 a time adjuster, 3 a speech feature to mouth-shape feature converter, 4 a conversion table of mouth-shape features to mouth-shape parameters, 5 a unit obtaining mouth-shape parameters, 6 a picture generator, 10 a gate, 900 an input text (sentence) terminal, and 901 an output picture terminal.
Next, the operation of each unit will be described.
The speech synthesizer 1 synthesizes a speech output corresponding to an input sentence. Various systems have been proposed for speech synthesis, but it is postulated here to utilize an existing speech rule synthesizing method which employs a Klatt type format speech synthesizer as a vocal tract model, because it is excellent in matching with the mouth-shape generation. This method is described in detail in a published paper [ Seiichi Yamamoto, Norio Higuchi and
Tohru Shimizu: "Trial Manufacture of a Speech Rule Synthesizer with Text-Editing Function," Institute of Electronic
Information and Communication Engineers of Japan, Technical
Report SP87-137 (March 1988) ] . No detailed description will be given of the speech synthesizer, because it is a known technique and is not the applied object of the present invention.The speech synthesizer needs only to output information of a vocal sound feature and a duration for each phoneme so as to establish accurate correspondence between generated voice and mouth shapes. According to the method by Yamamoto, Higuchi and Shimizu, the speech synthesizer is adapted to output vocal sound features such as an articulation mode, an articulation point, a distinction between voiced and voiceless sound and pitch control information and information of a duration based thereon, and fulfils the requirement. Other speech synthesizing methods can be employed, as long as they provide such information.
Moreover, if the information of a vocal sound feature and a duration for each phoneme is obtained, the present invention can be applied to an input text of English, French,
German, etc. as well as Japanese.
The time adjuster 2 is provided to control the input of a mouth-shape parameter into the picture generator 6 on the basis of the duration of each phoneme (the duration of an i-th phoneme being represented by ti) which is provided from the speech synthesizer 1. That is, when a picture (a moving picture, in particular) is output as a television signal of 30 frames per second by the NTSC television system, for example, it is necessary that the picture be generated as information for each 1/30 second. The operation of the time adjuster 2 will be described in detail later on.
The converter 3 converts the vocal sound feature from the speech synthesizer 1 to a mouth-shape feature corresponding to the phoneme concerned. The mouth-shape features are, for example, (1) the degree of opening of the mouth (appreciably open X completely shut), (2) the degree of roundness of lips (round X drawn to both sides), (3) the height of the lower jaw (raised X lowered), and (4) the degree to which the tongue is seen. Based on an observation of how a man actually utters each phoneme, the correspondence between the vocal sound feature and the mouth-shape feature is formulated.
For example, in the case of a Japanese sentence "konnichiwa" being input, vocal sound features are converted to mouth-shape features as follows:
##(voiceless sound) lv0 lh4 jaw4
k lv2 lhx jaw2 tbck
0 lv2 lhl jaw2
In the above lv, lh and jaw represent the degree of opening of the mouth, the degree of roundness of lips: and the height of the lower jaw, respectively, the numerals represent their values, x indicates that their degree is determined by preceding and succeeding phonemes, and tbck represents the degree to which the tongue is seen. (In this case, it is indicated that the tongue is slightly seen at the back of the mount.)
The conversion table 4 for converting the mouth-shape feature to the corresponding mouth-shape parameter is a table which provides the parameter values for representing a concrete mouth shape for each of the afore-mentioned mouthshape features.Examples of parameters for represwnting mouth shapes are shown in Figs. 2A and 2B. Fig. 2A is a front view of the mouth portion. The mouth shape is defined by the positions of eight points P1 through Pg, the degree to which upper and lower teeth are seen is defined by the positions of points Q1 and Q21 and the thicknesses of upper and lower lips are defined by values hl and h2. Fig. 2B is a side view of the mouth portion, and inversions of the upper and lower lips are defined by angles 61 and 62. 2These parameters are adopted for representing natural mouth-shapes.
However, more kinds of parameters can be utilized. Mouthshapes may also be represents by parameters and indications other than those of Figs. 2A and 2B. In the conversion table 4 there are prestored, in the form of a table, sets of values of the above-mentioned parameters P1 to P8 Q11 Q2' hl, h2, 61 and 62 predetermined on the basis of the results of measurements of the mouth shapes of a man when he actually utters vocal sounds.
In response to the mouth-shape feature corresponding to the phoneme concerned, provided from the speech feature to mouth-shape feature converter 3, the unit 5 refers to the conversion table 4 to read out therefrom a set of values of mouth-shape parameters for the phoneme.
The gate 10 is provided for controlling whether or not the above-mentioned mouth-shape parameters for the phoneme are sent to the picture generator 6, and this sends the mouth-shape parameters to the picture generator 6 by the number of times specified by the time adjuster 2 (a value obtained by multiplying the above-mentioned number of times by 1/30 second being the time for displaying the mouth shape for the phoneme).
The picture generator 6 generates a picture of the mouth based on the mouth-shape parameters sent for each 1/30 second from the unit 5 via the gate 10. A picture including the whole face in addition to the mouth portion is generated as required. The detailes of the generation of a picture of a mouth or face based on mouth-shape parameters are described in, for example, a published paper [Masahide Kaneko,
Yoshinori Hatori and Kiyoshi Koike, "Detection of Shape
Variations and Coding of a Moving Face picture Based on a
Three-Dimensional Model," Journal of the Institute of Electronic Information and Communication Engineers of Japan, B, Vol. J71-B, No. 12, pp. 1554-1563 (December 1988) ] .In rough terms, a three-dimensional wire frame model is at first prepared which represents the three-dimensional configuration of the head of a person, and mouth portions (lips, teeth, jaws, etc., in concrete terms) of the threedimensional wire frame model are modified in accordance with mouth-shape parameters provided. By providing to the modified model information specifying the shading and color of each part of the model for each picture element, it is possible to obtain a real picture of the mouth or face.
Now, the operation of the time adjuster 2 will be described in detail. Fig. 3 is a block diagram explanatory of the structure and operation of the time adjuster 2.
In Fig. 3 reference numeral 21. indicates a delay, 22 a comparator, 23 and 24 memories, 25 and 26 adders, 27 a switch, 28 and 29 branches, 30 a time normalizer, 201 and 202 output lines of the comparator 22, 902 an initial reset signal terminal, 903 a constant (1/30) input terminal, and 920 and 921 terminals of the switch 27. Next, the operation of each of these parts will be described. The memory 23 is provided for storing a total duration,
to an I-th phoneme. Prior to the start of picture synthesis, a zero is set in the memory 23 by an initial reset signal from the terminal 902.When the duration of the I-th phoneme is proprovided from the speech synthesizer 1, the total duration
to an (I-l)th phoneme stored in the memory 23 and the duration tI of the I-th phoneme are added by the adder 25 to obtain the sum
and the delay 21. serves to store the total duration
to the (I-l)th phoneme until processing for the (I+l)th phoneme is initiated. In response to the output
of the delay 21, the time normalizer 30 obtains an N which satisfies (1/30) x N
(1/30) (N+1), and outputs a value (1/30) x N, where N is an integer and 1/30 is a constant which provides a one-frame period of 1/30 second. The switch 27 is connected to the terminal 920 by the output 202 from the comparator 22 when processing for the I-th phoneme is started.At this time, the sum t of the output 1/30 x N of the time normalizer 30 and the constant 1/30 is calculated by the adder 26. The comparator 22 compares the value t and the value
and provides a signal on the output line 201 or 202 depending on whether t
or t
The latter case means the expiration of the duration of the I-th phoneme, issuing through the output line 202 an instruction to the speech synthesizer 1 to output information of the (I+l)th phoneme, an instruction to the memory 24 to reset its contents, an instruction to the switch 27 to connect the same to the terminal 920, and an instruction to the delay 21 to output the value of the delayed duration
The memory 24 is provided for temporarily store the output of the adder 26. The switch 27 is connected to the terminal 921 while t
holds, during which the adder 26 renews the preceding sum t by adding thereto the constant 1/30 for each frame. In this way, while t
holds, the comparator 22 provides the signal on the output line 201 to enable the gate 10 in Fig. 1, through which mouth-shape parameters corresponding to the I-th phoneme are supplied to the picture generator 6 duration of the I-th phoneme.
The above is the first embodiment of the present invention. In the first embodiment, when the I-th phoneme changes to the (I+l)th phoneme, the mouth-shape parameters of the former discontinuously change to the mouth-shape parameters of the latter. In this instance, if the mouthshape parameters of the both phonemes do not differ widely from each other, the synthesized moving picture will not be so unnatural. When a person utters vocal sounds, however, his mouth shape changes continuously; therefore, when the
I-th phoneme changes to the (I+l)th phoneme, it is desirable that the mouth shape of the moving picture changes continu ously Fig. 4 is a block diagram explanatory of another embodiment of the present invention designed to meet with the above requirement.In Fig. 4 reference numeral 7 indicates a mouth-shape parameter modifier, 8 a transition detector, 9 a memory, 40 a switch, and 910 and 911 terminals of the switch 40. This embodiment is identical in construction with the Fig. 1 embodiment except the above. Now, a description will be given of the operations of the newly added units.
The transition detector 8 is to detect the transition from a certain phoneme (the I-th phoneme, for example) to the next one (the (I+l)th phoneme). Fig. 5 is a block diagram explanatory of the operation of the transition detector 8 according to the present invention. Reference numeral 81 indicates a counter, 82 a decision circuit, and 210 and 211 output lines The counter 81 is reset to zero when the comparator 22 provides a signal on the output line 202, and the counter 81 is incremented by one whenever the comparator 22 provides a signal on the output line 201.The decision circuit 82 determines whether the output of the counter 81 is a state "1" or not and, when it is the state "1", provides a signal on the output line 210, because the state "1" indicates the occurrence of transition from a certain phoneme to the next When the counter output is a state "2" or more, this means that the current phoneme still lasts, and the decision circuit 82 provides a signal on the output line 211.
The memory 9 is provided for storing, for at least one frame period, the mouth-shape parameters used for synthesizing a picture of the preceding frame. The mouth-shape parameter modifier 7 obtains, for instance, intermediate values between the mouth-shape parameters of the preceding frame stored in the memory 9 and the mouth-shape parameters for the current phoneme which are provided from the unit 5 to provide such intermediate values as mouth-shape parameters for synthesizing a picture of the current frame. The switch 40 is connected to the terminal 910 or 911, depending on whether the transition detector 8 provides a signal on the output line 210 or 211.Consequently, the intermediate values between the mouth-shape parameters for two phonemes, available from the mouth-shape parameter modifier 7, or the mouth-shape parameters for the current phoneme are supplied to the picture generator 6, depending on whether the switch 40 is connected to the terminal 910 or 911. While in the above the intermediate values between the mouth-shape parameters of a certain phoneme and the next are produced for only one frame, it is also possible to implement more smooth mouth-shape variations by producing such intermediate values at more steps in accordance with the counting state of the counter 82, for instance.
As described above, the present invention is directed to a system for synthesizing a moving picture of a person's face which has mouth-shape variations corresponding to a sentence input. However, if it is possible to utilize a speech recognition method by which, even if speech information is input, it can be divided into a train of phoneme and a voice feature for each phoneme and its duration can be output, then a moving picture with mouth-shape variations corresponding to the input speech information can also be synthesized by replacing the speech synthesizer 1 in the present invention by a speech detector which performs such operations as mentioned above.
As described above, the present invention permits the synthesis of a moving picture which has an accurate correspondence between a sentence input and a speech output and mouth-shape variations corresponding to the duration of each phoneme and consequently natural mouth-shape variations well matched with the speech output.
The prior art can only synthesize a speech output but the present invention allows ease in producing not only such a speech output but also a moving picture having natural mouth-shape variations well matched with the speech output.
Accordingly, the present invention is applicable to the production of a moving picture without the necessity of actual film shooting (the production of a television program or movie, for example), an automatic response unit and a man-machine interface utilizing a speech and a picture, and the conversion of medium from a sentence to a speech and a moving picture. Hence, the present invention is of great practical utility.
Claims (5)
1. A picture synthesizing method for synthesizing a moving picture of a person's face which has mouth-shape variations corresponding to a sentence input,
characterized by the steps of:
dividing the sentence input into a train of phonemes;
utilizing of a speech synthesis technique capable of outputting a voice feature of each phoneme of the train of phonemes and its duration;
determining a mouth-shape feature corresponding to each phoneme on the basis of the voice feature;
determining the value of a mouth-shape parameter for representing a concrete mouth shape on the basis of the mouth-shape feature; and
controlling the value of the mouth-shape parameter for each phoneme for each frame of the moving picture in accordance with the duration of each phoneme, thereby synthesizing the moving picture having mouth-shape variations matched with a speech output.
2. A picture synthesizing apparatus comprising:
an input terminal for receiving a sentence input;
a speech synthesizer capable of dividing the input sentence into a train of phonemes and outputting a voice feature of each phoneme and its duration;
a converter for converting the voice feature for each phoneme into a mouth shape feature;
a conversion table having established correspondence between various mouth-shape features and mouth-shape parameters for representing concrete mouth shapes;
means Tor obtaining from the conversion table a mouthshape parameter corresponding to the mouth-shape feature for each phoneme provided in the converting section;;
a time adjuster whereby the value of the mouth-shape parameter output from said means for obtaining is controlled in accordance with the duration of each phoneme from the speech synthesizer so as to produce a moving picture as a train of pictures spaced apart for a fixed period of time; and
a picture generator for generating the picture in accordance with the values of the mouth-shape parameters from said means for obtaining mouth-shape parameters under control of the time adjuster.
3. A picture synthesizing apparatus according to claim 2, characterized by a transition detector for detecting a transition from a certain phoneme to the next in accordance with the output of the time adjuster, a memory capable of storing for at least one frame period the values of the mouth-shape parameters used in the picture generator, and a mouth-shape parameter modifier for obtaining an intermediate value between the value of the mouth-shape parameter stored in the memory and the value of the mouth-shape parameter provided from siad means for obtaining the mouth-shape parameters, whereby during the transition from the certain phoneme to the next an intermediate mouth shape is generated, priducing the moving picture of a person's face with smooth mouth-shape variations.
4. A picture synthesizing method for synthesizing a moving picture of a person's face which has mouth shaped variations corresponding to a sentence input substantially as herein described with reference to Figure 1 with or without reference to any of Figures 2 to 5 of the accompanying drawings.
5. A picture synthesizing apparatus substantially as herein described with reference to Figure 1 with or without reference to any of Figures 2 to 5 of the accompanying drawings.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP1053899A JP2518683B2 (en) | 1989-03-08 | 1989-03-08 | Image combining method and apparatus thereof |
Publications (3)
Publication Number | Publication Date |
---|---|
GB9005142D0 GB9005142D0 (en) | 1990-05-02 |
GB2231246A true GB2231246A (en) | 1990-11-07 |
GB2231246B GB2231246B (en) | 1993-06-30 |
Family
ID=12955569
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB9005142A Expired - Fee Related GB2231246B (en) | 1989-03-08 | 1990-03-07 | Picture synthesizing method and apparatus |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP2518683B2 (en) |
GB (1) | GB2231246B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2250405A (en) * | 1990-09-11 | 1992-06-03 | British Telecomm | Speech analysis and image synthesis |
EP0603809A2 (en) * | 1992-12-21 | 1994-06-29 | Casio Computer Co., Ltd. | Object image display devices |
EP0673170A2 (en) * | 1994-03-18 | 1995-09-20 | AT&T Corp. | Video signal processing systems and methods utilizing automated speech analysis |
EP0689362A2 (en) * | 1994-06-21 | 1995-12-27 | AT&T Corp. | Sound-synchronised video system |
WO1997036288A1 (en) * | 1996-03-26 | 1997-10-02 | British Telecommunications Plc | Image synthesis |
WO1997036297A1 (en) * | 1996-03-25 | 1997-10-02 | Interval Research Corporation | Automated synchronization of video image sequences to new soundtracks |
EP0860811A2 (en) * | 1997-02-24 | 1998-08-26 | Digital Equipment Corporation | Automated speech alignment for image synthesis |
WO1998043235A2 (en) * | 1997-03-25 | 1998-10-01 | Telia Ab (Publ) | Device and method for prosody generation at visual synthesis |
WO1998043236A2 (en) * | 1997-03-25 | 1998-10-01 | Telia Ab (Publ) | Method of speech synthesis |
AT404887B (en) * | 1994-06-08 | 1999-03-25 | Siemens Ag Oesterreich | READER |
WO1999046734A1 (en) * | 1998-03-11 | 1999-09-16 | Entropic, Inc. | Face synthesis system and methodology |
US6208356B1 (en) * | 1997-03-24 | 2001-03-27 | British Telecommunications Public Limited Company | Image synthesis |
US7342094B1 (en) * | 1997-12-30 | 2008-03-11 | Max-Delbrück-Centrum für Molekulare Medizin | Tumor vaccines for MUC1-positive carcinomas |
USRE42647E1 (en) | 1997-05-08 | 2011-08-23 | Electronics And Telecommunications Research Institute | Text-to speech conversion system for synchronizing between synthesized speech and a moving picture in a multimedia environment and a method of the same |
US8078466B2 (en) * | 1999-09-07 | 2011-12-13 | At&T Intellectual Property Ii, L.P. | Coarticulation method for audio-visual text-to-speech synthesis |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3070073B2 (en) | 1990-07-13 | 2000-07-24 | ソニー株式会社 | Shape control method based on audio signal |
WO1996027983A1 (en) | 1995-03-07 | 1996-09-12 | Interval Research Corporation | System and method for selective recording of information |
US5893062A (en) | 1996-12-05 | 1999-04-06 | Interval Research Corporation | Variable rate video playback with synchronized audio |
US6263507B1 (en) | 1996-12-05 | 2001-07-17 | Interval Research Corporation | Browser for use in navigating a body of information, with particular application to browsing information represented by audiovisual data |
KR100236974B1 (en) | 1996-12-13 | 2000-02-01 | 정선종 | Synchronization system between moving picture and text / voice converter |
US7366670B1 (en) | 1997-08-05 | 2008-04-29 | At&T Corp. | Method and system for aligning natural and synthetic video to speech synthesis |
US6567779B1 (en) * | 1997-08-05 | 2003-05-20 | At&T Corp. | Method and system for aligning natural and synthetic video to speech synthesis |
US7155735B1 (en) | 1999-10-08 | 2006-12-26 | Vulcan Patents Llc | System and method for the broadcast dissemination of time-ordered data |
US6757682B1 (en) | 2000-01-28 | 2004-06-29 | Interval Research Corporation | Alerting users to items of current interest |
JP2006021273A (en) * | 2004-07-08 | 2006-01-26 | Advanced Telecommunication Research Institute International | Text visual speech (TTVS) synthesis method and computer executable program |
JP2012103904A (en) * | 2010-11-10 | 2012-05-31 | Sysystem Co Ltd | Image processing device, method and program |
CN117275485B (en) * | 2023-11-22 | 2024-03-12 | 翌东寰球(深圳)数字科技有限公司 | Audio and video generation method, device, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0056507A1 (en) * | 1981-01-19 | 1982-07-28 | Richard Welcher Bloomstein | Apparatus and method for creating visual images of lip movements |
EP0179701A1 (en) * | 1984-10-02 | 1986-04-30 | Yves Guinet | Television method for multilingual programmes |
EP0225729A1 (en) * | 1985-11-14 | 1987-06-16 | BRITISH TELECOMMUNICATIONS public limited company | Image encoding and synthesis |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4884972A (en) * | 1986-11-26 | 1989-12-05 | Bright Star Technology, Inc. | Speech synchronized animation |
-
1989
- 1989-03-08 JP JP1053899A patent/JP2518683B2/en not_active Expired - Fee Related
-
1990
- 1990-03-07 GB GB9005142A patent/GB2231246B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0056507A1 (en) * | 1981-01-19 | 1982-07-28 | Richard Welcher Bloomstein | Apparatus and method for creating visual images of lip movements |
EP0179701A1 (en) * | 1984-10-02 | 1986-04-30 | Yves Guinet | Television method for multilingual programmes |
EP0225729A1 (en) * | 1985-11-14 | 1987-06-16 | BRITISH TELECOMMUNICATIONS public limited company | Image encoding and synthesis |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2250405A (en) * | 1990-09-11 | 1992-06-03 | British Telecomm | Speech analysis and image synthesis |
EP0603809A2 (en) * | 1992-12-21 | 1994-06-29 | Casio Computer Co., Ltd. | Object image display devices |
EP0603809A3 (en) * | 1992-12-21 | 1994-08-17 | Casio Computer Co Ltd | Object image display devices. |
US5588096A (en) * | 1992-12-21 | 1996-12-24 | Casio Computer Co., Ltd. | Object image display devices |
US5608839A (en) * | 1994-03-18 | 1997-03-04 | Lucent Technologies Inc. | Sound-synchronized video system |
EP0673170A2 (en) * | 1994-03-18 | 1995-09-20 | AT&T Corp. | Video signal processing systems and methods utilizing automated speech analysis |
EP0673170A3 (en) * | 1994-03-18 | 1996-06-26 | At & T Corp | Video signal processing systems and methods utilizing automated speech analysis. |
AT404887B (en) * | 1994-06-08 | 1999-03-25 | Siemens Ag Oesterreich | READER |
EP0689362A3 (en) * | 1994-06-21 | 1996-06-26 | At & T Corp | Sound-synchronised video system |
EP0689362A2 (en) * | 1994-06-21 | 1995-12-27 | AT&T Corp. | Sound-synchronised video system |
WO1997036297A1 (en) * | 1996-03-25 | 1997-10-02 | Interval Research Corporation | Automated synchronization of video image sequences to new soundtracks |
US5880788A (en) * | 1996-03-25 | 1999-03-09 | Interval Research Corporation | Automated synchronization of video image sequences to new soundtracks |
WO1997036288A1 (en) * | 1996-03-26 | 1997-10-02 | British Telecommunications Plc | Image synthesis |
EP0860811A3 (en) * | 1997-02-24 | 1999-02-10 | Digital Equipment Corporation | Automated speech alignment for image synthesis |
EP0860811A2 (en) * | 1997-02-24 | 1998-08-26 | Digital Equipment Corporation | Automated speech alignment for image synthesis |
US6208356B1 (en) * | 1997-03-24 | 2001-03-27 | British Telecommunications Public Limited Company | Image synthesis |
WO1998043236A2 (en) * | 1997-03-25 | 1998-10-01 | Telia Ab (Publ) | Method of speech synthesis |
WO1998043236A3 (en) * | 1997-03-25 | 1998-12-23 | Telia Ab | Method of speech synthesis |
WO1998043235A3 (en) * | 1997-03-25 | 1998-12-23 | Telia Ab | Device and method for prosody generation at visual synthesis |
WO1998043235A2 (en) * | 1997-03-25 | 1998-10-01 | Telia Ab (Publ) | Device and method for prosody generation at visual synthesis |
US6385580B1 (en) | 1997-03-25 | 2002-05-07 | Telia Ab | Method of speech synthesis |
US6389396B1 (en) | 1997-03-25 | 2002-05-14 | Telia Ab | Device and method for prosody generation at visual synthesis |
USRE42647E1 (en) | 1997-05-08 | 2011-08-23 | Electronics And Telecommunications Research Institute | Text-to speech conversion system for synchronizing between synthesized speech and a moving picture in a multimedia environment and a method of the same |
US7342094B1 (en) * | 1997-12-30 | 2008-03-11 | Max-Delbrück-Centrum für Molekulare Medizin | Tumor vaccines for MUC1-positive carcinomas |
WO1999046734A1 (en) * | 1998-03-11 | 1999-09-16 | Entropic, Inc. | Face synthesis system and methodology |
US6449595B1 (en) | 1998-03-11 | 2002-09-10 | Microsoft Corporation | Face synthesis system and methodology |
US8078466B2 (en) * | 1999-09-07 | 2011-12-13 | At&T Intellectual Property Ii, L.P. | Coarticulation method for audio-visual text-to-speech synthesis |
Also Published As
Publication number | Publication date |
---|---|
JP2518683B2 (en) | 1996-07-24 |
GB9005142D0 (en) | 1990-05-02 |
JPH02234285A (en) | 1990-09-17 |
GB2231246B (en) | 1993-06-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
GB2231246A (en) | Converting text input into moving-face picture | |
US6332123B1 (en) | Mouth shape synthesizing | |
Chen et al. | Audio-visual integration in multimodal communication | |
US6208356B1 (en) | Image synthesis | |
CA1263187A (en) | Image encoding and synthesis | |
US6662161B1 (en) | Coarticulation method for audio-visual text-to-speech synthesis | |
US6097381A (en) | Method and apparatus for synthesizing realistic animations of a human speaking using a computer | |
USRE42000E1 (en) | System for synchronization between moving picture and a text-to-speech converter | |
US6778252B2 (en) | Film language | |
CA2375350C (en) | Method of animating a synthesised model of a human face driven by an acoustic signal | |
EP0890168B1 (en) | Image synthesis | |
US6014625A (en) | Method and apparatus for producing lip-movement parameters in a three-dimensional-lip-model | |
US8078466B2 (en) | Coarticulation method for audio-visual text-to-speech synthesis | |
WO2001001353A1 (en) | Post-synchronizing an information stream | |
JPH04331997A (en) | Accent component control system of speech synthesis device | |
KR950035447A (en) | Video Signal Processing System Using Speech Analysis Automation and Its Method | |
EP0710929A2 (en) | Acoustic-assisted image processing | |
KR20230172427A (en) | Talking face image synthesis system according to audio voice | |
JP5109038B2 (en) | Lip sync animation creation device and computer program | |
JP4617500B2 (en) | Lip sync animation creation device, computer program, and face model creation device | |
Kakumanu et al. | Speech driven facial animation | |
JP3059022B2 (en) | Video display device | |
US7392190B1 (en) | Coarticulation method for audio-visual text-to-speech synthesis | |
US3530248A (en) | Synthesis of speech from code signals | |
JP3298076B2 (en) | Image creation device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PCNP | Patent ceased through non-payment of renewal fee |
Effective date: 20050307 |