Nothing Special   »   [go: up one dir, main page]

EP0527529B1 - Method and apparatus for manipulating duration of a physical audio signal, and a storage medium containing a representation of such physical audio signal - Google Patents

Method and apparatus for manipulating duration of a physical audio signal, and a storage medium containing a representation of such physical audio signal Download PDF

Info

Publication number
EP0527529B1
EP0527529B1 EP19920202374 EP92202374A EP0527529B1 EP 0527529 B1 EP0527529 B1 EP 0527529B1 EP 19920202374 EP19920202374 EP 19920202374 EP 92202374 A EP92202374 A EP 92202374A EP 0527529 B1 EP0527529 B1 EP 0527529B1
Authority
EP
European Patent Office
Prior art keywords
audio
signal
equivalent signal
audio equivalent
duration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
EP19920202374
Other languages
German (de)
French (fr)
Other versions
EP0527529A2 (en
EP0527529A3 (en
Inventor
Leonardus Lambertus Maria Vogten
Chang Xue Ma
Werner Desiré Elisabeth Verhelst
Josephus Hubertus Eggen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Koninklijke Philips NV
Original Assignee
Koninklijke Philips Electronics NV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics NV filed Critical Koninklijke Philips Electronics NV
Priority to EP19920202374 priority Critical patent/EP0527529B1/en
Publication of EP0527529A2 publication Critical patent/EP0527529A2/en
Publication of EP0527529A3 publication Critical patent/EP0527529A3/en
Application granted granted Critical
Publication of EP0527529B1 publication Critical patent/EP0527529B1/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion

Definitions

  • the invention relates to a method for manipulating an audio equivalent signal, comprising positioning of a chain of mutually overlapping time windows with respect to the audio equivalent signal, as based on periodicity measurements on said audio equivalent signal, and wherein a positional displacement between adjacent windows substantially corresponds to a principal period of said periodicity, and synthesizing an audio output signal by chained superposition of segment signals, each deriving from the audio equivalent signal through weighting with the associated window function.
  • the known method is used during speech synthesis for changing the prosody or pitch of synthesised speech, or to change the duration of stretches of speech.
  • the known method uses voice marks determined manually for placing the windows. It is preferred that such a manipulation method can be performed automatically, is robust against noise, and retains a high audio quality for the output signal.
  • the inventors of the present invention have realized that the manipulation of the duration can be used in various situations where there are external constraints to the total length of a self-contained unit of speech, which constraints may specify both the maximum and the minimum duration of such unit.
  • the object is realized in that the invention is characterized by manipulating a duration of said output signal through systematically repeating, maintaining, and/or suppressing said segment signals, to a resulting predetermined overall length that differs from a corresponding duration of said audio equivalent signal.
  • An advantage of the method of positioning windows according to the junior reference is that it can be machine-executed without any window-to-window human control being necessary. Furthermore, it has been found that the duration can be changed by a factor between 2 and 1 ⁇ 2 without seriously impairing understandability of speech. For lesser degrees of manipulating the duration, such as by + or - 30%, not only remains the understandability very good, but also the natural quality of speech is maintained, and a listener would hardly feel the change of duration as unnatural. A prerequisite to applying the method is that the pitch can indeed be measured, which for human speech is a problem knowing various solutions.
  • the invention relates also to an apparatus for executing the method and to a storage medium containing a representation of audio signal equivalent.
  • the invention allows to fill the available space for a unit of speech (sentence, partial sentence, exclamation, or other) well nigh completely.
  • CD-I Compact Disc Interactive
  • a particular application is Compact Disc Interactive, especially so in a multi-language environment. Editing CD-I is by itself a complicated task. Sizing the duration of speech utterances may now be performed by the machine for relieving the program editor from this tedium.
  • CD-I is a well-published storage medium with associated development platform, the storage itself being an extension from Compact Disc Audio.
  • the audio or speech equivalent signal may be direct analog speech, or it may be speech that is stored as a sequence of codes for on the basis thereon generating synthetic speech.
  • the length of the various windows may be non-uniform, and in a particular embodiment, the length of each window may be substantially equal to a local actual pitch period length.
  • the window function is uniform, which means that the window function scales linearly with the width of the window, which means that generally, there may be an appreciable variation between the widths of successive windows.
  • the systematical character of the repeating, maintaining, or suppression implies that there is a certain prescription for the sequence of window positions that, first, restricts to either repeating or suppressing, either possibly in combination with maintaining, and furthermore, that the repeating or maintaining is done under control of an actual or emulated recurrent cycle. Examples are:
  • the different representations in parallel may be different languages; it has been found that the same sentence, translated to another language, would have different length, counted for example, as a number of syllables: in particular, the German language caused a longer duration as compared with English and French.
  • the pictorial material 200 is shown with accompanying speech representations in French (202), German (204) and English (206) before editing. It is intended to lend each language representation (among which a user may choose) exactly the same duration as the pictorial material (movie, animation, etcetera).
  • a single window is suppressed
  • five windows are suppressed.
  • six windows are repeated one (crosses). The result after editing is not shown. It has been found that analysis of the results can prove infringement. Especially the occurrence of the repeated windows is well traceable.
  • the substantially equal lengths of the various representations is, together with the high subjective quality of the rendering is a clear indication for the use of the present technology.
  • the slowing down or speeding may lend the speech a character, such as nervous (fast) or lively (slow). Also such use is sometimes advantageous.
  • Changing the duration of the audio equivalent signal may be combined with changing the pitch.
  • the two types of manipulation may be both in the same direction, for example in that both effectively shorten the duration. In other circumstances, they could to some degree compensate the effects, so that the change in duration would be less or even be zero.
  • the change of duration may be according to a time-varying pattern, whereby the overall change of duration is the integral or sum of the elementary changes-of-duration.
  • Figures 2a, 2b and 2c show speech signals with marks 52 placed apart by distances determined with a pitch meter (that may be conventional), that is, without a fixed phase reference.
  • a pitch meter that may be conventional
  • two successive periods where marked as voiceless by placing their pitch period length indication outside the scale.
  • the pitch marks (lower scale) where obtained by interpolating the period length.
  • the incremental placement of windows also solves another problem.
  • unvoiced stretches that contain fricatives like the sound “ssss", in which the vocal cords are not excited
  • the windows are placed incrementally just like for voiced stretches.
  • the pitch period length is interpolated between the lengths measured for unvoiced stretches adjacent to the voiced stretch. This provides regularly spaced windows without audible artefacts.
  • the placement of windows is easy if the input audio equivalent signal is monotonous.
  • the windows may be placed simply at fixed distances from each other. This may be effected by preprocessing the signal, so as to change its pitch to a single monotonous value. The final manipulation to obtain a desired pitch and/or duration starting can then be performed with windows at uniform spacing.
  • Figure 3 shows an exemplary embodiment of an apparatus for changing the pitch and/or duration of an audible signal.
  • the input audio equivalent signal arrives at an input 60, and the output signal leaves at an output 63.
  • the input signal is multiplied by the window function in multiplication means 61, and stored segment signal by segment signal in segment slots in storage means 62.
  • speech samples from various segment signals are summed in summing means 64.
  • the manipulation of speech signals in terms of pitch change and/or duration manipulation, is effected by addressing the storage means 62 and selecting window function values. Accordingly, selection of storage addresses for storing the segments is controlled by window position selection means 65, which also control window function value selection means 69; selection of readout addresses is controlled by combination means 66.
  • Figure 4 shows the multiplication means 61 and the window function value selection means 69.
  • the respective t values t a , t b described above are multiplied by the inverse of the period length L i (determined from the period length in an invertor 74) in scaling multipliers 70a, 70b to determine the corresponding arguments of the window function W.
  • These arguments are supplied to window function evaluators 71a, 71b (implemented for example in case of discrete arguments as a lookup table) which outputs the corresponding values of the window function, which are multiplied with the input signal in two multipliers 72a, 72b. This produces the segment signal values Si, Si+1 at two inputs 73a, 73b to the storage means 62.
  • segment signal values are stored in the storage means 62 in segment slots at addresses in the slots corresponding to their respective time point values t a , t b and to respective slot numbers. These addresses are controlled by window position selection means 65. Window position selection means suitable for implementing the invention are shown in Figure 5.
  • the time point values t a , t b are addressed by counters 81, 82, the segment slots numbers are addressed by indexing means 84, (which output the segment indices i, i+1).
  • the counters 81, 82 and the indexing means 84 output addresses with a width as appropriate to distinguish the various positions within the slots and the various slot respectively, but are shown symbolically only as single lines in Figure 5.
  • the two counters 81, 82 are clocked at a fixed clock rate and count from an initial value loaded from a load input (L), upon a trigger signal at trigger input (T).
  • the indexing means 84 increment the index values upon reception of this trigger signal.
  • pitch measuring means 86 determine a pitch value from input 60, and control the scale factor for the scaling multipliers 70a, 70b, and provide the initial value of the first counter 81 (the initial count being minus the pitch value), whereas the trigger signal is generated internally in the window position selection means, once the counter reaches zero, as detected by a comparator 88. This means that successive windows are placed by incrementing the location of a previous window by the time needed by the first counter 81 to reach zero.
  • a monotonized signal is applied to the input 60 (this monotonized signal being obtained by prior processing in which the pitch is adjusted to a time independent value).
  • a constant value, corresponding to the monotonized pitch is fed as initial value to the first counter 81.
  • the scaling multipliers 70a, 70b can be omitted since the windows have a fixed size.
  • the combination means 66 of Figure 3 are shown in Figure 10.
  • the sum being limited to index values i for which -L i ⁇ t-T i ⁇ L i+1 ; in principle, any number of index values may contribute to the sum at one time point t. But when the pitch is not changed by more than a factor of 3/2, at most 3 index values will contribute at a time.
  • Figures 3 and 7 show an apparatus which provides for only three active indices at a time; extension to more than three segments is straightforward.
  • the combination means 66 are quite similar to the input side: they comprise three counters 101, 102, 103 (clocked at a fixed rate), outputting the time point values t-T i for the three segment signals.
  • the three counters receive the same trigger signal, which triggers loading of minus the desired output pitch interval in the first of the three counters 101.
  • the trigger signal is generated by a comparator 104, which detects zero crossing of the first counter 101.
  • the trigger signal also updates indexing means 106.
  • the indexing means address the segment slot numbers which must be read out and the counters address the position within the slots.
  • the counters and indexing means address three segments, which are output from the storage means 62 to the summing means 64 in order to produce the output signal.
  • the duration of the speech signal is controlled by a duration control input 68b to the indexing means. Without duration manipulation, the indexing means simply produce three successive segment slot numbers.
  • the values of the first and second output are copied to the second an third output respectively, and the first output is increased by one.
  • the duration is increased, the first output is kept constant once every so many cycles, as determined by the duration control input 68b.
  • the first output is increased by two every so many cycles. The change in duration is determined by the net number of skipped or repeated indices.
  • Figure 3 only provides one embodiment by way of example.
  • the principal point is the incremental placement of windows at the input side with a phase determined from the phase of a previous window.
  • the addresses may be generated using a computer program, and the starting addresses need not have the values given in the example.
  • Figure 3 can be implemented in various ways, for example using digital samples at input 60, where the sampling rate has at any convenient value, for example 10000 samples per second; conversely, it may use continuous signal techniques, where the clocks 81, 82, 101, 102, 103 provide continuous ramp signals, and the storage means provide for continuously controlled access like a magnetic disk. Furthermore, in Figure 3 in practice segment slots may be reused after some time, as they are not needed permanently. Not all components of Figure 4 need to be implemented by discrete function blocks: often it may be implemented in whole or part by a computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)

Description

FIELD OF THE INVENTION
The invention relates to a method for manipulating an audio equivalent signal, comprising positioning of a chain of mutually overlapping time windows with respect to the audio equivalent signal, as based on periodicity measurements on said audio equivalent signal, and wherein a positional displacement between adjacent windows substantially corresponds to a principal period of said periodicity, and synthesizing an audio output signal by chained superposition of segment signals, each deriving from the audio equivalent signal through weighting with the associated window function.
Such method has been described in EP-A-363 233. The known method is used during speech synthesis for changing the prosody or pitch of synthesised speech, or to change the duration of stretches of speech. The known method uses voice marks determined manually for placing the windows. It is preferred that such a manipulation method can be performed automatically, is robust against noise, and retains a high audio quality for the output signal.
The inventors of the present invention have realized that the manipulation of the duration can be used in various situations where there are external constraints to the total length of a self-contained unit of speech, which constraints may specify both the maximum and the minimum duration of such unit.
SUMMARY TO THE INVENTION
Accordingly, amongst other things, it is an object of the present invention to position the manipulated audio equivalent signal in a predetermined time length that differs from the original length, while on the one hand filling the interval more or less completely, and on the other hand keeping the impression of the eventual representation as natural as possible.
Now, according to one of its aspects, the object is realized in that the invention is characterized by manipulating a duration of said output signal through systematically repeating, maintaining, and/or suppressing said segment signals, to a resulting predetermined overall length that differs from a corresponding duration of said audio equivalent signal.
An advantage of the method of positioning windows according to the junior reference is that it can be machine-executed without any window-to-window human control being necessary. Furthermore, it has been found that the duration can be changed by a factor between 2 and ½ without seriously impairing understandability of speech. For lesser degrees of manipulating the duration, such as by + or - 30%, not only remains the understandability very good, but also the natural quality of speech is maintained, and a listener would hardly feel the change of duration as unnatural. A prerequisite to applying the method is that the pitch can indeed be measured, which for human speech is a problem knowing various solutions. Situations where the duration of speech should be manipulated are various, such as in post-synchronizing of movies or other video representative material, adapting a speech explanation or other matter to physical motion of objects, such as the closing instant of a door, and many other instances. In movies, actor utterances should preferably coincide with their facial motions, or at least with their moving around in general. Typical time scales of the total duration of the utterance are 0.3 to several seconds. In this short time frame, prior art had not succeeded in duration manipulation with also preserving naturalness. On a much longer time scale, the length of a pause can be manipulated, such as is often done by human interpreters. If the available time is known beforehand, sometimes a different verbalization can be used, but all these methods require specialized human skills. The present method is easily applicable and just requires the setting of a speed-up or slow-down percentage. Of course, the use of the present invention is also for amending longer durations than in the seconds range.
In itself the automatic placement of overlapping windows is used in the non pre-published European Patent EP-B-0 527 527 for adjusting prosody during speech synthesis. The article "Simple pitch-dependent algorithm for high-quality speech rate changing", E.P. Neuburg, Journal of the Acoustic Society of America, Vol. 63, No. 2, February 1978, pages 624-625 describes a cut-and-splice method for speeding up or slowing down speech by removing or, respectively, repeating a stretch of the speech signal whose length is equal to the pitch period. WO-A- 8 303 483 describes a system for replacing an original dialogue recorded at the time of shooting a picture by a similar signal recorded in the studio at a higher quality. The relative timing of the original recording is kept by comparing both signals on a frame-basis and keeping or repeating a frame of the studio recording depending on how well the frames match.
The invention relates also to an apparatus for executing the method and to a storage medium containing a representation of audio signal equivalent. The invention allows to fill the available space for a unit of speech (sentence, partial sentence, exclamation, or other) well nigh completely.
A particular application is Compact Disc Interactive, especially so in a multi-language environment. Editing CD-I is by itself a complicated task. Sizing the duration of speech utterances may now be performed by the machine for relieving the program editor from this tedium. By itself, CD-I is a well-published storage medium with associated development platform, the storage itself being an extension from Compact Disc Audio.
Various advantageous aspects of the invention are recited in dependent Claims.
BRIEF DESCRIPTION OF THE FIGURE
These and other advantages will be described with reference to a preferred embodiment which is shown in a number of Figures, of which
  • Figure 1 shows editing of CDI-program for storage on a CD-I-disc; The following Figures especially show the technology of the junior reference:
  • Figures 2a,b,c show speech signals with windows placed according to the invention;
  • Figure 3 shows an apparatus for changing the pitch and/or duration of a signal;
  • Figure 4 shows multiplication means and window function value selection means for use in an apparatus for changing the pitch and/or duration of a signal;
  • Figure 5 shows window position selection means for implementing the invention;
  • Figure 6 shows a subsystem for combining several segment signals.
  • DESCRIPTION OF A PREFERRED EMBODIMENT
    As is commonly understood, the audio or speech equivalent signal may be direct analog speech, or it may be speech that is stored as a sequence of codes for on the basis thereon generating synthetic speech. The length of the various windows may be non-uniform, and in a particular embodiment, the length of each window may be substantially equal to a local actual pitch period length. Within the window, the window function is uniform, which means that the window function scales linearly with the width of the window, which means that generally, there may be an appreciable variation between the widths of successive windows. The systematical character of the repeating, maintaining, or suppression implies that there is a certain prescription for the sequence of window positions that, first, restricts to either repeating or suppressing, either possibly in combination with maintaining, and furthermore, that the repeating or maintaining is done under control of an actual or emulated recurrent cycle.
    Examples are:
  • each third window is repeated once, the others are maintained;
  • of each five successive windows, #2 and #4 are suppressed;
  • at each next window, a count is incremented by a particular amount and overflow controls actual suppression or repetition.
  • It is commented that the systematical character would not need to be completely uniform. For example, in post-synchronization of a movie, it could be advantageous to amend the time durations of various parts of a sentence somewhat differently from each other, as long as the natural character of the resulting speech would remain. In particular the movement of a face while speaking speech could to a certain extent be followed by the dynamics of the audio speech. Also, different sentences in various places of the post-synchronizing now may have uniform pitch among each other.
    The different representations in parallel may be different languages; it has been found that the same sentence, translated to another language, would have different length, counted for example, as a number of syllables: in particular, the German language caused a longer duration as compared with English and French.
    Other, in particular exotic, languages may lead to even more extreme situations. Similar situations may distinguish child voices from adult voices.
    In the Figure for a three language CD-I track the pictorial material 200 is shown with accompanying speech representations in French (202), German (204) and English (206) before editing. It is intended to lend each language representation (among which a user may choose) exactly the same duration as the pictorial material (movie, animation, etcetera). As shown, on line 202, a single window is suppressed, on line 204, five windows are suppressed. On line 206, six windows are repeated one (crosses). The result after editing is not shown. It has been found that analysis of the results can prove infringement. Especially the occurrence of the repeated windows is well traceable. Moreover, the substantially equal lengths of the various representations is, together with the high subjective quality of the rendering is a clear indication for the use of the present technology.
    In certain situations, apart from changing the duration per se, the slowing down or speeding may lend the speech a character, such as nervous (fast) or majestic (slow). Also such use is sometimes advantageous. Changing the duration of the audio equivalent signal may be combined with changing the pitch. The two types of manipulation may be both in the same direction, for example in that both effectively shorten the duration. In other circumstances, they could to some degree compensate the effects, so that the change in duration would be less or even be zero. The change of duration may be according to a time-varying pattern, whereby the overall change of duration is the integral or sum of the elementary changes-of-duration.
    DESCRIPTION OF A PREFERRED TECHNOLOGY
    Hereinafter, a description of the preferred technology according to the junior reference is given.
    Figures 2a, 2b and 2c show speech signals with marks 52 placed apart by distances determined with a pitch meter (that may be conventional), that is, without a fixed phase reference. In Figure 2a, two successive periods where marked as voiceless by placing their pitch period length indication outside the scale. The pitch marks (lower scale) where obtained by interpolating the period length. Although the pitch period lengths were determined without smoothing other than that inherent in determining spectra of the speech signal extending over several pitch periods, a very regular curve was obtained automatically.
    The incremental placement of windows also solves another problem. For unvoiced stretches, that contain fricatives like the sound "ssss", in which the vocal cords are not excited, the windows are placed incrementally just like for voiced stretches. The pitch period length is interpolated between the lengths measured for unvoiced stretches adjacent to the voiced stretch. This provides regularly spaced windows without audible artefacts.
    The placement of windows is easy if the input audio equivalent signal is monotonous. In this case, the windows may be placed simply at fixed distances from each other. This may be effected by preprocessing the signal, so as to change its pitch to a single monotonous value. The final manipulation to obtain a desired pitch and/or duration starting can then be performed with windows at uniform spacing.
    An exemplary apparatus.
    Figure 3 shows an exemplary embodiment of an apparatus for changing the pitch and/or duration of an audible signal. The input audio equivalent signal arrives at an input 60, and the output signal leaves at an output 63. The input signal is multiplied by the window function in multiplication means 61, and stored segment signal by segment signal in segment slots in storage means 62. To synthesize the output signal on output 63, speech samples from various segment signals are summed in summing means 64. The manipulation of speech signals, in terms of pitch change and/or duration manipulation, is effected by addressing the storage means 62 and selecting window function values. Accordingly, selection of storage addresses for storing the segments is controlled by window position selection means 65, which also control window function value selection means 69; selection of readout addresses is controlled by combination means 66.
    In order to explain the operation of the components of the apparatus shown in Figure 3 it will be briefly explained that signal segments S are to be derived from the input signal X (at 60), the segments being defined by Si(t)= W(t/Li) X(t+ti) (-Li<t<0) Si(t)= W(t/Li+1)X(t+ti) ( 0<t<Li+1) and these segments are to be superposed to produce the output signal Y (at 63): Y(t)= Σi' Si(t-Ti) (The sum being limited to indices i for which -Li<t-Ti<Li+1 ).
    At any point in time t' a signal X(t') is supplied at the input 60, which contributes to two segments i, i+1 at respective t values ta=t'-ti and tb=t'-ti+1 (these being the only possibilities that -Li<t<Li+1 ).
    Figure 4 shows the multiplication means 61 and the window function value selection means 69. The respective t values ta, tb described above are multiplied by the inverse of the period length Li (determined from the period length in an invertor 74) in scaling multipliers 70a, 70b to determine the corresponding arguments of the window function W. These arguments are supplied to window function evaluators 71a, 71b (implemented for example in case of discrete arguments as a lookup table) which outputs the corresponding values of the window function, which are multiplied with the input signal in two multipliers 72a, 72b. This produces the segment signal values Si, Si+1 at two inputs 73a, 73b to the storage means 62.
    These segment signal values are stored in the storage means 62 in segment slots at addresses in the slots corresponding to their respective time point values ta, tb and to respective slot numbers. These addresses are controlled by window position selection means 65. Window position selection means suitable for implementing the invention are shown in Figure 5. The time point values ta, tb are addressed by counters 81, 82, the segment slots numbers are addressed by indexing means 84, (which output the segment indices i, i+1). The counters 81, 82 and the indexing means 84 output addresses with a width as appropriate to distinguish the various positions within the slots and the various slot respectively, but are shown symbolically only as single lines in Figure 5.
    The two counters 81, 82 are clocked at a fixed clock rate and count from an initial value loaded from a load input (L), upon a trigger signal at trigger input (T). The indexing means 84 increment the index values upon reception of this trigger signal. According to one embodiment, pitch measuring means 86 determine a pitch value from input 60, and control the scale factor for the scaling multipliers 70a, 70b, and provide the initial value of the first counter 81 (the initial count being minus the pitch value), whereas the trigger signal is generated internally in the window position selection means, once the counter reaches zero, as detected by a comparator 88. This means that successive windows are placed by incrementing the location of a previous window by the time needed by the first counter 81 to reach zero.
    In another embodiment, a monotonized signal is applied to the input 60 (this monotonized signal being obtained by prior processing in which the pitch is adjusted to a time independent value). In this monotonized case, a constant value, corresponding to the monotonized pitch is fed as initial value to the first counter 81. In this case the scaling multipliers 70a, 70b can be omitted since the windows have a fixed size.
    The combination means 66 of Figure 3 are shown in Figure 10. The purpose of the output side is to superpose segments from the storage means 62 according to Y(t)= Σi' Si(t-Ti) The sum being limited to index values i for which -Li<t-Ti<Li+1 ;
    in principle, any number of index values may contribute to the sum at one time point t. But when the pitch is not changed by more than a factor of 3/2, at most 3 index values will contribute at a time. By way of example, therefore, Figures 3 and 7 show an apparatus which provides for only three active indices at a time; extension to more than three segments is straightforward.
    For addressing the segments, the combination means 66 are quite similar to the input side: they comprise three counters 101, 102, 103 (clocked at a fixed rate), outputting the time point values t-Ti for the three segment signals. The three counters receive the same trigger signal, which triggers loading of minus the desired output pitch interval in the first of the three counters 101. Upon the trigger signal the last position of the first counter 101 is loaded into the second counter 102, and in the third counter 103 the last position of the second counter 102 is loaded. The trigger signal is generated by a comparator 104, which detects zero crossing of the first counter 101. The trigger signal also updates indexing means 106.
    The indexing means address the segment slot numbers which must be read out and the counters address the position within the slots. The counters and indexing means address three segments, which are output from the storage means 62 to the summing means 64 in order to produce the output signal.
    By applying desired pitch interval values at the pitch control input 68a, one can thus control the pitch value. The duration of the speech signal is controlled by a duration control input 68b to the indexing means. Without duration manipulation, the indexing means simply produce three successive segment slot numbers. At the trigger signal, the values of the first and second output are copied to the second an third output respectively, and the first output is increased by one. When the duration is increased, the first output is kept constant once every so many cycles, as determined by the duration control input 68b. To decrease the duration, the first output is increased by two every so many cycles. The change in duration is determined by the net number of skipped or repeated indices. When the apparatus is used to change the pitch and duration of a signal independently (for example changing the pitch and keeping the duration constant), the duration input 68b should be controlled to give a net frequency F at which indices should be skipped or repeated according to F = (Dt/T)-1 (D being the factor by which the duration is changed, t being the pitch period length of the input signal and T being the period length of the output signal; a negative value of F corresponds to skipping of indices, a positive value corresponds to repetition).
    Figure 3 only provides one embodiment by way of example. The principal point is the incremental placement of windows at the input side with a phase determined from the phase of a previous window. There are many ways of generating the addresses for the storage means 62, of which Figure 5 is but one. For example, the addresses may be generated using a computer program, and the starting addresses need not have the values given in the example.
    Figure 3 can be implemented in various ways, for example using digital samples at input 60, where the sampling rate has at any convenient value, for example 10000 samples per second; conversely, it may use continuous signal techniques, where the clocks 81, 82, 101, 102, 103 provide continuous ramp signals, and the storage means provide for continuously controlled access like a magnetic disk. Furthermore, in Figure 3 in practice segment slots may be reused after some time, as they are not needed permanently. Not all components of Figure 4 need to be implemented by discrete function blocks: often it may be implemented in whole or part by a computer.

    Claims (7)

    1. A method for manipulating an audio equivalent signal, comprising:
      positioning of a chain of mutually overlapping time windows with respect to the audio equivalent signal, wherein a positional displacement between adjacent windows substantially corresponds to a principal period as based on periodicity measurements on said audio equivalent signal,
      forming segment signals Si each deriving from the audio equivalent signal through weighting with a window function of the associated window Wi; and
      synthesizing an audio output signal by chained superposition of the segment signals, characterized:
      in that the step of positioning the chain of mutually overlapping time windows includes shifting each window Wi with respect to a previous window Wi-1 in the chain over an actual pitch period length Li of said audio equivalent signal, where the window Wi has a window function formed by linearly stretching a first half of a normalised window function by Li and a second half of the normalised window function by Li+1; and
      in manipulating a duration of said output signal through systematically repeating, maintaining, and/or suppressing said segment signals, to a predetermined length of pictorial material corresponding to said audio equivalent signal, where said length differs from a duration of said audio equivalent signal.
    2. A method as claimed in Claim 1, wherein said predetermined length applies to a plurality of speech equivalent signals in parallel that correspond in content but have differences in representation.
    3. A method as claimed in Claim 2 wherein said differences originate from said plurality of audio equivalent signals being in as many different languages.
    4. A method as claimed in Claim 1, 2 or 3, wherein said predetermined length pertains to an intermission between non-manipulated audio equivalent signals.
    5. A method as claimed in any of Claims 1 to 4 for post-synchronizing human speech as featured by a video representable item.
    6. A method for producing a software title from predetermined pictorial material and at least one corresponding audio equivalent signal; the method comprising:
      manipulating the audio equivalent signal, by positioning of a chain of mutually overlapping time windows with respect to the audio equivalent signal, as based on periodicity measurements on said audio equivalent signal, and wherein a positional displacement between adjacent windows substantially corresponds to a principal period of said periodicity; deriving segment signals from the audio equivalent signal through weighting with the associated window function; and synthesizing an audio output signal by chained superposition of said segment signals, wherein a duration of said audio output signal is manipulated to a predetermined length of the pictorial material through systematically repeating, maintaining, and/or suppressing said segment signals, where said length differs from a duration of said audio equivalent signal; and
      storing the pictorial material and the resulting audio output signal in a unitary storage medium for synchronised playback.
    7. An apparatus for manipulating an audio equivalent signal; the apparatus comprising:
      means for positioning a chain of mutually overlapping time windows with respect to the audio equivalent signal, as based on periodicity measurements on said audio equivalent signal, by shifting each window Wi with respect to a previous window Wi-1 in the chain over an actual pitch period length Li of said audio equivalent signal, where the window Wi has a window function formed by linearly stretching a first half of a normalised window function by Li and a second half of the normalised window function by Li+1; and
      means for deriving segment signals from the audio equivalent signal through weighting with the associated window function; and
      means for synthesizing an audio output signal by chained superposition of said segment signals by manipulating a duration of said output signal to a predetermined length of pictorial material corresponding to said audio equivalent signal through systematically repeating, maintaining, and/or suppressing said segment signals, where said length differs from a duration of said audio equivalent signal.
    EP19920202374 1991-08-09 1992-07-31 Method and apparatus for manipulating duration of a physical audio signal, and a storage medium containing a representation of such physical audio signal Expired - Lifetime EP0527529B1 (en)

    Priority Applications (1)

    Application Number Priority Date Filing Date Title
    EP19920202374 EP0527529B1 (en) 1991-08-09 1992-07-31 Method and apparatus for manipulating duration of a physical audio signal, and a storage medium containing a representation of such physical audio signal

    Applications Claiming Priority (5)

    Application Number Priority Date Filing Date Title
    EP91202044 1991-08-09
    EP91202044 1991-08-09
    EP92200521 1992-02-24
    EP92200521 1992-02-24
    EP19920202374 EP0527529B1 (en) 1991-08-09 1992-07-31 Method and apparatus for manipulating duration of a physical audio signal, and a storage medium containing a representation of such physical audio signal

    Publications (3)

    Publication Number Publication Date
    EP0527529A2 EP0527529A2 (en) 1993-02-17
    EP0527529A3 EP0527529A3 (en) 1993-05-05
    EP0527529B1 true EP0527529B1 (en) 2000-07-19

    Family

    ID=27234119

    Family Applications (1)

    Application Number Title Priority Date Filing Date
    EP19920202374 Expired - Lifetime EP0527529B1 (en) 1991-08-09 1992-07-31 Method and apparatus for manipulating duration of a physical audio signal, and a storage medium containing a representation of such physical audio signal

    Country Status (1)

    Country Link
    EP (1) EP0527529B1 (en)

    Families Citing this family (4)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    DE69822618T2 (en) * 1997-12-19 2005-02-10 Koninklijke Philips Electronics N.V. REMOVING PERIODICITY IN A TRACKED AUDIO SIGNAL
    EP0995190B1 (en) 1998-05-11 2005-08-03 Koninklijke Philips Electronics N.V. Audio coding based on determining a noise contribution from a phase change
    EP0993674B1 (en) 1998-05-11 2006-08-16 Philips Electronics N.V. Pitch detection
    EP1628288A1 (en) * 2004-08-19 2006-02-22 Vrije Universiteit Brussel Method and system for sound synthesis

    Family Cites Families (4)

    * Cited by examiner, † Cited by third party
    Publication number Priority date Publication date Assignee Title
    JPS58102298A (en) * 1981-12-14 1983-06-17 キヤノン株式会社 Electronic appliance
    CA1204855A (en) * 1982-03-23 1986-05-20 Phillip J. Bloom Method and apparatus for use in processing signals
    US5055939A (en) * 1987-12-15 1991-10-08 Karamon John J Method system & apparatus for synchronizing an auxiliary sound source containing multiple language channels with motion picture film video tape or other picture source containing a sound track
    FR2636163B1 (en) * 1988-09-02 1991-07-05 Hamon Christian METHOD AND DEVICE FOR SYNTHESIZING SPEECH BY ADDING-COVERING WAVEFORMS

    Also Published As

    Publication number Publication date
    EP0527529A2 (en) 1993-02-17
    EP0527529A3 (en) 1993-05-05

    Similar Documents

    Publication Publication Date Title
    US5611002A (en) Method and apparatus for manipulating an input signal to form an output signal having a different length
    US5479564A (en) Method and apparatus for manipulating pitch and/or duration of a signal
    US5828994A (en) Non-uniform time scale modification of recorded audio
    US7277856B2 (en) System and method for speech synthesis using a smoothing filter
    US6950798B1 (en) Employing speech models in concatenative speech synthesis
    EP0561752B1 (en) A method and an arrangement for speech synthesis
    EP0527529B1 (en) Method and apparatus for manipulating duration of a physical audio signal, and a storage medium containing a representation of such physical audio signal
    US20060074678A1 (en) Prosody generation for text-to-speech synthesis based on micro-prosodic data
    JP2002108382A (en) Animation method and device for performing lip sinchronization
    JP3728173B2 (en) Speech synthesis method, apparatus and storage medium
    Dutilleux et al. Time‐segment Processing
    US4092495A (en) Speech synthesizing apparatus
    JP3394281B2 (en) Speech synthesis method and rule synthesizer
    van Santen Quantitative modeling of pitch accent alignment
    JP2583883B2 (en) Speech analyzer and speech synthesizer
    JP2785628B2 (en) Pitch pattern generator
    Rodet Sound analysis, processing and synthesis tools for music research and production
    Tychtl Phase-mismatch-free and data efficient approach to natural sounding harmonic concatenative speech synthesis
    Nebbia et al. Eight-channel digital speech synthesizer based on LPC techniques
    JP2573587B2 (en) Pitch pattern generator
    Reddy SPEECH ANALYSIS-SYNTHESIS FOR SPEAKER CHARACTERISTIC MODIFICATION
    Hsiao Speech synthesis algorithms for voice conversion
    EP1256933A2 (en) Method and apparatus for controlling the operation of an emotion synthesising device
    JPH04280B2 (en)
    JPH01274199A (en) Pitch pattern generating device

    Legal Events

    Date Code Title Description
    PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

    Free format text: ORIGINAL CODE: 0009012

    AK Designated contracting states

    Kind code of ref document: A2

    Designated state(s): DE FR GB IT

    PUAL Search report despatched

    Free format text: ORIGINAL CODE: 0009013

    AK Designated contracting states

    Kind code of ref document: A3

    Designated state(s): DE FR GB IT

    17P Request for examination filed

    Effective date: 19931026

    17Q First examination report despatched

    Effective date: 19970221

    RAP3 Party data changed (applicant data changed or rights of an application transferred)

    Owner name: KONINKLIJKE PHILIPS ELECTRONICS N.V.

    GRAG Despatch of communication of intention to grant

    Free format text: ORIGINAL CODE: EPIDOS AGRA

    GRAG Despatch of communication of intention to grant

    Free format text: ORIGINAL CODE: EPIDOS AGRA

    GRAH Despatch of communication of intention to grant a patent

    Free format text: ORIGINAL CODE: EPIDOS IGRA

    RIC1 Information provided on ipc code assigned before grant

    Free format text: 7G 10L 21/04 A

    GRAH Despatch of communication of intention to grant a patent

    Free format text: ORIGINAL CODE: EPIDOS IGRA

    GRAA (expected) grant

    Free format text: ORIGINAL CODE: 0009210

    AK Designated contracting states

    Kind code of ref document: B1

    Designated state(s): DE FR GB IT

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: IT

    Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT;WARNING: LAPSES OF ITALIAN PATENTS WITH EFFECTIVE DATE BEFORE 2007 MAY HAVE OCCURRED AT ANY TIME BEFORE 2007. THE CORRECT EFFECTIVE DATE MAY BE DIFFERENT FROM THE ONE RECORDED.

    Effective date: 20000719

    REF Corresponds to:

    Ref document number: 69231266

    Country of ref document: DE

    Date of ref document: 20000824

    ET Fr: translation filed
    PLBE No opposition filed within time limit

    Free format text: ORIGINAL CODE: 0009261

    STAA Information on the status of an ep patent application or granted ep patent

    Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

    26N No opposition filed
    REG Reference to a national code

    Ref country code: GB

    Ref legal event code: IF02

    REG Reference to a national code

    Ref country code: GB

    Ref legal event code: 732E

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: GB

    Payment date: 20031224

    Year of fee payment: 12

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: FR

    Payment date: 20031231

    Year of fee payment: 12

    PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

    Ref country code: DE

    Payment date: 20040115

    Year of fee payment: 12

    REG Reference to a national code

    Ref country code: FR

    Ref legal event code: TP

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: GB

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20040731

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: DE

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20050201

    GBPC Gb: european patent ceased through non-payment of renewal fee

    Effective date: 20040731

    PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

    Ref country code: FR

    Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

    Effective date: 20050331

    REG Reference to a national code

    Ref country code: FR

    Ref legal event code: ST