WO2004049307A1

WO2004049307A1 - Method for automatically matching audio segments with text elements

Info

Publication number: WO2004049307A1
Application number: PCT/AT2003/000356
Authority: WO
Inventors: Norbert Pfannerer; Gerhard Backfried
Original assignee: Sail Labs Technology Ag
Priority date: 2002-11-28
Filing date: 2003-11-28
Publication date: 2004-06-10
Also published as: AT6921U1; WO2004049307A8; AU2003285972A1

Abstract

The invention relates to a method for automatically matching audio segments with text elements in a transcript (MT) manually created from the audio recording, whereby an automatic transcript (AT) was created from the audio recording and contains, together with a time reference, the audio segments converted into text elements. The inventive method also involves: subdividing the automatic transcript (AT) and the manual transcript (MT) into passages (ati, mtj) of a defined length that each comprises a number of text elements; shifting each passage in the automatic transcript (AT) and in the manual transcript (MT) over the entire automatic and manual transcript, whereby each passage overlaps the preceding passage; determining a specific passage characteristic for each passage; comparing the passage characteristic of each passage (ati) in the automatic transcript (AT) with each passage (mtj) in the manual transcript (MT); assigning a respective passage in the automatic transcript (AT) to a corresponding passage in the manual transcript (MT) so that the optimal path of assignments results when looking at the sum of the passage comparisons.

Description

Method for automatic matching of audio segments with text elements

The invention relates to the automatic recognition of natural language. In detail, this is a novel method for the automatic matching of audio segments contained in an audio recording with text elements in a transcript generated manually from the audio recording, an automatic transcript first being created from the audio recording, preferably by an automatic speech recognizer that contains the audio segments formed into text elements together with a time reference at which point in the audio recording the respective automatically created text element is located.

An automatic speech recognizer can generate an automatic transcript from input audio data which corresponds to the words occurring in the audio data. The audio data can come from a variety of sources, e.g. from video recordings or audio clips. The manual transcript is typically created by a transcriptionist who uses an audio recording or a shorthand for reference. The automatic transcript is compared with the manual transcript by means of the dynamic alignment alignment process and corresponding passages are found.

In principle, however, the method according to the invention is equally suitable for texts that were not produced by an automatic speech recognizer.

The field of multimedia data processing has become increasingly important in recent years. The number of recordings available for processing has increased enormously, not least thanks to the rapid development in processing and storage capacity. However, there is an increasing problem of extracting the desired and relevant information efficiently from this huge amount of data. The extraction of relevant data poses a particular challenge, particularly in the area of recordings of court hearings, lectures or conferences. This challenge is met on the one hand by automating the transcription process using automatic language processing, and on the other hand the recordings are still manually transcribed because the quality is more automatic The procedure has so far only been considered sufficient in very few cases. Manual transcription allows information to be found reliably in textual form. However, since manual transcription rarely annotates exactly when a word or phrase was said, the temporal connection from text to text is missing multimedia medium. So, for example, to check the exact wording of a testimony or to be able to view a testimony in the video, you have to search sequentially on the medium (with the help of the text). Of course, this is cumbersome and extremely time-consuming for longer passages.

To solve this problem, methods have already been developed to create an exact link between the transcribed words and the multimedia medium. This link allows a precise connection between text and audio (or video), which allows direct access and makes tedious searching unnecessary.

The time information of the automatically recognized text (which assigns exactly one point in time to the underlying audio / video) is transferred to the manually transcribed words. This allows the corresponding audio or video sequences to be found efficiently, starting from the manually transcribed text. Fig. 1 shows the overall process schematically.

However, the transcription of multimedia data (or the audio data contained in it) still represents a technology area that is currently on the threshold of research into the commercial sector. Existing methods, such as disclosed in EP 0 649 144 "Automatic indexing of audio using speech recognition", US 5,649,060 "Automatic indexing and aligning of audio and text using speech recognition" and US 6,076,059 "Method for aligning text with audio signals" on solving the problem described here. However, these methods are based on single words, which makes them more susceptible to poor recognition rates in automatic speech recognition. The methods known from the cited patent specifications have in common that they are also based on the recognition and finding of identical words (in context) and use these found pairs as “ariker points”.

For a better understanding, it should be noted that in language processing the term "forced-alignmenf" as used in US Pat. No. 6,076,059 is understood to mean a process which reconciles an already known text with a recording (ie, an alignrnent between text and audio should). However, this process is fraught with a myriad of problems: the transcribed text almost never has the necessary accuracy in the transcription. Especially in the case of superimposed ones

Malfunctions easily lead to problems with the speech recognizer. Longer pieces of audio may not contain the entire transcript and the length of the one to be used Windows can be difficult to determine in advance. It should be noted that the term "window" in connection with "forced alignment" refers to windows of the audio file. For example, the next 20 seconds of the audio file (ie a window of 20s in length), the next ten words of the manual transcript that have not yet been used in the process, and the speech recognizer can determine the assignment of these words to the audio contained.

The aim of the present invention is to use an innovative method to establish an automatic match between an automatically (preferably by means of automatic speech recognition) produced text or transcript and a manually generated text or transcript, the method being much more robust against errors and incompleteness in the automatically generated Text is said to be the known method. Furthermore, the method according to the invention is intended to substantially reduce the effort in processing the automatically generated text.

To achieve this object, the invention provides a method for automatic correspondence between an automatically generated text or transcript and a manually generated text or transcript, as defined in claim 1.

Advantageous refinements and developments of this method are defined in the claims dependent on claim 1.

In contrast to the known methods, the method according to the invention does not use the individual words themselves, but rather entire text passages which are shifted in a window-like manner (sliding window) and overlap over the entire text. The passages are represented by properties of the words corresponding to them (similar to those that are also used in the field of “information retrieval”), which means that errors in speech recognition can be compensated for. The result is an association of passages and the words contained in the automatic transcript with those in the manual transcript. Since passages (text windows) are used according to the invention as a unit of the matching process and properties defined on the words contained in these passages, exact matching of words is no longer necessary, and the method thus becomes much more robust against errors in the automatic text creation. Through the use according to the invention of an approach based on text passages instead of the known approach based on individual words, the outlay on processing is also considerably reduced.

In contrast to such methods, which are based on the principle of "forced alignment", in the approach according to the invention it is not necessary to have the text available before the actual recognition. Furthermore, speech recognition takes place without the aid of forced alignment (and of the problems associated therewith) The present method is restricted exclusively to the use of the text produced by the speech emitter and the manually generated counterpart.

The present method thus allows a transcription (AT) generated, for example, by an automatic speech recognizer to be automatically and dynamically reconciled with a manual transcription (MT) of the same audio or video file (i.e. an alignment, an association between them).

In the present method, the automatic speech recognizer produces an automatic transcript which corresponds to the words occurring in the audio data entered. A time stamp of the word is also generated with each word of the transcription. This timestamp indicates when exactly this word in the

Audio stream was detected (relative to the beginning of the file). The audio data itself can come from a variety of sources, such as video recordings or audio clips. The manual transcript is typically created by a transcriptionist who uses a recording or stenogram as a reference. The quality of the transcription and how exactly it reproduces the actual audio varies greatly. Since the manual transcription focuses on intelligibility and should not provide the most accurate possible transcription of the audio data, non-linguistic phenomena such as clearing the throat, coughing, breathing noises, smacking the lips, etc., or linguistic phenomena such as stuttering, Slips, elimination of errors and multiple starts of a phrase (eg "I I want to make you the following offer and ...") are not taken into account. However, these are recognized and transcribed by the automatic speech recognizer (possibly also "wrong" recognized and transcribed) , They therefore pose a problem in the assignment of the two transcripts; however, taking them into account allows a more precise assignment of words and their time stamp. The invention is explained in more detail below with reference to the drawings, in which FIG. 1 shows a general scheme of the assignment of text from a transcript automatically generated from an audio recording to text from a transcript generated manually from the audio recording, FIG 2 schematically shows an overview of the method according to the invention, FIG. 3 shows an evaluation matrix used in the method according to the invention, FIG. 4 shows a word frequency vector created during the implementation of the method, FIG. 5 shows how the manually created text word for words with time stamps of the automatically recognized words, and FIG. 6 shows the first steps of the result of the dynamic comparison in an exemplary embodiment of the method according to the invention.

The present method is based on the subdivision of the two texts into passages (windows), the length of which is determined by a parameter that can be adjusted. Each passage is shifted backwards by a specified value (in words) in the text. This happens in both files in the same way and overlapping, each passage of AT being compared with every passage of MT (see FIG. 2). The length of the passages need not be the same. By varying the parameters and creating an assignment several times, the assignment that received the best overall rating can be selected. This is a special case of the dynamic programming method based on text passages instead of single words. Dynamic programming in itself is a general programming tool that is often used when the search space of a problem can be represented as a sequence of states. The conditions must meet the following conditions:

- The initial state contains trivial solutions to sub-problems

- Each partial solution of a later state can be determined from a limited number of already calculated partial solutions of an earlier state, the last state contains the solution to the overall problem

In our case, these requirements are met: the two sequences of text passages represent the axes of a matrix (FIG. 3). The columns of the matrix represent the states, a matrix entry represents a partial solution which is only determined from the partial solutions of the previous and the same column (for example, the entry amy represents the best possible arrangement of sequences up to and including a match of sequence a; and sequence m _j ). The last element at _nm contains the overall solution to the problem. By tracing the path back through the matrix, the best possible assignment is then obtained. In our case it is a comparison of two text passages, the provides an assessment of the similarity of these passages (which in turn finds its way into the dynamic assignment). This comparison is carried out using properties which are defined on these passages (such as are used, for example, in the area of "information retrieval"). The passages are represented as vectors of these properties (defined by the words contained in the passage) In the preferred embodiment, each component of the vector is assigned a word and its frequency (FIG. 4).

The use of this representation allows a number of possibilities for representation. For example, the TF-IDF (term frequency / inverse document frequency) known in the field of "information retrieval" can be used, the most important terms of which are summarized below:

term frequency / inverse document frequency (tf / idf):

term frequency: tf, j how often word w occurs in document d _j document frequency: dfj number of documents in the corpus in which w; occurs collection frequency: cf. total number of occurrences of Wj in the entire body

These sizes are usually defined in a corpus using words and documents. In our case, the passages (windows) can be viewed as documents and the entire document as a corpus. But you can also view the entire document as a document and create the corpus model from a larger body of text. You could also combine these two approaches.

tf / idf only says that the actual word frequency (term frequency) and the document frequency (document frequency) are combined to determine the value for a specific word. There are numerous variants of how these values are combined, e.g.

weight (ij) = (l + log (tfy)) (log (N / dfj)) where N is the total number of documents.

Further possibilities of representation are word stems (lemmas) instead of full forms, it is also possible to use phonetic similarity of words or stop word lists to exclude certain words. By using normalized vectors (ie whose length is 1), the comparison of two vectors can be viewed as determining the cosine of the angle between them. This serves as a measure of the similarity of the Vectors and thus the text passages represented by them. Alternative dimensions are, for example, the distance between the end points of the vectors or the number of different dimensions. The dynamic programming method provides the best possible chain of assignments between passages from AT and passages from MT (best possible in the sense of minimizing the costs of the assignment process, whereby identical passages get a value of 0 and different passages values between 0 and 1, according to their distance , ie the angle between them).

The possible assignments of two passages are:

1) Passage from AT is assigned to passage from MT

2) Passage from AT cannot be assigned to a passage from MT

3) Passage from MT cannot be assigned to a passage from AT

Case 2) is called “insertion error” in the literator and corresponds to one

Text passage that was transcribed by the automatic speech recognizer. There is speech in the audio file at this point, but it was not transcribed manually (possibly voices and audio during a break in negotiations or interruptions etc.) because the transcriptionist ignored them or considered them to be unimportant (what a human reader / Listener may be correct, but is a problem for automatic processing.)

Case 3) is referred to as a "deletion error" and corresponds to a manually transcribed passage that was not, however, transcribed by the automatic speech operator (e.g. additions or inaccurate manual transcription.

Assignments of both types can be taken into account by the present method, "insertion errors" by displaying the "additional text" differently and "deletion errors" by inserting the not directly assigned text at the appropriate place (see also the following description of a preferred embodiment) ).

The dynamic assignment process also provides an overall value that describes the overall quality of the assignment. If these values fall below a limit, the assignment can be rejected as not meaningful. If several assignments are available, the one with the best rating can be selected. If a successful (meaningful) assignment of passages is created, then a direct relationship of the words contained in these passages can be established. This allows the manually created text to be provided word by word with time stamps of the automatically recognized words (FIG. 5). This makes it possible to find the appropriate position for each word in the underlying audio or video file, which enables audio to be found efficiently. The accuracy of this process is determined by the length of the window (ie the passage), the assignment of words that have occurred in the window and interpolation between the windows.

In the following example of a currently preferred embodiment of the method according to the invention, an automatically generated transcript (AT), which is generated by an automatic speech recognizer, is compared with a manually created transcript (MT) of the same audio file and the text passages and words contained therein are brought into harmony with one another.

For this purpose, the two text files are divided into text passages of the same length. For each comparison step, these are shifted backwards by a predetermined number of words, in such a way that adjacent passages overlap by a predetermined number of words. These two values are freely selectable and can be adapted to the specific text (e.g. knowledge of the nature of the automatic and / or manual transcription or its quality can be incorporated).

All passages are compared using the dynamic programming method. The metric used in this comparison is the cosine of the angle (the distance) between the vectors representing the respective text passages. These vectors are generated from the words contained in the passage. In the preferred embodiment, the words themselves are used for this, each word and its frequency representing a component of the vector. However, the method in question is in no way limited to this representation of the vectors, but is equally suitable for other representations. For example, TF / IDF or methods based on phonetic similarity or other properties, such as the basic form of a word or its phonetic representation, can also be used. Likewise, words can be put together into composites or composites can be broken down into their components. The result of this process is an assignment of passages from the two input texts. In a next step, this assignment is now used for assignment on a word basis. Starting from the last assigned pair of text passages (which correspond to the end of the two input files), the following procedure is used: - If the passage from MT corresponds to that from AT, then each word from MT is assigned the time stamp of the corresponding word from AT;

- if the passage from AT is an "insertion" passage, i.e. if it corresponds to text recognized by the automatic speech recognizer, to which no passage in MT corresponds, this text is rejected;

- If the passage from MT is a "deletion" passage, ie if no passage in AT corresponds to it, then the words from MT are inserted before the last time stamp according to the nature of their words (eg the length in phonemes). For this purpose, time stamps are artificial generated that lie in the corresponding time interval.

- Words that are in the area of the overlap of two neighboring passages are treated separately. The time stamp of the passage that has the better comparison value (i.e. the smaller value of the distance) is used. Correctly harmonized passages (with a distance of 0) are always treated preferentially.

The above steps are applied until the first passage in each of the two input files is reached. At this point, a time stamp was assigned to all words in all passages of the MT. This is now output together with the word and can be used directly and efficiently in the search for the word or the localization of the word in the media file.

We now want to consider these steps using a concrete example.

The following example should serve as a manually transcribed text:

"The automatic speech recognizer produces an automatic transcript which corresponds to the words occurring in the input data"

This text corresponds to the version (MT) created by a human transcriptionist.

The same text, as issued by the Spracherkemier (AT), could read, for example:

<time start = "00000011" end = "00000045"> louder </time><time start = "00000046" end = "00000084"> please </time><time start = "00000085" end = "00000101"> one </ time> <time start = "00000102" end- = "00000212"> automatic </time><time start = "00000213" end- = "00000253"> language </time><time start = "00000254" end- = "00000281 "> recognize </time><time start =" 00000282 "end- =" 00000325 "> produces </time><time tart -" 00000326 "end- -" 00000370 "> thereby </time><time start = "00000371" end- = "00000387"> a </time><time tart - "00000388" end-- - "00000410"> [AHM] </time><time start = "00000411" end-- - "00000458"> Transbipt </time><time start = "00000459" end-- = "00000607"> which </time><time start = "00000608" end- - "0000747"> the </time>< time start = "00000748" end-- - "00000772"> in </time><time start = "00000773" end-- - "00000797"> the </time><time start - "00000798" end- - - "00000825"> Entry </time><time start = "00000826" end-- - "00000925"> Files </time><time start = "00000926" end- - "00000962"> before < / time><time start = "00000963" end-- - "00001053"> coming </time><time start = "00001054" end-- - "00001074"> den </time><time st art- "00001075" end = - "00001096"> words </time><time start = "00001097" end-- - "00001160"> end </time><time start = "00001831" end = - "00001995"> speaks </time>

The respective time stamp is given with each word, i.e. the time relative to the start of the audio input data (in 1/100 s) at which the speech recognizer recognized the respective word. In the example above, a hesitation in the audio stream was once recognized ([AHM]). The speaker at this position may have actually hesitated to continue the sentence. For example, the first two words of the passage could come from an interjection that the speech recognizer transcribed, but which the human transcriptionist did not consider.

The comparison of these two texts (MT and AT) is now carried out using overlapping windows. In the example at hand there are windows with a length of 4 (words), which are shifted backwards by 2 words. All windows from AT are compared with all windows from MT. The matrix defined by the text passages is filled in step by step, from left to right (progressing in time) and from top to bottom. Each matrix element corresponds to the best (least expensive) path up to it. These steps continue until all passages were compared with each other and thus all elements of the matrix were assigned a value. Then, starting from the least expensive element in the last column, the path leading to this element is tracked (back-tracking), which results in the clear sequence of actions and assignments between passages.

6 shows the first steps of the result of the dynamic comparison. The first window in AT was shown as "insertion", ie as recognized but not corresponding to the manually transcribed text. The second window in AT was assigned to the first window from MT, which results in the transmission of the associated time stamps in the result. The other windows were assigned to each other according to the rules shown in this procedure.

The result of the method is an assignment of the words from MT to the time stamps of the words from AT according to the windows assigned to each other (see figure above)

<time start = "00000102" end = "00000212"> automatic </time>

<time start = "00000213" end = "00000253"> Speech Recognizer </time> <time start = "00000254" end = "00000281"> produces </time>

<time start = "00000282" end = "00000325"> doing it </time>

<time start- '00000371 "end =" 00000387 "> automatic </time>

<time start = "00000388" end = "00000410"> transcript </time> <time start = "00000411" end = "00000458"> which </time>

<time start = "00000963" end = "00001053"> input data </time> <time start = "00001054" end = "00001074"> occurring </time>

<time start = "00001097" end = "00001160"> words </time>

<time start = "00001831" end = "00001995"> meets </time>

Claims

Expectations:

1. A method for the automatic matching of audio segments contained in an audio recording with text elements in a transcript (MT) generated manually from the audio recording, an automatic transcript (AT) being initially created from the audio recording, preferably by an automatic speech recorder , which contains the audio segments formed into text elements together with a time reference, at which point in the audio recording the respective automatically created text element is located, characterized in that the method comprises the following further steps:

dividing the automatic transcript (AT) and the manual transcript (MT) into passages (atj, mtj) of defined but not necessarily the same length, each comprising several text elements,

shifting each passage by a specified value of text elements in the automatic transcript (AT) and in the manual transcript (MT) over the entire automatic and manual transcript, each passage overlapping with the previous passage, and determining a specific passage property for each Passage,

comparing the passage property of each passage (atj) in the automatic transcript (AT) with each passage (mt _j ) in the manual transcript (MT),

the assignment of a respective passage in the automatic transcript (AT) to that passage in the manual transcript (MT), so that the optimal path of assignments results from the sum of the passage comparisons.

2. The method according to claim 1, characterized in that a text element comprises one or more words or components of words or stem words.

3. The method according to claim 1 or 2, characterized in that the passage property is the frequency of the occurrence of the text elements contained in the passage, or units similar in sound.

4. The method according to claim 1 or 2, characterized in that the passage property is the term frequency / inverse document frequency (TF-IDF).

5. The method according to any one of the preceding claims, characterized in that stop word lists are used to determine certain words to determine the passage property.

6. The method according to any one of the preceding claims, characterized in that the passage property is represented by a vector, preferably a normalized vector with a unit length, and preferably the comparison of passage properties of two passages on the basis of the angle or formed by the vectors the distance of the peaks of the vectors from one another or the number of different dimensions of the vectors or a function of the above measures.

7. The method according to any one of the preceding claims, characterized in that the length of the passages and / or the width of their displacement are varied in several runs of the method and in each run the passage property is determined, compared and a respective passage in the automatic transcript ( AT) that passage is assigned in the manual transcript (MT), so that the optimal path of assignments results from the sum of the passage comparisons, the final assignment being selected as the one that achieves the best overall rating for all passages.

8. The method according to any one of the preceding claims, characterized in that the assignment is made by the means of dynamic programming.