US20090048833A1 - Automated Extraction of Semantic Content and Generation of a Structured Document from Speech - Google Patents
Automated Extraction of Semantic Content and Generation of a Structured Document from Speech Download PDFInfo
- Publication number
- US20090048833A1 US20090048833A1 US12/253,241 US25324108A US2009048833A1 US 20090048833 A1 US20090048833 A1 US 20090048833A1 US 25324108 A US25324108 A US 25324108A US 2009048833 A1 US2009048833 A1 US 2009048833A1
- Authority
- US
- United States
- Prior art keywords
- document
- language model
- sub
- audio stream
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000605 extraction Methods 0.000 title description 2
- 238000000034 method Methods 0.000 claims abstract description 149
- 238000013507 mapping Methods 0.000 claims description 21
- 239000003814 drug Substances 0.000 claims description 10
- 229940079593 drug Drugs 0.000 claims description 9
- 238000009877 rendering Methods 0.000 claims description 6
- 238000003780 insertion Methods 0.000 claims 1
- 230000037431 insertion Effects 0.000 claims 1
- 230000008569 process Effects 0.000 abstract description 15
- 238000012549 training Methods 0.000 description 31
- 238000010586 diagram Methods 0.000 description 17
- 238000013518 transcription Methods 0.000 description 10
- 230000035897 transcription Effects 0.000 description 10
- 230000006870 function Effects 0.000 description 9
- 238000002591 computed tomography Methods 0.000 description 8
- 230000008901 benefit Effects 0.000 description 7
- 238000002483 medication Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 6
- 206010020751 Hypersensitivity Diseases 0.000 description 5
- 206010035664 Pneumonia Diseases 0.000 description 5
- 230000007815 allergy Effects 0.000 description 5
- 238000003745 diagnosis Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 239000004606 Fillers/Extenders Substances 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000036772 blood pressure Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 230000008021 deposition Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000000945 filler Substances 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 239000000955 prescription drug Substances 0.000 description 1
- 238000004393 prognosis Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000002269 spontaneous effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H15/00—ICT specially adapted for medical reports, e.g. generation or transmission thereof
Definitions
- the present invention relates to automatic speech recognition and, more particularly, to techniques for automatically transcribing speech.
- transcripts in these and other fields typically need to be highly accurate (as measured in terms of the degree of correspondence between the semantic content (meaning) of the original speech and the semantic content of the resulting transcript) because of the reliance placed on the resulting transcripts and the harm that could result from an inaccuracy (such as providing an incorrect prescription drug to a patient).
- High degrees of reliability may, however, be difficult to obtain consistently for a variety of reasons, such as variations in: (1) features of the speakers whose speech is transcribed (e.g., accent, volume, dialect, speed); (2) external conditions (e.g., background noise); (3) the transcriptionist or transcription system (e.g., imperfect hearing or audio capture capabilities, imperfect understanding of language); or (4) the recording/transmission medium (e.g., paper, analog audio tape, analog telephone network, compression algorithms applied in digital telephone networks, and noises/artifacts due to cell phone channels).
- features of the speakers whose speech is transcribed e.g., accent, volume, dialect, speed
- external conditions e.g., background noise
- the transcriptionist or transcription system e.g., imperfect hearing or audio capture capabilities, imperfect understanding of language
- the recording/transmission medium e.g., paper, analog audio tape, analog telephone network, compression algorithms applied in digital telephone networks, and noises/artifacts due to cell phone channels.
- transcription was performed solely by human transcriptionists who would listen to speech, either in real-time (i.e., in person by “taking dictation”) or by listening to a recording.
- human transcriptionists may have domain-specific knowledge, such as knowledge of medicine and medical terminology, which enables them to interpret ambiguities in speech and thereby to improve transcript accuracy.
- Human transcriptionists have a variety of disadvantages.
- human transcriptionists produce transcripts relatively slowly and are subject to decreasing accuracy over time as a result of fatigue.
- Automated dictation systems typically attempt to produce a word-for-word transcript of speech. Such a transcript, in which there is a one-to-one mapping between words in the spoken audio stream and words in the transcript, is referred to herein as a “verbatim transcript.” Automated dictation systems are not perfect and may therefore fail to produce perfect verbatim transcripts.
- transcriptionists may intentionally introduce a variety of changes into the written transcription.
- a transcriptionist may, for example, filter out spontaneous speech effects (e.g., pause fillers, hesitations, and false starts), discard irrelevant remarks and comments, convert data into a standard format, insert headings or other explanatory materials, or change the sequence of the speech to fit the structure of a written report.
- the report 111 includes a variety of sections 112 - 138 which appear in a predetermined sequence when the report 111 is displayed.
- the report includes a header section 112 , a subjective section 122 , an objective section 134 , an assessment section 136 , and a plan section 138 .
- Sections may include text as well as sub-sections.
- the header section 112 includes a hospital name section 120 (containing the text “General Hospital”), a patient name section 114 (containing the text “Jane Doe”), a chart number section 116 (containing the text “851D”), and a report date section 118 (containing text “Oct. 1, 1993”).
- a hospital name section 120 containing the text “General Hospital”
- a patient name section 114 containing the text “Jane Doe”
- a chart number section 116 containing the text “851D”
- report date section 118 containing text “Oct. 1, 1993”.
- the subjective section 122 includes various subjective information about the patient, included both in text and in a medical history section 124 , a medications section 126 , an allergies section 128 , a family history section 130 , and a social history section 132 .
- the objective section 134 includes various objective information about the patient, such as her weight and blood pressure.
- the information in the objective section may include sub-sections for containing the illustrated information.
- the assessment section 136 includes a textual assessment of the patient's condition
- the plan subsection 138 includes a textual description of a plan of treatment.
- information may appear in a different form in the report 111 from the form in which such information was spoken by the dictating doctor.
- the date in the report date section 118 may have been spoken as “october first nineteen ninety three, “the first of october ninety three,” or in some other form.
- the transcriptionist however, transcribed such speech using the text “Oct. 1, 1993” in the report date section 118 , perhaps because the hospital specified in the hospital section 120 requires that dates in written reports be expressed in such a format.
- information in the medical report 111 may not appear in the same sequence as in the original audio recording, due to the need to conform to a required report format or for some other reason.
- the dictating physician may have dictated the objective section 134 first, followed by the subjective section 122 , and then by the header 120 .
- the written report 111 contains the header 120 first, followed by the subjective section 122 , and then the objective section 134 .
- Such a report structure may, for example, be required for medical reports in the hospital specified in the hospital section 120 .
- the beginning of the report 111 may have been generated based on a spoken audio stream such as the following: “this is doctor smith on uh the first of october um nineteen ninety three patient ID eighty five one d um next is the patient's family history which i have reviewed . . . ” It should be apparent that a verbatim transcript of this speech would be difficult to understand and would not be particularly useful.
- the written report 111 organizes the original speech into the predefined sections 112 - 140 by re-ordering the speech. As these examples illustrate, the written report 111 is not a verbatim transcript of the dictating physician's speech.
- a report such as the report 111 may be more desirable than a verbatim transcript for a variety of reasons (e.g., because it organizes information in a way that facilitates understanding). It would, therefore, be desirable for an automatic transcription system to be capable of generating a structured report (rather than a verbatim transcript) based on unstructured speech.
- FIG. 1A a dataflow diagram is shown of a prior art system 100 for generating a structured document 110 based on a spoken audio stream 102 .
- Such a system produces the structured textual document 110 from the spoken audio stream 102 using a two-step process: (1) an automatic speech recognizer 104 generates a verbatim transcript 106 based on the spoken audio stream 102 ; and (2) a natural language processor 108 identifies structure in the transcript 106 and thereby creates the structured document 110 , which has the same content as the transcript 106 , but which is organized into the structure (e.g., report format) identified by the natural language processor 108 .
- the structure e.g., report format
- some existing systems attempt to generate structured textual documents by: (1) analyzing the spoken audio stream 102 to identify and distinguish spoken content in the audio stream 102 from explicit or implicit structural hints in the audio stream 102 ; (2) converting the “content” portions of the spoken audio stream 102 into raw text; and (3) using the identified structural hints to convert the raw text into the structured report 110 .
- explicit structural hints include formatting commands (e.g., “new paragraph,” “new line,” “next item”) and paragraph identifiers (e.g., “findings,” “impression,” “conclusion”).
- Examples of implicit structural hints include long pauses that may denote paragraph boundaries, prosodic cues that indicate ends of enumerations, and the spoken content itself.
- the structured document 110 produced by the system 100 may be sub-optimal.
- the structured document 110 may contain incorrectly transcribed (i.e., misrecognized) words, the structure of the structured document 110 may fail to reflect the desired document structure, and content from the spoken audio stream 102 may be inserted into the wrong sub-structures (e.g., sections, paragraphs, or sentences) in the structured document.
- semantic content such as information about medications, allergies, or previous illnesses of the patient described in the audio stream 102
- it may be desirable to extract semantic content from the spoken audio stream 102 .
- semantic content may be useful for generating the structured document 110
- such content may also be useful for other purposes, such as populating a database of patient information that can be analyzed independently of the document 110 .
- Prior art systems such as the system 100 shown in FIG. 1 , however, typically are designed to generate the structured document 110 based primarily or solely on syntactic information in the spoken audio stream 102 . Such systems, therefore, are not useful for extracting semantic content.
- Techniques are disclosed for automatically generating structured documents based on speech, including identification of relevant concepts and their interpretation.
- a structured document generator uses an integrated process to generate a structured textual document (such as a structured textual medical report) based on a spoken audio stream.
- the spoken audio stream may be recognized using a language model which includes a plurality of sub-models arranged in a hierarchical structure.
- Each of the sub-models may correspond to a concept that is expected to appear in the spoken audio stream.
- sub-models may correspond to document sections.
- Sub-models may, for example, be n-gram language models or context-free grammars.
- the resulting structured textual document may have a hierarchical structure that corresponds to the hierarchical structure of the language sub-models that were used to generate the structured textual document.
- a method which includes steps of: (A) identifying a probabilistic language model including a plurality of probabilistic language models associated with a plurality of sub-structures of a document; and (B) using a speech recognition decoder to apply the probabilistic language model to a spoken audio stream to produce a document including content organized into the plurality of sub-structures, wherein the content in each of the plurality of sub-structures is produced by recognizing speech using the probabilistic language model associated with the sub-structure.
- Another aspect of the present invention is directed to the probabilistic language model identified in step (A).
- a data structure which includes: a plurality of language models logically organized in a hierarchy, the plurality of language models including a first language model and a second language model; wherein the first language model is a parent of the second language model in the hierarchy; wherein the first language model is suitable for recognizing speech representing a first concept associated with a substructure of a document; and wherein the second language model is suitable for recognizing speech representing a second concept associated with a subset of the substructure of the document.
- a method which includes steps of: (A) identifying a probabilistic language model including a plurality of probabilistic language models associated with a plurality of concepts logically organized in a first hierarchy; (B) using a speech recognition decoder to apply the probabilistic language model to a spoken audio stream to produce a document including content organized into a plurality of sub-structures logically organized in a second hierarchy having a logical structure defined by a path through the first hierarchy.
- FIG. 1A is a dataflow diagram of a prior art system for generating a structured document based on a spoken audio stream
- FIG. 1B illustrates a textual medical report generated based on a spoken report
- FIG. 2 is a flowchart of a method that is performed in one embodiment of the present invention to generate a structured textual document based on a spoken document;
- FIG. 3 is a dataflow diagram of a system that performs the method of FIG. 2 in one embodiment of the present invention
- FIG. 4 illustrates an example of a spoken audio stream in one embodiment of the present invention
- FIG. 5 illustrates a structured textual document according to one embodiment of the present invention
- FIG. 6 is an example of a rendered document that is rendered based on the structured textual document of FIG. 5 according to one embodiment of the present invention
- FIG. 7 is a flowchart of a method that is performed by the structured document generator of FIG. 3 in one embodiment of the present invention to generate a structured textual document;
- FIG. 8 is a dataflow diagram illustrating a portion of the system of FIG. 3 in detail relevant to the method of FIG. 7 according to one embodiment of the present invention
- FIG. 9 is a diagram illustrating mappings between language models, document sub-structures corresponding to the language models, and candidate contents produced using the language models according to one embodiment of the present invention.
- FIG. 10A is a diagram illustrating a hierarchical language model according to one embodiment of the present invention.
- FIG. 10B is a diagram illustrating a path through the hierarchical language model of FIG. 10A according to one embodiment of the present invention.
- FIG. 10C is a diagram illustrating a hierarchical language model according to another embodiment of the present invention.
- FIG. 11A is a flowchart of a method that is performed by the structured document generator of FIG. 3 to generate a structured textual document according to one embodiment of the present invention
- FIG. 11B is a flowchart of a method which uses an integrated process to select a path through a hierarchical language model and to generate a structured textual document based on speech according to one embodiment of the present invention
- FIGS. 11C-11D are flowcharts of methods that are performed in one embodiment of the present invention to calculate a fitness score for a candidate document
- FIG. 12A is a dataflow diagram illustrating a portion of the system of FIG. 3 in detail relevant to the method of FIG. 11A according to one embodiment of the present invention
- FIG. 12B is a dataflow diagram illustrating an embodiment of the structured document generator of FIG. 3 which performs the method of FIG. 11B in one embodiment of the present invention
- FIG. 13 is a flowchart of a method that is used in one embodiment of the present invention to generate a hierarchical language model for use in generating structured textual documents;
- FIG. 14 is a flowchart of a method that is used in one embodiment of the present invention to generate a structured textual document using distinct speech recognition and structural parsing steps;
- FIG. 15 is a dataflow diagram of a system that performs the method of FIG. 14 according to one embodiment of the present invention.
- FIG. 2 a flowchart is shown of a method 200 that is performed in one embodiment of the present invention to generate a structured textual document based on a spoken document.
- FIG. 3 a dataflow diagram is shown of a system 300 for performing the method 200 of FIG. 2 according to one embodiment of the present invention.
- the system 300 includes a spoken audio stream 302 , which may, for example, be a live or recorded spoken audio stream of a medical report dictated by a doctor.
- a textual representation of an example of the spoken audio stream 302 is shown.
- text between percentage signs represents spoken punctuation (e.g., “% comma %”, “% period %”, and “% colon %”) and explicit structural cues (e.g., “% new-paragraph %”) in the audio stream 302 .
- spoken punctuation e.g., “% comma %”, “% period %”, and “% colon %”
- explicit structural cues e.g., “% new-paragraph %”
- the system 300 also includes a probabilistic language model 304 .
- the term “probabilistic language model” as used herein refers to any language model which assigns probabilities to sequences of spoken words. (Probabilistic) context-free grammars and n-gram language models 306 a - e are both examples of “probabilistic language models” as that term is used herein.
- a context-free grammar specifies a plurality of spoken forms for a concept and associates probabilities with each of the spoken forms.
- a finite state grammar is an example of a context-free grammar.
- a finite state grammar for the date Oct. 1, 1993 might include the spoken form “october first nineteen ninety three” with a probability of 0.7, the spoken form “ten one ninety three” with a probability of 0.2, and the spoken form “first october ninety three” with a probability of 0.1.
- the probability associated with each spoken form is an estimated probability that the concept will be spoken in that spoken form in a particular audio stream.
- a finite state grammar therefore, is one kind of probabilistic language model.
- an n-gram language model specifies the probability that a particular sequence of n words will occur in a spoken audio stream.
- a unigram specifies the probability that the word will occur in a spoken document.
- a “bigram” language model specifies probabilities that pairs of words will occur in a spoken document.
- a bigram model may specify the conditional probability that the word “cat” will occur in a spoken document given that the previous word in the document was “the”.
- a “trigram” language model specifies probabilities of three-word sequences, and so on.
- the probabilities specified by n-gram language models and finite state grammars may be obtained by training such documents using training speech and training text, as described in more detail in the above-referenced patent application entitled, “Document Transcription System Training.”
- the probabilistic language model 304 includes a plurality of sub-models 306 a - e , each of which is a probabilistic language model.
- the sub-models 306 a - e may include n-gram language models and/or finite state grammars in any combination.
- each of the sub-models 306 a - e may contain further sub-models, and so on. Although five sub-models are shown in FIG. 3 , the probabilistic language model 304 may include any number of sub-models.
- the purpose of the system 300 shown in FIG. 3 is to produce a structured textual document 310 which includes content from the spoken audio stream 302 , in which the content is organized into a particular structure, and where concepts are identified and interpreted in a machine-readable form.
- the structured textual document 310 includes a plurality of sub-structures 312 a - f , such as sections, paragraphs, and/or sentences. Each of the sub-structures 312 a - f may include further sub-structures, and so on. Although six sub-structures are shown in FIG. 3 , the structured textual document 310 may include any number of sub-structures.
- the structured textual document 310 is an XML document.
- the structured textual document 310 may, however, be implemented in any form.
- the structured document 310 includes six sub-structures 312 a - f , each of which may represent a section of the document 310 .
- the structured document 310 includes header section 312 a which includes meta-data about the document 310 , such as a title 314 of the document 310 (“CT scan of the chest without contrast”) and the date 316 on which the document 310 was dictated (“ ⁇ date>22-APR-2003 ⁇ /date>”).
- the content in the header section 312 a was obtained from the beginning of the spoken audio stream 302 ( FIG. 4 ).
- the header section 312 a includes both flat text (i.e., the title 314 ) and a sub-structure (e.g., the date 316 ) representing a concept that has been interpreted in a machine-readable form as a triplet of values (day-month-year).
- Representing the date in a machine-readable form enables the date to be stored easily in a database and to be processed more easily than if the date were stored in a textual form. For example, if multiple dates in the audio stream 302 have been recognized and stored in machine-readable form, such dates may easily be compared to each other by a computer. As another example, statistical information about the content of the audio stream 302 , such as the average time between doctor's visits, may easily be generated if dates are stored in computer-readable form. This advantage of embodiments of the present invention applies generally not only to dates but to the recognition of any kind of semantic content and the storage of such content in machine-readable form.
- the structured document 310 further includes a comparison section 312 b , which includes content describing prior studies performed on the same patient as the patient who is the subject of the document (report) 310 .
- the content in the comparison section 312 b was obtained from the portion of the audio stream 302 beginning with “comparison to” and ending with “april six two thousand one”, but that the comparison section 312 b does not include the text “comparison to,” which is an example of a section cue. The use of such cues to identify the beginning of a section or other document sub-structure will be described in more detail below.
- the structured document 310 also includes a technique section 312 c , which describes techniques that were performed in the procedures performed on the patient; a findings section 312 d , which describes the doctor's findings; and an impression section 312 e , which describes the doctor's impressions of the patient.
- a technique section 312 c which describes techniques that were performed in the procedures performed on the patient
- a findings section 312 d which describes the doctor's findings
- an impression section 312 e which describes the doctor's impressions of the patient.
- XML documents such as the example structured document 310 illustrated in FIG. 5
- XML documents typically are not intended for direct viewing by an end user. Rather, such documents typically are rendered in a form that is more easily readable before being presented to the end user.
- the system 300 includes a rendering engine 314 which renders the structured textual document 310 based on a stylesheet 316 to produce a rendered document 318 .
- Techniques for generating stylesheets and for rendering documents in accordance with stylesheets are well-known to those having ordinary skill in the art.
- the rendered document 318 includes five sections 602 a - e , each of which may correspond to one or more of the six sub-structures 312 a - f in the structured textual document 310 . More specifically, the rendered document 318 includes a header section 602 a , a comparison section 602 b , a technique section 602 c , a findings section 602 d , and an impression section 602 e . Note that there may or may not be a one-to-one mapping between sections in the rendered document 318 and sub-structures in the structured textual document 310 .
- each of the sub-structures 312 a - f need not represent a distinct type of document section. If, for example, two or more of the sub-structures 312 a - f represent the same type of section (such as a header section), the rendering engine 314 may render both of the sub-structures in the same section of the rendered document 318 .
- the system 300 includes a structured document generator 308 , which identifies the probabilistic language model 304 (step 202 ), and uses the language model 304 to recognize the spoken audio stream 302 and thereby to produce the structured textual document 310 (step 204 ).
- the structured document generator 308 may, for example, include an automatic speech recognition decoder 320 which produces each of the sub-structures 312 a - f in the structured textual document 310 using a corresponding one of the sub-models 306 a - e in the probabilistic language model 304 .
- a decoder is a component of a speech recognizer which converts audio into text.
- the decoder 320 may, for example, produce sub-structure 312 a by using sub-model 306 a to recognize a first portion of the spoken audio stream 302 . Similarly, the decoder 320 may produce sub-structure 312 b by using sub-model 306 b to recognize a second portion of the spoken audio stream 302 .
- the speech recognition decoder may use the sub-model 306 a to recognize a first portion of the spoken audio stream 302 and thereby produce sub-structure 312 a , and use the same sub-model 306 a to recognize a second portion of the spoken audio stream 302 and thereby produce sub-structure 312 b .
- multiple sub-structures in the structured textual document 310 may contain content for a single semantic structure (e.g., section or paragraph).
- Sub-model 306 a may, for example, be a “header” language model which is used to recognize portions of the spoken audio stream 302 containing content in the header section 312 a ; sub-model 306 b may, for example, be a “comparison” language model which is used to recognize portions of the spoken audio stream 302 containing content in the comparison section 312 b ; and so on.
- Each such language model may be trained using training text from the corresponding section of training documents.
- the header sub-model 306 a may be trained using text from the header sections of a plurality of training documents, and the comparison sub-model may be trained using text from the comparison sections of the plurality of training documents.
- FIG. 7 a flowchart is shown of a method that is performed by the structured document generator 308 in one embodiment of the present invention to generate the structured textual document 310 ( FIG. 2 , step 204 ).
- FIG. 8 a dataflow diagram is shown illustrating a portion of the system 300 in detail relevant to the method of FIG. 7 .
- the structured document generator 308 includes a segment identifier 814 which identifies a plurality of segments S 802 a - c in the spoken audio stream 302 (step 701 ).
- the segments 802 a - c may, for example, represent concepts such as sections, paragraphs, sentences, words, dates, times, or codes. Although only three segments 802 a - c are shown in FIG. 8 , the spoken audio stream 302 may include any number of portions. Although for ease of explanation, all of the segments 802 a - c are identified in step 701 of FIG.
- the identification of the segments 802 a - c may be performed concurrently with recognizing the audio stream 302 and generating the structured document 310 , as will be described in more detail below with respect to FIGS. 11B and 12B .
- the structured document generator 308 enters a loop over each segment S in the spoken audio stream 302 (step 702 ).
- the structured document generator 308 includes speech recognition decoder 320 , which may, for example, include one or more conventional speech recognition decoders for recognizing speech using different kinds of language models.
- each of the sub-models 306 a - e may be an n-gram language model, a context-free grammar, or a combination of both.
- the structured document generator 308 is currently processing segment 802 a of the spoken audio stream 302 .
- the structured document generator 308 selects a plurality 804 of the sub-models 306 a - e with which to recognize the current segment S.
- the sub-models 804 may, for example, be all of the language sub-models 306 a - e or a subset of the sub-models 306 a - e .
- the speech recognition decoder 320 recognizes the current segment S (e.g., segment 802 a ) with each of the selected sub-models 804 , thereby producing a plurality of candidate contents 808 corresponding to segment S (step 704 ).
- each of the candidate contents 808 is produced by using the speech recognition decoder 320 to recognize the current segment S using a distinct one of the sub-models 804 .
- each of the candidate contents 808 may include not only recognized text but also other kinds of content, such as concepts (e.g., dates, times, codes, medications, allergies, vitals, etc.) encoded in machine-readable form.
- the structured document generator 308 includes a final content selector 810 which selects one of the candidate contents 808 as a final content 812 for segment S (step 706 ).
- the final content selector 810 may use any of a variety of techniques that are well-known to those of ordinary skill in the art for selecting speech recognition output that most closely matches speech from which the output was derived.
- the structured document generator 308 keeps track of the sub-model that is used to produce each of the candidate contents 808 .
- the sub-models 304 include all of the sub-models 306 a - e
- the candidate contents 808 therefore include five candidate contents per segment 802 a - c (one produced using each of the sub-models 306 a - e ).
- FIG. 9 a diagram is shown illustrating mappings between the document sub-structures 312 a - f , the sub-models 306 a - e , and candidate contents 808 a - e .
- each of the sub-models 306 a - e may be associated with one or more corresponding sub-structures 312 a - f in the structured textual document 310 .
- mappings 902 a - e between the sub-structures 312 a - e and the sub-models 306 a - e .
- the structured document generator 308 may maintain such mappings 902 a - e in a table or using other means.
- candidate content 808 a is the text that is produced when speech recognition decoder 320 recognizes segment 802 a with sub-model 306 a
- candidate content 808 b is the text that is produced when speech recognition decoder 320 recognizes segment 802 a with sub-model 306 b
- the structured document generator 308 may record the mapping between candidate contents 808 a - e and corresponding sub-models 306 a - e in a set of candidate model-content mappings 816 .
- a final mapping identifier 818 may use the mappings 816 and the selected final content 812 to identify the language sub-model that produced the candidate content that has been selected as the final content 812 (step 708 ). For example, if candidate content 808 c is selected as the final content 812 , it may be seen from FIG. 9 that the final mapping identifier 818 may identify the sub-model 306 c as the sub-model that produced candidate content 808 c .
- the final mapping identifier 818 may accumulate each identified sub-model in the set of mappings 820 , so that at any given time the mappings 820 identify the sequence of language sub-models that were used to generate the final contents that have been selected for inclusion in the structured textual document 310 .
- the structured document generator 308 may identify the document sub-structure associated with the identified sub-model (step 710 ). For example, if the sub-model 306 c has been identified in step 708 , it may be seen from FIG. 9 that document sub-structure 312 c is associated with sub-model 306 c.
- a structured content inserter 822 inserts the final content 812 into the identified sub-structure of the structured text document 310 (step 712 ). For example, if the sub-structure 312 c is identified in step 710 , the text inserter 514 inserts the final content 812 into sub-structure 312 c.
- the structured document generator repeats steps 704 - 712 for the remaining segments 802 b - c of the spoken audio stream 302 (step 714 ), thereby generating final content 812 for each of the remaining segments 802 b - c and inserting the final content 812 into the appropriate ones of the sub-structures 312 a - f of the textual document 310 .
- the structured textual document 310 includes text corresponding to the spoken audio stream 302
- the final model-content mappings 820 identify the sequence of language sub-models that were used by the speech recognition decoder 320 to generate the text in the structured textual document 310 .
- the method 700 may not only generate text corresponding to the spoken audio, but may also identify semantic information represented by the audio and store such semantic information in a machine-readable form.
- the comparison section 312 b includes a date element in which a particular date is represented as a triplet containing individual values for the day (“06”), month (“APR”), and year (“2001”).
- Other examples for semantic concepts in the medical domain include vital signs, medications and their dosages, allergies, medical codes, etc. Extracting and representing semantic information in this way facilitates the process of performing automated processing on such information.
- FIG. 5 is merely an example and does not constitute a limitation of the present invention.
- step 701 that the method 700 shown in FIG. 7A identifies the set of segments 802 a - c before identifying the sub-models to be used to recognize the segments 802 a - c .
- the structured document generator 308 may integrate the process of identifying the segments 802 a - c with the process of identifying the sub-models to be used to recognize the segments 802 a - c , and with the process of performing speech recognition on the segments 802 a - c . Examples of techniques that may be used to perform such integrated segmentation and recognition will be described in more detail below with respect to FIGS. 11B and 12B .
- the first portion of the spoken audio stream 302 is the spoken stream of utterances: “CT scan of the chest without contrast april twenty second two thousand three”. This portion may be selected in step 702 and recognized using all of the language sub-models 306 a - e in step 704 to produce a plurality of candidate contents 808 a - e .
- sub-model 306 a is a “header” language model
- sub-model 306 b is a “comparison” language model
- sub-model 306 c is a “technique” language model
- sub-model 306 d is a “findings” language model
- sub-model 306 e is an “impression” language model.
- sub-model 306 a is a language model which has been trained to recognize speech in the “header” section of the document 310 (e.g., sub-structure 312 a ), it is likely that the candidate content 808 a produced using sub-model 306 a will match the words in the above-referenced audio portion more closely than the other candidate contents 808 b - e . Assuming that the candidate content 808 a is selected as the final content 812 for this audio portion, the content inserter 822 will insert the final content 812 produced by sub-model 306 a into the header section 312 a of the structured text document 310 .
- the second portion of the spoken audio stream is the spoken stream of utterances: “comparison to prior studies from march twenty six two thousand two and april six two thousand one”.
- This portion may be selected in step 702 and recognized using all of the language sub-models 306 a - e in step 704 to produce a plurality of candidate contents 808 a - e .
- sub-model 306 b is a language model which has been trained to recognize speech in the “comparison” section of the document 310 (e.g., sub-structure 312 b ), it is likely that the candidate content 808 b produced using sub-model 306 b will match the words in the above-referenced audio portion more closely than the other candidate contents 808 a and 808 c - e . Assuming that the candidate content 808 b is selected as the final content 812 for this audio portion, the text inserter 514 will insert the final content 812 produced by sub-model 306 b into the comparison section 312 b of the structured text document 310 .
- the remainder of the audio stream 302 illustrated in FIG. 4 may be recognized and inserted into appropriate ones of the sub-structures 312 a - f in the structured textual document 310 in a similar manner.
- content in the spoken audio stream 302 illustrated in FIG. 4 appears in the same sequence as the sections 312 a - f in the structured textual document 310 , this is not a requirement of the present invention. Rather, content may appear in the audio stream 302 in any order.
- Each of the segments 802 a - c of the audio stream 302 is recognized by the speech recognition decoder 320 , and the resulting final content 812 is inserted into the appropriate one of the sub-structures 312 a - f .
- the order of the textual content in the sub-structures 312 a - f may not be the same as the order of the content in the spoken audio stream. Note, however, that even if the order of textual content is the same in both the audio stream 302 and the structured textual document 310 , the rendering engine 314 ( FIG. 3 ) may render the textual content of the document 310 in any desired order.
- the probabilistic language model 304 is a hierarchical language model.
- the plurality of sub-models 306 a - e are organized in a hierarchy.
- the sub-models 306 a - e may further include additional sub-models, and so on, so that the hierarchy of the language model 304 may include multiple levels.
- the language model 304 includes a plurality of nodes 1002 , 306 a - e , 1006 a - e , and 1010 and 1012 .
- Square nodes 1002 , 306 b - e , and 1006 e and 1012 use probabilistic finite state grammars to model highly constrained concepts (such as report section order, section cues, dates, and times).
- Elliptical nodes 306 a , 1006 a - d , and 1010 use statistical (n-gram) language models to model less-constrained language.
- cept includes, for example, dates, times, numbers, codes, medications, medical history, diagnoses, prescriptions, phrases, enumerations and section cues.
- a concept may be spoken in a plurality of ways. Each way of speaking a particular concept is referred to herein as a “spoken form” of the concept.
- spoke form A distinction is sometimes made between “semantic” concepts and “syntactic” concepts.
- cept includes both semantic concepts and syntactic concepts, but is not limited to either and does not rely on any particular definition of “semantic concept” or “syntactic concept” or on any distinction between the two.
- This sentence which is a concept as that term is used herein, may be spoken in a plurality of ways, such as the spoken phrases, “john jones has pneumonia,” “patient jones diagnosis pneumonia,” and “diagnosis pneumonia patient jones.”
- the written sentence “John Jones has pneumonia” is an example of a “written form” of the same concept.
- the hierarchical language model 304 may include sub-models for such low-level concepts.
- the n-gram sub-models 306 a , 1006 a - d , and 1010 may assign probabilities to sequences of words representing dates, times, and other low-level concepts.
- the language model 304 includes root node 1002 , which contains a finite state grammar representing the probabilities of occurrence of node 1002 's sub-nodes 306 a - e .
- the root node 1002 may, for example, indicate probabilities of the header, comparison, technique, findings, and impression sections of the document 310 appearing in particular orders in the spoken audio stream 302 .
- node 306 a is a “header” node, which is an n-gram language model representing probabilities of occurrence of words in portions of the spoken audio stream 302 intended for inclusion in the header section 312 a of the structured textual document 310 .
- Node 306 b contains a “comparison” finite state grammar representing probabilities of occurrence of a variety of alternative spoken forms of cues for the comparison section 312 b of the textual document.
- the finite state grammar in the comparison node 306 b may, for example, include cues such as “comparison to”, “comparison for”, “prior is”, and “prior studies are”.
- the finite state grammar may include a probability for each of these cues. Such probabilities may, for example, be based on observed frequencies of use of the cues in a set of training speech for the same speaker or in the same domain as the spoken audio stream 302 . Such frequencies may be obtained, for example, using the techniques disclosed in the above-reference patent application entitled “Document Transcription System Training.”
- the comparison node 306 b includes a “comparison content” sub-node 1006 a , which is an n-gram language model representing probabilities of occurrence of words in portions of the spoken audio stream 302 intended for inclusion in the body of the comparison section 312 b of the textual document 310 .
- the comparison content node 1006 a has a date node 1012 as a child.
- the date node 1012 is a finite state grammar representing probabilities of the date being spoken in various ways.
- Nodes 306 c and 306 d may be understood similarly.
- Node 306 c contains a “technique” finite state grammar representing probabilities of occurrence of a variety of alternative spoken forms of cues for the technique section 312 c of the textual document 310 .
- the technique node 306 c includes a “technique content” sub-node 1006 b , which is an n-gram language model representing probabilities of occurrence of words in portions of the spoken audio stream 302 intended for inclusion in the body of the technique section 312 c of the textual document 310 .
- node 306 d contains a “findings” finite state grammar representing probabilities of occurrence of a variety of alternative spoken forms of cues for the findings section 312 d of the textual document 310 .
- the findings node 306 d includes a “findings content” sub-node 1006 c , which is an n-gram language model representing probabilities of occurrence of words in portions of the spoken audio stream 302 intended for inclusion in the body of the findings section 312 d of the textual document 310 .
- Impression node 306 e is similar to nodes 306 b - d , in that it includes a finite state grammar for recognizing section cues and a sub-node 1006 d including an n-gram language model for recognizing section content. In addition, however, the impression node 306 e includes an additional sub-node 1006 e , which in turn includes a sub-node 1010 . This indicates that the content of the impression section may be recognized using either the language model in the impression content node 1006 d or the “enum” node 1006 e , governed by the finite state grammar-based language model corresponding to impression node 306 e .
- the “enum” node 1006 e contains a finite state grammar indicating probabilities associated with different ways of speaking enumeration cues (such as “number one,” “number two,” “first,” “second,” “third,” and so on).
- the impression content node 1010 may include the same language model as the impression content node 1006 d.
- FIG. 11A a flowchart is shown of a method that is performed by the structured document generator 308 in one embodiment of the present invention to generate the structured textual document 310 ( FIG. 2 , step 204 ).
- FIG. 12A a dataflow diagram is shown illustrating a portion of the system 300 in detail relevant to the method of FIG. 11A .
- the structured document generator 308 includes a path selector 1202 which identifies a path 1204 through the hierarchical language model 304 (step 1102 ).
- the path 1204 is an ordered sequence of nodes in the hierarchical language model 304 . Nodes may be traversed multiple times in the path 1204 . Examples of techniques for generating the path 1204 will be described in more detail below with respect to FIGS. 11B and 12B .
- the path 1204 includes points 1020 a - j , which specify a sequence in which to traverse nodes in the language model 304 .
- Points 1020 a - j are referred to as “points” rather than “nodes” to distinguish them from nodes 1002 , 306 a - e , 1006 a - e , and 1010 in the language model 304 .
- path 1204 traverses the following nodes of language model 304 in sequence: (1) root node 1002 (point 1020 a ); (2) header content node 306 a (point 1020 b ); (3) comparison node 306 b (point 1020 c ); (4) comparison content node 1006 a (point 1020 d ); (5) technique node 306 c (point 1020 e ); (6) technique content node 1006 b (point 1020 f ); (7) findings node 306 d (point 1020 g ); (8) findings content node 1006 c (point 1020 h ); (9) impression node 306 e (point 1020 i ); and (10) impression content node 1006 d (point 1020 j ).
- the spoken audio stream 302 begins with speech that is best recognized by the header content language model 306 a (“CT scan of the chest without contrast april twenty second two thousand three”), followed by speech that is best recognized by the comparison language model 306 b (“comparison to”), followed by speech that is best recognized by the comparison content language model 1006 a (“prior studies from march twenty six two thousand two and april six two thousand one”), and so on.
- CT scan of the chest without contrast april twenty second two thousand three followed by speech that is best recognized by the comparison language model 306 b (“comparison to”)
- speech that is best recognized by the comparison content language model 1006 a (“prior studies from march twenty six two thousand two and april six two thousand one”), and so on.
- the structured document generator 308 recognizes the spoken audio stream 302 using the language models traversed by the path 1204 to produce the structured textual document 310 (step 1104 ). As described in more detail below with respect to FIGS. 11B and 12B , the speech recognition and structured textual document generation of step 1104 may be integrated with the path identification of step 1102 , rather than performed separately.
- the structured document generator 308 may include a node enumerator 1206 which iterates over each of the language model nodes N 1208 traversed by the selected path 1204 (step 1106 ). For each such node N, the speech recognition decoder 320 may recognize the portion of the audio stream 302 corresponding to the language model at node N to produce corresponding structured text T (step 1108 ). The structured document generator 308 may insert text T 1210 into the substructure of the structured textual document 310 corresponding to node N 1208 of the language model 304 (step 1110 ).
- comparison node 306 b may be used to recognize the text “comparison to” in the spoken audio stream 302 ( FIG. 4 ).
- comparison node 306 b corresponds to a document sub-structure (e.g., the comparison section 312 b ) rather than to content
- the result of the speech recognition performed in step 1108 in this case may be a document substructure, namely an empty “comparison” section.
- Such a section may be inserted into the structured document 310 in step 1110 , for example, in the form of matching “ ⁇ comparison>” and “ ⁇ /comparison>” tags.
- the comparison content node 1006 a may be used to recognize the text “prior studies from march twenty six two thousand two and april six two thousand one” in the spoken audio stream 302 ( FIG. 4 ), thereby producing the structured text “Prior studies from ⁇ date>26-MAR-2002 ⁇ /date> and ⁇ date>06-APR- 2001 ⁇ /date>”, as shown in FIG. 5 .
- This structured text may then be inserted into the comparison section 312 b in step 1110 (e.g., between the “ ⁇ comparison>” and “ ⁇ /comparison>” tags, as shown in FIG. 5 ).
- the structured document generator 308 repeats steps 1108 - 1110 for the remaining nodes N traversed by the path 1204 (step 1112 ), thereby inserting a plurality of structured texts 1210 into the structured textual document 310 .
- the end result of the method illustrated in FIG. 11A is the creation of the structured textual document 310 , which contains text having a structure that corresponds to the structure of the path 1204 through the language model 304 .
- the structure of the illustrated path traverses language model nodes corresponding to the header, comparison, technique, findings, and impression sections in sequence.
- the resulting structured textual document 310 (as illustrated, for example, in FIG. 5 ) similarly includes header, comparison, technique, findings, and impression sections in sequence.
- the structured textual document 310 therefore has the same structure as the language model path 1204 that was used to create the structured textual document 310 .
- the structured document generator 308 inserts recognized structured text 1210 into the appropriate sub-structures of the structured textual document 310 ( FIG. 11A , step 1110 ).
- the structured textual document 310 may be implemented as an XML document or other document which supports nested structures.
- the system illustrated in FIG. 12A includes path selector 1202 , which selects a path 1204 through the language model 304 .
- the method illustrated in FIG. 11A then uses the selected path 1204 to generate the structured textual document 310 .
- the steps of path selection and structured document creation are performed separately. This is not, however, a limitation of the present invention.
- FIG. 11B a flowchart is shown of a method 1150 which integrates the steps of path selection and structured document generation.
- FIG. 12B an embodiment of the structured document generator 308 is shown which performs the method 1150 of FIG. 11B in one embodiment of the present invention.
- the method 1150 of FIG. 11B searches for possible paths through the hierarchy of the language model 304 ( FIG. 10A ), beginning at the root node 1002 and expanding outward. Any of a variety of techniques, including techniques well-known to those of ordinary skill in the art, may be used to search through the language model hierarchy.
- the method 1150 uses the speech recognition decoder 320 to recognize increasingly large portions of the spoken audio stream 302 using the language models falling along the partial paths, thereby creating partial candidate structured documents.
- the method 1150 assigns fitness scores to each of the partial candidate structured documents.
- the fitness score for each candidate structured document is a measure of how well the path that produced the candidate structured document has performed.
- the method 1150 expands the partial paths, thereby continuing to search through the language model hierarchy, until the entire spoken audio stream 302 has been recognized.
- the structured document generator 308 selects the candidate structured document having the highest fitness score as the final structured textual document 310 .
- the method 1150 initializes one or more candidate paths 1224 through the language model 304 (step 1152 ).
- the candidate paths 1224 may be initialized to contain a single path consisting of the root node 1002 .
- the term “frame” refers herein to a short period of time, such as 10 milliseconds.
- the method 1150 initializes an audio stream pointer to point to the first frame in the audio stream 302 (step 1153 ).
- the structured document generator 308 contains an audio stream enumerator 1240 which provides a portion 1242 of the audio stream 302 to the speech recognition decoder 320 .
- the portion 1242 may solely contain the first frame of the audio stream 302 .
- the speech recognition decoder 320 recognizes the current portion 1242 of the audio stream 302 using the language sub-models in the candidate path(s) 1224 to generate one or more candidate structured partial documents 1232 (step 1154 ). Note that the documents 1232 are only partial documents 1232 because they have been generated based on only a portion of the audio stream 302 .
- the speech recognition decoder 320 may simply recognize the first frame of the audio stream 302 using the language model at the root node 1002 of the language model 304 .
- the techniques disclosed above with respect to FIG. 11A and FIG. 12A may be used by the speech recognition decoder 320 to generate the candidate structured partial documents 1232 using the candidate paths 1224 . More specifically, the speech recognition decoder 320 may apply the methods illustrated in FIG. 11A to the audio stream portion 1242 using each of the candidate paths 1224 as the path identified in step 1102 ( FIG. 11A ).
- a fitness evaluator 1234 generates fitness scores 1236 for each of the candidate structured partial documents 1232 (step 1156 ).
- the fitness scores 1236 are measures of how well the candidate structured partial documents 1232 represent the corresponding portion of the audio stream 302 .
- the fitness score for a single candidate document may be generated by: (1) generating fitness scores for each of the nodes in the corresponding one of the candidate paths 1224 ; and (2) using a synthesis function to synthesize the individual node fitness scores generated in step (1) into an overall fitness score for the candidate structured document. Examples of techniques that may be used to generate the candidate fitness scores 1236 will be described in more detail below with respect to FIG. 11C .
- the structured document generator 308 were to attempt to search for all possible paths through the hierarchy of the language model 304 , the computational resources required to evaluate each possible path might become prohibitively costly and/or time-consuming due to the exponential growth in the number of possible paths.
- a path pruner 1230 uses the candidate fitness scores 1236 to remove poorly-fitting paths from the candidate paths 1224 , thereby producing a set of pruned paths 1222 (step 1158 ).
- a final document selector 1238 selects, from among the candidate structured partial documents 1232 , the candidate structured document having the highest fitness score, and provides the selected document as the final structured textual document 310 (step 1164 ). If the entire audio stream 302 has not been recognized, a path extender 1220 extends the pruned paths 1222 within the language model 304 to produce a new set of candidate paths 1224 (step 1162 ). If for, example, the pruned paths 1222 consist of a single path containing the root node 1002 , the path extender 1220 may extend this path by one node downward in the hierarchy illustrated in FIG.
- a path from the root node 1002 to the header content node 306 a a path from the root node 1002 to the comparison node 306 b , a path from the root node 1002 to the technique node 306 c , and so on.
- Various techniques for extending the paths 1224 to perform depth-first, breadth-first, or other kinds of hierarchical searches are well-known to those having ordinary skill in the art.
- the audio stream enumerator 1240 extends the portion 1242 of the audio stream 302 to include the next frame in the audio stream 302 (step 1163 ). Steps 1154 - 1160 are then repeated by using the new candidate paths 1224 to recognize the portion 1242 of the audio stream 302 . In this way the entire audio stream 302 may be recognized using appropriate sub-models in the language model 304 .
- fitness scores 1236 may be generated for each of the candidate structured partial documents 1232 produced by the structured document generator 308 while evaluating candidate paths 1224 through the language model 304 . Examples of techniques will now be described for generating fitness scores, either for the partial candidate structured partial documents 1232 illustrated in FIG. 12B or for structured documents more generally.
- the comparison content node 1006 a has a date node 1012 as a child.
- the text “CT scan of the chest without contrast april twenty second two thousand three” has been recognized as text corresponding to the comparison content node 1006 a .
- the comparison content node 1006 a was used to recognize the text “CT scan of the chest without contrast” and that the date node 1012 , which is a child of the comparison content node 1006 a , was used to generate the text “april twenty second two thousand three”.
- the fitness score for this text may, therefore, be calculated by using the comparison content node 1006 a to calculate a first fitness score for the text “CT scan of the chest without contrast” followed by any date, calculating a second fitness score for the text “april twenty second two thousand three” based on the date node 1012 , and multiplying the first and second fitness scores.
- a fitness score S is initialized to a value of one for the candidate structured document being evaluated (step 1172 ).
- the method assigns a current node pointer N to point to the root node in the candidate path corresponding to the candidate document (step 1174 ).
- the method calls a function named Fitness( ) with the values N and S (step 1176 ) and returns the result as the fitness score for the candidate document (step 1178 ).
- the Fitness( ) function generates the fitness score S using a hierarchical factorization by traversing the candidate path corresponding to the candidate document.
- the function 1180 identifies the probability P(W(N)) that the text W corresponding to the current node N has been recognized by the language model associated with that node, and multiplies the probability by the current value of S to produce a new value for S (step 1184 ).
- step 1186 If node N has no children (step 1186 ), the value of S is returned (step 1194 ). If node N has children, then the Fitness( ) function 1180 is called recursively on each of the child nodes, with the results being multiplied by the value of S to produce new values of S (steps 1188 - 1192 ). The resulting value of S is returned (step 1194 ).
- the value of S represents a fitness score for the entire candidate structured document, and the value of S is returned, e.g., for use in the method 1150 illustrated in FIG. 11B (step 1194 ).
- the fitness score (probability) of this text may be obtained by identifying the probability of the text “CT scan of the chest without contrast ⁇ DATE>”, where ⁇ DATE> denotes any date, multiplied by the conditional probability of the text “april twenty second two thousand three” occurring given that the text represents a date.
- the effect of the method illustrated in FIG. 11C is to hierarchically factor probabilities of word sequences according to the hierarchy of the language model 304 , allowing the individual probability estimates associated with each language model node to be seamlessly combined with the probability estimates associated with other nodes.
- This probabilistic framework allows the system to model and use statistical language models with embedded probabilistic finite state grammars and finite state grammars with embedded statistical language models.
- nodes in the language model 304 represent language sub-models which specify the probabilities of occurrence of sequences of words in the spoken audio stream 302 .
- the probabilities have already been assigned in such language models. Examples of techniques will now be disclosed for assigning probabilities to the language sub-models (such as n-gram language models and context-free grammars) in the language model 304 .
- a flowchart is shown of a method 1300 that is used in one embodiment of the present invention to generate the language model 304 .
- a plurality of nodes are selected for use in the language model (step 1302 ).
- the nodes may, for example, be selected by a transcriptionist or other person skilled in the relevant domain.
- the nodes may be selected in an attempt to capture all of the types of concepts that may occur in the spoken audio stream 302 .
- nodes (such as those shown in FIG. 10A ) may be selected which represent the sections of a medical report and the concepts (such as dates, times, medications, allergies, vital signs and medical codes) which are expected to occur in a medical report.
- a concept and language model type may be assigned to each of the nodes selected in step 1302 (steps 1304 - 1306 ).
- node 306 b ( FIG. 10A ) may be assigned the concept “comparison section cue” and be assigned the language model type “finite state grammar.”
- node 1006 a may be assigned the concept “comparison content” and the language model type “n-gram language model.”
- the nodes selected in step 1302 may be arranged into a hierarchical structure (step 1308 ).
- the nodes 1002 , 306 a - e , 1006 a - e , and 1010 may be arranged into the hierarchical structure illustrated in FIG. 10A to represent and enforce structural dependencies between the nodes.
- Each of the nodes selected in step 1302 may then be trained using text representing a corresponding concept (step 1310 ).
- a set of training documents may be identified.
- the set of training documents may, for example, be a set of existing medical reports or other documents in the same domain as the spoken audio stream 302 .
- the training documents may be marked up manually to indicate the existence and location of structures in the document, such as sections, sub-sections, dates, times, codes, and other concepts. Such markup may, for example, be performed automatically on formatted documents, or manually by a transcriptionist or other person skilled in the relevant domain. Examples of techniques for training the nodes selected in step 1302 are described in the above-referenced patent application entitled “Document Transcription System Training.”
- Conventional language model training techniques may be used in step 1310 to train concept-specific language models for each of the concepts that is marked up in the training documents. For example, the text from all of the marked-up “header” sections in the training documents may be used to train the language model node 306 a representing the header section. In this way, language models for each of the nodes 1002 , 306 a - e , 1006 a - e , and 1010 in the language model 304 illustrated in FIG. 10A may be trained.
- the result of the method 1300 illustrated in FIG. 13 is a hierarchical language model having trained probabilities, which can be used to generate the structured textual document 310 in the manner described above.
- This hierarchical language model may then be used, for example, to iteratively re-segment the training text, such as by using the techniques disclosed above in conjunction with FIGS. 11B and 12B .
- the resegmented training text may be used to retain the hierarchical language model.
- This process of re-segmenting and re-training may be performed iteratively to repeatedly improve the quality of the language model.
- the structured document generator 308 both recognizes the spoken audio stream 302 and generates the structured textual document 310 using an integrated process, within generating an intermediate non-structured transcript. Such techniques, however, are disclosed merely for purposes of example and do not constitute limitations of the present invention.
- FIG. 14 a flowchart is shown of a method 1400 that is used in another embodiment of the present invention to generate the structured textual document 310 using distinct speech recognition and structural parsing steps.
- FIG. 15 a dataflow diagram is shown of a system 1500 that performs the method 1400 of FIG. 14 according to one embodiment of the present invention.
- the speech recognition decoder 320 recognizes the spoken audio stream 302 using a language model 1506 to produce a transcript 1502 of the spoken audio stream 302 .
- the language model 1506 may be a conventional language model that is distinct from the language model 304 . More specifically, the language model 1506 may be a conventional monolithic language model.
- the language model 1506 may, for example, be generated using the same training corpus as is used to train the language model 304 . While portions of the training corpus may be used to train nodes of the language model 304 , the entire corpus may be used to train the language model 1506 .
- the speech recognition decoder 320 may, therefore, use conventional speech recognition techniques to recognize the spoken audio stream 302 using the language model 1506 and thereby to produce the transcript 1502 .
- transcript 1502 may be a “flat” transcript 1502 of the spoken audio stream 302 , rather than a structured document as in the previous examples disclosed above.
- the transcript 1502 may, for example, include a sequence of flat text resembling the text illustrated in FIG. 4 (which illustrates the spoken audio stream 302 in textual form).
- the system 1500 also includes a structural parser 1504 , which uses the hierarchical language model 304 to parse the transcript 1502 and thereby to produce the structured textual document 310 (step 1404 ).
- the structural parser 1504 may use the techniques disclosed above with respect to FIGS. 11C and 12B to: (1) produce multiple candidate structured documents having the same content as the transcript 1502 but having structures corresponding to different paths through the language model 304 ; (2) generate fitness scores for each of the candidate structured documents; and (3) select the candidate structured document having the highest fitness score as the final structured textual document.
- step 1404 may be performed without performing speech recognition to generate each of the candidate structured documents. Rather, once the transcript 1502 has been produced using the speech recognition decoder 320 , candidate structured documents may be generated based on the transcript 1502 without performing additional speech recognition.
- the structural parser 1504 need not use the full language model 304 to produce the structured textual document 310 . Rather, the structural parser 1504 may use a scaled-down “skeletal” language model, such as the language model 1030 illustrated in FIG. 10C .
- the example language model 1030 shown in FIG. 10C is the same as the language model 304 shown in FIG. 10A , except that in the skeletal language model 1030 the content language model nodes 306 a , 1006 a - d , and 1010 have been replaced with universally-accepting language models 1032 a - f , also referred to as “don't care” language models.
- the language models 1032 a - f will accept any text that is provided to them as input.
- the heading cue language models 306 b - e in the skeletal language model 1030 enable the structural parser 1504 to parse the transcript 1502 into the correct sub-structures in the structured document 310 .
- the use of the universally-accepting language models 1032 a - f enables the structural parser 1504 to perform such structural parsing without incurring the (typically significant) expense of training content language models, such as the models 306 a , 1006 a - d , and 1010 shown in FIG. 10A .
- the skeletal language model 1030 may still include language models, such as the date language model 1012 , corresponding to lower-level concepts. As a result, the skeletal language model 1030 may be used to generate the structured document 310 from the transcript 1502 without incurring the overhead of training content language models, while retaining the ability to parse lower-level concepts into the structured document 310 .
- the techniques disclosed herein replace the traditional global language model with a combination of specialized local language models which are more well-suited to section of a document than a single generic language model. Such a language model has a variety of advantages.
- the use of a language model which contains sub-models, each of which corresponds to a particular concept is advantageous because it allows the most appropriate language model to be used to recognize speech corresponding to each concept.
- each of the sub-models corresponds to a different concept, then each of the sub-models may be used to perform speech recognition on speech representing the corresponding concept. Because the characteristics of speech may vary from concept to concept, the use of such concept-specific language models may produce better recognition results than those which would be produced using a monolithic language model for all concepts.
- each sub-model in the language model may correspond to any concept, such as a section, paragraph, sentence, date, time or ICD9 code.
- sub-models in the language model may be matched to particular concepts with a higher degree of precision than would be possible if only section-specific language models were employed.
- the use of such concept-specific language models for a wide variety of concepts may further improve speech recognition accuracy.
- hierarchical language models designed in accordance with embodiments of the present invention may have multi-level hierarchical structures, with the effect of nesting sub-models inside of each other.
- sub-models in the language model may be applied to portions of the spoken audio stream 302 at various levels of granularity, with the most appropriate language model being applied at each level of granularity.
- a “header section” language model may be applied generally to speech inside of the header section of a document, while a “date” language model may be applied specifically to speech representing dates in the header section.
- This ability to nest language models and to apply nested language models to different portions of speech may further improve recognition accuracy by enabling the most appropriate language model to be applied to each portion of a spoken audio stream.
- Another advantage of using a language model which includes a plurality of sub-models is that the techniques disclosed herein may use such a language model to generate a structured textual document from a spoken audio stream using a single integrated process, rather than the prior art two-step process 100 illustrated in FIG. 1A in which a speech recognition step is followed by a natural language processing step.
- the steps performed by the speech recognizer 104 and the natural language processor 108 are completely decoupled. Because the automatic speech recognizer 104 and natural language processor 108 operate independently from each other, the output 106 of the automatic speech recognizer 104 is a literal transcript of the spoken content in the audio stream 102 .
- the literal transcript 106 therefore contains text corresponding to all spoken utterances in the audio stream 102 , whether or not such utterances are relevant to the final desired structured textual document.
- Such utterances may include, for example, hesitations, extraneous words or repetitions, as well as structural hints or task-related words.
- the natural language processor 108 relies on the successful detection and transcription of certain key words and/or key phrases, such as structural hints. If these key words/phrases are misrecognized by the automatic speech recognizer 104 , the identification of structural entities by the natural language processor 108 may be negatively affected.
- speech recognition and natural language processing are integrated, thereby enabling the language model to influence both the recognition of words in the audio stream 302 and the generation of structure in the structured textual document 310 , thereby improving the overall quality of the structured document 310 .
- the techniques disclosed herein may also be used to extract and interpret semantic content from the audio stream 302 .
- the date language model 1012 FIGS. 10A-10B
- the techniques disclosed herein may be used to identify portions of the audio stream 302 that represent dates, and to store representations of such dates in a computer-readable form.
- Storing such concepts in a computer-readable form allows the content of such concepts to be easily processed by a computer, such as by sorting document sections by date or identifying medications prescribed prior to a given date.
- the techniques disclosed herein enable the user to define different portions (e.g., sections) of the document, and to choose which concepts are to be extracted in each section.
- the techniques disclosed herein therefore, facilitate the recognition and processing of semantic content in spoken audio streams.
- Such techniques may be applied instead of or in addition to storing extracted information in a structured document.
- Domains such as the medical and legal domains, in which there are large bodies of pre-existing recorded audio streams to use as training text, may find particular benefit in techniques disclosed herein.
- Such training text may be used to train the language model 304 using the techniques disclosed above with respect to FIG. 13 .
- documents in such domains may be required to have well-defined structures, and because such structures may be readily identifiable in existing documents, it may be relatively easy (albeit time-consuming) to correctly identify the portions of such existing documents to use in training each of the concept-specific language model nodes in the language model 304 .
- each of the language model nodes may be well-trained to recognize the corresponding concept, thereby increasing recognition accuracy and increasing the ability of the system to generate documents having the required structure.
- the techniques disclosed herein may be used to generate documents having the desired structure regardless of the manner in which the spoken audio stream is dictated.
- individual sub-models 306 a - e in the language model 304 may be updated easily without affecting the remainder of the language model.
- the header content 306 a sub-model may be replaced with a different header content sub-model which accounts differently for the way in which the document header is dictated.
- the modular structure of the language model 304 enables such modification/replacement of sub-models to be performed without the need to modify any other part of the language model 304 .
- parts of the language model 304 may easily be updated to reflect different document dictation conventions.
- the structured textual document 310 that is produced by various embodiments of the present invention may be used to train a language model.
- the training techniques described in the above-referenced patent application entitled “Document Transcription System Training” may use the structured textual document 310 to retrain and thereby improve the language model 304 .
- the retrained language model 304 may then be used to produce subsequent structured textual documents, which may in turn be used to retrain the language model 304 . This iterative process may be employed to improve the quality of the structured documents that are produced over time.
- the spoken audio stream 302 may be any audio stream, such as a live audio stream received directly or indirectly (such as over a telephone or IP connection), or an audio stream recorded on any medium and in any format.
- DSR distributed speech recognition
- a client performs preprocessing on an audio stream to produce a processed audio stream that is transmitted to a server, which performs speech recognition on the processed audio stream.
- the audio stream 302 may, for example, be a processed audio stream produced by a DSR client.
- each node in the language model 304 is described as containing a language model that corresponds to a particular concept, this is not a requirement of the present invention.
- a node may include a language model that results from interpolating a concept-specific language model associated with the node with one or more of: (1) global background language models, or (2) concept-specific language models associated with other nodes.
- grammars In the examples above, a distinction may be made between “grammars” and “text.” It should be appreciated that text may be represented as a grammar, in which there is a single spoken form having a probability of one. Therefore, documents which are described herein as including both text and grammars may be implemented solely using grammars if desired. Furthermore, a finite state grammar is merely one kind of context-free grammar, which is a kind of language model that allows multiple alternative spoken forms of a concept to be represented. Therefore, any description herein of techniques that are applied to finite state grammars may be applied more generally to any other kind of grammar.
- Embodiments of the present invention are not limited to use in conjunction with any particular kind(s) of language model(s).
- the invention is not limited to any of the described fields (such as medical and legal reports), but generally applies to any kind of structured documents.
- the techniques described above may be implemented, for example, in hardware, software, firmware, or any combination thereof.
- the techniques described above may be implemented in one or more computer programs executing on a programmable computer including a processor, a storage medium readable by the processor (including, for example, volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
- Program code may be applied to input entered using the input device to perform the functions described and to generate output.
- the output may be provided to one or more output devices.
- Each computer program within the scope of the claims below may be implemented in any programming language, such as assembly language, machine language, a high-level procedural programming language, or an object-oriented programming language.
- the programming language may, for example, be a compiled or interpreted programming language.
- Each such computer program may be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a computer processor.
- Method steps of the invention may be performed by a computer processor executing a program tangibly embodied on a computer-readable medium to perform functions of the invention by operating on input and generating output.
- Suitable processors include, by way of example, both general and special purpose microprocessors.
- the processor receives instructions and data from a read-only memory and/or a random access memory.
- Storage devices suitable for tangibly embodying computer program instructions include, for example, all forms of non-volatile memory, such as semiconductor memory devices, including EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROMs. Any of the foregoing may be supplemented by, or incorporated in, specially-designed ASICs (application-specific integrated circuits) or FPGAs (Field-Programmable Gate Arrays).
- a computer can generally also receive programs and data from a storage medium such as an internal disk (not shown) or a removable disk.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Description
- This application is a continuation of copending U.S. patent application Ser. No. 10/923,517, filed on Aug. 20, 2004, entitled “Automated Extraction of Semantic Content and Generation of a Structured Document from Speech,” which is hereby incorporated by reference herein.
- 1. Field of the Invention
- The present invention relates to automatic speech recognition and, more particularly, to techniques for automatically transcribing speech.
- 2. Related Art
- It is desirable in many contexts to generate a written document based on human speech. In the legal profession, for example, transcriptionists transcribe testimony given in court proceedings and in depositions to produce a written transcript of the testimony. Similarly, in the medical profession, transcripts are produced of diagnoses, prognoses, prescriptions, and other information dictated by doctors and other medical professionals. Transcripts in these and other fields typically need to be highly accurate (as measured in terms of the degree of correspondence between the semantic content (meaning) of the original speech and the semantic content of the resulting transcript) because of the reliance placed on the resulting transcripts and the harm that could result from an inaccuracy (such as providing an incorrect prescription drug to a patient). High degrees of reliability may, however, be difficult to obtain consistently for a variety of reasons, such as variations in: (1) features of the speakers whose speech is transcribed (e.g., accent, volume, dialect, speed); (2) external conditions (e.g., background noise); (3) the transcriptionist or transcription system (e.g., imperfect hearing or audio capture capabilities, imperfect understanding of language); or (4) the recording/transmission medium (e.g., paper, analog audio tape, analog telephone network, compression algorithms applied in digital telephone networks, and noises/artifacts due to cell phone channels).
- At first, transcription was performed solely by human transcriptionists who would listen to speech, either in real-time (i.e., in person by “taking dictation”) or by listening to a recording. One benefit of human transcriptionists is that they may have domain-specific knowledge, such as knowledge of medicine and medical terminology, which enables them to interpret ambiguities in speech and thereby to improve transcript accuracy. Human transcriptionists, however, have a variety of disadvantages.
- For example, human transcriptionists produce transcripts relatively slowly and are subject to decreasing accuracy over time as a result of fatigue.
- Various automated speech recognition systems exist for recognizing human speech generally and for transcribing speech in particular. Speech recognition systems which create transcripts are referred to herein as “automated transcription systems” or “automated dictation systems.” Off-the-shelf dictation software, for example, may be used by personal computer users to dictate documents in a word processor as an alternative to typing such documents using a keyboard.
- Automated dictation systems typically attempt to produce a word-for-word transcript of speech. Such a transcript, in which there is a one-to-one mapping between words in the spoken audio stream and words in the transcript, is referred to herein as a “verbatim transcript.” Automated dictation systems are not perfect and may therefore fail to produce perfect verbatim transcripts.
- In some circumstances, however, a verbatim transcript is not desired. In fact, transcriptionists may intentionally introduce a variety of changes into the written transcription. A transcriptionist may, for example, filter out spontaneous speech effects (e.g., pause fillers, hesitations, and false starts), discard irrelevant remarks and comments, convert data into a standard format, insert headings or other explanatory materials, or change the sequence of the speech to fit the structure of a written report.
- In the medical domain, for example, spoken reports produced by doctors are frequently transcribed into written reports having standard formats. For example, referring to
FIG. 1B , an example of a structured and formattedmedical report 111 is shown. Thereport 111 includes a variety of sections 112-138 which appear in a predetermined sequence when thereport 111 is displayed. In the particular example shown inFIG. 1B , the report includes aheader section 112, asubjective section 122, anobjective section 134, anassessment section 136, and aplan section 138. Sections may include text as well as sub-sections. For example, theheader section 112 includes a hospital name section 120 (containing the text “General Hospital”), a patient name section 114 (containing the text “Jane Doe”), a chart number section 116 (containing the text “851D”), and a report date section 118 (containing text “Oct. 1, 1993”). - Similarly, the
subjective section 122 includes various subjective information about the patient, included both in text and in amedical history section 124, amedications section 126, anallergies section 128, afamily history section 130, and asocial history section 132. Theobjective section 134 includes various objective information about the patient, such as her weight and blood pressure. - Although not illustrated in
FIG. 1B , the information in the objective section may include sub-sections for containing the illustrated information. Theassessment section 136 includes a textual assessment of the patient's condition, and theplan subsection 138 includes a textual description of a plan of treatment. - Note that information may appear in a different form in the
report 111 from the form in which such information was spoken by the dictating doctor. For example, the date in thereport date section 118 may have been spoken as “october first nineteen ninety three, “the first of october ninety three,” or in some other form. The transcriptionist, however, transcribed such speech using the text “Oct. 1, 1993” in thereport date section 118, perhaps because the hospital specified in thehospital section 120 requires that dates in written reports be expressed in such a format. - Similarly, information in the
medical report 111 may not appear in the same sequence as in the original audio recording, due to the need to conform to a required report format or for some other reason. For example, the dictating physician may have dictated theobjective section 134 first, followed by thesubjective section 122, and then by theheader 120. The writtenreport 111, however, contains theheader 120 first, followed by thesubjective section 122, and then theobjective section 134. Such a report structure may, for example, be required for medical reports in the hospital specified in thehospital section 120. - The beginning of the
report 111 may have been generated based on a spoken audio stream such as the following: “this is doctor smith on uh the first of october um nineteen ninety three patient ID eighty five one d um next is the patient's family history which i have reviewed . . . ” It should be apparent that a verbatim transcript of this speech would be difficult to understand and would not be particularly useful. - Note, for example, that certain words, such as “next is a,” do not appear in the written
report 111. Similarly, pause-filling utterances such as “uh” do not appear in the writtenreport 111. In addition, the writtenreport 111 organizes the original speech into the predefined sections 112-140 by re-ordering the speech. As these examples illustrate, the writtenreport 111 is not a verbatim transcript of the dictating physician's speech. - In summary, a report such as the
report 111 may be more desirable than a verbatim transcript for a variety of reasons (e.g., because it organizes information in a way that facilitates understanding). It would, therefore, be desirable for an automatic transcription system to be capable of generating a structured report (rather than a verbatim transcript) based on unstructured speech. - Referring to
FIG. 1A , a dataflow diagram is shown of aprior art system 100 for generating a structureddocument 110 based on a spokenaudio stream 102. Such a system produces the structuredtextual document 110 from the spokenaudio stream 102 using a two-step process: (1) anautomatic speech recognizer 104 generates averbatim transcript 106 based on the spokenaudio stream 102; and (2) a natural language processor 108 identifies structure in thetranscript 106 and thereby creates thestructured document 110, which has the same content as thetranscript 106, but which is organized into the structure (e.g., report format) identified by the natural language processor 108. - For example, some existing systems attempt to generate structured textual documents by: (1) analyzing the spoken
audio stream 102 to identify and distinguish spoken content in theaudio stream 102 from explicit or implicit structural hints in theaudio stream 102; (2) converting the “content” portions of the spokenaudio stream 102 into raw text; and (3) using the identified structural hints to convert the raw text into thestructured report 110. Examples of explicit structural hints include formatting commands (e.g., “new paragraph,” “new line,” “next item”) and paragraph identifiers (e.g., “findings,” “impression,” “conclusion”). Examples of implicit structural hints include long pauses that may denote paragraph boundaries, prosodic cues that indicate ends of enumerations, and the spoken content itself. - For various reasons described in more detail below, the
structured document 110 produced by thesystem 100 may be sub-optimal. For example, thestructured document 110 may contain incorrectly transcribed (i.e., misrecognized) words, the structure of thestructured document 110 may fail to reflect the desired document structure, and content from the spokenaudio stream 102 may be inserted into the wrong sub-structures (e.g., sections, paragraphs, or sentences) in the structured document. - Furthermore, in addition to or instead of generating the
structured document 110 based on the spokenaudio stream 102, it may be desirable to extract semantic content (such as information about medications, allergies, or previous illnesses of the patient described in the audio stream 102) from the spokenaudio stream 102. Although such semantic content may be useful for generating thestructured document 110, such content may also be useful for other purposes, such as populating a database of patient information that can be analyzed independently of thedocument 110. Prior art systems, such as thesystem 100 shown inFIG. 1 , however, typically are designed to generate the structureddocument 110 based primarily or solely on syntactic information in the spokenaudio stream 102. Such systems, therefore, are not useful for extracting semantic content. - What is needed, therefore, are improved techniques for generating structured documents based on spoken audio streams.
- Techniques are disclosed for automatically generating structured documents based on speech, including identification of relevant concepts and their interpretation.
- In one embodiment, a structured document generator uses an integrated process to generate a structured textual document (such as a structured textual medical report) based on a spoken audio stream. The spoken audio stream may be recognized using a language model which includes a plurality of sub-models arranged in a hierarchical structure. Each of the sub-models may correspond to a concept that is expected to appear in the spoken audio stream. For example, sub-models may correspond to document sections. Sub-models may, for example, be n-gram language models or context-free grammars.
- Different portions of the spoken audio stream may be recognized using different sub-models. The resulting structured textual document may have a hierarchical structure that corresponds to the hierarchical structure of the language sub-models that were used to generate the structured textual document.
- For example, in one aspect of the present invention, a method is provided which includes steps of: (A) identifying a probabilistic language model including a plurality of probabilistic language models associated with a plurality of sub-structures of a document; and (B) using a speech recognition decoder to apply the probabilistic language model to a spoken audio stream to produce a document including content organized into the plurality of sub-structures, wherein the content in each of the plurality of sub-structures is produced by recognizing speech using the probabilistic language model associated with the sub-structure. Another aspect of the present invention is directed to the probabilistic language model identified in step (A).
- In yet another aspect of the present invention, a data structure is provided which includes: a plurality of language models logically organized in a hierarchy, the plurality of language models including a first language model and a second language model; wherein the first language model is a parent of the second language model in the hierarchy; wherein the first language model is suitable for recognizing speech representing a first concept associated with a substructure of a document; and wherein the second language model is suitable for recognizing speech representing a second concept associated with a subset of the substructure of the document.
- In a further aspect of the present invention, a method is provided which includes steps of: (A) identifying a probabilistic language model including a plurality of probabilistic language models associated with a plurality of concepts logically organized in a first hierarchy; (B) using a speech recognition decoder to apply the probabilistic language model to a spoken audio stream to produce a document including content organized into a plurality of sub-structures logically organized in a second hierarchy having a logical structure defined by a path through the first hierarchy.
- Other features and advantages of various aspects and embodiments of the present invention will become apparent from the following description and from the claims.
-
FIG. 1A is a dataflow diagram of a prior art system for generating a structured document based on a spoken audio stream; -
FIG. 1B illustrates a textual medical report generated based on a spoken report; -
FIG. 2 is a flowchart of a method that is performed in one embodiment of the present invention to generate a structured textual document based on a spoken document; -
FIG. 3 is a dataflow diagram of a system that performs the method ofFIG. 2 in one embodiment of the present invention; -
FIG. 4 illustrates an example of a spoken audio stream in one embodiment of the present invention; -
FIG. 5 illustrates a structured textual document according to one embodiment of the present invention; -
FIG. 6 is an example of a rendered document that is rendered based on the structured textual document ofFIG. 5 according to one embodiment of the present invention; -
FIG. 7 is a flowchart of a method that is performed by the structured document generator ofFIG. 3 in one embodiment of the present invention to generate a structured textual document; -
FIG. 8 is a dataflow diagram illustrating a portion of the system ofFIG. 3 in detail relevant to the method ofFIG. 7 according to one embodiment of the present invention; -
FIG. 9 is a diagram illustrating mappings between language models, document sub-structures corresponding to the language models, and candidate contents produced using the language models according to one embodiment of the present invention; -
FIG. 10A is a diagram illustrating a hierarchical language model according to one embodiment of the present invention; -
FIG. 10B is a diagram illustrating a path through the hierarchical language model ofFIG. 10A according to one embodiment of the present invention; -
FIG. 10C is a diagram illustrating a hierarchical language model according to another embodiment of the present invention; -
FIG. 11A is a flowchart of a method that is performed by the structured document generator ofFIG. 3 to generate a structured textual document according to one embodiment of the present invention; -
FIG. 11B is a flowchart of a method which uses an integrated process to select a path through a hierarchical language model and to generate a structured textual document based on speech according to one embodiment of the present invention; -
FIGS. 11C-11D are flowcharts of methods that are performed in one embodiment of the present invention to calculate a fitness score for a candidate document; -
FIG. 12A is a dataflow diagram illustrating a portion of the system ofFIG. 3 in detail relevant to the method ofFIG. 11A according to one embodiment of the present invention; -
FIG. 12B is a dataflow diagram illustrating an embodiment of the structured document generator ofFIG. 3 which performs the method ofFIG. 11B in one embodiment of the present invention; -
FIG. 13 is a flowchart of a method that is used in one embodiment of the present invention to generate a hierarchical language model for use in generating structured textual documents; -
FIG. 14 is a flowchart of a method that is used in one embodiment of the present invention to generate a structured textual document using distinct speech recognition and structural parsing steps; and -
FIG. 15 is a dataflow diagram of a system that performs the method ofFIG. 14 according to one embodiment of the present invention. - Referring to
FIG. 2 , a flowchart is shown of amethod 200 that is performed in one embodiment of the present invention to generate a structured textual document based on a spoken document. Referring toFIG. 3 , a dataflow diagram is shown of asystem 300 for performing themethod 200 ofFIG. 2 according to one embodiment of the present invention. - The
system 300 includes a spokenaudio stream 302, which may, for example, be a live or recorded spoken audio stream of a medical report dictated by a doctor. - Referring to
FIG. 4 , a textual representation of an example of the spokenaudio stream 302 is shown. InFIG. 4 , text between percentage signs represents spoken punctuation (e.g., “% comma %”, “% period %”, and “% colon %”) and explicit structural cues (e.g., “% new-paragraph %”) in theaudio stream 302. It may be seen from theaudio stream 302 illustrated inFIG. 4 that a verbatim transcript of theaudio stream 302 would not be particularly useful for purposes of understanding the diagnosis, prognosis, or other information contained in the medical report represented by theaudio stream 302. - The
system 300 also includes aprobabilistic language model 304. The term “probabilistic language model” as used herein refers to any language model which assigns probabilities to sequences of spoken words. (Probabilistic) context-free grammars and n-gram language models 306 a-e are both examples of “probabilistic language models” as that term is used herein. - In general, a context-free grammar specifies a plurality of spoken forms for a concept and associates probabilities with each of the spoken forms. A finite state grammar is an example of a context-free grammar. For example, a finite state grammar for the date Oct. 1, 1993, might include the spoken form “october first nineteen ninety three” with a probability of 0.7, the spoken form “ten one ninety three” with a probability of 0.2, and the spoken form “first october ninety three” with a probability of 0.1. The probability associated with each spoken form is an estimated probability that the concept will be spoken in that spoken form in a particular audio stream. A finite state grammar, therefore, is one kind of probabilistic language model.
- In general, an n-gram language model specifies the probability that a particular sequence of n words will occur in a spoken audio stream. Consider, for example, a “unigram” language model, for which n=1. For each word in a language, a unigram specifies the probability that the word will occur in a spoken document. A “bigram” language model (for which n=2) specifies probabilities that pairs of words will occur in a spoken document. For example, a bigram model may specify the conditional probability that the word “cat” will occur in a spoken document given that the previous word in the document was “the”. Similarly, a “trigram” language model specifies probabilities of three-word sequences, and so on. The probabilities specified by n-gram language models and finite state grammars may be obtained by training such documents using training speech and training text, as described in more detail in the above-referenced patent application entitled, “Document Transcription System Training.”
- The
probabilistic language model 304 includes a plurality of sub-models 306 a-e, each of which is a probabilistic language model. The sub-models 306 a-e may include n-gram language models and/or finite state grammars in any combination. Furthermore, as described in more detail below, each of the sub-models 306 a-e may contain further sub-models, and so on. Although five sub-models are shown inFIG. 3 , theprobabilistic language model 304 may include any number of sub-models. - The purpose of the
system 300 shown inFIG. 3 is to produce a structuredtextual document 310 which includes content from the spokenaudio stream 302, in which the content is organized into a particular structure, and where concepts are identified and interpreted in a machine-readable form. - The structured
textual document 310 includes a plurality of sub-structures 312 a-f, such as sections, paragraphs, and/or sentences. Each of the sub-structures 312 a-f may include further sub-structures, and so on. Although six sub-structures are shown inFIG. 3 , the structuredtextual document 310 may include any number of sub-structures. - For example, referring to
FIG. 5 , an example of the structuredtextual document 310 is shown. In the example illustrated inFIG. 5 , the structuredtextual document 310 is an XML document. The structuredtextual document 310 may, however, be implemented in any form. As shown inFIG. 5 , the structureddocument 310 includes six sub-structures 312 a-f, each of which may represent a section of thedocument 310. - For example, the structured
document 310 includesheader section 312 a which includes meta-data about thedocument 310, such as atitle 314 of the document 310 (“CT scan of the chest without contrast”) and thedate 316 on which thedocument 310 was dictated (“<date>22-APR-2003</date>”). Note that the content in theheader section 312 a was obtained from the beginning of the spoken audio stream 302 (FIG. 4 ). Furthermore, note that theheader section 312 a includes both flat text (i.e., the title 314) and a sub-structure (e.g., the date 316) representing a concept that has been interpreted in a machine-readable form as a triplet of values (day-month-year). - Representing the date in a machine-readable form enables the date to be stored easily in a database and to be processed more easily than if the date were stored in a textual form. For example, if multiple dates in the
audio stream 302 have been recognized and stored in machine-readable form, such dates may easily be compared to each other by a computer. As another example, statistical information about the content of theaudio stream 302, such as the average time between doctor's visits, may easily be generated if dates are stored in computer-readable form. This advantage of embodiments of the present invention applies generally not only to dates but to the recognition of any kind of semantic content and the storage of such content in machine-readable form. - The structured
document 310 further includes acomparison section 312 b, which includes content describing prior studies performed on the same patient as the patient who is the subject of the document (report) 310. Note that the content in thecomparison section 312 b was obtained from the portion of theaudio stream 302 beginning with “comparison to” and ending with “april six two thousand one”, but that thecomparison section 312 b does not include the text “comparison to,” which is an example of a section cue. The use of such cues to identify the beginning of a section or other document sub-structure will be described in more detail below. - In brief, the structured
document 310 also includes a technique section 312 c, which describes techniques that were performed in the procedures performed on the patient; afindings section 312 d, which describes the doctor's findings; and animpression section 312 e, which describes the doctor's impressions of the patient. - XML documents, such as the example structured
document 310 illustrated inFIG. 5 , typically are not intended for direct viewing by an end user. Rather, such documents typically are rendered in a form that is more easily readable before being presented to the end user. Thesystem 300, for example, includes arendering engine 314 which renders the structuredtextual document 310 based on astylesheet 316 to produce a rendereddocument 318. Techniques for generating stylesheets and for rendering documents in accordance with stylesheets are well-known to those having ordinary skill in the art. - Referring to
FIG. 6 , an example of the rendereddocument 318 is shown. The rendereddocument 318 includes five sections 602 a-e, each of which may correspond to one or more of the six sub-structures 312 a-f in the structuredtextual document 310. More specifically, the rendereddocument 318 includes aheader section 602 a, acomparison section 602 b, atechnique section 602 c, afindings section 602 d, and animpression section 602 e. Note that there may or may not be a one-to-one mapping between sections in the rendereddocument 318 and sub-structures in the structuredtextual document 310. For example, each of the sub-structures 312 a-f need not represent a distinct type of document section. If, for example, two or more of the sub-structures 312 a-f represent the same type of section (such as a header section), therendering engine 314 may render both of the sub-structures in the same section of the rendereddocument 318. - The
system 300 includes a structureddocument generator 308, which identifies the probabilistic language model 304 (step 202), and uses thelanguage model 304 to recognize the spokenaudio stream 302 and thereby to produce the structured textual document 310 (step 204). The structureddocument generator 308 may, for example, include an automaticspeech recognition decoder 320 which produces each of the sub-structures 312 a-f in the structuredtextual document 310 using a corresponding one of the sub-models 306 a-e in theprobabilistic language model 304. As is well-known to those having ordinary skill in the art, a decoder is a component of a speech recognizer which converts audio into text. Thedecoder 320 may, for example, produce sub-structure 312 a by using sub-model 306 a to recognize a first portion of the spokenaudio stream 302. Similarly, thedecoder 320 may produce sub-structure 312 b by using sub-model 306 b to recognize a second portion of the spokenaudio stream 302. - Note that there need not be a one-to-one mapping between sub-models 306 a-e in the
language model 304 and sub-structures 312 a-f in the structureddocument 310. For example, the speech recognition decoder may use the sub-model 306 a to recognize a first portion of the spokenaudio stream 302 and thereby produce sub-structure 312 a, and use thesame sub-model 306 a to recognize a second portion of the spokenaudio stream 302 and thereby produce sub-structure 312 b. In such a case, multiple sub-structures in the structuredtextual document 310 may contain content for a single semantic structure (e.g., section or paragraph). -
Sub-model 306 a may, for example, be a “header” language model which is used to recognize portions of the spokenaudio stream 302 containing content in theheader section 312 a; sub-model 306 b may, for example, be a “comparison” language model which is used to recognize portions of the spokenaudio stream 302 containing content in thecomparison section 312 b; and so on. Each such language model may be trained using training text from the corresponding section of training documents. For example, theheader sub-model 306 a may be trained using text from the header sections of a plurality of training documents, and the comparison sub-model may be trained using text from the comparison sections of the plurality of training documents. - Having generally described features of various embodiments of the present invention, embodiments of the present invention will now be described in more detail.
- Referring to
FIG. 7 , a flowchart is shown of a method that is performed by the structureddocument generator 308 in one embodiment of the present invention to generate the structured textual document 310 (FIG. 2 , step 204). Referring toFIG. 8 , a dataflow diagram is shown illustrating a portion of thesystem 300 in detail relevant to the method ofFIG. 7 . - In the example illustrated in
FIG. 8 , the structureddocument generator 308 includes asegment identifier 814 which identifies a plurality of segments S 802 a-c in the spoken audio stream 302 (step 701). The segments 802 a-c may, for example, represent concepts such as sections, paragraphs, sentences, words, dates, times, or codes. Although only three segments 802 a-c are shown inFIG. 8 , the spokenaudio stream 302 may include any number of portions. Although for ease of explanation, all of the segments 802 a-c are identified instep 701 ofFIG. 7 prior to performing the remainder of themethod 700, the identification of the segments 802 a-c may be performed concurrently with recognizing theaudio stream 302 and generating the structureddocument 310, as will be described in more detail below with respect toFIGS. 11B and 12B . - The structured
document generator 308 enters a loop over each segment S in the spoken audio stream 302 (step 702). As described above, the structureddocument generator 308 includesspeech recognition decoder 320, which may, for example, include one or more conventional speech recognition decoders for recognizing speech using different kinds of language models. As further described above, each of the sub-models 306 a-e may be an n-gram language model, a context-free grammar, or a combination of both. - Assume for purposes of example that the structured
document generator 308 is currently processingsegment 802 a of the spokenaudio stream 302. The structureddocument generator 308 selects aplurality 804 of the sub-models 306 a-e with which to recognize the current segment S. - The sub-models 804 may, for example, be all of the language sub-models 306 a-e or a subset of the sub-models 306 a-e. The
speech recognition decoder 320 recognizes the current segment S (e.g.,segment 802 a) with each of the selected sub-models 804, thereby producing a plurality ofcandidate contents 808 corresponding to segment S (step 704). In other words, each of thecandidate contents 808 is produced by using thespeech recognition decoder 320 to recognize the current segment S using a distinct one of the sub-models 804. Note that each of thecandidate contents 808 may include not only recognized text but also other kinds of content, such as concepts (e.g., dates, times, codes, medications, allergies, vitals, etc.) encoded in machine-readable form. - The structured
document generator 308 includes afinal content selector 810 which selects one of thecandidate contents 808 as afinal content 812 for segment S (step 706). Thefinal content selector 810 may use any of a variety of techniques that are well-known to those of ordinary skill in the art for selecting speech recognition output that most closely matches speech from which the output was derived. - The structured
document generator 308 keeps track of the sub-model that is used to produce each of thecandidate contents 808. Assume, for purposes of example, that the sub-models 304 include all of the sub-models 306 a-e, and that thecandidate contents 808 therefore include five candidate contents per segment 802 a-c (one produced using each of the sub-models 306 a-e). For example, referring toFIG. 9 , a diagram is shown illustrating mappings between the document sub-structures 312 a-f, the sub-models 306 a-e, andcandidate contents 808 a-e. As described above, each of the sub-models 306 a-e may be associated with one or more corresponding sub-structures 312 a-f in the structuredtextual document 310. - These correspondences are indicated in
FIG. 9 by mappings 902 a-e between the sub-structures 312 a-e and the sub-models 306 a-e. The structureddocument generator 308 may maintain such mappings 902 a-e in a table or using other means. - When the
speech recognition decoder 320 recognizes segment S (e.g.,segment 802 a) with each of the sub-models 306 a-e, it producescorresponding candidate contents 808 a-e. For example,candidate content 808 a is the text that is produced whenspeech recognition decoder 320 recognizessegment 802 a with sub-model 306 a,candidate content 808 b is the text that is produced whenspeech recognition decoder 320 recognizessegment 802 a withsub-model 306 b, and so on. The structureddocument generator 308 may record the mapping betweencandidate contents 808 a-e and corresponding sub-models 306 a-e in a set of candidate model-content mappings 816. - Therefore, when the structured
document generator 308 selects one of thecandidate contents 808 a-e as thefinal content 812 for segment S (step 706), afinal mapping identifier 818 may use themappings 816 and the selectedfinal content 812 to identify the language sub-model that produced the candidate content that has been selected as the final content 812 (step 708). For example, if candidate content 808 c is selected as thefinal content 812, it may be seen fromFIG. 9 that thefinal mapping identifier 818 may identify the sub-model 306 c as the sub-model that produced candidate content 808 c. Thefinal mapping identifier 818 may accumulate each identified sub-model in the set ofmappings 820, so that at any given time themappings 820 identify the sequence of language sub-models that were used to generate the final contents that have been selected for inclusion in the structuredtextual document 310. - Once the sub-model corresponding to the
final content 812 has been identified, the structureddocument generator 308 may identify the document sub-structure associated with the identified sub-model (step 710). For example, if the sub-model 306 c has been identified instep 708, it may be seen fromFIG. 9 that document sub-structure 312 c is associated with sub-model 306 c. - A
structured content inserter 822 inserts thefinal content 812 into the identified sub-structure of the structured text document 310 (step 712). For example, if the sub-structure 312 c is identified instep 710, the text inserter 514 inserts thefinal content 812 into sub-structure 312 c. - The structured document generator repeats steps 704-712 for the remaining
segments 802 b-c of the spoken audio stream 302 (step 714), thereby generatingfinal content 812 for each of the remainingsegments 802 b-c and inserting thefinal content 812 into the appropriate ones of the sub-structures 312 a-f of thetextual document 310. Upon conclusion of themethod 700, the structuredtextual document 310 includes text corresponding to the spokenaudio stream 302, and the final model-content mappings 820 identify the sequence of language sub-models that were used by thespeech recognition decoder 320 to generate the text in the structuredtextual document 310. - Note that in the process of recognizing the spoken
audio stream 302, themethod 700 may not only generate text corresponding to the spoken audio, but may also identify semantic information represented by the audio and store such semantic information in a machine-readable form. For example, referring again toFIG. 5 , thecomparison section 312 b includes a date element in which a particular date is represented as a triplet containing individual values for the day (“06”), month (“APR”), and year (“2001”). Other examples for semantic concepts in the medical domain include vital signs, medications and their dosages, allergies, medical codes, etc. Extracting and representing semantic information in this way facilitates the process of performing automated processing on such information. Note that the particular form in which semantic information is represented inFIG. 5 is merely an example and does not constitute a limitation of the present invention. - Recall from
step 701 that themethod 700 shown inFIG. 7A identifies the set of segments 802 a-c before identifying the sub-models to be used to recognize the segments 802 a-c. Note, however, that the structureddocument generator 308 may integrate the process of identifying the segments 802 a-c with the process of identifying the sub-models to be used to recognize the segments 802 a-c, and with the process of performing speech recognition on the segments 802 a-c. Examples of techniques that may be used to perform such integrated segmentation and recognition will be described in more detail below with respect toFIGS. 11B and 12B . - Having generally described the operation of the method illustrated in
FIG. 7 , consider now the application of the method ofFIG. 7 to theexample audio stream 302 shown inFIG. 4 . Assume that the first portion of the spokenaudio stream 302 is the spoken stream of utterances: “CT scan of the chest without contrast april twenty second two thousand three”. This portion may be selected instep 702 and recognized using all of the language sub-models 306 a-e instep 704 to produce a plurality ofcandidate contents 808 a-e. As described above, assume that sub-model 306 a is a “header” language model, that sub-model 306 b is a “comparison” language model, that sub-model 306 c is a “technique” language model, that sub-model 306 d is a “findings” language model, and that sub-model 306 e is an “impression” language model. - Because sub-model 306 a is a language model which has been trained to recognize speech in the “header” section of the document 310 (e.g., sub-structure 312 a), it is likely that the
candidate content 808 a produced using sub-model 306 a will match the words in the above-referenced audio portion more closely than theother candidate contents 808 b-e. Assuming that thecandidate content 808 a is selected as thefinal content 812 for this audio portion, thecontent inserter 822 will insert thefinal content 812 produced by sub-model 306 a into theheader section 312 a of the structuredtext document 310. - Assume that the second portion of the spoken audio stream is the spoken stream of utterances: “comparison to prior studies from march twenty six two thousand two and april six two thousand one”. This portion may be selected in
step 702 and recognized using all of the language sub-models 306 a-e instep 704 to produce a plurality ofcandidate contents 808 a-e. Becausesub-model 306 b is a language model which has been trained to recognize speech in the “comparison” section of the document 310 (e.g., sub-structure 312 b), it is likely that thecandidate content 808 b produced using sub-model 306 b will match the words in the above-referenced audio portion more closely than theother candidate contents 808 a and 808 c-e. Assuming that thecandidate content 808 b is selected as thefinal content 812 for this audio portion, the text inserter 514 will insert thefinal content 812 produced bysub-model 306 b into thecomparison section 312 b of the structuredtext document 310. - The remainder of the
audio stream 302 illustrated inFIG. 4 may be recognized and inserted into appropriate ones of the sub-structures 312 a-f in the structuredtextual document 310 in a similar manner. Note that although content in the spokenaudio stream 302 illustrated inFIG. 4 appears in the same sequence as the sections 312 a-f in the structuredtextual document 310, this is not a requirement of the present invention. Rather, content may appear in theaudio stream 302 in any order. Each of the segments 802 a-c of theaudio stream 302 is recognized by thespeech recognition decoder 320, and the resultingfinal content 812 is inserted into the appropriate one of the sub-structures 312 a-f. As a result, the order of the textual content in the sub-structures 312 a-f may not be the same as the order of the content in the spoken audio stream. Note, however, that even if the order of textual content is the same in both theaudio stream 302 and the structuredtextual document 310, the rendering engine 314 (FIG. 3 ) may render the textual content of thedocument 310 in any desired order. - In another embodiment of the present invention, the
probabilistic language model 304 is a hierarchical language model. In particular, in this embodiment the plurality of sub-models 306 a-e are organized in a hierarchy. As described above, the sub-models 306 a-e may further include additional sub-models, and so on, so that the hierarchy of thelanguage model 304 may include multiple levels. - Referring to
FIG. 10A , a diagram is shown illustrating an example of thelanguage model 304 in hierarchical form. Thelanguage model 304 includes a plurality ofnodes 1002, 306 a-e, 1006 a-e, and 1010 and 1012.Square nodes Elliptical nodes 306 a, 1006 a-d, and 1010 use statistical (n-gram) language models to model less-constrained language. - The term “concept” as used herein includes, for example, dates, times, numbers, codes, medications, medical history, diagnoses, prescriptions, phrases, enumerations and section cues. A concept may be spoken in a plurality of ways. Each way of speaking a particular concept is referred to herein as a “spoken form” of the concept. A distinction is sometimes made between “semantic” concepts and “syntactic” concepts. The term “concept” as used herein includes both semantic concepts and syntactic concepts, but is not limited to either and does not rely on any particular definition of “semantic concept” or “syntactic concept” or on any distinction between the two.
- Consider, for example, the date Oct. 1, 1993, which is an example of a concept as that term is used herein. Spoken forms of this concept include the spoken phrases, “october first nineteen ninety three,” “one october ninety three,” and “ten dash one dash ninety three.” Text such as “Oct. 1, 1993” and “Oct. 1, 1993” are examples of “written forms” of this concept.
- Now consider the sentence “John Jones has pneumonia.” This sentence, which is a concept as that term is used herein, may be spoken in a plurality of ways, such as the spoken phrases, “john jones has pneumonia,” “patient jones diagnosis pneumonia,” and “diagnosis pneumonia patient jones.” The written sentence “John Jones has pneumonia” is an example of a “written form” of the same concept.
- Although language models for low-level concepts such as dates and times are not shown in
FIG. 10A (except for sub-model 1012), thehierarchical language model 304 may include sub-models for such low-level concepts. For example, the n-gram sub-models 306 a, 1006 a-d, and 1010 may assign probabilities to sequences of words representing dates, times, and other low-level concepts. - The
language model 304 includesroot node 1002, which contains a finite state grammar representing the probabilities of occurrence ofnode 1002's sub-nodes 306 a-e. Theroot node 1002 may, for example, indicate probabilities of the header, comparison, technique, findings, and impression sections of thedocument 310 appearing in particular orders in the spokenaudio stream 302. - Moving down one level in the hierarchy of
language model 304,node 306 a is a “header” node, which is an n-gram language model representing probabilities of occurrence of words in portions of the spokenaudio stream 302 intended for inclusion in theheader section 312 a of the structuredtextual document 310. -
Node 306 b contains a “comparison” finite state grammar representing probabilities of occurrence of a variety of alternative spoken forms of cues for thecomparison section 312 b of the textual document. The finite state grammar in thecomparison node 306 b may, for example, include cues such as “comparison to”, “comparison for”, “prior is”, and “prior studies are”. The finite state grammar may include a probability for each of these cues. Such probabilities may, for example, be based on observed frequencies of use of the cues in a set of training speech for the same speaker or in the same domain as the spokenaudio stream 302. Such frequencies may be obtained, for example, using the techniques disclosed in the above-reference patent application entitled “Document Transcription System Training.” - The
comparison node 306 b includes a “comparison content” sub-node 1006 a, which is an n-gram language model representing probabilities of occurrence of words in portions of the spokenaudio stream 302 intended for inclusion in the body of thecomparison section 312 b of thetextual document 310. Thecomparison content node 1006 a has adate node 1012 as a child. As will be described in more detail below, thedate node 1012 is a finite state grammar representing probabilities of the date being spoken in various ways. -
Nodes 306 c and 306 d may be understood similarly. Node 306 c contains a “technique” finite state grammar representing probabilities of occurrence of a variety of alternative spoken forms of cues for the technique section 312 c of thetextual document 310. The technique node 306 c includes a “technique content” sub-node 1006 b, which is an n-gram language model representing probabilities of occurrence of words in portions of the spokenaudio stream 302 intended for inclusion in the body of the technique section 312 c of thetextual document 310. Similarly,node 306 d contains a “findings” finite state grammar representing probabilities of occurrence of a variety of alternative spoken forms of cues for thefindings section 312 d of thetextual document 310. Thefindings node 306 d includes a “findings content” sub-node 1006 c, which is an n-gram language model representing probabilities of occurrence of words in portions of the spokenaudio stream 302 intended for inclusion in the body of thefindings section 312 d of thetextual document 310. -
Impression node 306 e is similar tonodes 306 b-d, in that it includes a finite state grammar for recognizing section cues and a sub-node 1006 d including an n-gram language model for recognizing section content. In addition, however, theimpression node 306 e includes an additional sub-node 1006 e, which in turn includes a sub-node 1010. This indicates that the content of the impression section may be recognized using either the language model in theimpression content node 1006 d or the “enum”node 1006 e, governed by the finite state grammar-based language model corresponding toimpression node 306 e. The “enum”node 1006 e contains a finite state grammar indicating probabilities associated with different ways of speaking enumeration cues (such as “number one,” “number two,” “first,” “second,” “third,” and so on). Theimpression content node 1010 may include the same language model as theimpression content node 1006 d. - Having described the hierarchical structure of the
language model 304 in one embodiment of the present invention, examples of techniques that may be used to generate the structureddocument 310 using thelanguage model 304 will now be described. Referring toFIG. 11A , a flowchart is shown of a method that is performed by the structureddocument generator 308 in one embodiment of the present invention to generate the structured textual document 310 (FIG. 2 , step 204). Referring toFIG. 12A , a dataflow diagram is shown illustrating a portion of thesystem 300 in detail relevant to the method ofFIG. 11A . - The structured
document generator 308 includes apath selector 1202 which identifies apath 1204 through the hierarchical language model 304 (step 1102). Thepath 1204 is an ordered sequence of nodes in thehierarchical language model 304. Nodes may be traversed multiple times in thepath 1204. Examples of techniques for generating thepath 1204 will be described in more detail below with respect toFIGS. 11B and 12B . - Referring to
FIG. 10B , an example of thepath 1204 is illustrated. Thepath 1204 includes points 1020 a-j, which specify a sequence in which to traverse nodes in thelanguage model 304. Points 1020 a-j are referred to as “points” rather than “nodes” to distinguish them fromnodes 1002, 306 a-e, 1006 a-e, and 1010 in thelanguage model 304. - In the example illustrated in
FIG. 10B ,path 1204 traverses the following nodes oflanguage model 304 in sequence: (1) root node 1002 (point 1020 a); (2)header content node 306 a (point 1020 b); (3)comparison node 306 b (point 1020 c); (4)comparison content node 1006 a (point 1020 d); (5) technique node 306 c (point 1020 e); (6)technique content node 1006 b (point 1020 f); (7)findings node 306 d (point 1020 g); (8) findings content node 1006 c (point 1020 h); (9)impression node 306 e (point 1020 i); and (10)impression content node 1006 d (point 1020 j). - As may be seen by reference to
FIG. 4 , recognizing the spokenaudio stream 302 using the language sub-models falling along thepath 1204 illustrated inFIG. 10B would result in optimal speech recognition, since speech in theaudio stream 302 occurs in the same sequence as the language sub-models in thepath 1204 illustrated inFIG. 10B . For example, the spokenaudio stream 302 begins with speech that is best recognized by the headercontent language model 306 a (“CT scan of the chest without contrast april twenty second two thousand three”), followed by speech that is best recognized by thecomparison language model 306 b (“comparison to”), followed by speech that is best recognized by the comparisoncontent language model 1006 a (“prior studies from march twenty six two thousand two and april six two thousand one”), and so on. - Having identified the
path 1204, the structureddocument generator 308 recognizes the spokenaudio stream 302 using the language models traversed by thepath 1204 to produce the structured textual document 310 (step 1104). As described in more detail below with respect toFIGS. 11B and 12B , the speech recognition and structured textual document generation ofstep 1104 may be integrated with the path identification ofstep 1102, rather than performed separately. - More specifically, the structured
document generator 308 may include anode enumerator 1206 which iterates over each of the languagemodel nodes N 1208 traversed by the selected path 1204 (step 1106). For each such node N, thespeech recognition decoder 320 may recognize the portion of theaudio stream 302 corresponding to the language model at node N to produce corresponding structured text T (step 1108). The structureddocument generator 308 may inserttext T 1210 into the substructure of the structuredtextual document 310 corresponding tonode N 1208 of the language model 304 (step 1110). - For example, when node N is the
comparison node 306 b (FIG. 10A ), thecomparison node 306 b may be used to recognize the text “comparison to” in the spoken audio stream 302 (FIG. 4 ). Becausecomparison node 306 b corresponds to a document sub-structure (e.g., thecomparison section 312 b) rather than to content, the result of the speech recognition performed instep 1108 in this case may be a document substructure, namely an empty “comparison” section. Such a section may be inserted into the structureddocument 310 instep 1110, for example, in the form of matching “<comparison>” and “</comparison>” tags. - When node N is the
comparison content node 1006 a (FIG. 10A ), thecomparison content node 1006 a may be used to recognize the text “prior studies from march twenty six two thousand two and april six two thousand one” in the spoken audio stream 302 (FIG. 4 ), thereby producing the structured text “Prior studies from <date>26-MAR-2002</date> and <date>06-APR-2001</date>”, as shown inFIG. 5 . This structured text may then be inserted into thecomparison section 312 b in step 1110 (e.g., between the “<comparison>” and “</comparison>” tags, as shown inFIG. 5 ). - The structured
document generator 308 repeats steps 1108-1110 for the remaining nodes N traversed by the path 1204 (step 1112), thereby inserting a plurality ofstructured texts 1210 into the structuredtextual document 310. The end result of the method illustrated inFIG. 11A is the creation of the structuredtextual document 310, which contains text having a structure that corresponds to the structure of thepath 1204 through thelanguage model 304. For example, it can be seen fromFIG. 10B that the structure of the illustrated path traverses language model nodes corresponding to the header, comparison, technique, findings, and impression sections in sequence. The resulting structured textual document 310 (as illustrated, for example, inFIG. 5 ) similarly includes header, comparison, technique, findings, and impression sections in sequence. The structuredtextual document 310 therefore has the same structure as thelanguage model path 1204 that was used to create the structuredtextual document 310. - It was stated above that the structured
document generator 308 inserts recognized structuredtext 1210 into the appropriate sub-structures of the structured textual document 310 (FIG. 11A , step 1110). As shown inFIG. 5 , the structuredtextual document 310 may be implemented as an XML document or other document which supports nested structures. - In such a case, it is necessary to insert each of the recognized
structured texts 1210 inside of the appropriate substructure so that the final structuredtextual document 310 has a structure that corresponds to the structure of thepath 1204. Those having ordinary skill in the art will understand how to use the final model-content mappings 820 (FIG. 8 ) to use thepath 1204 to traverse the structure of thelanguage model 304 and thereby to create such a structured document. - The system illustrated in
FIG. 12A includespath selector 1202, which selects apath 1204 through thelanguage model 304. The method illustrated inFIG. 11A then uses the selectedpath 1204 to generate the structuredtextual document 310. In other words, inFIGS. 11A and 12A , the steps of path selection and structured document creation are performed separately. This is not, however, a limitation of the present invention. - Rather, referring to
FIG. 11B , a flowchart is shown of amethod 1150 which integrates the steps of path selection and structured document generation. Referring toFIG. 12B , an embodiment of the structureddocument generator 308 is shown which performs themethod 1150 ofFIG. 11B in one embodiment of the present invention. In overview, themethod 1150 ofFIG. 11B searches for possible paths through the hierarchy of the language model 304 (FIG. 10A ), beginning at theroot node 1002 and expanding outward. Any of a variety of techniques, including techniques well-known to those of ordinary skill in the art, may be used to search through the language model hierarchy. As themethod 1150 identifies partial paths through the language model hierarchy, themethod 1150 uses thespeech recognition decoder 320 to recognize increasingly large portions of the spokenaudio stream 302 using the language models falling along the partial paths, thereby creating partial candidate structured documents. Themethod 1150 assigns fitness scores to each of the partial candidate structured documents. The fitness score for each candidate structured document is a measure of how well the path that produced the candidate structured document has performed. Themethod 1150 expands the partial paths, thereby continuing to search through the language model hierarchy, until the entire spokenaudio stream 302 has been recognized. The structureddocument generator 308 selects the candidate structured document having the highest fitness score as the final structuredtextual document 310. - More specifically, the
method 1150 initializes one ormore candidate paths 1224 through the language model 304 (step 1152). For example, thecandidate paths 1224 may be initialized to contain a single path consisting of theroot node 1002. The term “frame” refers herein to a short period of time, such as 10 milliseconds. Themethod 1150 initializes an audio stream pointer to point to the first frame in the audio stream 302 (step 1153). For example, in the embodiment illustrated inFIG. 12B , the structureddocument generator 308 contains anaudio stream enumerator 1240 which provides aportion 1242 of theaudio stream 302 to thespeech recognition decoder 320. Upon initiation of themethod 1150, theportion 1242 may solely contain the first frame of theaudio stream 302. - The
speech recognition decoder 320 recognizes thecurrent portion 1242 of theaudio stream 302 using the language sub-models in the candidate path(s) 1224 to generate one or more candidate structured partial documents 1232 (step 1154). Note that thedocuments 1232 are onlypartial documents 1232 because they have been generated based on only a portion of theaudio stream 302. Whenstep 1154 is first performed, thespeech recognition decoder 320 may simply recognize the first frame of theaudio stream 302 using the language model at theroot node 1002 of thelanguage model 304. - Note that the techniques disclosed above with respect to
FIG. 11A andFIG. 12A may be used by thespeech recognition decoder 320 to generate the candidate structuredpartial documents 1232 using thecandidate paths 1224. More specifically, thespeech recognition decoder 320 may apply the methods illustrated inFIG. 11A to theaudio stream portion 1242 using each of thecandidate paths 1224 as the path identified in step 1102 (FIG. 11A ). - Returning to
FIGS. 11B and 12B , afitness evaluator 1234 generatesfitness scores 1236 for each of the candidate structured partial documents 1232 (step 1156). The fitness scores 1236 are measures of how well the candidate structuredpartial documents 1232 represent the corresponding portion of theaudio stream 302. In general, the fitness score for a single candidate document may be generated by: (1) generating fitness scores for each of the nodes in the corresponding one of thecandidate paths 1224; and (2) using a synthesis function to synthesize the individual node fitness scores generated in step (1) into an overall fitness score for the candidate structured document. Examples of techniques that may be used to generate thecandidate fitness scores 1236 will be described in more detail below with respect toFIG. 11C . - If the structured
document generator 308 were to attempt to search for all possible paths through the hierarchy of thelanguage model 304, the computational resources required to evaluate each possible path might become prohibitively costly and/or time-consuming due to the exponential growth in the number of possible paths. - Therefore, in the embodiment illustrated in
FIG. 12B , a path pruner 1230 uses thecandidate fitness scores 1236 to remove poorly-fitting paths from thecandidate paths 1224, thereby producing a set of pruned paths 1222 (step 1158). - If the
entire audio stream 302 has been recognized (step 1160), afinal document selector 1238 selects, from among the candidate structuredpartial documents 1232, the candidate structured document having the highest fitness score, and provides the selected document as the final structured textual document 310 (step 1164). If theentire audio stream 302 has not been recognized, apath extender 1220 extends the prunedpaths 1222 within thelanguage model 304 to produce a new set of candidate paths 1224 (step 1162). If for, example, the prunedpaths 1222 consist of a single path containing theroot node 1002, thepath extender 1220 may extend this path by one node downward in the hierarchy illustrated inFIG. 10A to produce a plurality of candidate paths extending from theroot node 1002, such as a path from theroot node 1002 to theheader content node 306 a, a path from theroot node 1002 to thecomparison node 306 b, a path from theroot node 1002 to the technique node 306 c, and so on. Various techniques for extending thepaths 1224 to perform depth-first, breadth-first, or other kinds of hierarchical searches are well-known to those having ordinary skill in the art. - The
audio stream enumerator 1240 extends theportion 1242 of theaudio stream 302 to include the next frame in the audio stream 302 (step 1163). Steps 1154-1160 are then repeated by using thenew candidate paths 1224 to recognize theportion 1242 of theaudio stream 302. In this way theentire audio stream 302 may be recognized using appropriate sub-models in thelanguage model 304. - As described above with respect to
FIGS. 11B and 12B ,fitness scores 1236 may be generated for each of the candidate structuredpartial documents 1232 produced by the structureddocument generator 308 while evaluatingcandidate paths 1224 through thelanguage model 304. Examples of techniques will now be described for generating fitness scores, either for the partial candidate structuredpartial documents 1232 illustrated inFIG. 12B or for structured documents more generally. - For example, referring to
FIG. 10A , note that thecomparison content node 1006 a has adate node 1012 as a child. Assume that the text “CT scan of the chest without contrast april twenty second two thousand three” has been recognized as text corresponding to thecomparison content node 1006 a. Note that thecomparison content node 1006 a was used to recognize the text “CT scan of the chest without contrast” and that thedate node 1012, which is a child of thecomparison content node 1006 a, was used to generate the text “april twenty second two thousand three”. The fitness score for this text may, therefore, be calculated by using thecomparison content node 1006 a to calculate a first fitness score for the text “CT scan of the chest without contrast” followed by any date, calculating a second fitness score for the text “april twenty second two thousand three” based on thedate node 1012, and multiplying the first and second fitness scores. - Referring to
FIG. 11C , a flowchart is shown of a method that is performed in one embodiment of the present invention to calculate a fitness score for a candidate document, and which may therefore be used to implementstep 1156 of themethod 1150 illustrated inFIG. 11B . A fitness score S is initialized to a value of one for the candidate structured document being evaluated (step 1172). The method assigns a current node pointer N to point to the root node in the candidate path corresponding to the candidate document (step 1174). - The method calls a function named Fitness( ) with the values N and S (step 1176) and returns the result as the fitness score for the candidate document (step 1178). As will now be described in more detail, the Fitness( ) function generates the fitness score S using a hierarchical factorization by traversing the candidate path corresponding to the candidate document.
- Referring to
FIG. 1D , a flowchart is shown of the Fitness( )function 1180 according to one embodiment of the present invention. Thefunction 1180 identifies the probability P(W(N)) that the text W corresponding to the current node N has been recognized by the language model associated with that node, and multiplies the probability by the current value of S to produce a new value for S (step 1184). - If node N has no children (step 1186), the value of S is returned (step 1194). If node N has children, then the Fitness( )
function 1180 is called recursively on each of the child nodes, with the results being multiplied by the value of S to produce new values of S (steps 1188-1192). The resulting value of S is returned (step 1194). - Upon completion of the method illustrated in
FIG. 11C , the value of S represents a fitness score for the entire candidate structured document, and the value of S is returned, e.g., for use in themethod 1150 illustrated inFIG. 11B (step 1194). - For example, recall again the text “CT scan of the chest without contrast april twenty second two thousand three”. The fitness score (probability) of this text may be obtained by identifying the probability of the text “CT scan of the chest without contrast <DATE>”, where <DATE> denotes any date, multiplied by the conditional probability of the text “april twenty second two thousand three” occurring given that the text represents a date.
- More generally, the effect of the method illustrated in
FIG. 11C is to hierarchically factor probabilities of word sequences according to the hierarchy of thelanguage model 304, allowing the individual probability estimates associated with each language model node to be seamlessly combined with the probability estimates associated with other nodes. This probabilistic framework allows the system to model and use statistical language models with embedded probabilistic finite state grammars and finite state grammars with embedded statistical language models. - As described above, nodes in the
language model 304 represent language sub-models which specify the probabilities of occurrence of sequences of words in the spokenaudio stream 302. In the preceding discussion, it has been assumed that the probabilities have already been assigned in such language models. Examples of techniques will now be disclosed for assigning probabilities to the language sub-models (such as n-gram language models and context-free grammars) in thelanguage model 304. - Referring to
FIG. 13 , a flowchart is shown of amethod 1300 that is used in one embodiment of the present invention to generate thelanguage model 304. A plurality of nodes are selected for use in the language model (step 1302). The nodes may, for example, be selected by a transcriptionist or other person skilled in the relevant domain. The nodes may be selected in an attempt to capture all of the types of concepts that may occur in the spokenaudio stream 302. For example, in the medical domain, nodes (such as those shown inFIG. 10A ) may be selected which represent the sections of a medical report and the concepts (such as dates, times, medications, allergies, vital signs and medical codes) which are expected to occur in a medical report. - A concept and language model type may be assigned to each of the nodes selected in step 1302 (steps 1304-1306). For example,
node 306 b (FIG. 10A ) may be assigned the concept “comparison section cue” and be assigned the language model type “finite state grammar.” Similarly,node 1006 a may be assigned the concept “comparison content” and the language model type “n-gram language model.” - The nodes selected in
step 1302 may be arranged into a hierarchical structure (step 1308). For example, thenodes 1002, 306 a-e, 1006 a-e, and 1010 may be arranged into the hierarchical structure illustrated inFIG. 10A to represent and enforce structural dependencies between the nodes. - Each of the nodes selected in
step 1302 may then be trained using text representing a corresponding concept (step 1310). For example, a set of training documents may be identified. The set of training documents may, for example, be a set of existing medical reports or other documents in the same domain as the spokenaudio stream 302. The training documents may be marked up manually to indicate the existence and location of structures in the document, such as sections, sub-sections, dates, times, codes, and other concepts. Such markup may, for example, be performed automatically on formatted documents, or manually by a transcriptionist or other person skilled in the relevant domain. Examples of techniques for training the nodes selected instep 1302 are described in the above-referenced patent application entitled “Document Transcription System Training.” - Conventional language model training techniques may be used in
step 1310 to train concept-specific language models for each of the concepts that is marked up in the training documents. For example, the text from all of the marked-up “header” sections in the training documents may be used to train thelanguage model node 306 a representing the header section. In this way, language models for each of thenodes 1002, 306 a-e, 1006 a-e, and 1010 in thelanguage model 304 illustrated inFIG. 10A may be trained. The result of themethod 1300 illustrated inFIG. 13 is a hierarchical language model having trained probabilities, which can be used to generate the structuredtextual document 310 in the manner described above. This hierarchical language model may then be used, for example, to iteratively re-segment the training text, such as by using the techniques disclosed above in conjunction withFIGS. 11B and 12B . The resegmented training text may be used to retain the hierarchical language model. - This process of re-segmenting and re-training may be performed iteratively to repeatedly improve the quality of the language model.
- In the examples described above, the structured
document generator 308 both recognizes the spokenaudio stream 302 and generates the structuredtextual document 310 using an integrated process, within generating an intermediate non-structured transcript. Such techniques, however, are disclosed merely for purposes of example and do not constitute limitations of the present invention. - Referring to
FIG. 14 , a flowchart is shown of amethod 1400 that is used in another embodiment of the present invention to generate the structuredtextual document 310 using distinct speech recognition and structural parsing steps. Referring toFIG. 15 , a dataflow diagram is shown of asystem 1500 that performs themethod 1400 ofFIG. 14 according to one embodiment of the present invention. - The
speech recognition decoder 320 recognizes the spokenaudio stream 302 using alanguage model 1506 to produce atranscript 1502 of the spokenaudio stream 302. Note that thelanguage model 1506 may be a conventional language model that is distinct from thelanguage model 304. More specifically, thelanguage model 1506 may be a conventional monolithic language model. Thelanguage model 1506 may, for example, be generated using the same training corpus as is used to train thelanguage model 304. While portions of the training corpus may be used to train nodes of thelanguage model 304, the entire corpus may be used to train thelanguage model 1506. Thespeech recognition decoder 320 may, therefore, use conventional speech recognition techniques to recognize the spokenaudio stream 302 using thelanguage model 1506 and thereby to produce thetranscript 1502. - Note that the
transcript 1502 may be a “flat”transcript 1502 of the spokenaudio stream 302, rather than a structured document as in the previous examples disclosed above. Thetranscript 1502 may, for example, include a sequence of flat text resembling the text illustrated inFIG. 4 (which illustrates the spokenaudio stream 302 in textual form). - The
system 1500 also includes astructural parser 1504, which uses thehierarchical language model 304 to parse thetranscript 1502 and thereby to produce the structured textual document 310 (step 1404). Thestructural parser 1504 may use the techniques disclosed above with respect toFIGS. 11C and 12B to: (1) produce multiple candidate structured documents having the same content as thetranscript 1502 but having structures corresponding to different paths through thelanguage model 304; (2) generate fitness scores for each of the candidate structured documents; and (3) select the candidate structured document having the highest fitness score as the final structured textual document. In contrast to the techniques disclosed above with respect toFIGS. 11C and 12B , however,step 1404 may be performed without performing speech recognition to generate each of the candidate structured documents. Rather, once thetranscript 1502 has been produced using thespeech recognition decoder 320, candidate structured documents may be generated based on thetranscript 1502 without performing additional speech recognition. - Furthermore, the
structural parser 1504 need not use thefull language model 304 to produce the structuredtextual document 310. Rather, thestructural parser 1504 may use a scaled-down “skeletal” language model, such as thelanguage model 1030 illustrated inFIG. 10C . Note that theexample language model 1030 shown inFIG. 10C is the same as thelanguage model 304 shown inFIG. 10A , except that in theskeletal language model 1030 the contentlanguage model nodes 306 a, 1006 a-d, and 1010 have been replaced with universally-accepting language models 1032 a-f, also referred to as “don't care” language models. The language models 1032 a-f will accept any text that is provided to them as input. The headingcue language models 306 b-e in theskeletal language model 1030 enable thestructural parser 1504 to parse thetranscript 1502 into the correct sub-structures in the structureddocument 310. The use of the universally-accepting language models 1032 a-f, however, enables thestructural parser 1504 to perform such structural parsing without incurring the (typically significant) expense of training content language models, such as themodels 306 a, 1006 a-d, and 1010 shown inFIG. 10A . - Note that the
skeletal language model 1030 may still include language models, such as thedate language model 1012, corresponding to lower-level concepts. As a result, theskeletal language model 1030 may be used to generate the structureddocument 310 from thetranscript 1502 without incurring the overhead of training content language models, while retaining the ability to parse lower-level concepts into the structureddocument 310. - Among the advantages of the invention are one or more of the following. The techniques disclosed herein replace the traditional global language model with a combination of specialized local language models which are more well-suited to section of a document than a single generic language model. Such a language model has a variety of advantages.
- For example, the use of a language model which contains sub-models, each of which corresponds to a particular concept, is advantageous because it allows the most appropriate language model to be used to recognize speech corresponding to each concept. In other words, if each of the sub-models corresponds to a different concept, then each of the sub-models may be used to perform speech recognition on speech representing the corresponding concept. Because the characteristics of speech may vary from concept to concept, the use of such concept-specific language models may produce better recognition results than those which would be produced using a monolithic language model for all concepts.
- Although the sub-models of a language model may correspond to sections of a document, this is not a limitation of the present invention. Rather, each sub-model in the language model may correspond to any concept, such as a section, paragraph, sentence, date, time or ICD9 code. As a result, sub-models in the language model may be matched to particular concepts with a higher degree of precision than would be possible if only section-specific language models were employed. The use of such concept-specific language models for a wide variety of concepts may further improve speech recognition accuracy.
- Furthermore, hierarchical language models designed in accordance with embodiments of the present invention may have multi-level hierarchical structures, with the effect of nesting sub-models inside of each other. As a result, sub-models in the language model may be applied to portions of the spoken
audio stream 302 at various levels of granularity, with the most appropriate language model being applied at each level of granularity. For example, a “header section” language model may be applied generally to speech inside of the header section of a document, while a “date” language model may be applied specifically to speech representing dates in the header section. This ability to nest language models and to apply nested language models to different portions of speech may further improve recognition accuracy by enabling the most appropriate language model to be applied to each portion of a spoken audio stream. - Another advantage of using a language model which includes a plurality of sub-models is that the techniques disclosed herein may use such a language model to generate a structured textual document from a spoken audio stream using a single integrated process, rather than the prior art two-
step process 100 illustrated inFIG. 1A in which a speech recognition step is followed by a natural language processing step. In the two-step process 100 illustrated inFIG. 1A the steps performed by thespeech recognizer 104 and the natural language processor 108 are completely decoupled. Because theautomatic speech recognizer 104 and natural language processor 108 operate independently from each other, theoutput 106 of theautomatic speech recognizer 104 is a literal transcript of the spoken content in theaudio stream 102. Theliteral transcript 106 therefore contains text corresponding to all spoken utterances in theaudio stream 102, whether or not such utterances are relevant to the final desired structured textual document. Such utterances may include, for example, hesitations, extraneous words or repetitions, as well as structural hints or task-related words. Furthermore, the natural language processor 108 relies on the successful detection and transcription of certain key words and/or key phrases, such as structural hints. If these key words/phrases are misrecognized by theautomatic speech recognizer 104, the identification of structural entities by the natural language processor 108 may be negatively affected. In contrast, in themethod 200 illustrated inFIG. 2 , speech recognition and natural language processing are integrated, thereby enabling the language model to influence both the recognition of words in theaudio stream 302 and the generation of structure in the structuredtextual document 310, thereby improving the overall quality of the structureddocument 310. - In addition to generating the structured
document 310, the techniques disclosed herein may also be used to extract and interpret semantic content from theaudio stream 302. For example, the date language model 1012 (FIGS. 10A-10B ) may be used to identify portions of theaudio stream 302 that represent dates, and to store representations of such dates in a computer-readable form. For example, the techniques disclosed herein may be used to identify the spoken phrase “october first nineteen ninety three” as a date and to store the date in a computer-readable form, such as “month=10, day=1, year=1998”. Storing such concepts in a computer-readable form allows the content of such concepts to be easily processed by a computer, such as by sorting document sections by date or identifying medications prescribed prior to a given date. Furthermore, the techniques disclosed herein enable the user to define different portions (e.g., sections) of the document, and to choose which concepts are to be extracted in each section. The techniques disclosed herein, therefore, facilitate the recognition and processing of semantic content in spoken audio streams. Such techniques may be applied instead of or in addition to storing extracted information in a structured document. - Domains, such as the medical and legal domains, in which there are large bodies of pre-existing recorded audio streams to use as training text, may find particular benefit in techniques disclosed herein. Such training text may be used to train the
language model 304 using the techniques disclosed above with respect toFIG. 13 . Because documents in such domains may be required to have well-defined structures, and because such structures may be readily identifiable in existing documents, it may be relatively easy (albeit time-consuming) to correctly identify the portions of such existing documents to use in training each of the concept-specific language model nodes in thelanguage model 304. As a result, each of the language model nodes may be well-trained to recognize the corresponding concept, thereby increasing recognition accuracy and increasing the ability of the system to generate documents having the required structure. - Furthermore, techniques disclosed herein may be applied within such domains without requiring any changes in the existing process by which audio is recorded and transcribed. In the medical domain, for example, doctors may continue to dictate medical reports in their current manner.
- The techniques disclosed herein may be used to generate documents having the desired structure regardless of the manner in which the spoken audio stream is dictated.
- Alternative techniques requiring changes in workflow, such as techniques which require speakers to enroll (by reading training text), which require speakers to modify their manner of speaking (such as by always speaking particular concepts using predetermined spoken forms), or which require transcripts to be generated in a particular format, may be prohibitively costly to implement in domains such as the medical and legal domains. Such changes might, in fact, be inconsistent with institutional or legal requirements related to report structure (such as those imposed by insurance reporting requirements). The techniques disclosed herein, in contrast, allow the
audio stream 302 to be generated in any manner and to have any form. - Additionally, individual sub-models 306 a-e in the
language model 304 may be updated easily without affecting the remainder of the language model. For example, theheader content 306 a sub-model may be replaced with a different header content sub-model which accounts differently for the way in which the document header is dictated. The modular structure of thelanguage model 304 enables such modification/replacement of sub-models to be performed without the need to modify any other part of thelanguage model 304. As a result, parts of thelanguage model 304 may easily be updated to reflect different document dictation conventions. - Furthermore, the structured
textual document 310 that is produced by various embodiments of the present invention may be used to train a language model. For example, the training techniques described in the above-referenced patent application entitled “Document Transcription System Training” may use the structuredtextual document 310 to retrain and thereby improve thelanguage model 304. The retrainedlanguage model 304 may then be used to produce subsequent structured textual documents, which may in turn be used to retrain thelanguage model 304. This iterative process may be employed to improve the quality of the structured documents that are produced over time. - It is to be understood that although the invention has been described above in terms of particular embodiments, the foregoing embodiments are provided as illustrative only, and do not limit or define the scope of the invention. Various other embodiments, including but not limited to the following, are also within the scope of the claims. For example, elements and components described herein may be further divided into additional components or joined together to form fewer components for performing the same functions.
- The spoken
audio stream 302 may be any audio stream, such as a live audio stream received directly or indirectly (such as over a telephone or IP connection), or an audio stream recorded on any medium and in any format. In distributed speech recognition (DSR), a client performs preprocessing on an audio stream to produce a processed audio stream that is transmitted to a server, which performs speech recognition on the processed audio stream. Theaudio stream 302 may, for example, be a processed audio stream produced by a DSR client. - Although in the examples above each node in the
language model 304 is described as containing a language model that corresponds to a particular concept, this is not a requirement of the present invention. For example, a node may include a language model that results from interpolating a concept-specific language model associated with the node with one or more of: (1) global background language models, or (2) concept-specific language models associated with other nodes. - In the examples above, a distinction may be made between “grammars” and “text.” It should be appreciated that text may be represented as a grammar, in which there is a single spoken form having a probability of one. Therefore, documents which are described herein as including both text and grammars may be implemented solely using grammars if desired. Furthermore, a finite state grammar is merely one kind of context-free grammar, which is a kind of language model that allows multiple alternative spoken forms of a concept to be represented. Therefore, any description herein of techniques that are applied to finite state grammars may be applied more generally to any other kind of grammar.
- Furthermore, although the description above may refer to finite state grammars and n-gram language models, these are merely examples of kinds of language models that may be used in conjunction with embodiments of the present invention.
- Embodiments of the present invention are not limited to use in conjunction with any particular kind(s) of language model(s).
- The invention is not limited to any of the described fields (such as medical and legal reports), but generally applies to any kind of structured documents.
- The techniques described above may be implemented, for example, in hardware, software, firmware, or any combination thereof. The techniques described above may be implemented in one or more computer programs executing on a programmable computer including a processor, a storage medium readable by the processor (including, for example, volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Program code may be applied to input entered using the input device to perform the functions described and to generate output. The output may be provided to one or more output devices.
- Each computer program within the scope of the claims below may be implemented in any programming language, such as assembly language, machine language, a high-level procedural programming language, or an object-oriented programming language. The programming language may, for example, be a compiled or interpreted programming language.
- Each such computer program may be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a computer processor. Method steps of the invention may be performed by a computer processor executing a program tangibly embodied on a computer-readable medium to perform functions of the invention by operating on input and generating output. Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, the processor receives instructions and data from a read-only memory and/or a random access memory. Storage devices suitable for tangibly embodying computer program instructions include, for example, all forms of non-volatile memory, such as semiconductor memory devices, including EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROMs. Any of the foregoing may be supplemented by, or incorporated in, specially-designed ASICs (application-specific integrated circuits) or FPGAs (Field-Programmable Gate Arrays). A computer can generally also receive programs and data from a storage medium such as an internal disk (not shown) or a removable disk. These elements will also be found in a conventional desktop or workstation computer as well as other computers suitable for executing computer programs implementing the methods described herein, which may be used in conjunction with any digital print engine or marking engine, display monitor, or other raster output device capable of producing color or gray scale pixels on paper, film, display screen, or other output medium.
Claims (55)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/253,241 US20090048833A1 (en) | 2004-08-20 | 2008-10-17 | Automated Extraction of Semantic Content and Generation of a Structured Document from Speech |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/923,517 US7584103B2 (en) | 2004-08-20 | 2004-08-20 | Automated extraction of semantic content and generation of a structured document from speech |
US12/253,241 US20090048833A1 (en) | 2004-08-20 | 2008-10-17 | Automated Extraction of Semantic Content and Generation of a Structured Document from Speech |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/923,517 Continuation US7584103B2 (en) | 2004-08-20 | 2004-08-20 | Automated extraction of semantic content and generation of a structured document from speech |
Publications (1)
Publication Number | Publication Date |
---|---|
US20090048833A1 true US20090048833A1 (en) | 2009-02-19 |
Family
ID=35910687
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/923,517 Expired - Fee Related US7584103B2 (en) | 2004-08-20 | 2004-08-20 | Automated extraction of semantic content and generation of a structured document from speech |
US12/253,241 Abandoned US20090048833A1 (en) | 2004-08-20 | 2008-10-17 | Automated Extraction of Semantic Content and Generation of a Structured Document from Speech |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/923,517 Expired - Fee Related US7584103B2 (en) | 2004-08-20 | 2004-08-20 | Automated extraction of semantic content and generation of a structured document from speech |
Country Status (8)
Country | Link |
---|---|
US (2) | US7584103B2 (en) |
EP (1) | EP1787288B1 (en) |
JP (1) | JP4940139B2 (en) |
CA (1) | CA2577721C (en) |
DK (1) | DK1787288T3 (en) |
ES (1) | ES2394726T3 (en) |
PL (1) | PL1787288T3 (en) |
WO (1) | WO2006023622A2 (en) |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060025670A1 (en) * | 2004-07-07 | 2006-02-02 | Young Kim | System and method for efficient diagnostic analysis of ophthalmic examinations |
US20060041427A1 (en) * | 2004-08-20 | 2006-02-23 | Girija Yegnanarayanan | Document transcription system training |
US20060074895A1 (en) * | 2004-09-29 | 2006-04-06 | International Business Machines Corporation | Method and system for extracting and utilizing metadata to improve accuracy in speech to text conversions |
US20070299665A1 (en) * | 2006-06-22 | 2007-12-27 | Detlef Koll | Automatic Decision Support |
US20080059173A1 (en) * | 2006-08-31 | 2008-03-06 | At&T Corp. | Method and system for providing an automated web transcription service |
US20080177623A1 (en) * | 2007-01-24 | 2008-07-24 | Juergen Fritsch | Monitoring User Interactions With A Document Editing System |
US20080273774A1 (en) * | 2007-05-04 | 2008-11-06 | Maged Mikhail | System and methods for capturing a medical drawing or sketch for generating progress notes, diagnosis and billing codes |
US7793217B1 (en) * | 2004-07-07 | 2010-09-07 | Young Kim | System and method for automated report generation of ophthalmic examinations from digital drawings |
US20100299135A1 (en) * | 2004-08-20 | 2010-11-25 | Juergen Fritsch | Automated Extraction of Semantic Content and Generation of a Structured Document from Speech |
US20100318347A1 (en) * | 2005-07-22 | 2010-12-16 | Kjell Schubert | Content-Based Audio Playback Emphasis |
US20110131486A1 (en) * | 2006-05-25 | 2011-06-02 | Kjell Schubert | Replacing Text Representing a Concept with an Alternate Written Form of the Concept |
US20120159316A1 (en) * | 2007-01-24 | 2012-06-21 | Cerner Innovation, Inc. | Multi-modal entry for electronic clinical documentation |
US20130030793A1 (en) * | 2011-07-28 | 2013-01-31 | Microsoft Corporation | Linguistic error detection |
US8504372B2 (en) | 2008-08-29 | 2013-08-06 | Mmodal Ip Llc | Distributed speech recognition using one way communication |
US8666742B2 (en) | 2005-11-08 | 2014-03-04 | Mmodal Ip Llc | Automatic detection and application of editing patterns in draft documents |
US20140136197A1 (en) * | 2011-07-31 | 2014-05-15 | Jonathan Mamou | Accuracy improvement of spoken queries transcription using co-occurrence information |
US20140278553A1 (en) * | 2013-03-15 | 2014-09-18 | Mmodal Ip Llc | Dynamic Superbill Coding Workflow |
US8959102B2 (en) | 2010-10-08 | 2015-02-17 | Mmodal Ip Llc | Structured searching of dynamic structured document corpuses |
US9009025B1 (en) * | 2011-12-27 | 2015-04-14 | Amazon Technologies, Inc. | Context-based utterance recognition |
US9262397B2 (en) | 2010-10-08 | 2016-02-16 | Microsoft Technology Licensing, Llc | General purpose correction of grammatical and word usage errors |
US9275643B2 (en) | 2011-06-19 | 2016-03-01 | Mmodal Ip Llc | Document extension in dictation-based document generation workflow |
US9477662B2 (en) | 2011-02-18 | 2016-10-25 | Mmodal Ip Llc | Computer-assisted abstraction for reporting of quality measures |
US9679077B2 (en) | 2012-06-29 | 2017-06-13 | Mmodal Ip Llc | Automated clinical evidence sheet workflow |
US10567850B2 (en) | 2016-08-26 | 2020-02-18 | International Business Machines Corporation | Hierarchical video concept tagging and indexing system for learning content orchestration |
US10950329B2 (en) | 2015-03-13 | 2021-03-16 | Mmodal Ip Llc | Hybrid human and computer-assisted coding workflow |
US11043306B2 (en) | 2017-01-17 | 2021-06-22 | 3M Innovative Properties Company | Methods and systems for manifestation and transmission of follow-up notifications |
US11062704B1 (en) | 2018-12-21 | 2021-07-13 | Cerner Innovation, Inc. | Processing multi-party conversations |
US11282596B2 (en) | 2017-11-22 | 2022-03-22 | 3M Innovative Properties Company | Automated code feedback system |
US11455497B2 (en) * | 2018-07-23 | 2022-09-27 | Accenture Global Solutions Limited | Information transition management platform |
Families Citing this family (122)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004049192A2 (en) | 2002-11-28 | 2004-06-10 | Koninklijke Philips Electronics N.V. | Method to assign word class information |
US8666725B2 (en) * | 2004-04-16 | 2014-03-04 | University Of Southern California | Selection and use of nonstatistical translation components in a statistical machine translation framework |
WO2006015169A2 (en) * | 2004-07-30 | 2006-02-09 | Dictaphone Corporation | A system and method for report level confidence |
US7584103B2 (en) * | 2004-08-20 | 2009-09-01 | Multimodal Technologies, Inc. | Automated extraction of semantic content and generation of a structured document from speech |
US8412521B2 (en) * | 2004-08-20 | 2013-04-02 | Multimodal Technologies, Llc | Discriminative training of document transcription system |
JP5452868B2 (en) * | 2004-10-12 | 2014-03-26 | ユニヴァーシティー オブ サザン カリフォルニア | Training for text-to-text applications that use string-to-tree conversion for training and decoding |
US7502741B2 (en) * | 2005-02-23 | 2009-03-10 | Multimodal Technologies, Inc. | Audio signal de-identification |
US20060212452A1 (en) * | 2005-03-18 | 2006-09-21 | Cornacchia Louis G Iii | System and method for remotely inputting and retrieving records and generating reports |
US7430715B2 (en) * | 2005-05-31 | 2008-09-30 | Sap, Aktiengesellschaft | Interface for indicating the presence of inherited values in a document |
US7640255B2 (en) | 2005-05-31 | 2009-12-29 | Sap, Ag | Method for utilizing a multi-layered data model to generate audience specific documents |
US8886517B2 (en) | 2005-06-17 | 2014-11-11 | Language Weaver, Inc. | Trust scoring for language translation systems |
US7693713B2 (en) * | 2005-06-17 | 2010-04-06 | Microsoft Corporation | Speech models generated using competitive training, asymmetric training, and data boosting |
US8676563B2 (en) * | 2009-10-01 | 2014-03-18 | Language Weaver, Inc. | Providing human-generated and machine-generated trusted translations |
US8577684B2 (en) | 2005-07-13 | 2013-11-05 | Intellisist, Inc. | Selective security masking within recorded speech utilizing speech recognition techniques |
US8700404B1 (en) | 2005-08-27 | 2014-04-15 | At&T Intellectual Property Ii, L.P. | System and method for using semantic and syntactic graphs for utterance classification |
US8032372B1 (en) * | 2005-09-13 | 2011-10-04 | Escription, Inc. | Dictation selection |
US20070081428A1 (en) * | 2005-09-29 | 2007-04-12 | Spryance, Inc. | Transcribing dictation containing private information |
US20070078806A1 (en) * | 2005-10-05 | 2007-04-05 | Hinickle Judith A | Method and apparatus for evaluating the accuracy of transcribed documents and other documents |
US10319252B2 (en) * | 2005-11-09 | 2019-06-11 | Sdl Inc. | Language capability assessment and training apparatus and techniques |
US8473296B2 (en) | 2005-12-08 | 2013-06-25 | Nuance Communications, Inc. | Method and system for dynamic creation of contexts |
US8036889B2 (en) * | 2006-02-27 | 2011-10-11 | Nuance Communications, Inc. | Systems and methods for filtering dictated and non-dictated sections of documents |
US8301448B2 (en) * | 2006-03-29 | 2012-10-30 | Nuance Communications, Inc. | System and method for applying dynamic contextual grammars and language models to improve automatic speech recognition accuracy |
US7756708B2 (en) * | 2006-04-03 | 2010-07-13 | Google Inc. | Automatic language model update |
US8943080B2 (en) * | 2006-04-07 | 2015-01-27 | University Of Southern California | Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections |
US8433915B2 (en) | 2006-06-28 | 2013-04-30 | Intellisist, Inc. | Selective security masking within recorded speech |
US20080027726A1 (en) * | 2006-07-28 | 2008-01-31 | Eric Louis Hansen | Text to audio mapping, and animation of the text |
US8886518B1 (en) | 2006-08-07 | 2014-11-11 | Language Weaver, Inc. | System and method for capitalizing machine translated text |
US9122674B1 (en) | 2006-12-15 | 2015-09-01 | Language Weaver, Inc. | Use of annotations in statistical machine translation |
US8433576B2 (en) * | 2007-01-19 | 2013-04-30 | Microsoft Corporation | Automatic reading tutoring with parallel polarized language modeling |
US20080221882A1 (en) * | 2007-03-06 | 2008-09-11 | Bundock Donald S | System for excluding unwanted data from a voice recording |
US8615389B1 (en) * | 2007-03-16 | 2013-12-24 | Language Weaver, Inc. | Generation and exploitation of an approximate language model |
EP2130167A1 (en) * | 2007-03-29 | 2009-12-09 | Nuance Communications Austria GmbH | Method and system for generating a medical report and computer program product therefor |
US8831928B2 (en) * | 2007-04-04 | 2014-09-09 | Language Weaver, Inc. | Customizable machine translation service |
JP5145751B2 (en) * | 2007-04-06 | 2013-02-20 | コニカミノルタエムジー株式会社 | Medical information processing system |
US8825466B1 (en) | 2007-06-08 | 2014-09-02 | Language Weaver, Inc. | Modification of annotated bilingual segment pairs in syntax-based machine translation |
US8306822B2 (en) * | 2007-09-11 | 2012-11-06 | Microsoft Corporation | Automatic reading tutoring using dynamically built language model |
US20090216532A1 (en) * | 2007-09-26 | 2009-08-27 | Nuance Communications, Inc. | Automatic Extraction and Dissemination of Audio Impression |
US8301633B2 (en) * | 2007-10-01 | 2012-10-30 | Palo Alto Research Center Incorporated | System and method for semantic search |
US20100017293A1 (en) * | 2008-07-17 | 2010-01-21 | Language Weaver, Inc. | System, method, and computer program for providing multilingual text advertisments |
US20100125450A1 (en) | 2008-10-27 | 2010-05-20 | Spheris Inc. | Synchronized transcription rules handling |
US20100145720A1 (en) * | 2008-12-05 | 2010-06-10 | Bruce Reiner | Method of extracting real-time structured data and performing data analysis and decision support in medical reporting |
JP5377430B2 (en) * | 2009-07-08 | 2013-12-25 | 本田技研工業株式会社 | Question answering database expansion device and question answering database expansion method |
US8990064B2 (en) * | 2009-07-28 | 2015-03-24 | Language Weaver, Inc. | Translating documents based on content |
WO2011100474A2 (en) | 2010-02-10 | 2011-08-18 | Multimodal Technologies, Inc. | Providing computable guidance to relevant evidence in question-answering systems |
US10417646B2 (en) | 2010-03-09 | 2019-09-17 | Sdl Inc. | Predicting the cost associated with translating textual content |
US8463673B2 (en) | 2010-09-23 | 2013-06-11 | Mmodal Ip Llc | User feedback in semi-automatic question answering systems |
US10460288B2 (en) | 2011-02-18 | 2019-10-29 | Nuance Communications, Inc. | Methods and apparatus for identifying unspecified diagnoses in clinical documentation |
US8768723B2 (en) | 2011-02-18 | 2014-07-01 | Nuance Communications, Inc. | Methods and apparatus for formatting text for clinical fact extraction |
US10032127B2 (en) | 2011-02-18 | 2018-07-24 | Nuance Communications, Inc. | Methods and apparatus for determining a clinician's intent to order an item |
US9904768B2 (en) | 2011-02-18 | 2018-02-27 | Nuance Communications, Inc. | Methods and apparatus for presenting alternative hypotheses for medical facts |
US11003838B2 (en) | 2011-04-18 | 2021-05-11 | Sdl Inc. | Systems and methods for monitoring post translation editing |
US8694303B2 (en) | 2011-06-15 | 2014-04-08 | Language Weaver, Inc. | Systems and methods for tuning parameters in statistical machine translation |
US9412369B2 (en) | 2011-06-17 | 2016-08-09 | Microsoft Technology Licensing, Llc | Automated adverse drug event alerts |
US8886515B2 (en) | 2011-10-19 | 2014-11-11 | Language Weaver, Inc. | Systems and methods for enhancing machine translation post edit review processes |
US9569593B2 (en) | 2012-03-08 | 2017-02-14 | Nuance Communications, Inc. | Methods and apparatus for generating clinical reports |
US9569594B2 (en) | 2012-03-08 | 2017-02-14 | Nuance Communications, Inc. | Methods and apparatus for generating clinical reports |
CA2865280A1 (en) * | 2012-03-08 | 2013-09-12 | Nuance Communications, Inc. | Methods and apparatus for generating clinical reports |
US8942973B2 (en) | 2012-03-09 | 2015-01-27 | Language Weaver, Inc. | Content page URL translation |
US8612261B1 (en) | 2012-05-21 | 2013-12-17 | Health Management Associates, Inc. | Automated learning for medical data processing system |
US10261994B2 (en) | 2012-05-25 | 2019-04-16 | Sdl Inc. | Method and system for automatic management of reputation of translators |
WO2014028529A2 (en) | 2012-08-13 | 2014-02-20 | Mmodal Ip Llc | Maintaining a discrete data representation that corresponds to information contained in free-form text |
US9710431B2 (en) * | 2012-08-18 | 2017-07-18 | Health Fidelity, Inc. | Systems and methods for processing patient information |
US8762134B2 (en) | 2012-08-30 | 2014-06-24 | Arria Data2Text Limited | Method and apparatus for situational analysis text generation |
US9336193B2 (en) | 2012-08-30 | 2016-05-10 | Arria Data2Text Limited | Method and apparatus for updating a previously generated text |
US8762133B2 (en) | 2012-08-30 | 2014-06-24 | Arria Data2Text Limited | Method and apparatus for alert validation |
US9405448B2 (en) | 2012-08-30 | 2016-08-02 | Arria Data2Text Limited | Method and apparatus for annotating a graphical output |
US9135244B2 (en) | 2012-08-30 | 2015-09-15 | Arria Data2Text Limited | Method and apparatus for configurable microplanning |
US9600471B2 (en) | 2012-11-02 | 2017-03-21 | Arria Data2Text Limited | Method and apparatus for aggregating with information generalization |
WO2014076525A1 (en) | 2012-11-16 | 2014-05-22 | Data2Text Limited | Method and apparatus for expressing time in an output text |
WO2014076524A1 (en) | 2012-11-16 | 2014-05-22 | Data2Text Limited | Method and apparatus for spatial descriptions in an output text |
US9152622B2 (en) | 2012-11-26 | 2015-10-06 | Language Weaver, Inc. | Personalized machine translation via online adaptation |
WO2014102569A1 (en) | 2012-12-27 | 2014-07-03 | Arria Data2Text Limited | Method and apparatus for motion description |
US10115202B2 (en) | 2012-12-27 | 2018-10-30 | Arria Data2Text Limited | Method and apparatus for motion detection |
US10776561B2 (en) | 2013-01-15 | 2020-09-15 | Arria Data2Text Limited | Method and apparatus for generating a linguistic representation of raw input data |
US11024406B2 (en) | 2013-03-12 | 2021-06-01 | Nuance Communications, Inc. | Systems and methods for identifying errors and/or critical results in medical reports |
US9819798B2 (en) | 2013-03-14 | 2017-11-14 | Intellisist, Inc. | Computer-implemented system and method for efficiently facilitating appointments within a call center via an automatic call distributor |
WO2014165837A1 (en) * | 2013-04-04 | 2014-10-09 | Waterhouse Jonathan | Displaying an action vignette while text of a passage is correctly read aloud |
US10496743B2 (en) | 2013-06-26 | 2019-12-03 | Nuance Communications, Inc. | Methods and apparatus for extracting facts from a medical text |
WO2015028844A1 (en) | 2013-08-29 | 2015-03-05 | Arria Data2Text Limited | Text generation from correlated alerts |
US9244894B1 (en) | 2013-09-16 | 2016-01-26 | Arria Data2Text Limited | Method and apparatus for interactive reports |
US9396181B1 (en) | 2013-09-16 | 2016-07-19 | Arria Data2Text Limited | Method, apparatus, and computer program product for user-directed reporting |
US9213694B2 (en) | 2013-10-10 | 2015-12-15 | Language Weaver, Inc. | Efficient online domain adaptation |
US10324966B2 (en) | 2014-03-21 | 2019-06-18 | Mmodal Ip Llc | Search by example |
WO2015159133A1 (en) | 2014-04-18 | 2015-10-22 | Arria Data2Text Limited | Method and apparatus for document planning |
US10152532B2 (en) | 2014-08-07 | 2018-12-11 | AT&T Interwise Ltd. | Method and system to associate meaningful expressions with abbreviated names |
US10169826B1 (en) * | 2014-10-31 | 2019-01-01 | Intuit Inc. | System and method for generating explanations for tax calculations |
US10387970B1 (en) | 2014-11-25 | 2019-08-20 | Intuit Inc. | Systems and methods for analyzing and generating explanations for changes in tax return results |
WO2016090010A1 (en) * | 2014-12-03 | 2016-06-09 | Hakman Labs LLC | Workflow definition, orchestration and enforcement via a collaborative interface according to a hierarchical checklist |
US20170116194A1 (en) | 2015-10-23 | 2017-04-27 | International Business Machines Corporation | Ingestion planning for complex tables |
US10747947B2 (en) * | 2016-02-25 | 2020-08-18 | Nxgn Management, Llc | Electronic health record compatible distributed dictation transcription system |
JP2017167433A (en) * | 2016-03-17 | 2017-09-21 | 株式会社東芝 | Summary generation device, summary generation method, and summary generation program |
US10754978B2 (en) | 2016-07-29 | 2020-08-25 | Intellisist Inc. | Computer-implemented system and method for storing and retrieving sensitive information |
US10445432B1 (en) | 2016-08-31 | 2019-10-15 | Arria Data2Text Limited | Method and apparatus for lightweight multilingual natural language realizer |
US12020334B2 (en) | 2016-10-26 | 2024-06-25 | Intuit Inc. | Methods, systems and computer program products for generating and presenting explanations for tax questions |
US10467347B1 (en) | 2016-10-31 | 2019-11-05 | Arria Data2Text Limited | Method and apparatus for natural language document orchestrator |
US10860685B2 (en) | 2016-11-28 | 2020-12-08 | Google Llc | Generating structured text content using speech recognition models |
WO2018152352A1 (en) | 2017-02-18 | 2018-08-23 | Mmodal Ip Llc | Computer-automated scribe tools |
US20190051395A1 (en) | 2017-08-10 | 2019-02-14 | Nuance Communications, Inc. | Automated clinical documentation system and method |
US11316865B2 (en) | 2017-08-10 | 2022-04-26 | Nuance Communications, Inc. | Ambient cooperative intelligence system and method |
US10579716B2 (en) | 2017-11-06 | 2020-03-03 | Microsoft Technology Licensing, Llc | Electronic document content augmentation |
US11250382B2 (en) | 2018-03-05 | 2022-02-15 | Nuance Communications, Inc. | Automated clinical documentation system and method |
WO2019173333A1 (en) | 2018-03-05 | 2019-09-12 | Nuance Communications, Inc. | Automated clinical documentation system and method |
US20190272902A1 (en) | 2018-03-05 | 2019-09-05 | Nuance Communications, Inc. | System and method for review of automated clinical documentation |
US10891436B2 (en) * | 2018-03-09 | 2021-01-12 | Accenture Global Solutions Limited | Device and method for voice-driven ideation session management |
US10664662B2 (en) * | 2018-04-18 | 2020-05-26 | Microsoft Technology Licensing, Llc | Multi-scale model for semantic matching |
US11836454B2 (en) | 2018-05-02 | 2023-12-05 | Language Scientific, Inc. | Systems and methods for producing reliable translation in near real-time |
KR20190136578A (en) * | 2018-05-31 | 2019-12-10 | 삼성전자주식회사 | Method and apparatus for speech recognition |
US11094322B2 (en) | 2019-02-07 | 2021-08-17 | International Business Machines Corporation | Optimizing speech to text conversion and text summarization using a medical provider workflow model |
US10522138B1 (en) * | 2019-02-11 | 2019-12-31 | Groupe Allo Media SAS | Real-time voice processing systems and methods |
US11227679B2 (en) | 2019-06-14 | 2022-01-18 | Nuance Communications, Inc. | Ambient clinical intelligence system and method |
US11043207B2 (en) | 2019-06-14 | 2021-06-22 | Nuance Communications, Inc. | System and method for array data simulation and customized acoustic modeling for ambient ASR |
US11216480B2 (en) | 2019-06-14 | 2022-01-04 | Nuance Communications, Inc. | System and method for querying data points from graph data structures |
US11531807B2 (en) | 2019-06-28 | 2022-12-20 | Nuance Communications, Inc. | System and method for customized text macros |
US11670408B2 (en) | 2019-09-30 | 2023-06-06 | Nuance Communications, Inc. | System and method for review of automated clinical documentation |
WO2021111374A1 (en) * | 2019-12-04 | 2021-06-10 | Rajanna Pooran Prasad | A system and method for providing contextual information and actions to make a conversation meaningful and engaging |
US10805665B1 (en) | 2019-12-13 | 2020-10-13 | Bank Of America Corporation | Synchronizing text-to-audio with interactive videos in the video framework |
US11350185B2 (en) | 2019-12-13 | 2022-05-31 | Bank Of America Corporation | Text-to-audio for interactive videos using a markup language |
JP6818916B2 (en) * | 2020-01-08 | 2021-01-27 | 株式会社東芝 | Summary generator, summary generation method and summary generation program |
US11222103B1 (en) | 2020-10-29 | 2022-01-11 | Nuance Communications, Inc. | Ambient cooperative intelligence system and method |
US11429780B1 (en) * | 2021-01-11 | 2022-08-30 | Suki AI, Inc. | Systems and methods to briefly deviate from and resume back to amending a section of a note |
US20220383874A1 (en) | 2021-05-28 | 2022-12-01 | 3M Innovative Properties Company | Documentation system based on dynamic semantic templates |
US20230395063A1 (en) * | 2022-06-03 | 2023-12-07 | Nuance Communications, Inc. | System and Method for Secure Transcription Generation |
Citations (78)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5526407A (en) * | 1991-09-30 | 1996-06-11 | Riverrun Technology | Method and apparatus for managing information |
US5594638A (en) * | 1993-12-29 | 1997-01-14 | First Opinion Corporation | Computerized medical diagnostic system including re-enter function and sensitivity factors |
US5701469A (en) * | 1995-06-07 | 1997-12-23 | Microsoft Corporation | Method and system for generating accurate search results using a content-index |
US5823948A (en) * | 1996-07-08 | 1998-10-20 | Rlis, Inc. | Medical records, documentation, tracking and order entry system |
US5835893A (en) * | 1996-02-15 | 1998-11-10 | Atr Interpreting Telecommunications Research Labs | Class-based word clustering for speech recognition using a three-level balanced hierarchical similarity |
US5870706A (en) * | 1996-04-10 | 1999-02-09 | Lucent Technologies, Inc. | Method and apparatus for an improved language recognition system |
US5983187A (en) * | 1995-12-15 | 1999-11-09 | Hewlett-Packard Company | Speech data storage organizing system using form field indicators |
US5995936A (en) * | 1997-02-04 | 1999-11-30 | Brais; Louis | Report generation system and method for capturing prose, audio, and video by voice command and automatically linking sound and image to formatted text locations |
US6061675A (en) * | 1995-05-31 | 2000-05-09 | Oracle Corporation | Methods and apparatus for classifying terminology utilizing a knowledge catalog |
US6182039B1 (en) * | 1998-03-24 | 2001-01-30 | Matsushita Electric Industrial Co., Ltd. | Method and apparatus using probabilistic language model based on confusable sets for speech recognition |
US6345249B1 (en) * | 1999-07-07 | 2002-02-05 | International Business Machines Corp. | Automatic analysis of a speech dictated document |
US20020099717A1 (en) * | 2001-01-24 | 2002-07-25 | Gordon Bennett | Method for report generation in an on-line transcription system |
WO2002071391A2 (en) * | 2001-03-01 | 2002-09-12 | International Business Machines Corporation | Hierarchichal language models |
US20030018470A1 (en) * | 2001-04-13 | 2003-01-23 | Golden Richard M. | System and method for automatic semantic coding of free response data using Hidden Markov Model methodology |
US6526380B1 (en) * | 1999-03-26 | 2003-02-25 | Koninklijke Philips Electronics N.V. | Speech recognition system having parallel large vocabulary recognition engines |
US20030065503A1 (en) * | 2001-09-28 | 2003-04-03 | Philips Electronics North America Corp. | Multi-lingual transcription system |
US20030069760A1 (en) * | 2001-10-04 | 2003-04-10 | Arthur Gelber | System and method for processing and pre-adjudicating patient benefit claims |
US20030101054A1 (en) * | 2001-11-27 | 2003-05-29 | Ncc, Llc | Integrated system and method for electronic speech recognition and transcription |
US20030181790A1 (en) * | 2000-05-18 | 2003-09-25 | Daniel David | Methods and apparatus for facilitated, hierarchical medical diagnosis and symptom coding and definition |
US20030191627A1 (en) * | 1998-05-28 | 2003-10-09 | Lawrence Au | Topological methods to organize semantic network data flows for conversational applications |
US6662168B1 (en) * | 2000-05-19 | 2003-12-09 | International Business Machines Corporation | Coding system for high data volume |
US6684188B1 (en) * | 1996-02-02 | 2004-01-27 | Geoffrey C Mitchell | Method for production of medical records and other technical documents |
US20040019482A1 (en) * | 2002-04-19 | 2004-01-29 | Holub John M. | Speech to text system using controlled vocabulary indices |
US20040030688A1 (en) * | 2000-05-31 | 2004-02-12 | International Business Machines Corporation | Information search using knowledge agents |
US20040030556A1 (en) * | 1999-11-12 | 2004-02-12 | Bennett Ian M. | Speech based learning/training system using semantic decoding |
US20040064317A1 (en) * | 2002-09-26 | 2004-04-01 | Konstantin Othmer | System and method for online transcription services |
US20040078215A1 (en) * | 2000-11-22 | 2004-04-22 | Recare, Inc. | Systems and methods for documenting medical findings of a physical examination |
US6738784B1 (en) * | 2000-04-06 | 2004-05-18 | Dictaphone Corporation | Document and information processing system |
US20040102957A1 (en) * | 2002-11-22 | 2004-05-27 | Levin Robert E. | System and method for speech translation using remote devices |
US20040111265A1 (en) * | 2002-12-06 | 2004-06-10 | Forbes Joseph S | Method and system for sequential insertion of speech recognition results to facilitate deferred transcription services |
US20040117189A1 (en) * | 1999-11-12 | 2004-06-17 | Bennett Ian M. | Query engine for processing voice based queries including semantic decoding |
US20040148170A1 (en) * | 2003-01-23 | 2004-07-29 | Alejandro Acero | Statistical classifiers for spoken language understanding and command/control scenarios |
US6785651B1 (en) * | 2000-09-14 | 2004-08-31 | Microsoft Corporation | Method and apparatus for performing plan-based dialog |
US20040243614A1 (en) * | 2003-05-30 | 2004-12-02 | Dictaphone Corporation | Method, system, and apparatus for validation |
US20040249667A1 (en) * | 2001-10-18 | 2004-12-09 | Oon Yeong K | System and method of improved recording of medical transactions |
US6834264B2 (en) * | 2001-03-29 | 2004-12-21 | Provox Technologies Corporation | Method and apparatus for voice dictation and document production |
US20050065774A1 (en) * | 2003-09-20 | 2005-03-24 | International Business Machines Corporation | Method of self enhancement of search results through analysis of system logs |
US20050086056A1 (en) * | 2003-09-25 | 2005-04-21 | Fuji Photo Film Co., Ltd. | Voice recognition system and program |
US20050091059A1 (en) * | 2003-08-29 | 2005-04-28 | Microsoft Corporation | Assisted multi-modal dialogue |
US20050114122A1 (en) * | 2003-09-25 | 2005-05-26 | Dictaphone Corporation | System and method for customizing speech recognition input and output |
US20050114129A1 (en) * | 2002-12-06 | 2005-05-26 | Watson Kirk L. | Method and system for server-based sequential insertion processing of speech recognition results |
US20050120300A1 (en) * | 2003-09-25 | 2005-06-02 | Dictaphone Corporation | Method, system, and apparatus for assembly, transport and display of clinical data |
US20050154690A1 (en) * | 2002-02-04 | 2005-07-14 | Celestar Lexico-Sciences, Inc | Document knowledge management apparatus and method |
US20050216443A1 (en) * | 2000-07-06 | 2005-09-29 | Streamsage, Inc. | Method and system for indexing and searching timed media information based upon relevance intervals |
US20050228815A1 (en) * | 2004-03-31 | 2005-10-13 | Dictaphone Corporation | Categorization of information using natural language processing and predefined templates |
US20050234891A1 (en) * | 2004-03-15 | 2005-10-20 | Yahoo! Inc. | Search systems and methods with integration of user annotations |
US20050240439A1 (en) * | 2004-04-15 | 2005-10-27 | Artificial Medical Intelligence, Inc, | System and method for automatic assignment of medical codes to unformatted data |
US20050288930A1 (en) * | 2004-06-09 | 2005-12-29 | Vaastek, Inc. | Computer voice recognition apparatus and method |
US20060007188A1 (en) * | 2004-07-09 | 2006-01-12 | Gesturerad, Inc. | Gesture-based reporting method and system |
US20060020466A1 (en) * | 2004-07-26 | 2006-01-26 | Cousineau Leo E | Ontology based medical patient evaluation method for data capture and knowledge representation |
US20060020886A1 (en) * | 2004-07-15 | 2006-01-26 | Agrawal Subodh K | System and method for the structured capture of information and the generation of semantically rich reports |
US20060041428A1 (en) * | 2004-08-20 | 2006-02-23 | Juergen Fritsch | Automated extraction of semantic content and generation of a structured document from speech |
US20060041836A1 (en) * | 2002-09-12 | 2006-02-23 | Gordon T J | Information documenting system with improved speed, completeness, retriveability and granularity |
US20060074656A1 (en) * | 2004-08-20 | 2006-04-06 | Lambert Mathias | Discriminative training of document transcription system |
US7031908B1 (en) * | 2000-06-01 | 2006-04-18 | Microsoft Corporation | Creating a language model for a language processing system |
US20060089857A1 (en) * | 2004-10-21 | 2006-04-27 | Zimmerman Roger S | Transcription data security |
US7054812B2 (en) * | 2000-05-16 | 2006-05-30 | Canon Kabushiki Kaisha | Database annotation and retrieval |
US20060129435A1 (en) * | 2004-12-15 | 2006-06-15 | Critical Connection Inc. | System and method for providing community health data services |
US20070043761A1 (en) * | 2005-08-22 | 2007-02-22 | The Personal Bee, Inc. | Semantic discovery engine |
US7197460B1 (en) * | 2002-04-23 | 2007-03-27 | At&T Corp. | System for handling frequently asked questions in a natural language dialog service |
US7216073B2 (en) * | 2001-03-13 | 2007-05-08 | Intelligate, Ltd. | Dynamic natural language understanding |
US20070179777A1 (en) * | 2005-12-22 | 2007-08-02 | Rakesh Gupta | Automatic Grammar Generation Using Distributedly Collected Knowledge |
US20070226211A1 (en) * | 2006-03-27 | 2007-09-27 | Heinze Daniel T | Auditing the Coding and Abstracting of Documents |
US20070239445A1 (en) * | 2006-04-11 | 2007-10-11 | International Business Machines Corporation | Method and system for automatic transcription prioritization |
US20070237427A1 (en) * | 2006-04-10 | 2007-10-11 | Patel Nilesh V | Method and system for simplified recordkeeping including transcription and voting based verification |
US20070288212A1 (en) * | 2002-08-19 | 2007-12-13 | General Electric Company | System And Method For Optimizing Simulation Of A Discrete Event Process Using Business System Data |
US20080059232A1 (en) * | 1997-03-13 | 2008-03-06 | Clinical Decision Support, Llc | Disease management system and method including question version |
US20080168343A1 (en) * | 2007-01-05 | 2008-07-10 | Doganata Yurdaer N | System and Method of Automatically Mapping a Given Annotator to an Aggregate of Given Annotators |
US20090055168A1 (en) * | 2007-08-23 | 2009-02-26 | Google Inc. | Word Detection |
US7519529B1 (en) * | 2001-06-29 | 2009-04-14 | Microsoft Corporation | System and methods for inferring informational goals and preferred level of detail of results in response to questions posed to an automated information-retrieval or question-answering service |
US20090228126A1 (en) * | 2001-03-09 | 2009-09-10 | Steven Spielberg | Method and apparatus for annotating a line-based document |
US20090228299A1 (en) * | 2005-11-09 | 2009-09-10 | The Regents Of The University Of California | Methods and apparatus for context-sensitive telemedicine |
US7610192B1 (en) * | 2006-03-22 | 2009-10-27 | Patrick William Jamieson | Process and system for high precision coding of free text documents against a standard lexicon |
US20100076761A1 (en) * | 2008-09-25 | 2010-03-25 | Fritsch Juergen | Decoding-Time Prediction of Non-Verbalized Tokens |
US7716040B2 (en) * | 2006-06-22 | 2010-05-11 | Multimodal Technologies, Inc. | Verification of extracted data |
US20100185685A1 (en) * | 2009-01-13 | 2010-07-22 | Chew Peter A | Technique for Information Retrieval Using Enhanced Latent Semantic Analysis |
US20100299135A1 (en) * | 2004-08-20 | 2010-11-25 | Juergen Fritsch | Automated Extraction of Semantic Content and Generation of a Structured Document from Speech |
US7869998B1 (en) * | 2002-04-23 | 2011-01-11 | At&T Intellectual Property Ii, L.P. | Voice-enabled dialog system |
Family Cites Families (53)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS62221775A (en) | 1986-03-20 | 1987-09-29 | Fujitsu Ltd | Natural language processing system |
US5434962A (en) | 1990-09-07 | 1995-07-18 | Fuji Xerox Co., Ltd. | Method and system for automatically generating logical structures of electronic documents |
JPH0769921B2 (en) | 1990-11-09 | 1995-07-31 | 株式会社日立製作所 | Document logical structure generation method |
JPH06168267A (en) | 1992-11-30 | 1994-06-14 | Itec:Kk | Structural document preparing method and structural document preparation supporting device |
US5384892A (en) | 1992-12-31 | 1995-01-24 | Apple Computer, Inc. | Dynamic language model for speech recognition |
DE4397100T1 (en) | 1992-12-31 | 1995-11-23 | Apple Computer | Recursive grammar with a finite number of states |
NZ248751A (en) | 1994-03-23 | 1997-11-24 | Ryan John Kevin | Text analysis and coding |
JP2618832B2 (en) | 1994-06-16 | 1997-06-11 | 日本アイ・ビー・エム株式会社 | Method and system for analyzing logical structure of document |
US6041292A (en) | 1996-01-16 | 2000-03-21 | Jochim; Carol | Real time stenographic system utilizing vowel omission principle |
US5797123A (en) | 1996-10-01 | 1998-08-18 | Lucent Technologies Inc. | Method of key-phase detection and verification for flexible speech understanding |
US6055494A (en) | 1996-10-28 | 2000-04-25 | The Trustees Of Columbia University In The City Of New York | System and method for medical language extraction and encoding |
US6182029B1 (en) | 1996-10-28 | 2001-01-30 | The Trustees Of Columbia University In The City Of New York | System and method for language extraction and encoding utilizing the parsing of text data in accordance with domain parameters |
US5839106A (en) | 1996-12-17 | 1998-11-17 | Apple Computer, Inc. | Large-vocabulary speech recognition using an integrated syntactic and semantic statistical language model |
US6122613A (en) | 1997-01-30 | 2000-09-19 | Dragon Systems, Inc. | Speech recognition using multiple recognizers (selectively) applied to the same input sample |
US5970449A (en) | 1997-04-03 | 1999-10-19 | Microsoft Corporation | Text normalization using a context-free grammar |
US6490561B1 (en) | 1997-06-25 | 2002-12-03 | Dennis L. Wilson | Continuous speech voice transcription |
US5926784A (en) | 1997-07-17 | 1999-07-20 | Microsoft Corporation | Method and system for natural language parsing using podding |
EP0903727A1 (en) | 1997-09-17 | 1999-03-24 | Istituto Trentino Di Cultura | A system and method for automatic speech recognition |
US6292771B1 (en) | 1997-09-30 | 2001-09-18 | Ihc Health Services, Inc. | Probabilistic method for natural language processing and for encoding free-text data into a medical database by utilizing a Bayesian network to perform spell checking of words |
US6112168A (en) | 1997-10-20 | 2000-08-29 | Microsoft Corporation | Automatically recognizing the discourse structure of a body of text |
US6304870B1 (en) | 1997-12-02 | 2001-10-16 | The Board Of Regents Of The University Of Washington, Office Of Technology Transfer | Method and apparatus of automatically generating a procedure for extracting information from textual information sources |
US6154722A (en) | 1997-12-18 | 2000-11-28 | Apple Computer, Inc. | Method and apparatus for a speech recognition system language model that integrates a finite state grammar probability and an N-gram probability |
DE19809563A1 (en) | 1998-03-05 | 1999-09-09 | Siemens Ag | Medical work station for treating patient |
US7043426B2 (en) | 1998-04-01 | 2006-05-09 | Cyberpulse, L.L.C. | Structured speech recognition |
US6915254B1 (en) | 1998-07-30 | 2005-07-05 | A-Life Medical, Inc. | Automatically assigning medical codes using natural language processing |
US6304848B1 (en) | 1998-08-13 | 2001-10-16 | Medical Manager Corp. | Medical record forming and storing apparatus and medical record and method related to same |
US6122614A (en) | 1998-11-20 | 2000-09-19 | Custom Speech Usa, Inc. | System and method for automating transcription services |
US6249765B1 (en) | 1998-12-22 | 2001-06-19 | Xerox Corporation | System and method for extracting data from audio messages |
US6243669B1 (en) | 1999-01-29 | 2001-06-05 | Sony Corporation | Method and apparatus for providing syntactic analysis and data structure for translation knowledge in example-based language translation |
US6278968B1 (en) | 1999-01-29 | 2001-08-21 | Sony Corporation | Method and apparatus for adaptive speech recognition hypothesis construction and selection in a spoken language translation system |
WO2000054180A1 (en) | 1999-03-05 | 2000-09-14 | Cai Co., Ltd. | System and method for creating formatted document on the basis of conversational speech recognition |
JP2000259175A (en) | 1999-03-08 | 2000-09-22 | Mitsubishi Electric Corp | Voice recognition device |
US6609087B1 (en) | 1999-04-28 | 2003-08-19 | Genuity Inc. | Fact recognition system |
US6434547B1 (en) * | 1999-10-28 | 2002-08-13 | Qenm.Com | Data capture and verification system |
CN1254787C (en) | 1999-12-02 | 2006-05-03 | 汤姆森许可贸易公司 | Method and device for speech recognition with disjoint language moduls |
US6535849B1 (en) | 2000-01-18 | 2003-03-18 | Scansoft, Inc. | Method and system for generating semi-literal transcripts for speech recognition systems |
JP4108948B2 (en) * | 2000-09-25 | 2008-06-25 | 富士通株式会社 | Apparatus and method for viewing a plurality of documents |
CN1529861B (en) | 2000-11-07 | 2010-12-29 | 阿斯科瑞帕兹公司 | System for creation of database and structured information from verbal input |
US20020087315A1 (en) | 2000-12-29 | 2002-07-04 | Lee Victor Wai Leung | Computer-implemented multi-scanning language method and system |
US20020087311A1 (en) | 2000-12-29 | 2002-07-04 | Leung Lee Victor Wai | Computer-implemented dynamic language model generation method and system |
US6714939B2 (en) | 2001-01-08 | 2004-03-30 | Softface, Inc. | Creation of structured data from plain text |
US20020156817A1 (en) | 2001-02-22 | 2002-10-24 | Volantia, Inc. | System and method for extracting information |
JP2003022091A (en) | 2001-07-10 | 2003-01-24 | Nippon Hoso Kyokai <Nhk> | Method, device, and program for voice recognition |
US20030105638A1 (en) | 2001-11-27 | 2003-06-05 | Taira Rick K. | Method and system for creating computer-understandable structured medical data from natural language reports |
US20030144885A1 (en) | 2002-01-29 | 2003-07-31 | Exscribe, Inc. | Medical examination and transcription method, and associated apparatus |
US7028038B1 (en) | 2002-07-03 | 2006-04-11 | Mayo Foundation For Medical Education And Research | Method for generating training data for medical text abbreviation and acronym normalization |
JP4415546B2 (en) * | 2003-01-06 | 2010-02-17 | 三菱電機株式会社 | Spoken dialogue processing apparatus and program thereof |
US7958443B2 (en) | 2003-02-28 | 2011-06-07 | Dictaphone Corporation | System and method for structuring speech recognized text into a pre-selected document format |
US20040243545A1 (en) | 2003-05-29 | 2004-12-02 | Dictaphone Corporation | Systems and methods utilizing natural language medical records |
US20050144184A1 (en) | 2003-10-01 | 2005-06-30 | Dictaphone Corporation | System and method for document section segmentation |
US7996223B2 (en) | 2003-10-01 | 2011-08-09 | Dictaphone Corporation | System and method for post processing speech recognition output |
US20050273365A1 (en) | 2004-06-04 | 2005-12-08 | Agfa Corporation | Generalized approach to structured medical reporting |
US7502741B2 (en) | 2005-02-23 | 2009-03-10 | Multimodal Technologies, Inc. | Audio signal de-identification |
-
2004
- 2004-08-20 US US10/923,517 patent/US7584103B2/en not_active Expired - Fee Related
-
2005
- 2005-08-18 ES ES05789851T patent/ES2394726T3/en active Active
- 2005-08-18 JP JP2007528000A patent/JP4940139B2/en not_active Expired - Fee Related
- 2005-08-18 DK DK05789851.2T patent/DK1787288T3/en active
- 2005-08-18 PL PL05789851T patent/PL1787288T3/en unknown
- 2005-08-18 WO PCT/US2005/029354 patent/WO2006023622A2/en active Application Filing
- 2005-08-18 CA CA2577721A patent/CA2577721C/en not_active Expired - Fee Related
- 2005-08-18 EP EP05789851A patent/EP1787288B1/en not_active Not-in-force
-
2008
- 2008-10-17 US US12/253,241 patent/US20090048833A1/en not_active Abandoned
Patent Citations (83)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5526407A (en) * | 1991-09-30 | 1996-06-11 | Riverrun Technology | Method and apparatus for managing information |
US5594638A (en) * | 1993-12-29 | 1997-01-14 | First Opinion Corporation | Computerized medical diagnostic system including re-enter function and sensitivity factors |
US6061675A (en) * | 1995-05-31 | 2000-05-09 | Oracle Corporation | Methods and apparatus for classifying terminology utilizing a knowledge catalog |
US5701469A (en) * | 1995-06-07 | 1997-12-23 | Microsoft Corporation | Method and system for generating accurate search results using a content-index |
US5983187A (en) * | 1995-12-15 | 1999-11-09 | Hewlett-Packard Company | Speech data storage organizing system using form field indicators |
US6684188B1 (en) * | 1996-02-02 | 2004-01-27 | Geoffrey C Mitchell | Method for production of medical records and other technical documents |
US5835893A (en) * | 1996-02-15 | 1998-11-10 | Atr Interpreting Telecommunications Research Labs | Class-based word clustering for speech recognition using a three-level balanced hierarchical similarity |
US5870706A (en) * | 1996-04-10 | 1999-02-09 | Lucent Technologies, Inc. | Method and apparatus for an improved language recognition system |
US5823948A (en) * | 1996-07-08 | 1998-10-20 | Rlis, Inc. | Medical records, documentation, tracking and order entry system |
US5995936A (en) * | 1997-02-04 | 1999-11-30 | Brais; Louis | Report generation system and method for capturing prose, audio, and video by voice command and automatically linking sound and image to formatted text locations |
US20080059232A1 (en) * | 1997-03-13 | 2008-03-06 | Clinical Decision Support, Llc | Disease management system and method including question version |
US6182039B1 (en) * | 1998-03-24 | 2001-01-30 | Matsushita Electric Industrial Co., Ltd. | Method and apparatus using probabilistic language model based on confusable sets for speech recognition |
US20030191627A1 (en) * | 1998-05-28 | 2003-10-09 | Lawrence Au | Topological methods to organize semantic network data flows for conversational applications |
US6526380B1 (en) * | 1999-03-26 | 2003-02-25 | Koninklijke Philips Electronics N.V. | Speech recognition system having parallel large vocabulary recognition engines |
US6345249B1 (en) * | 1999-07-07 | 2002-02-05 | International Business Machines Corp. | Automatic analysis of a speech dictated document |
US20040117189A1 (en) * | 1999-11-12 | 2004-06-17 | Bennett Ian M. | Query engine for processing voice based queries including semantic decoding |
US20050086059A1 (en) * | 1999-11-12 | 2005-04-21 | Bennett Ian M. | Partial speech processing device & method for use in distributed systems |
US20040030556A1 (en) * | 1999-11-12 | 2004-02-12 | Bennett Ian M. | Speech based learning/training system using semantic decoding |
US7624007B2 (en) * | 1999-11-12 | 2009-11-24 | Phoenix Solutions, Inc. | System and method for natural language processing of sentence based queries |
US7555431B2 (en) * | 1999-11-12 | 2009-06-30 | Phoenix Solutions, Inc. | Method for processing speech using dynamic grammars |
US6738784B1 (en) * | 2000-04-06 | 2004-05-18 | Dictaphone Corporation | Document and information processing system |
US7054812B2 (en) * | 2000-05-16 | 2006-05-30 | Canon Kabushiki Kaisha | Database annotation and retrieval |
US20030181790A1 (en) * | 2000-05-18 | 2003-09-25 | Daniel David | Methods and apparatus for facilitated, hierarchical medical diagnosis and symptom coding and definition |
US6662168B1 (en) * | 2000-05-19 | 2003-12-09 | International Business Machines Corporation | Coding system for high data volume |
US20040030688A1 (en) * | 2000-05-31 | 2004-02-12 | International Business Machines Corporation | Information search using knowledge agents |
US7031908B1 (en) * | 2000-06-01 | 2006-04-18 | Microsoft Corporation | Creating a language model for a language processing system |
US20050216443A1 (en) * | 2000-07-06 | 2005-09-29 | Streamsage, Inc. | Method and system for indexing and searching timed media information based upon relevance intervals |
US6785651B1 (en) * | 2000-09-14 | 2004-08-31 | Microsoft Corporation | Method and apparatus for performing plan-based dialog |
US20040078215A1 (en) * | 2000-11-22 | 2004-04-22 | Recare, Inc. | Systems and methods for documenting medical findings of a physical examination |
US20020099717A1 (en) * | 2001-01-24 | 2002-07-25 | Gordon Bennett | Method for report generation in an on-line transcription system |
WO2002071391A2 (en) * | 2001-03-01 | 2002-09-12 | International Business Machines Corporation | Hierarchichal language models |
US20090228126A1 (en) * | 2001-03-09 | 2009-09-10 | Steven Spielberg | Method and apparatus for annotating a line-based document |
US7216073B2 (en) * | 2001-03-13 | 2007-05-08 | Intelligate, Ltd. | Dynamic natural language understanding |
US6834264B2 (en) * | 2001-03-29 | 2004-12-21 | Provox Technologies Corporation | Method and apparatus for voice dictation and document production |
US20030018470A1 (en) * | 2001-04-13 | 2003-01-23 | Golden Richard M. | System and method for automatic semantic coding of free response data using Hidden Markov Model methodology |
US7519529B1 (en) * | 2001-06-29 | 2009-04-14 | Microsoft Corporation | System and methods for inferring informational goals and preferred level of detail of results in response to questions posed to an automated information-retrieval or question-answering service |
US20030065503A1 (en) * | 2001-09-28 | 2003-04-03 | Philips Electronics North America Corp. | Multi-lingual transcription system |
US20030069760A1 (en) * | 2001-10-04 | 2003-04-10 | Arthur Gelber | System and method for processing and pre-adjudicating patient benefit claims |
US20040249667A1 (en) * | 2001-10-18 | 2004-12-09 | Oon Yeong K | System and method of improved recording of medical transactions |
US7555425B2 (en) * | 2001-10-18 | 2009-06-30 | Oon Yeong K | System and method of improved recording of medical transactions |
US20030101054A1 (en) * | 2001-11-27 | 2003-05-29 | Ncc, Llc | Integrated system and method for electronic speech recognition and transcription |
US20050154690A1 (en) * | 2002-02-04 | 2005-07-14 | Celestar Lexico-Sciences, Inc | Document knowledge management apparatus and method |
US20040019482A1 (en) * | 2002-04-19 | 2004-01-29 | Holub John M. | Speech to text system using controlled vocabulary indices |
US7197460B1 (en) * | 2002-04-23 | 2007-03-27 | At&T Corp. | System for handling frequently asked questions in a natural language dialog service |
US7869998B1 (en) * | 2002-04-23 | 2011-01-11 | At&T Intellectual Property Ii, L.P. | Voice-enabled dialog system |
US20070288212A1 (en) * | 2002-08-19 | 2007-12-13 | General Electric Company | System And Method For Optimizing Simulation Of A Discrete Event Process Using Business System Data |
US20060041836A1 (en) * | 2002-09-12 | 2006-02-23 | Gordon T J | Information documenting system with improved speed, completeness, retriveability and granularity |
US20040064317A1 (en) * | 2002-09-26 | 2004-04-01 | Konstantin Othmer | System and method for online transcription services |
US20040102957A1 (en) * | 2002-11-22 | 2004-05-27 | Levin Robert E. | System and method for speech translation using remote devices |
US20040111265A1 (en) * | 2002-12-06 | 2004-06-10 | Forbes Joseph S | Method and system for sequential insertion of speech recognition results to facilitate deferred transcription services |
US20050114129A1 (en) * | 2002-12-06 | 2005-05-26 | Watson Kirk L. | Method and system for server-based sequential insertion processing of speech recognition results |
US20040148170A1 (en) * | 2003-01-23 | 2004-07-29 | Alejandro Acero | Statistical classifiers for spoken language understanding and command/control scenarios |
US20040243614A1 (en) * | 2003-05-30 | 2004-12-02 | Dictaphone Corporation | Method, system, and apparatus for validation |
US20050091059A1 (en) * | 2003-08-29 | 2005-04-28 | Microsoft Corporation | Assisted multi-modal dialogue |
US20050065774A1 (en) * | 2003-09-20 | 2005-03-24 | International Business Machines Corporation | Method of self enhancement of search results through analysis of system logs |
US20050114122A1 (en) * | 2003-09-25 | 2005-05-26 | Dictaphone Corporation | System and method for customizing speech recognition input and output |
US20050120300A1 (en) * | 2003-09-25 | 2005-06-02 | Dictaphone Corporation | Method, system, and apparatus for assembly, transport and display of clinical data |
US20050086056A1 (en) * | 2003-09-25 | 2005-04-21 | Fuji Photo Film Co., Ltd. | Voice recognition system and program |
US20050234891A1 (en) * | 2004-03-15 | 2005-10-20 | Yahoo! Inc. | Search systems and methods with integration of user annotations |
US20050228815A1 (en) * | 2004-03-31 | 2005-10-13 | Dictaphone Corporation | Categorization of information using natural language processing and predefined templates |
US20050240439A1 (en) * | 2004-04-15 | 2005-10-27 | Artificial Medical Intelligence, Inc, | System and method for automatic assignment of medical codes to unformatted data |
US20050288930A1 (en) * | 2004-06-09 | 2005-12-29 | Vaastek, Inc. | Computer voice recognition apparatus and method |
US20060007188A1 (en) * | 2004-07-09 | 2006-01-12 | Gesturerad, Inc. | Gesture-based reporting method and system |
US20060020886A1 (en) * | 2004-07-15 | 2006-01-26 | Agrawal Subodh K | System and method for the structured capture of information and the generation of semantically rich reports |
US20060020466A1 (en) * | 2004-07-26 | 2006-01-26 | Cousineau Leo E | Ontology based medical patient evaluation method for data capture and knowledge representation |
US7584103B2 (en) * | 2004-08-20 | 2009-09-01 | Multimodal Technologies, Inc. | Automated extraction of semantic content and generation of a structured document from speech |
US20060074656A1 (en) * | 2004-08-20 | 2006-04-06 | Lambert Mathias | Discriminative training of document transcription system |
US20100299135A1 (en) * | 2004-08-20 | 2010-11-25 | Juergen Fritsch | Automated Extraction of Semantic Content and Generation of a Structured Document from Speech |
US20060041428A1 (en) * | 2004-08-20 | 2006-02-23 | Juergen Fritsch | Automated extraction of semantic content and generation of a structured document from speech |
US20060089857A1 (en) * | 2004-10-21 | 2006-04-27 | Zimmerman Roger S | Transcription data security |
US20060129435A1 (en) * | 2004-12-15 | 2006-06-15 | Critical Connection Inc. | System and method for providing community health data services |
US20070043761A1 (en) * | 2005-08-22 | 2007-02-22 | The Personal Bee, Inc. | Semantic discovery engine |
US20090228299A1 (en) * | 2005-11-09 | 2009-09-10 | The Regents Of The University Of California | Methods and apparatus for context-sensitive telemedicine |
US20070179777A1 (en) * | 2005-12-22 | 2007-08-02 | Rakesh Gupta | Automatic Grammar Generation Using Distributedly Collected Knowledge |
US7610192B1 (en) * | 2006-03-22 | 2009-10-27 | Patrick William Jamieson | Process and system for high precision coding of free text documents against a standard lexicon |
US20070226211A1 (en) * | 2006-03-27 | 2007-09-27 | Heinze Daniel T | Auditing the Coding and Abstracting of Documents |
US20070237427A1 (en) * | 2006-04-10 | 2007-10-11 | Patel Nilesh V | Method and system for simplified recordkeeping including transcription and voting based verification |
US20070239445A1 (en) * | 2006-04-11 | 2007-10-11 | International Business Machines Corporation | Method and system for automatic transcription prioritization |
US7716040B2 (en) * | 2006-06-22 | 2010-05-11 | Multimodal Technologies, Inc. | Verification of extracted data |
US20080168343A1 (en) * | 2007-01-05 | 2008-07-10 | Doganata Yurdaer N | System and Method of Automatically Mapping a Given Annotator to an Aggregate of Given Annotators |
US20090055168A1 (en) * | 2007-08-23 | 2009-02-26 | Google Inc. | Word Detection |
US20100076761A1 (en) * | 2008-09-25 | 2010-03-25 | Fritsch Juergen | Decoding-Time Prediction of Non-Verbalized Tokens |
US20100185685A1 (en) * | 2009-01-13 | 2010-07-22 | Chew Peter A | Technique for Information Retrieval Using Enhanced Latent Semantic Analysis |
Cited By (52)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060025670A1 (en) * | 2004-07-07 | 2006-02-02 | Young Kim | System and method for efficient diagnostic analysis of ophthalmic examinations |
US7793217B1 (en) * | 2004-07-07 | 2010-09-07 | Young Kim | System and method for automated report generation of ophthalmic examinations from digital drawings |
US7818041B2 (en) | 2004-07-07 | 2010-10-19 | Young Kim | System and method for efficient diagnostic analysis of ophthalmic examinations |
US20060041427A1 (en) * | 2004-08-20 | 2006-02-23 | Girija Yegnanarayanan | Document transcription system training |
US8335688B2 (en) | 2004-08-20 | 2012-12-18 | Multimodal Technologies, Llc | Document transcription system training |
US20100299135A1 (en) * | 2004-08-20 | 2010-11-25 | Juergen Fritsch | Automated Extraction of Semantic Content and Generation of a Structured Document from Speech |
US7908141B2 (en) * | 2004-09-29 | 2011-03-15 | International Business Machines Corporation | Extracting and utilizing metadata to improve accuracy in speech to text conversions |
US20060074895A1 (en) * | 2004-09-29 | 2006-04-06 | International Business Machines Corporation | Method and system for extracting and utilizing metadata to improve accuracy in speech to text conversions |
US8768706B2 (en) | 2005-07-22 | 2014-07-01 | Multimodal Technologies, Llc | Content-based audio playback emphasis |
US20100318347A1 (en) * | 2005-07-22 | 2010-12-16 | Kjell Schubert | Content-Based Audio Playback Emphasis |
US8666742B2 (en) | 2005-11-08 | 2014-03-04 | Mmodal Ip Llc | Automatic detection and application of editing patterns in draft documents |
US20110131486A1 (en) * | 2006-05-25 | 2011-06-02 | Kjell Schubert | Replacing Text Representing a Concept with an Alternate Written Form of the Concept |
US20100211869A1 (en) * | 2006-06-22 | 2010-08-19 | Detlef Koll | Verification of Extracted Data |
US8321199B2 (en) | 2006-06-22 | 2012-11-27 | Multimodal Technologies, Llc | Verification of extracted data |
US20070299665A1 (en) * | 2006-06-22 | 2007-12-27 | Detlef Koll | Automatic Decision Support |
US9892734B2 (en) | 2006-06-22 | 2018-02-13 | Mmodal Ip Llc | Automatic decision support |
US8560314B2 (en) | 2006-06-22 | 2013-10-15 | Multimodal Technologies, Llc | Applying service levels to transcripts |
US9070368B2 (en) | 2006-08-31 | 2015-06-30 | At&T Intellectual Property Ii, L.P. | Method and system for providing an automated web transcription service |
US8775176B2 (en) | 2006-08-31 | 2014-07-08 | At&T Intellectual Property Ii, L.P. | Method and system for providing an automated web transcription service |
US20080059173A1 (en) * | 2006-08-31 | 2008-03-06 | At&T Corp. | Method and system for providing an automated web transcription service |
US8521510B2 (en) * | 2006-08-31 | 2013-08-27 | At&T Intellectual Property Ii, L.P. | Method and system for providing an automated web transcription service |
US20080177623A1 (en) * | 2007-01-24 | 2008-07-24 | Juergen Fritsch | Monitoring User Interactions With A Document Editing System |
US20120159316A1 (en) * | 2007-01-24 | 2012-06-21 | Cerner Innovation, Inc. | Multi-modal entry for electronic clinical documentation |
US9069746B2 (en) * | 2007-01-24 | 2015-06-30 | Cerner Innovation, Inc. | Multi-modal entry for electronic clinical documentation |
US20080273774A1 (en) * | 2007-05-04 | 2008-11-06 | Maged Mikhail | System and methods for capturing a medical drawing or sketch for generating progress notes, diagnosis and billing codes |
US8504372B2 (en) | 2008-08-29 | 2013-08-06 | Mmodal Ip Llc | Distributed speech recognition using one way communication |
US9262397B2 (en) | 2010-10-08 | 2016-02-16 | Microsoft Technology Licensing, Llc | General purpose correction of grammatical and word usage errors |
US8959102B2 (en) | 2010-10-08 | 2015-02-17 | Mmodal Ip Llc | Structured searching of dynamic structured document corpuses |
US9477662B2 (en) | 2011-02-18 | 2016-10-25 | Mmodal Ip Llc | Computer-assisted abstraction for reporting of quality measures |
US9275643B2 (en) | 2011-06-19 | 2016-03-01 | Mmodal Ip Llc | Document extension in dictation-based document generation workflow |
US9836447B2 (en) * | 2011-07-28 | 2017-12-05 | Microsoft Technology Licensing, Llc | Linguistic error detection |
US20150006159A1 (en) * | 2011-07-28 | 2015-01-01 | Microsoft Corporation | Linguistic error detection |
US8855997B2 (en) * | 2011-07-28 | 2014-10-07 | Microsoft Corporation | Linguistic error detection |
US20130030793A1 (en) * | 2011-07-28 | 2013-01-31 | Microsoft Corporation | Linguistic error detection |
US9330661B2 (en) * | 2011-07-31 | 2016-05-03 | Nuance Communications, Inc. | Accuracy improvement of spoken queries transcription using co-occurrence information |
US20140136197A1 (en) * | 2011-07-31 | 2014-05-15 | Jonathan Mamou | Accuracy improvement of spoken queries transcription using co-occurrence information |
US9633653B1 (en) | 2011-12-27 | 2017-04-25 | Amazon Technologies, Inc. | Context-based utterance recognition |
US9009025B1 (en) * | 2011-12-27 | 2015-04-14 | Amazon Technologies, Inc. | Context-based utterance recognition |
US9679077B2 (en) | 2012-06-29 | 2017-06-13 | Mmodal Ip Llc | Automated clinical evidence sheet workflow |
US20140343963A1 (en) * | 2013-03-15 | 2014-11-20 | Mmodal Ip Llc | Dynamic Superbill Coding Workflow |
US20140278553A1 (en) * | 2013-03-15 | 2014-09-18 | Mmodal Ip Llc | Dynamic Superbill Coding Workflow |
US10950329B2 (en) | 2015-03-13 | 2021-03-16 | Mmodal Ip Llc | Hybrid human and computer-assisted coding workflow |
US10567850B2 (en) | 2016-08-26 | 2020-02-18 | International Business Machines Corporation | Hierarchical video concept tagging and indexing system for learning content orchestration |
US11095953B2 (en) | 2016-08-26 | 2021-08-17 | International Business Machines Corporation | Hierarchical video concept tagging and indexing system for learning content orchestration |
US11043306B2 (en) | 2017-01-17 | 2021-06-22 | 3M Innovative Properties Company | Methods and systems for manifestation and transmission of follow-up notifications |
US20210296010A1 (en) * | 2017-01-17 | 2021-09-23 | 3M Innovative Properties Company | Methods and Systems for Manifestation and Transmission of Follow-Up Notifications |
US11699531B2 (en) * | 2017-01-17 | 2023-07-11 | 3M Innovative Properties Company | Methods and systems for manifestation and transmission of follow-up notifications |
US11282596B2 (en) | 2017-11-22 | 2022-03-22 | 3M Innovative Properties Company | Automated code feedback system |
US12131810B2 (en) | 2017-11-22 | 2024-10-29 | Solventum Intellectual Properties Company | Automated code feedback system |
US11455497B2 (en) * | 2018-07-23 | 2022-09-27 | Accenture Global Solutions Limited | Information transition management platform |
US11062704B1 (en) | 2018-12-21 | 2021-07-13 | Cerner Innovation, Inc. | Processing multi-party conversations |
US11869501B2 (en) | 2018-12-21 | 2024-01-09 | Cerner Innovation, Inc. | Processing multi-party conversations |
Also Published As
Publication number | Publication date |
---|---|
ES2394726T3 (en) | 2013-02-05 |
JP2008511024A (en) | 2008-04-10 |
PL1787288T3 (en) | 2013-01-31 |
EP1787288A4 (en) | 2008-10-08 |
EP1787288A2 (en) | 2007-05-23 |
US20060041428A1 (en) | 2006-02-23 |
CA2577721A1 (en) | 2006-03-02 |
US7584103B2 (en) | 2009-09-01 |
WO2006023622A2 (en) | 2006-03-02 |
DK1787288T3 (en) | 2012-10-29 |
EP1787288B1 (en) | 2012-08-15 |
WO2006023622A3 (en) | 2007-04-12 |
JP4940139B2 (en) | 2012-05-30 |
CA2577721C (en) | 2015-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7584103B2 (en) | Automated extraction of semantic content and generation of a structured document from speech | |
US20100299135A1 (en) | Automated Extraction of Semantic Content and Generation of a Structured Document from Speech | |
US9552809B2 (en) | Document transcription system training | |
US9520124B2 (en) | Discriminative training of document transcription system | |
US8666742B2 (en) | Automatic detection and application of editing patterns in draft documents | |
US7383172B1 (en) | Process and system for semantically recognizing, correcting, and suggesting domain specific speech | |
US7668718B2 (en) | Synchronized pattern recognition source data processed by manual or automatic means for creation of shared speaker-dependent speech user profile | |
US20120296639A1 (en) | Verification of Extracted Data | |
WO2006034152A2 (en) | Discriminative training of document transcription system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MULTIMODAL TECHNOLOGIES, INC., PENNSYLVANIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FRITSCH, JUERGEN;FINKE, MICHAEL;KOLL, DETLEF;AND OTHERS;REEL/FRAME:022074/0624;SIGNING DATES FROM 20081210 TO 20081223 |
|
AS | Assignment |
Owner name: MULTIMODAL TECHNOLOGIES, LLC, PENNSYLVANIA Free format text: CHANGE OF NAME;ASSIGNOR:MULTIMODAL TECHNOLOGIES, INC.;REEL/FRAME:027061/0492 Effective date: 20110818 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: ROYAL BANK OF CANADA, AS ADMINISTRATIVE AGENT, ONT Free format text: SECURITY AGREEMENT;ASSIGNORS:MMODAL IP LLC;MULTIMODAL TECHNOLOGIES, LLC;POIESIS INFOMATICS INC.;REEL/FRAME:028824/0459 Effective date: 20120817 |
|
AS | Assignment |
Owner name: MULTIMODAL TECHNOLOGIES, LLC, PENNSYLVANIA Free format text: RELEASE OF SECURITY INTEREST;ASSIGNOR:ROYAL BANK OF CANADA, AS ADMINISTRATIVE AGENT;REEL/FRAME:033459/0987 Effective date: 20140731 |
|
AS | Assignment |
Owner name: WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT, NEW YORK Free format text: SECURITY AGREEMENT;ASSIGNOR:MMODAL IP LLC;REEL/FRAME:034047/0527 Effective date: 20140731 Owner name: WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT, Free format text: SECURITY AGREEMENT;ASSIGNOR:MMODAL IP LLC;REEL/FRAME:034047/0527 Effective date: 20140731 |
|
AS | Assignment |
Owner name: CORTLAND CAPITAL MARKET SERVICES LLC, ILLINOIS Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:MULTIMODAL TECHNOLOGIES, LLC;REEL/FRAME:033958/0511 Effective date: 20140731 |
|
AS | Assignment |
Owner name: MULTIMODAL TECHNOLOGIES, LLC, PENNSYLVANIA Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CORTLAND CAPITAL MARKET SERVICES LLC, AS ADMINISTRATIVE AGENT;REEL/FRAME:048210/0792 Effective date: 20190201 |
|
AS | Assignment |
Owner name: MMODAL IP LLC, TENNESSEE Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT;REEL/FRAME:048411/0712 Effective date: 20190201 Owner name: MEDQUIST OF DELAWARE, INC., TENNESSEE Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT;REEL/FRAME:048411/0712 Effective date: 20190201 Owner name: MMODAL MQ INC., TENNESSEE Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT;REEL/FRAME:048411/0712 Effective date: 20190201 Owner name: MEDQUIST CM LLC, TENNESSEE Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT;REEL/FRAME:048411/0712 Effective date: 20190201 Owner name: MULTIMODAL TECHNOLOGIES, LLC, TENNESSEE Free format text: TERMINATION AND RELEASE OF SECURITY INTEREST IN PATENTS;ASSIGNOR:WELLS FARGO BANK, NATIONAL ASSOCIATION, AS AGENT;REEL/FRAME:048411/0712 Effective date: 20190201 |