MXPA04012865A - Metadata preparing device, preparing method therefor and retrieving device. - Google Patents
Metadata preparing device, preparing method therefor and retrieving device.Info
- Publication number
- MXPA04012865A MXPA04012865A MXPA04012865A MXPA04012865A MXPA04012865A MX PA04012865 A MXPA04012865 A MX PA04012865A MX PA04012865 A MXPA04012865 A MX PA04012865A MX PA04012865 A MXPA04012865 A MX PA04012865A MX PA04012865 A MXPA04012865 A MX PA04012865A
- Authority
- MX
- Mexico
- Prior art keywords
- content
- metadata
- file
- speech recognition
- voice
- Prior art date
Links
- 238000000034 method Methods 0.000 title description 4
- 238000004519 manufacturing process Methods 0.000 claims description 53
- 230000001419 dependent effect Effects 0.000 claims description 6
- 230000010365 information processing Effects 0.000 claims description 5
- 230000001360 synchronised effect Effects 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 2
- 238000012544 monitoring process Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 19
- 244000236655 Diospyros kaki Species 0.000 description 6
- 235000008597 Diospyros kaki Nutrition 0.000 description 5
- 230000015654 memory Effects 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 5
- 241000237502 Ostreidae Species 0.000 description 4
- 238000005304 joining Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 235000020636 oyster Nutrition 0.000 description 4
- 150000003839 salts Chemical class 0.000 description 4
- 239000004065 semiconductor Substances 0.000 description 4
- 238000011084 recovery Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000012790 confirmation Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000012876 topography Methods 0.000 description 2
- 235000011511 Diospyros Nutrition 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010411 cooking Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000003756 stirring Methods 0.000 description 1
- 235000013311 vegetables Nutrition 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/435—Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/235—Processing of additional data, e.g. scrambling of additional data or processing content descriptors
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/41—Structure of client; Structure of client peripherals
- H04N21/422—Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
- H04N21/42203—Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/41—Structure of client; Structure of client peripherals
- H04N21/422—Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
- H04N21/4223—Cameras
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/433—Content storage operation, e.g. storage operation in response to a pause request, caching operations
- H04N21/4334—Recording operations
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
- H04N21/440236—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8106—Monomedia components thereof involving special audio data, e.g. different tracks for different languages
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/84—Generation or processing of descriptive data, e.g. content descriptors
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Theoretical Computer Science (AREA)
- Library & Information Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Indexing, Searching, Synchronizing, And The Amount Of Synchronization Travel Of Record Carriers (AREA)
Abstract
A metadata preparing device comprising a content reproducing unit (1) for reproducing and outputting content, a monitor (3) for monitoring the content reproduced by the content reproducing unit, a voice input unit (4), a voice recognition unit (5) for recognizing a voice signal input from the voice input unit, a metadata generation unit (6) for converting information recognized by the voice recognition unit into metadata, and an identification information imparting unit (7) for acquiring identification information that identifies respective parts in the content from the reproduced content supplied from the content reproducing unit, for imparting to metadata, wherein the generated metadata is so constructed as to be associated with respective parts in the content.
Description
WO 2004/002144 Al! (II tí III? Lili I · (iií (! I 11? Li: KÍIIU ílilí II? Ti It I! Tt!
- ffijE *
DEVICE FOR THE PREPARATION OF METADATA, METHOD OF PREPARATION FOR THE SAME AND RECOVERY DEVICE
TECHNICAL FIELD The present invention relates to metadata production devices and metadata production methods for producing metadata that are related to video or audio content or the like that has been created. The present invention also relates to recovery devices that search for content with the metadata produced.
BACKGROUND OF THE TECHNIQUE In recent years, the audio or video content or similar that has been created, is provided with metadata that are related to such content. However, for the conventional task of linking metadata, it was common to confirm the information that is supposed to serve as metadata, while reproducing the audio or video content created, based on a script or argument of the audio or video content created, and for produce the metadata by entering them manually into the computer. Consequently, the production of metadata required considerable effort. JP H09-130736A reveals a system that links tags using speech recognition while filming with a camera. However, this system is used at the same time that the image is taken, and can not be applied to join metadata to the content that has already been created.
BRIEF DESCRIPTION OF THE INVENTION Therefore, it is an object of the present invention to solve the problems described above, and to provide a metadata production device and a metadata production method, with which metadata can easily be created through the input of voice for content already created. It is another object of the present invention to provide a recovery mechanism, with which content can be easily searched, thus using the metadata produced. A metadata production device according to the present invention includes: a content reproduction part that reproduces and produces content; a voice input part; a voice recognition part that recognizes voice signals entering from the voice input part; a part of metadata generation that converts information recognized by the speech recognition part within the metadata; and a linking portion of identification information that obtains identification information to identify positions within the content of the reproduced content that is provided from the content reproduction part and links the identification information to the metadata; so the generated metadata is associated with positions in the - 3 -content. A method for producing metadata of the present invention includes: voice input information related to a given content; subjecting the input speech signal to speech recognition with a speech recognition device; convert speech recognition information into metadata; and joining the identification information supplied to the content to identify the positions in the content for the metadata, therefore associating the generated metadata with the positions in the content. A metadata search device according to the present invention includes a content database that reproduces and produces content; a voice input part that converts speech signals of key words entered into the data with a counter signal that is synchronized with a synchronization signal of the reproduced content; a speech recognition part that recognizes the keywords of the voice signal data that has been converted into data through the speech input part; a file processing part that produces a metadata file through the combination of the keyword production of the speech recognition part with time codes indicating a time position of an image signal that is included in the content; a content processing file processing part that generates a control file that controls a relationship between the file of 4 -metadata and the recording portions of the content file; a recording part that records the content file, the metadata file and the control file; and a search part that extracts a recording part corresponding to the keyword of the content file by specifying the metadata files in which the entered search keyword is included, and which references the control file. The recording position of the content file corresponds to the recording position in the recording part.
BRIEF DESCRIPTION OF THE DIAMETERS Fig. 1 is a block diagram showing the configuration of a metadata production device, according to Modality 1 of the present invention. Fig. 2 is a diagram showing an example of metadata to which a time code is attached, according to Modality 1 of the present invention. Fig. 3 is a block diagram showing the configuration of a metadata production device, according to Modality 2 of the present invention. Fig. 4 is a diagram showing an example of a fixed image content / metadata display part in that device. Fig. 5 is a block diagram showing another configuration of a metadata production device, according to Modality 2 of the present invention. Fig. 6 is a block diagram showing the configuration of a metadata production device, according to Modality 3 of the present invention. Fig. 7 is a diagram showing an example of the DB dictionary in the device of that mode. Fig. 8 is a diagram showing a recipe, that is, an example of a content argument to which the device of this mode can be applied. Fig. 9 is a data diagram in text format showing an example of a metadata file produced with the device of this mode. Fig. 10 is a block diagram showing the configuration of a metadata production device, according to Modality 4 of the present invention. Fig. 1 1 is a diagram showing an example of an information file produced with the device of this mode. Fig. 12 is a block diagram showing the configuration of a metadata search device, according to Modality 5 of the present invention. Fig. 13 is a block diagram showing the configuration of a metadata production device, according to Modality 6 of the present invention.
- 6 - BEST MODE FOR CARRYING OUT THE I NVENTION With the metadata production device, according to the present invention, the metadata or labels are produced through voice input using speech recognition for the production of metadata or the coupling of tags related to the content, and the metadata or tags are associated with the scenes or moments of content. Therefore, metadata that are conventionally used to be produced through keyboard input, can be produced automatically through voice input. It should be noted that "metadata" means a set of tags, and that it is referred to as "metadata" throughout this specification, it also includes tags by themselves. In addition, "content" is used to mean everything that is ordinarily referred to as content, such as fixed image content, audio content, video created, or audio or video content in a database or the like. It is preferable that the metadata production device further comprises a dictionary related to the content, wherein, when the input speech signals of the speech input part are recognized through the speech recognition part, the recognition is carried out out in association with the dictionary. With this configuration, it is possible to enter, as voice signals, keywords that have been extracted in advance from the created content arguments or the like, to set a dictionary file based on the argument, and to assign a placement priority to the keywords, in such a way that the metadata can be generated efficiently and precisely with the means of speech recognition. In addition, speech signals can be recognized through the speech recognition part, word by word, in association with the dictionary. It is also preferable that the metadata production device also contains an information processing part that includes a keyboard, and the metadata can be corrected through the information processing part through the keyboard input. The time code information that is attached to the content can be used as the identification information. Alternatively, content addresses, numbers or image numbers attached to the content can be used as the identification information. In addition, the content may be fixed image content, and the addresses of the still image content may be used as the identification information. As an example of an application of the present invention, the metadata production device can be configured as follows: The content reproduction part is configured through a content database, and the voice input part supplies the speech recognition part speech signals from entered keywords that are converted into data with a counter signal that is synchronized with a synchronization signal supplied from the content database. The speech recognition part 8 is configured to recognize the key words of the voice signal data that has been converted into data through the speech input part. And the metadata generation part is configured as a part of file processing that produces a metadata file using, as the identification information, a time code indicating a time position of an image signal included in the content, and that combines the key words that occur from the speech recognition part with that time code. With this configuration, metadata can be linked efficiently, even at intervals of several seconds. Consequently, it is possible to produce metadata of short time intervals, which are difficult to use with conventional input keyboard. In this configuration, it is preferable that the metadata production device further contains a recording part that records the content that is supplied from the content database together with the metadata file as a content file. It is also preferable that the metadata production device also contains a content information file processing part that generates a control file, which controls the relationship between the metadata file and recording positions, in which the content file is to be recorded, and the control file is recorded in the recording part along with the content file and the metadata file. It is also preferable that the metadata production device further comprises a dictionary database, wherein the speech recognition part can choose a dictionary of a genre corresponding to the content of a plurality of gender-dependent dictionaries. It is further preferable that the keywords related to the content can be provided to the speech recognition part, and that the speech recognition part is configured to recognize those keywords with higher priority. In the method for producing metadata, it is preferable that the information related to the content be voice input, while displaying the content in a reproduction monitor. It is also preferable that a dictionary related to the content be used, and the input speech signals be recognized through the speech recognition device through association with the dictionary. It is also preferable that the time code information that is linked to the content is used as the identification information. It is also preferable that the content be fixed image content, and the addresses of the still image content be used as the identification information. With the metadata search device of the present invention, it is possible to quickly search for the desired location of the content based on metadata., using a control file that indicates the recording positions of the content and a metadata file that indicates time codes and metadata.
In the metadata search device of the present invention, it is preferable that the output control file of the content information file processing part be conceived as a table listing content recording positions in the of recording, according to a recording time of the content, and the recording position of the content can be searched for the time code. It is further preferable that the metadata search device further contains a dictionary database, and a keyword supply part that supplies keywords related to the content within the speech recognition part, and that speech recognition part. may choose a dictionary of a genre corresponding to the content of a plurality of gender-dependent dictionaries, and the speech recognition part is configured to recognize those keywords with higher priority. It is more preferable that the metadata search device further contains a dictionary database, that the speech recognition part may choose a dictionary of a gender corresponding to the content of a plurality of gender-dependent dictionaries, and that the search is set to search through keywords that are chosen from a common dictionary used by the oz recognition part. The following is a more detailed explanation of the - 1 -invention, with reference to the accompanying drawings.
Modality 1 Fig. 1 is a block diagram showing the configuration of a metadata production device, according to Mode 1 of the present invention. A content reproduction part 1 is an element to confirm the content created during the production of metadata. The output of the content playback part 1 is provided to a video monitor 2, an audio monitor 3 and a part joining the time code 7. A microphone 4 is supplied as a voice input part for the production of metadata. The voice that is entered with the micro 4 is provided to the speech recognition part 5. The voice confirmation part 5 is connected to a dictionary 8 for speech recognition, and can refer to the data in the dictionary 8. The recognition output of the speech recognition part 5 is provided to a metadata generation part 6, and the metadata produced is provided to a time code binding part 7, from which these can be produced at the output . The content reproduction part 1 can be configured with a video / audio signal reproduction device such as a VTR, hard disk device or an optical disc device, a video / audio signal reproduction device using a medium of memory such as a memory-12-semiconductor as a recording medium, or a video / audio signal reproduction device that reproduces video / audio signals that are provided through transmissions or broadcast. The reproduced video signals are supplied from the video signal output terminal 1 to the content playback part 1 to the video monitor 2. The reproduced voice signals are provided from the voice signal output terminal 1 b to the audio monitor 3. The reproduced time codes are provided from the time code output terminal 1 c to the time code binding part 7. It should be noted that the video monitor 2 and the audio monitor 3 do not they are necessarily required as elements of the metadata production device, and it is sufficient if they can be connected and used when necessary. When the metadata is produced, the operator pronounces the metadata to be input to the microphone 4, while checking either the video monitor 2 or the audio monitor 3 or both, and if necessary, referring to the argument or script. The voice signals that are produced from the microphone 4, are provided to the voice recognition part 5. Further, if necessary, reference is made to the data of the dictionary 8 for speech recognition through the speech recognition part 5. The voice data that has been recognized through the speech recognition part 5 they are supplied to the metadata generation part 6, and converted into metadata. The metadata thus generated is provided with the time code information that is captured from the content reproduced and supplied from the content reproduction part 1, through the time code binding part 7, for the purpose of unite Information that associates the moment or scene of each part of the content with the metadata. In order to explain the operation from above in more detail, let's imagine, for example, an argument in which the content is a recipe. In this case, when the operator says "salt: a spoonful" in the microphone 4 while checking the display screen of the video monitor 2, then "salt" and "a spoonful" are recognized through the speech recognition part 5 consulting the dictionary 8, and the "salt" and "one tablespoon" data are converted through the metadata generation part 6. It should be noted that there is a particular limitation to the configuration of the speech recognition part 5 , and it is sufficient if voice recognition is performed using any of the commonly used speech recognition means, and the data "salt" and "a spoonful" can be recognized. It should be noted that ordinarily, "metadata" means a set of such labels. As shown in FIG. 2, as a result of this speech recognition, the metadata 9a is produced from the metadata generation part 6 and is supplied to the time code binding part 7. In the time code binding part 7, there are generated the packet data which are made of junction metadata of -14 - time code 1 0 having a time code attached thereto, based on the time code signal 9b provided from the content reproduction part 1 . The generated metadata can be produced as it is, or it can be stored in a recording medium, such as a hard disk or the like. It should be noted that in this example, an example was shown in which the metadata is generated in packet form, but there is no limitation for this.
Modality 2 Fig. 3 is a block diagram showing the configuration of a metadata production device, according to Modality 2 of the present invention. This mode is an example in which the fixed image content is the subject of metadata production. In order to identify the fixed image content, this configuration correlates the generated metadata and the fixed image content, using content addresses, which corresponds to the time code in the case of moving images. In FIG. 3, a camera 1 1 is an element for the creation of fixed image content. The production of the camera 1 1 is recorded through a fixed image content recording part 12 with address information attached to it. Here, the recorded still image content and the address information are provided to a recording part of fixed image content / metadata 13-15 for the creation of metadata. The address information further applies to a metadata address binding part 19. A microphone 16 is used for speech input of information related to the still images, and the production of the microphone 16 is supplied within a recognition part of the microphone. voice 17. The speech recognition part 17 is connected to a dictionary 20 for speech recognition, and can refer to the data in the dictionary 20. The recognition output of the speech recognition part 17 is provided to a part of generation of metadata 18, and the produced metadata are provided to a metadata address binding part 19. The fixed image content and the metadata recorded by the content recording / fixed image metadata portion 13 are reproduced through a reproduction part of fixed image content / metadata 14, and are displayed through a fixed image content / metadata display part 1 5. The following is a more detailed description of the operation of a metadata production device with the configuration described above. The still image content taken with the camera 1 1 is recorded through the recording part of the still image content 12 in a recording medium (not shown in the drawings), and the address information is attached to this , which is also recorded on the recording medium. The recording medium is ordinarily configured as a semiconductor memory, but there is no limitation for semiconductor memories, and it is possible to use any other recording medium, for example, a magnetic memory, an optical recording medium or a recording medium. magneto-optic. The recorded still picture content is supplied through an output terminal 12a and an input terminal 13a, as well as through an output terminal 12b and an input terminal 13b, to the content / metadata recording part. of still image 13. The address information is further provided through the output terminal 12b and an input terminal 19b to the junction part of metadata address 19. On the other hand, the information that relates to the images fixed ones that have been taken with the camera 1 1 is entered through the microphone 16 within the speech recognition part 17. The information relating to the still images can be, for example, title, date and time in which it has been recorded. taken the image, camera operator, location of the image (where), people in the image (who), objective in the image (what) or similar. In addition, also the data from dictionary 2.0 for speech recognition is supplied to speech recognition part 17, if necessary. The speech data recognized by the speech recognition part 17 is supplied to the metadata generation part 1 8, and converted into metadata or tags. It should be noted that ordinarily "metadata" is information related to the content, and means a set of labels, such as title, date and time in which the image was taken, camera operator, - 17 - location of the image (where), people in the image (who), objects in the image (what), or the like. The tags or megadates thus generated are provided to the metadata address binding part 19, for the purpose of joining the information that is associated with them to the scenes or fixed image content. In the metadata address binding part 19, the supplied address information of the fixed image content recording part 12, its joins the metadata. The address binding metadata, to which the address information is thus attached, are supplied to the content recording / fixed image metadata portion 13 via an output terminal 19c and an input terminal 13c. The fixed image content with a given address is associated through the recording part of fixed image content / metadata 13 with the metadata of the same address and is recorded. For the purpose of explaining the address binding metadata more specifically, FIG. 4 shows an example of reproduction with the playback part of fixed image content / metadata 14, the fixed image content and the metadata recorded through the content recording / fixed image metadata portion 13 and displaying them with the part of content display / fixed image metadata. The screen of the fixed image metadata content / display part 1 5 in FIG. 4, which is merely an example, is configured by a fixed image content display part 21, an address display part -18-22, and a metadata display region 23. The metadata display region 23 it is configured through, for example, 1) a title presentation part 23a, 2) a date / time display part, 3) a camera operator presentation part 23c, 4) a location presentation part of filming 23d etc. These metadata are created from the speech data recognized by the speech recognition part described above 17. The operation described above is related to the case, such as those previous to taking the content of the still image, in approximately the same time as the takes, or immediately after the fixed image content is taken, etc. , in which the creation of the metadata does not necessarily require a confirmation of the fixed image content that has been taken. With reference to FIG. 5, the following is an explanation of the case in which the fixed image content is reproduced, and the metadata is created for the monitored fixed image content, with the purpose of later joining the metadata created to the still image content. It should be noted that the elements that are the same as in FIG. 3 are denoted through the same numbers, and additional explanations have been omitted taking into account their functions and the like. In this case, a content reproduction part / still image direction 24 is arranged between the recording part of the still image content 12 and the recording part of the still image content / metadata 13. In addition, a monitor 25 is provided. , to which the output - 19 - of the content reproduction part / fixed image address is supplied 24. The still image content that is taken with the camera 1 1 and that is supplied to the image content recording part fixed 12, it is recorded on a recording medium (not shown in the drawings), and an address is attached to it, which is also recorded on the recording medium. "This recording medium is provided to the recording party. content playback / still image direction 24. Consequently, the still image content that has already been created can be reproduced, and the camera 1 1 and the still image content recording part 12 are not indispensable elements in the device. pro metadata download used to create metadata for the fixed image content monitored on the monitor. The still image content created with the content playback part / still image address 24 is supplied to the monitor 25. The address information that is reproduced in a similar manner is provided through the output terminal 24b and the input terminal 19b to the union part of metadata address 19. The user that creates the metadata, pronounces the necessary words for the creation of metadata in the microphone 16, after confirming the fixed image content that is displayed on the monitor 25. Therefore, the information related to the still images taken with the camera 1 1 is input to through the microphone 16 within the speech recognition part 17. The information related to the still images can be, for example, title, date and time the image was taken, camera operator, image location - 20 - (where), people in the image (who), objects in the image (what), or the like. The following operations are the same as those explained for the configuration of FIG. 3.
Mode 3 FIG. 6 is a block diagram showing the configuration of a metadata production device, according to Modality 3 of the present invention. This modality is an example in which the ordinary digital data content is the subject for the production of metadata. In order to identify the digital data content, this configuration correlates the generated metadata and the digital data content, using addresses or numbers of the content. In FIG. 6, the number 31 denotes a content database (referred to in the following as "DB content"). The reproducing output of the content DB 31 is supplied to a voice input part 32, a file processing part 35 and a recording part 37. The output of the speech input part 32 is supplied to a part of the speech input part 32. speech recognition 33. The data of a dictionary database (referred to as "DB dictionary" in the following) 34, can be supplied to the voice recognition part 33. The metadata is produced from the speech recognition part 33 and enter into the file processing portion 35. Using a time code value of the DB 31 content, the predetermined data is appended to the metadata output of the speech-recognition part-21, which is processed within a file with this format, through the file processing part 35. The file. metadata that is produced from the file processing part 35, is supplied to the recording part 37, and is recorded along with the content that is produced from the DB 31 content. The voice input part 32 is supplied with a voice input terminal 39, and the DB dictionary 34 is supplied with a dictionary file selection input terminal 40. The playback output of the DB 31 content and the output of the playback of the recording part 37, can be displayed with a video monitor 41. The content DB 31 has a configuration to provide a function for reproducing created content, while assigning a time code adapted to the content, which may be, for example, an audio / video signal reproduction device, such as a VTR. , a hard disk device, or an optical disk device, a video / audio signal reproduction device that uses a memory medium, such as a semiconductor memory as a recording medium, or a signal reproduction device of video / audio that temporarily records and reproduces audio / video signals, which are delivered through broadcasting or broadcasting. The following is an explanation of the operation of this metadata production device. A video signal with coded time code that is reproduced from the DB 31 content, is supplied to the video monitor 41 and is displayed. When the operator inputs a narration voice signal using the microphone, according to the content displayed through the video monitor 41, the voice signal is input through the voice input terminal 39, within the voice input 32. It is preferable that during this, the operator confirms the content displayed on the video monitor 41 or the time code, and utters the keywords for handling content that is extracted based on the argument, script or the video content, or similar. It is possible to improve the recognition ratio with the voice recognition part 33 using, as the thus input voice signals, only keywords that have been previously limited, according to the argument or the like. In the voice input part 32, the speech signal that is input from the speech input terminal 39 is converted to data with a counter that is synchronized with a vertical synchronization signal that is produced from the DB 31 content. of voice signal that have been converted to data through the voice input part 32 are entered into the speech recognition part 33, while at the same time the dictionary necessary for speech recognition, is supplied from the dictionary DB 34. The dictionary used for speech recognition in the dictionary DB 34 can be set from the input terminal of dictionary field selection 40 As shown in FIG. 7, for example, when the dictionary DB 34 is configured to have separate dictionaries -23- for different fields, then the field to be used is set from the dictionary field selection input terminal 40 (for example, a terminal of keyboard that allows the entry of password). For example in the case of a cooking program, it is possible to set the DB 34 dictionary field from terminal 40 to: Japanese Kitchen-Kitchen-Cooking Methods-Fry vegetables with stirring. By setting the DB34 dictionary in this manner, the terms used and the terms to be recognized by voice can be limited, and the recognition ratio of the voice recognition part 33 can be improved. In addition, from the dictionary field selection terminal 40 in FIG. 6, it is possible to enter keywords extracted from the plot, script or content. For example, if the content is a cooking program, it is possible to enter a recipe as shown in FIG. 8 of terminal 40. Considering the content of the program, the possibility is high that the words that appear in the recipe will be entered as voice signals, in such a way that the degree of priority of recognition of the terms in the straight line input Terminal 40 is clearly specified through the DB 34 dictionary, and speech recognition for these terms is prioritized. For example, if homonyms such as "KAKI", which can mean either "persimmon" or "oyster" in Japanese, are included in the dictionary, and if the terms in the recipe entered from terminal 40 include only the term " KAKI "(meaning" oyster "), then assigned a range of - 24 - priority of 1 to" KAKI "(meaning" oyster "). And if the expression "KAKI" is recognized for the voice recognition part 33, then it is recognized as "KAKI" (meaning "oyster"), to which a priority range of 1 has been set in the DB 34 dictionary. Therefore, it is possible to improve the recognition ratio with the speech recognition part 33 by limiting the terms in the DB 34 dictionary with the field that is entered from the terminal 40, and also by entering an argument of the terminal 40 and clearly specifying the degree of priority of the terms. The speech recognition part 33 in FIG. 6, recognizes the voice signal data that has been input from the speech input part 32, according to the dictionary provided in the DB 34 dictionary, and the metadata is created. The metadata that is produced from the speech recognition part 33 is input to the file processing part 35. As described above, the speech input part 32 converts the speech signals into data, in synchronization with a speech signal. vertical synchronization that is reproduced from the DB 31 content. Accordingly, the file processing portion 35 produces a text format metadata file, as shown in FIG. 9, in the case of the above-mentioned cooking program, for example, using synchronization information of the voice input part 32 and time code values that are provided from the DB 31 content. That is, TM_ENT (sec), which is a reference time measured in seconds of the start of the file, TM_OFFSET, which indicates the number of - 25 - image deviation of the reference time, and a time code are appended through from the file processing portion 35 to the metadata that is produced from the speech recognition part 33, and the metadata is processed in a file with this format. The recording part 37 records the metadata file that is produced from the file processing part 35 and the content production of the DB content 31. The recording part 37 is configured through an HDD, a memory, an optical disc, or the like, and records the output content of the DB 31 content, also in file format.
Modality 4 FIG. 10 is a block diagram showing the configuration of a metadata production device, according to Modality 4 of the present invention. In the device of this embodiment, a content information file processing part 36 is added to the mode 3 configuration. The content information file processing part 36 creates a control file indicating the recording positions of the content that is recorded with the recording part 37, and this control file is recorded with the recording part 37. That is, based on the recording position information of the content that is produced from the DB 32 content and the content which is produced from the recording part 37, the content information file processing portion 36 generates -26-the time axis information for that content, as well as information indicating an address relationship of the recorded content in the recording 37, and converts the time axis information into data to be produced as a control file. For example, as shown in FIG. 1 1, TM_ENT #j, which indicates a reference to the time axis of the content, is indicated in time axis intervals equal to the directions of the recording medium, which indicates the recording position of the content. For example, TM_ENT #j is signaled for the direction of the recording medium every second (30 images in case of an NSTC signal). By exposing in this way, even when the content is recorded in a dispersed manner in units of 1 sec, it is possible to identify the recording direction of the recording part 37 in an unambiguous manner, in gas to T _ENT #j. In a metadata file, as shown in FIG.
9, T _ENT (sec), which is a reference time measured in seconds of the start of the file, TM_OFFSET, which indicates the number of image deviations of the reference time, the time code, and the metadata are recorded in text format Consequently, if metadata is specified in the metadata file, then the time code, the reference time and the image bypass value are known, such that the recording position in the recording part 37 can be determined immediately. of the control file shown in FIG. eleven . It should be noted that the intervals of the time axis - 27 - equal to TM_ENT #j, are not limited to indicate each second as observed above, and it is also possible to note, according to GOP units used in MPEG 2 compression or similar . Furthermore, in NTSC television signals, the vertical synchronization signal is 60/1 .001 Hz, in such a way that it is possible to use two types of time codes, namely a time code adapted to the image mode of descent, of according to the absolute time or a non-descent time code, according to the vertical synchronization signal (60/1 .001 Hz). In this case, the non-descent time code can be expressed through TM_ENT #j, and a time code corresponding to the descent image mode can be expressed through TC_ENT #j. In addition, the conversion of the control file into data can be performed using an existing language such as SMIL 2. If the functionality of SMIL 2 is used, it is also possible to convert related content and the file name of the metadata file into data, and store them in the control file. Further, although FIG. 1 shows a configuration in which the recording direction of the recording part is directly displayed, it is also possible to display, in place of the recording address, the amount of data from the start of the content file. for the current time code, to calculate and find the recording address corresponding to the time code in the recording part, based on the amount of data and the recording address of the file system.
- 28 - In addition, a similar effect can be achieved when a TM_ENT correspondence table #j and the time codes are not stored in the metadata file, but the TM_ENT correspondence table #j and the time codes are stored in the control file.
Modality 5 FIG. 12 is a block diagram showing the configuration of a metadata search device, according to Modality 5 of the present invention. In the device of this mode, a search part 38 is added to the configuration of Mode 4. With the search part 38, the keywords for arguments to be searched for are chosen from a DB 34 dictionary that is identical to the one It was used to find metadata through voice recognition, and those keywords are fixed. Then, the search part 38 searches for the items in the metadata files and displays a list of title names corresponding to the keywords, as well as positions (time codes) of the content arguments. If a specified argument is set from the list display, then the address of the recording medium in the control file is automatically found from the reference time TM_ENT (sec) and the number of image deviations T _OFFSET from the metadata file and it is set on the recording part 37, and the content scene recorded on that recording-direction is played and displayed through the recording part 37 on the monitor 41. With this configuration, the scene to be seen can be found immediately when the metadata has been found. It should be noted that if the small files that are linked to the content are preconfigured, then it is possible to reproduce and display small images representative of the content, when the above-noted list of the content names corresponding to the keywords is displayed.
Modality 6 The aforementioned Modes 3 to 5 explained a device in which the metadata is linked to the content that has been recorded in advance, while the present modality relates to an example in which the present invention has been expanded to a system that links metadata when taking pictures with a camera or the like, and in particular a device that links metadata "to shooting positions when you take scenes whose content has been limited in advance." Figure 13 is a block diagram that shows the configuration of a metadata production device, according to Modality 6 of the present invention.The output image of the camera 51 is recorded as video content in a DB content 54. At the same time, a GPS 52 detects the location in which the camera takes the images, - 30 - this position information (geographic coordinates) are converted into voice signals through a part e voice synthesis 53, and recorded as position information through a voice channel of the content DB 54. The camera 51, the GPS 52, the speech synthesis part 53 and the content DB 54 can be configured an integrated way as a camera 50 with recording part. The content DB 54 enters the speech signal position information in the audio channel in a speech recognition part 56. Also, dictionary data from a DB 55 dictionary is supplied to the voice recognition part 56. The dictionary DB 55 can be configured in such a way that you can choose or restrict place names or highlights or the like, through the input keyboard of a terminal 59, and output of the speech recognition part 56. The recognition part Voice 56 finds the place names or highlighted places using the recognized geographical coordinates and data from the DB 55 dictionary and sends them to a file processing part 57. The file processing part 57 converts the time codes that are produced of content DB 54, as well as the place names and prominent places that are produced from the voice recognition part 56 as metadata in text, thus generating a metadata file. The metadata file is provided to the recording part 58, which records this metadata file, as well as the content data that is produced from the DB 54 content. With this configuration, the metadata of names of -31 -place and places Highlights can be automatically linked to each scene that is taken. In the aforementioned modes, the configurations were described, in which the keywords recognized by a speech recognition part are converted into metadata files together with time codes, but it is also possible to add keywords related to the keywords recognized by the voice recognition part and that include them in the files. For example, when "Yodogawa River" has been recognized through voice, then ordinary attributive keywords, such as "topography" or "river", can be added. Therefore, it is possible to use the added keywords "topography" or "river" when searching, in such a way that the search facility is increased. It should be noted that with the voice recognition part of the present invention, it is possible to improve the speech recognition rate by using a "word-based recognition method that recognizes individual words, and limiting the number of words of speech input. voice and the number of words in the recognition dictionary used.In addition, there is generally the possibility of false recognitions occurring in speech recognition.In the modalities described above, it is possible to improve a part of information processing, such as a computer that includes a keyboard, in such a way that when an acknowledgment has occurred - 32 - false, the tag or metadata produced can be corrected through a keyboard operation.
INDUSTRIAL APPLICABILITY With the metadata production device of the present invention, metadata is produced through voice input using speech recognition and metadata is associated with predetermined content positions, for the purpose of producing metadata or binding labels related to the content, in such a way that the production of metadata or the union of labels can be carried out more efficiently than with conventional keyboard input.
Claims (1)
- - 33 - CLAIMS 1. A metadata production device, comprising: a part of content reproduction that reproduces and produces content; a voice input part; a speech recognition part that recognizes voice signals that are input from the speech input part; a part of metadata generation that converts information recognized by the speech recognition part to metadata; a linking part of identification information that obtains identification information to identify positions within the content, and links the identification information to the metadata; and a dictionary that is limited, according to the content; by means of which the generated metadata are associated with the positions in the content; and the recognition is carried out in association with the dictionary, when it recognizes the input speech signals of the speech input part with the speech recognition part. 2. The device for producing metadata, according to claim 1, characterized in that the speech signals are recognized through the voice recognition part word by-34-word, in association with the dictionary. 3. The metadata production device according to claim 1 or 2, further comprising an information processing part that includes a keyboard, wherein the metadata can be corrected through the information processing part through the entry of the keyboard. 4. The metadata production device according to any of claims 1 and 2 to 3, characterized in that the time code information that is attached to the content is used as the identification information. The metadata production device according to any of claims 1 and 2 to 5, characterized in that the content directions, numbers or deviation numbers attached to the content are used as the identification information. The metadata production device according to claim 1, characterized in that the content is fixed image content and the addresses of the still image content are used as the identification information. The metadata production device according to claim 1, characterized in that the content reproduction part is configured by a content database; - 35 - characterized in that the voice input part supplies to the voice recognition part voice signals of entered key words that have been converted into data with a counter signal that is synchronized with a synchronization signal supplied from the base of content data; characterized in that the speech recognition part is configured to recognize the key words of the voice signal data that have been converted to data through the speech input part; and characterized in that the metadata generation part is configured as a file processing part that produces a metadata file using, as the identification information, a time code indicating a time position of an image signal that is included in the content, and combining the keywords that are produced from the speech recognition part with that time code. The metadata production device according to claim 7, further comprising a recording part that records the content that is provided from the content database, together with the metadata file as a content file. 9. The metadata production device according to claim 8, further comprising a processing portion of a content file that generates a control file that controls the relationship between the metadata file and the recording positions. to be recorded by the content file; characterized in that the control file is recorded in the recording part together with the content file and the metadata file. 1 0. The metadata production device according to claim 7, furthermore, it comprises a dictionary database, wherein the speech recognition part can choose a dictionary of a genre corresponding to the content of a plurality of gender-dependent dictionaries. eleven . The metadata production device according to claim 10, characterized in that the keywords related to the content can be supplied to the speech recognition part.; and characterized in that the speech recognition part is configured to recognize those keywords with higher priority. 12. A method for producing metadata, comprising: voice input information related to a given content while displaying the content on a monitor; submit the speech signal input to speech recognition with a speech recognition device using a dictionary that is limited according to the content; convert recognized voice information to metadata; and join the identification information supplied to the content to identify positions in the content of the metadata, therefore associating the generated metadata with the positions in the content. The method for producing metadata according to claim 12, characterized in that the time code information that is attached to the content is used as the identification information. The metadata production device according to claim 12, characterized in that the content is fixed image content and the addresses of the still image content are used as the identification information. 15. A metadata search device, comprising: a content database that reproduces and produces content; a voice input part that converts voice signals of keywords entered into data with a counter signal that is synchronized with a synchronization signal supplied from the reproduced content; a voice recognition part that recognizes - 38 - the keywords of the voice signal data that has been converted to data through the voice input part; and a file processing part that produces a metadata file by combining the keywords that are produced from the speech recognition part with time codes indicating a time position of an image signal that is included in the content; a part of content information file processing that generates a control file that controls a relationship between the metadata file and content file locations; a recording part that records the content file, the metadata file and the control file; and a search party that extracts a recording position corresponding to a keyword in the content file by specifying the metadata files in which a entered search keyword is included, and by reference to the control file; characterized in that the recording position of the content file is the recording position of the recording part. 16. The metadata search device according to claim 15, characterized in that the control file that is produced from the information file processing part of -39-content is conceived as a table that lists the recording positions of the content in the recording part, according to a recording time of the content, and the recording position of the content can be searched for the time code. 17. The metadata search device according to claim 15, further comprising a dictionary database, and a keyword supply part that provides keywords related to the content within the speech recognition part; characterized in that the speech recognition part can choose a dictionary of a genre corresponding to the content of a plurality of gender-dependent dictionaries, and the speech recognition part is configured to recognize those keywords with higher priority. 18. The metadata search device according to claim 15, further comprising a dictionary database; characterized in that the speech recognition part can choose a dictionary of a gender corresponding to the content of a plurality of gender-dependent dictionaries; and characterized in that the search part is configured to search through keywords that are chosen from a common dictionary, used by the speech recognition part.
Applications Claiming Priority (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2002182506 | 2002-06-24 | ||
JP2002319756A JP3781715B2 (en) | 2002-11-01 | 2002-11-01 | Metadata production device and search device |
JP2002319757A JP2004153765A (en) | 2002-11-01 | 2002-11-01 | Meta-data production apparatus and production method |
JP2002334831A JP2004086124A (en) | 2002-06-24 | 2002-11-19 | Device and method for creating metadata |
PCT/JP2003/007908 WO2004002144A1 (en) | 2002-06-24 | 2003-06-23 | Metadata preparing device, preparing method therefor and retrieving device |
Publications (1)
Publication Number | Publication Date |
---|---|
MXPA04012865A true MXPA04012865A (en) | 2005-03-31 |
Family
ID=30003905
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
MXPA04012865A MXPA04012865A (en) | 2002-06-24 | 2003-06-23 | Metadata preparing device, preparing method therefor and retrieving device. |
Country Status (5)
Country | Link |
---|---|
US (1) | US20050228665A1 (en) |
EP (1) | EP1536638A4 (en) |
CN (1) | CN1663249A (en) |
MX (1) | MXPA04012865A (en) |
WO (1) | WO2004002144A1 (en) |
Families Citing this family (156)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8645137B2 (en) | 2000-03-16 | 2014-02-04 | Apple Inc. | Fast, language-independent method for user authentication by voice |
JP4127668B2 (en) * | 2003-08-15 | 2008-07-30 | 株式会社東芝 | Information processing apparatus, information processing method, and program |
US20060080286A1 (en) * | 2004-08-31 | 2006-04-13 | Flashpoint Technology, Inc. | System and method for storing and accessing images based on position data associated therewith |
US7818350B2 (en) | 2005-02-28 | 2010-10-19 | Yahoo! Inc. | System and method for creating a collaborative playlist |
JP2006311462A (en) * | 2005-05-02 | 2006-11-09 | Toshiba Corp | Apparatus and method for retrieval contents |
US7467147B2 (en) * | 2005-06-01 | 2008-12-16 | Groundspeak, Inc. | System and method for facilitating ad hoc compilation of geospatial data for on-line collaboration |
JP4659681B2 (en) * | 2005-06-13 | 2011-03-30 | パナソニック株式会社 | Content tagging support apparatus and content tagging support method |
US8677377B2 (en) | 2005-09-08 | 2014-03-18 | Apple Inc. | Method and apparatus for building an intelligent automated assistant |
US7844820B2 (en) * | 2005-10-10 | 2010-11-30 | Yahoo! Inc. | Set of metadata for association with a composite media item and tool for creating such set of metadata |
EP2421183B1 (en) | 2005-10-21 | 2014-12-17 | Nielsen Media Research, Inc. | Audience metering in a portable media player using frame tags inserted at intervals for counting the number of presentings of content.Index with offset to closest I-pictures entry for random access in a bitstream. |
US7822746B2 (en) * | 2005-11-18 | 2010-10-26 | Qurio Holdings, Inc. | System and method for tagging images based on positional information |
US7884860B2 (en) * | 2006-03-23 | 2011-02-08 | Panasonic Corporation | Content shooting apparatus |
EP2011002B1 (en) | 2006-03-27 | 2016-06-22 | Nielsen Media Research, Inc. | Methods and systems to meter media content presented on a wireless communication device |
WO2007115224A2 (en) * | 2006-03-30 | 2007-10-11 | Sri International | Method and apparatus for annotating media streams |
JP4175390B2 (en) * | 2006-06-09 | 2008-11-05 | ソニー株式会社 | Information processing apparatus, information processing method, and computer program |
KR100856407B1 (en) * | 2006-07-06 | 2008-09-04 | 삼성전자주식회사 | Data recording and reproducing apparatus for generating metadata and method therefor |
US9318108B2 (en) | 2010-01-18 | 2016-04-19 | Apple Inc. | Intelligent automated assistant |
JP2008118232A (en) * | 2006-11-01 | 2008-05-22 | Hitachi Ltd | Video image reproducing unit |
WO2008111308A1 (en) * | 2007-03-12 | 2008-09-18 | Panasonic Corporation | Content imaging device |
US8204359B2 (en) * | 2007-03-20 | 2012-06-19 | At&T Intellectual Property I, L.P. | Systems and methods of providing modified media content |
US8977255B2 (en) | 2007-04-03 | 2015-03-10 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US9330720B2 (en) | 2008-01-03 | 2016-05-03 | Apple Inc. | Methods and apparatus for altering audio output signals |
US8793256B2 (en) | 2008-03-26 | 2014-07-29 | Tout Industries, Inc. | Method and apparatus for selecting related content for display in conjunction with a media |
US8996376B2 (en) | 2008-04-05 | 2015-03-31 | Apple Inc. | Intelligent text-to-speech conversion |
US10496753B2 (en) | 2010-01-18 | 2019-12-03 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US8364721B2 (en) | 2008-06-12 | 2013-01-29 | Groundspeak, Inc. | System and method for providing a guided user interface to process waymark records |
US20100030549A1 (en) | 2008-07-31 | 2010-02-04 | Lee Michael M | Mobile device having human language translation capability with positional feedback |
KR101479079B1 (en) * | 2008-09-10 | 2015-01-08 | 삼성전자주식회사 | Broadcast receiver for displaying description of terminology included in digital captions and method for processing digital captions applying the same |
US8712776B2 (en) * | 2008-09-29 | 2014-04-29 | Apple Inc. | Systems and methods for selective text to speech synthesis |
KR20100061078A (en) * | 2008-11-28 | 2010-06-07 | 삼성전자주식회사 | Method and apparatus to consume contents using metadata |
WO2010067118A1 (en) | 2008-12-11 | 2010-06-17 | Novauris Technologies Limited | Speech recognition involving a mobile device |
US10241644B2 (en) | 2011-06-03 | 2019-03-26 | Apple Inc. | Actionable reminder entries |
US9858925B2 (en) | 2009-06-05 | 2018-01-02 | Apple Inc. | Using context information to facilitate processing of commands in a virtual assistant |
US10706373B2 (en) | 2011-06-03 | 2020-07-07 | Apple Inc. | Performing actions associated with task items that represent tasks to perform |
US10241752B2 (en) | 2011-09-30 | 2019-03-26 | Apple Inc. | Interface for a virtual digital assistant |
US9431006B2 (en) | 2009-07-02 | 2016-08-30 | Apple Inc. | Methods and apparatuses for automatic speech recognition |
US8935204B2 (en) * | 2009-08-14 | 2015-01-13 | Aframe Media Services Limited | Metadata tagging of moving and still image content |
GB2472650A (en) * | 2009-08-14 | 2011-02-16 | All In The Technology Ltd | Metadata tagging of moving and still image content |
JP5257330B2 (en) | 2009-11-06 | 2013-08-07 | 株式会社リコー | Statement recording device, statement recording method, program, and recording medium |
US10276170B2 (en) | 2010-01-18 | 2019-04-30 | Apple Inc. | Intelligent automated assistant |
US10553209B2 (en) | 2010-01-18 | 2020-02-04 | Apple Inc. | Systems and methods for hands-free notification summaries |
US10705794B2 (en) | 2010-01-18 | 2020-07-07 | Apple Inc. | Automatically adapting user interfaces for hands-free interaction |
US10679605B2 (en) | 2010-01-18 | 2020-06-09 | Apple Inc. | Hands-free list-reading by intelligent automated assistant |
US8682667B2 (en) | 2010-02-25 | 2014-03-25 | Apple Inc. | User profiling for selecting user specific voice input processing information |
KR20120045582A (en) * | 2010-10-29 | 2012-05-09 | 한국전자통신연구원 | Apparatus and method for creating acoustic model |
US10762293B2 (en) | 2010-12-22 | 2020-09-01 | Apple Inc. | Using parts-of-speech tagging and named entity recognition for spelling correction |
US9262612B2 (en) | 2011-03-21 | 2016-02-16 | Apple Inc. | Device access using voice authentication |
US10057736B2 (en) | 2011-06-03 | 2018-08-21 | Apple Inc. | Active transport based notifications |
US20120310642A1 (en) | 2011-06-03 | 2012-12-06 | Apple Inc. | Automatically creating a mapping between text data and audio data |
US8994660B2 (en) | 2011-08-29 | 2015-03-31 | Apple Inc. | Text correction processing |
US10134385B2 (en) | 2012-03-02 | 2018-11-20 | Apple Inc. | Systems and methods for name pronunciation |
US9483461B2 (en) | 2012-03-06 | 2016-11-01 | Apple Inc. | Handling speech synthesis of content for multiple languages |
US9280610B2 (en) | 2012-05-14 | 2016-03-08 | Apple Inc. | Crowd sourcing information to fulfill user requests |
US9721563B2 (en) | 2012-06-08 | 2017-08-01 | Apple Inc. | Name recognition system |
US9495129B2 (en) | 2012-06-29 | 2016-11-15 | Apple Inc. | Device, method, and user interface for voice-activated navigation and browsing of a document |
US9576574B2 (en) | 2012-09-10 | 2017-02-21 | Apple Inc. | Context-sensitive handling of interruptions by intelligent digital assistant |
US9547647B2 (en) | 2012-09-19 | 2017-01-17 | Apple Inc. | Voice-based media searching |
KR102516577B1 (en) | 2013-02-07 | 2023-04-03 | 애플 인크. | Voice trigger for a digital assistant |
US9368114B2 (en) | 2013-03-14 | 2016-06-14 | Apple Inc. | Context-sensitive handling of interruptions |
US10652394B2 (en) | 2013-03-14 | 2020-05-12 | Apple Inc. | System and method for processing voicemail |
WO2014144579A1 (en) | 2013-03-15 | 2014-09-18 | Apple Inc. | System and method for updating an adaptive speech recognition model |
KR101759009B1 (en) | 2013-03-15 | 2017-07-17 | 애플 인크. | Training an at least partial voice command system |
US9325381B2 (en) | 2013-03-15 | 2016-04-26 | The Nielsen Company (Us), Llc | Methods, apparatus and articles of manufacture to monitor mobile devices |
US9559651B2 (en) | 2013-03-29 | 2017-01-31 | Apple Inc. | Metadata for loudness and dynamic range control |
WO2014197336A1 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for detecting errors in interactions with a voice-based digital assistant |
US9582608B2 (en) | 2013-06-07 | 2017-02-28 | Apple Inc. | Unified ranking with entropy-weighted information for phrase-based semantic auto-completion |
WO2014197334A2 (en) | 2013-06-07 | 2014-12-11 | Apple Inc. | System and method for user-specified pronunciation of words for speech synthesis and recognition |
WO2014197335A1 (en) | 2013-06-08 | 2014-12-11 | Apple Inc. | Interpreting and acting upon commands that involve sharing information with remote devices |
US10176167B2 (en) | 2013-06-09 | 2019-01-08 | Apple Inc. | System and method for inferring user intent from speech inputs |
JP6259911B2 (en) | 2013-06-09 | 2018-01-10 | アップル インコーポレイテッド | Apparatus, method, and graphical user interface for enabling conversation persistence across two or more instances of a digital assistant |
WO2014200731A1 (en) | 2013-06-13 | 2014-12-18 | Apple Inc. | System and method for emergency calls initiated by voice command |
CN105453026A (en) | 2013-08-06 | 2016-03-30 | 苹果公司 | Auto-activating smart responses based on activities from remote devices |
US9942396B2 (en) * | 2013-11-01 | 2018-04-10 | Adobe Systems Incorporated | Document distribution and interaction |
US9544149B2 (en) | 2013-12-16 | 2017-01-10 | Adobe Systems Incorporated | Automatic E-signatures in response to conditions and/or events |
US10182280B2 (en) | 2014-04-23 | 2019-01-15 | Panasonic Intellectual Property Management Co., Ltd. | Sound processing apparatus, sound processing system and sound processing method |
US9620105B2 (en) | 2014-05-15 | 2017-04-11 | Apple Inc. | Analyzing audio input for efficient speech and music recognition |
US10592095B2 (en) | 2014-05-23 | 2020-03-17 | Apple Inc. | Instantaneous speaking of content on touch devices |
US9502031B2 (en) | 2014-05-27 | 2016-11-22 | Apple Inc. | Method for supporting dynamic grammars in WFST-based ASR |
US9760559B2 (en) | 2014-05-30 | 2017-09-12 | Apple Inc. | Predictive text input |
US10289433B2 (en) | 2014-05-30 | 2019-05-14 | Apple Inc. | Domain specific language for encoding assistant dialog |
US9734193B2 (en) | 2014-05-30 | 2017-08-15 | Apple Inc. | Determining domain salience ranking from ambiguous words in natural speech |
US10078631B2 (en) | 2014-05-30 | 2018-09-18 | Apple Inc. | Entropy-guided text prediction using combined word and character n-gram language models |
US9785630B2 (en) | 2014-05-30 | 2017-10-10 | Apple Inc. | Text prediction using combined word N-gram and unigram language models |
US9633004B2 (en) | 2014-05-30 | 2017-04-25 | Apple Inc. | Better resolution when referencing to concepts |
US9715875B2 (en) | 2014-05-30 | 2017-07-25 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
CN110797019B (en) | 2014-05-30 | 2023-08-29 | 苹果公司 | Multi-command single speech input method |
US10170123B2 (en) | 2014-05-30 | 2019-01-01 | Apple Inc. | Intelligent assistant for home automation |
US9842101B2 (en) | 2014-05-30 | 2017-12-12 | Apple Inc. | Predictive conversion of language input |
US9430463B2 (en) | 2014-05-30 | 2016-08-30 | Apple Inc. | Exemplar-based natural language processing |
US10659851B2 (en) | 2014-06-30 | 2020-05-19 | Apple Inc. | Real-time digital assistant knowledge updates |
US9338493B2 (en) | 2014-06-30 | 2016-05-10 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US10446141B2 (en) | 2014-08-28 | 2019-10-15 | Apple Inc. | Automatic speech recognition based on user feedback |
US9818400B2 (en) | 2014-09-11 | 2017-11-14 | Apple Inc. | Method and apparatus for discovering trending terms in speech requests |
US10789041B2 (en) | 2014-09-12 | 2020-09-29 | Apple Inc. | Dynamic thresholds for always listening speech trigger |
US9668121B2 (en) | 2014-09-30 | 2017-05-30 | Apple Inc. | Social reminders |
US9886432B2 (en) | 2014-09-30 | 2018-02-06 | Apple Inc. | Parsimonious handling of word inflection via categorical stem + suffix N-gram language models |
US9646609B2 (en) | 2014-09-30 | 2017-05-09 | Apple Inc. | Caching apparatus for serving phonetic pronunciations |
US10074360B2 (en) | 2014-09-30 | 2018-09-11 | Apple Inc. | Providing an indication of the suitability of speech recognition |
US10127911B2 (en) | 2014-09-30 | 2018-11-13 | Apple Inc. | Speaker identification and unsupervised speaker adaptation techniques |
US9703982B2 (en) | 2014-11-06 | 2017-07-11 | Adobe Systems Incorporated | Document distribution and interaction |
US10552013B2 (en) | 2014-12-02 | 2020-02-04 | Apple Inc. | Data detection |
US9711141B2 (en) | 2014-12-09 | 2017-07-18 | Apple Inc. | Disambiguating heteronyms in speech synthesis |
US9865280B2 (en) | 2015-03-06 | 2018-01-09 | Apple Inc. | Structured dictation using intelligent automated assistants |
US9886953B2 (en) | 2015-03-08 | 2018-02-06 | Apple Inc. | Virtual assistant activation |
US9721566B2 (en) | 2015-03-08 | 2017-08-01 | Apple Inc. | Competing devices responding to voice triggers |
US10567477B2 (en) | 2015-03-08 | 2020-02-18 | Apple Inc. | Virtual assistant continuity |
US9899019B2 (en) | 2015-03-18 | 2018-02-20 | Apple Inc. | Systems and methods for structured stem and suffix language models |
US9842105B2 (en) | 2015-04-16 | 2017-12-12 | Apple Inc. | Parsimonious continuous-space phrase representations for natural language processing |
US10083688B2 (en) | 2015-05-27 | 2018-09-25 | Apple Inc. | Device voice control for selecting a displayed affordance |
US10127220B2 (en) | 2015-06-04 | 2018-11-13 | Apple Inc. | Language identification from short strings |
US9578173B2 (en) | 2015-06-05 | 2017-02-21 | Apple Inc. | Virtual assistant aided communication with 3rd party service in a communication session |
US10101822B2 (en) | 2015-06-05 | 2018-10-16 | Apple Inc. | Language input correction |
US11025565B2 (en) | 2015-06-07 | 2021-06-01 | Apple Inc. | Personalized prediction of responses for instant messaging |
US10255907B2 (en) | 2015-06-07 | 2019-04-09 | Apple Inc. | Automatic accent detection using acoustic models |
US10186254B2 (en) | 2015-06-07 | 2019-01-22 | Apple Inc. | Context-based endpoint detection |
CN106409295B (en) * | 2015-07-31 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Method and device for recognizing time information from natural voice information |
US9935777B2 (en) | 2015-08-31 | 2018-04-03 | Adobe Systems Incorporated | Electronic signature framework with enhanced security |
US10747498B2 (en) | 2015-09-08 | 2020-08-18 | Apple Inc. | Zero latency digital assistant |
US10671428B2 (en) | 2015-09-08 | 2020-06-02 | Apple Inc. | Distributed personal assistant |
US9626653B2 (en) | 2015-09-21 | 2017-04-18 | Adobe Systems Incorporated | Document distribution and interaction with delegation of signature authority |
US9697820B2 (en) | 2015-09-24 | 2017-07-04 | Apple Inc. | Unit-selection text-to-speech synthesis using concatenation-sensitive neural networks |
US11010550B2 (en) | 2015-09-29 | 2021-05-18 | Apple Inc. | Unified language modeling framework for word prediction, auto-completion and auto-correction |
US10366158B2 (en) | 2015-09-29 | 2019-07-30 | Apple Inc. | Efficient word encoding for recurrent neural network language models |
US11587559B2 (en) | 2015-09-30 | 2023-02-21 | Apple Inc. | Intelligent device identification |
CN105389350B (en) * | 2015-10-28 | 2019-02-15 | 浪潮(北京)电子信息产业有限公司 | A kind of metadata of distributed type file system information acquisition method |
US10691473B2 (en) | 2015-11-06 | 2020-06-23 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US10049668B2 (en) | 2015-12-02 | 2018-08-14 | Apple Inc. | Applying neural network language models to weighted finite state transducers for automatic speech recognition |
US10223066B2 (en) | 2015-12-23 | 2019-03-05 | Apple Inc. | Proactive assistance based on dialog communication between devices |
US10446143B2 (en) | 2016-03-14 | 2019-10-15 | Apple Inc. | Identification of voice inputs providing credentials |
US9934775B2 (en) | 2016-05-26 | 2018-04-03 | Apple Inc. | Unit-selection text-to-speech synthesis based on predicted concatenation parameters |
US10347215B2 (en) | 2016-05-27 | 2019-07-09 | Adobe Inc. | Multi-device electronic signature framework |
EP3467823B1 (en) | 2016-05-30 | 2024-08-21 | Sony Group Corporation | Video sound processing device, video sound processing method, and program |
US9972304B2 (en) | 2016-06-03 | 2018-05-15 | Apple Inc. | Privacy preserving distributed evaluation framework for embedded personalized systems |
US10249300B2 (en) | 2016-06-06 | 2019-04-02 | Apple Inc. | Intelligent list reading |
US10049663B2 (en) | 2016-06-08 | 2018-08-14 | Apple, Inc. | Intelligent automated assistant for media exploration |
DK179309B1 (en) | 2016-06-09 | 2018-04-23 | Apple Inc | Intelligent automated assistant in a home environment |
US10509862B2 (en) | 2016-06-10 | 2019-12-17 | Apple Inc. | Dynamic phrase expansion of language input |
US10490187B2 (en) | 2016-06-10 | 2019-11-26 | Apple Inc. | Digital assistant providing automated status report |
US10067938B2 (en) | 2016-06-10 | 2018-09-04 | Apple Inc. | Multilingual word prediction |
US10586535B2 (en) | 2016-06-10 | 2020-03-10 | Apple Inc. | Intelligent digital assistant in a multi-tasking environment |
US10192552B2 (en) | 2016-06-10 | 2019-01-29 | Apple Inc. | Digital assistant providing whispered speech |
DK179343B1 (en) | 2016-06-11 | 2018-05-14 | Apple Inc | Intelligent task discovery |
DK179415B1 (en) | 2016-06-11 | 2018-06-14 | Apple Inc | Intelligent device arbitration and control |
DK201670540A1 (en) | 2016-06-11 | 2018-01-08 | Apple Inc | Application integration with a digital assistant |
DK179049B1 (en) | 2016-06-11 | 2017-09-18 | Apple Inc | Data driven natural language event detection and classification |
JP6530357B2 (en) * | 2016-09-06 | 2019-06-12 | 株式会社日立ビルシステム | Maintenance work management system and maintenance work management device |
US10043516B2 (en) | 2016-09-23 | 2018-08-07 | Apple Inc. | Intelligent automated assistant |
US10593346B2 (en) | 2016-12-22 | 2020-03-17 | Apple Inc. | Rank-reduced token representation for automatic speech recognition |
US10503919B2 (en) | 2017-04-10 | 2019-12-10 | Adobe Inc. | Electronic signature framework with keystroke biometric authentication |
DK201770439A1 (en) | 2017-05-11 | 2018-12-13 | Apple Inc. | Offline personal assistant |
DK179496B1 (en) | 2017-05-12 | 2019-01-15 | Apple Inc. | USER-SPECIFIC Acoustic Models |
DK179745B1 (en) | 2017-05-12 | 2019-05-01 | Apple Inc. | SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT |
DK201770432A1 (en) | 2017-05-15 | 2018-12-21 | Apple Inc. | Hierarchical belief states for digital assistants |
DK201770431A1 (en) | 2017-05-15 | 2018-12-20 | Apple Inc. | Optimizing dialogue policy decisions for digital assistants using implicit feedback |
DK179549B1 (en) | 2017-05-16 | 2019-02-12 | Apple Inc. | Far-field extension for digital assistant services |
US11652656B2 (en) * | 2019-06-26 | 2023-05-16 | International Business Machines Corporation | Web conference replay association upon meeting completion |
Family Cites Families (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3337798B2 (en) * | 1993-12-24 | 2002-10-21 | キヤノン株式会社 | Apparatus for processing image data and audio data, data processing apparatus, and data processing method |
US5546145A (en) * | 1994-08-30 | 1996-08-13 | Eastman Kodak Company | Camera on-board voice recognition |
US5835667A (en) * | 1994-10-14 | 1998-11-10 | Carnegie Mellon University | Method and apparatus for creating a searchable digital video library and a system and method of using such a library |
JPH09130736A (en) * | 1995-11-02 | 1997-05-16 | Sony Corp | Image pickup device and edit device |
DE19645716A1 (en) * | 1995-11-06 | 1997-05-07 | Ricoh Kk | Digital individual image camera with image data transaction function |
JPH09149365A (en) * | 1995-11-20 | 1997-06-06 | Ricoh Co Ltd | Digital still video camera |
US6336093B2 (en) * | 1998-01-16 | 2002-01-01 | Avid Technology, Inc. | Apparatus and method using speech recognition and scripts to capture author and playback synchronized audio and video |
JP2000069442A (en) * | 1998-08-24 | 2000-03-03 | Sharp Corp | Moving picture system |
JP3166725B2 (en) * | 1998-08-28 | 2001-05-14 | 日本電気株式会社 | Information recording apparatus, information recording method, and recording medium |
JP2000306365A (en) * | 1999-04-16 | 2000-11-02 | Sony Corp | Editing support system and control apparatus therefor |
GB2354105A (en) * | 1999-09-08 | 2001-03-14 | Sony Uk Ltd | System and method for navigating source content |
GB2359918A (en) * | 2000-03-01 | 2001-09-05 | Sony Uk Ltd | Audio and/or video generation apparatus having a metadata generator |
US7051048B2 (en) * | 2000-09-29 | 2006-05-23 | Canon Kabushiki Kaisha | Data management system, data management method, and program |
JP2002157112A (en) * | 2000-11-20 | 2002-05-31 | Teac Corp | Voice information converting device |
JP2002171481A (en) * | 2000-12-04 | 2002-06-14 | Ricoh Co Ltd | Video processing apparatus |
JP2002207753A (en) * | 2001-01-10 | 2002-07-26 | Teijin Seiki Co Ltd | Multimedia information recording, forming and providing system |
JP2002374494A (en) * | 2001-06-14 | 2002-12-26 | Fuji Electric Co Ltd | Generation system and retrieving method for video contents file |
JP2003018505A (en) * | 2001-06-29 | 2003-01-17 | Toshiba Corp | Information reproducing device and conversation scene detection method |
JP4240867B2 (en) * | 2001-09-28 | 2009-03-18 | 富士フイルム株式会社 | Electronic album editing device |
JP3768915B2 (en) * | 2002-04-26 | 2006-04-19 | キヤノン株式会社 | Digital camera and digital camera data processing method |
-
2003
- 2003-06-23 US US10/519,089 patent/US20050228665A1/en not_active Abandoned
- 2003-06-23 CN CN038149028A patent/CN1663249A/en active Pending
- 2003-06-23 EP EP03733537A patent/EP1536638A4/en not_active Withdrawn
- 2003-06-23 WO PCT/JP2003/007908 patent/WO2004002144A1/en active Application Filing
- 2003-06-23 MX MXPA04012865A patent/MXPA04012865A/en unknown
Also Published As
Publication number | Publication date |
---|---|
US20050228665A1 (en) | 2005-10-13 |
EP1536638A1 (en) | 2005-06-01 |
CN1663249A (en) | 2005-08-31 |
EP1536638A4 (en) | 2005-11-09 |
WO2004002144B1 (en) | 2004-04-08 |
WO2004002144A1 (en) | 2003-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
MXPA04012865A (en) | Metadata preparing device, preparing method therefor and retrieving device. | |
JP4794740B2 (en) | Audio / video signal generation apparatus and audio / video signal generation method | |
US9837077B2 (en) | Enhanced capture, management and distribution of live presentations | |
JP4591982B2 (en) | Audio signal and / or video signal generating apparatus and audio signal and / or video signal generating method | |
US6789228B1 (en) | Method and system for the storage and retrieval of web-based education materials | |
US7093191B1 (en) | Video cataloger system with synchronized encoders | |
US7295752B1 (en) | Video cataloger system with audio track extraction | |
US6567980B1 (en) | Video cataloger system with hyperlinked output | |
JP3657206B2 (en) | A system that allows the creation of personal movie collections | |
US20020036694A1 (en) | Method and system for the storage and retrieval of web-based educational materials | |
JP2005341015A (en) | Video conference system with minute creation support function | |
KR20060132595A (en) | Storage system for retaining identification data to allow retrieval of media content | |
US20120257869A1 (en) | Multimedia data recording method and apparatus for automatically generating/updating metadata | |
US7924325B2 (en) | Imaging device and imaging system | |
JP3781715B2 (en) | Metadata production device and search device | |
US20070201864A1 (en) | Information processing apparatus, information processing method, and program | |
Ronfard et al. | Capturing and indexing rehearsals: the design and usage of a digital archive of performing arts | |
US7675827B2 (en) | Information processing apparatus, information processing method, and program | |
JP2004023661A (en) | Recorded information processing method, recording medium, and recorded information processor | |
US20140078331A1 (en) | Method and system for associating sound data with an image | |
JPH0991928A (en) | Method for editing image | |
US7444068B2 (en) | System and method of manual indexing of image data | |
KR102376646B1 (en) | Automatic System for moving picture in relation to goods | |
US7720798B2 (en) | Transmitter-receiver system, transmitting apparatus, transmitting method, receiving apparatus, receiving method, and program | |
JP2002262225A (en) | Contents mediating device and method for processing contents mediation |