VIDEO INDEXING METHOD, AND VIDEO INDEXING DEVICE
FIELD OF THE INVENTION
The invention relates to a video indexing method, and a video indexing device,
BACKGROUND OF THE INVENTION
Several picture processing applications use the detection of regions of interest (ROI) to improve picture quality. For example, coding applications often decode regions of interest and deploy more resources for coding these regions.
Different methods enable detection of regions of interest in a picture.
Particularly, methods are known based on the establishment of salience maps of a picture or a video that take into account the visual parameters and enables definition of regions on which the human eye lingers when viewing a picture or a video.
The detection of regions of interest is today principally used prior to coding in a such a manner as to privilege the regions of interest during coding by according them more bandwidth, for example by reducing the quantization step for these regions.
The emergence of mobile terminals, such as mobile telephones, PDAs, game consoles, portable DVD players, the development of display and screen techniques and the emergence of new services have all combined to render' necessary the display of video on terminals with a low display capacity. For example, the possibility to receive television on a mobile telephone raises display problems for dense pictures on low dimension screens.
The present invention is principally concerned not with the detection of regions of interest, but rather with the transmission of these regions of interest to the devices or applications that take them into account for different applications and can at least resolve the picture display problem on a terminal with a low display capacity, whether mobile or not.
SUMMARY OF THE INVENTION
For this purpose, the present invention proposes a method for indexing a coded video data stream. According to the invention, the video data stream
comprises information relative to the location of regions of interest of each picture, the method comprises steps of:
- reception of coded video stream,
- recording the coded video stream on a recording support, - decoding location information of regions of interest,
- selection of a region of interest per picture,
- decoding of video data,
- selecting a predetermined number of regions of interest for the video data stream from among the regions of interest selected per picture, - recording of the selected regions of interest.
According to a preferred embodiment, during the recording step,
- the selected regions of interest are recorded in a temporary memory as they are being selected and decoded, - when all the selected regions of interest are recorded in the temporary memory, the selected regions of interest are transferred to a permanent memory support (503).
Preferentially, prior to their recording the regions of interest are formatted in order to obtain a homogenous size for all the selected regions of interest.
Preferentially, the method comprises a step of encrypting the location of the regions of interest thanks to an encryption key.
Preferentially, the method comprises a step of obtaining a decryption key upon payment by the user.
Preferentially, the video data stream is coded according to the coding standard H.264/AVC and the location information is contained in a Supplemental Enhancement Information (SEI) type message.
According to a preferred embodiment, the SEI messages are encapsulated into real-time protocol packets (RTP), the RTP packets being encrypted.
Preferentially, the Supplemental Enhancement Information type messages relative to regions of interest location information are inserted in the coded data before or after each picture to which they refer.
According to a preferred embodiment, the location information comprises information chosen from:
- the number of regions of interest in each picture,
- the coordinates of each region of interest for each of the picture dimensions,
- the surface of each region of interest, - a weight relative to the importance of the region of interest with respect to other regions of interest of the picture,
- information relating to the content of each region of interest, and any combination of this information.
Preferentially, the selection step of a region of interest per picture selects a region of interest according to the weight relative to the importance of the region of interest.
Preferentially, the video coding standard uses flexible macro-bloc ordering, the regions of interest being coded into slice groups, independently from the other picture data, the location information of regions of interest comprising the slice group numbers in which the regions of interest are coded.
Preferentially, the Supplemental Enhancement Information message comprises an identifier indicating for each slice group if it is related to one region of interest.
Preferentially, the method comprises a further step of reading the SEI messages and in that the step of decoding of video data decodes only the slice groups containing the region of interests.
The invention concerns also a device for indexing a coded video data stream. According to the invention, the video data stream comprises information relative to the location of regions of interest of each picture, the device comprises means for:
- receiving the coded video stream, - recording the coded video stream on a recording support (503),
- decoding (501 ) location information of the regions of interest,
- decoding (501 ) video data,
- selecting (502) a region of interest per picture,
- selecting (502) a predetermined number of regions of interest for the video data stream from among the regions of interest selected per picture,
- recording (503) the selected regions of interest.
The detection of the regions of interest of a picture is made in general prior to coding. This data is then used to facilitate the encoding. The inventors realized that the location of regions of interest can also be of interest during the decoding of a picture and particularly during the display on a device whose display capacity is limited. In fact, the reception terminal can in fact choose to display the regions of interest only, which enables having a better visibility of these regions relative to the display of the complete picture.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention will be better understood and illustrated by means of embodiments and implementations, by no means limiting, with reference to the figures attached in the appendix, wherein:
- figure 1 shows a coding device according to a preferred embodiment of the invention,
- figure 2 shows a coding method according to a preferred embodiment of the invention,
- figure 3 shows a decoding device according to a preferred embodiment of the invention,
- figure 4 shows a decoding method according to another embodiment of the invention, - figure 5 shows a personal recording type device according to another embodiment of the invention,
- figure 6 shows an indexing method in a personal recording type device implementing an embodiment of the invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
Figure 1 shows a coding device in accordance with the coding standard H.264/AVC implementing a preferred embodiment of the invention. In this preferred embodiment, a video stream is coded.
A current frame Fn is presented at the coder input to be coded by it.
This frame is coded in the form of slices, namely it is divided into sub-units which each contain a certain number of macroblocks corresponding to groups of 16x16 pixels. Each macroblock is coded in intra or inter mode. Whether in intra mode or inter mode, a macroblock is coded by being based on a reconstructed frame. A module 109 decides the coding mode in intra mode of the current picture, according to the content of the picture. In intra mode, P (shown in figure 2) comprises samples of the current frame Fn that were previously coded, decoded and reconstructed (uF'n on figure 2, u meaning non-filtered). In inter mode, P is comprised from a motion estimation based on one or more F'n-i frames.
A motion estimation module 101 establishes an estimation of motion between the current frame Fn and at least one preceding frame F'n-1. From this motion estimation, a motion compensation module 102 produces a frame P when the current picture Fn must be coded in inter mode. A subtractor 103 produces a signal Dn, the difference between the picture Fn to be coded and the picture P. Then this picture is transformed by a DCT transform in a module 104. The transformed picture is then quantized by a quantization module 105. Then, the pictures are reorganized by a module
111. A CABAC (Context-based Adaptive Binary Arithmetic Coding) type entropic coding module 112 then codes each picture.
The modules 106 and 107 respectively of quantization and inverse transformation enable a difference D'n to be reconstituted after transformation and quantization then inverse quantization and inverse transformation.
When a picture is coded in intra mode, according to module 109, an intra prediction module 108 codes the picture. A uF'n picture is obtained at the adder output 114, as is the sum of the D'n signal and the P signal. This module 108 also receives at input the reconstructed non-filtered F'n picture. A filter module 110 can obtain an F'n picture reconstructed and filtered from a uF'n picture.
The entropic decoding module 112 transmits the coded slices encapsulated in NAL type units. The NALs contain, as well as the slices, information relating to the headers for example. The NAL type units are transmitted to a module 113.
A module 116 enables the regions of interest to be determined. Several methods now enable regions of interest to be located in a picture. Particularly known are methods based on the establishment of salience maps. For example the patent application WO2006/07263 filed in the name of
Thompson Licensing on the 10th January 2006 and published on 13th July 2006 discloses an effective method for establishing a salience map.
The means 116 then establish a salience map for each picture of the video. To establish this salience map, parameters entered by the user can also be taken into account. For example, it is possible to define, according to the event to which the video is related, certain important objects of the filmed scene and particularly for sporting events to specify that it concerns a football match. Advantageously, this allows a salience map to be obtained that weights the salience zones according to the event. In a football match, it would be preferable to focus on the ball rather than on the terraces.
The region of interest module therefore enables one or more salient zones to be extracted, also referred to as regions of interest. These regions of interest are then geographically located on the picture.
They are identified by their coordinates according to the height and width of the picture. Their size can also be extracted for each of the regions of interest. It is also possible to associate them with an element of semantic information. In fact for a football match, one may require information on a region of interest if the user can select the regions of interest to be displayed from a choice of several regions of interest to be displayed.
The module 115 receives information relating to the regions of interest in order to code them into an SEI ("Supplemental Enhancement Information") type message.
The SEI message is coded as indicated in the table below:
Table 1
uuid_iso_iec_11578: single word of 128 bits to indicate our message type to the decoder. user_data_payload_byte: 8 bits comprising a part of the SEI massage.
Typically in this case: payloadSize = 17 (bytes) thus 16 for the UUID and 1 for the proprietary data. user_data_payload_byte:
Table 2
Where: number_of_ROI: Number of regions of interest present in the picture
(or the following pictures). roi_x_16: Position X in the picture of the region of interest, in multiples of 16 pixels. roi_y_16: Position Y in the picture of the region of interest, in multiples of 16 pixels. roi_w_16: Width in the picture of the region of interest, in multiples of
16 pixels. roi_h_16: Height in the picture of the region of interest, in multiples of
16 pixels. semantic_information: title characterizing the region of interest.
Relative weights: gives the weight of each region of interest of the picture in such a way to know which region of interest that has in principle the most interest.
Macroblock_alignment: gives the number of the starting macroblock in which the region of interest is found, as well as the size of the region of interest in number of macroblocks, in width and in height.
When regions of interest are detected using the salience maps, a rate of salience is obtained for each region of interest, the regions are classified as salient if their salience is higher than a certain threshold predetermined by the method for obtaining salience maps. Hence, in the SEI messages, the regions of interest are classed in increasing order of salience for all regions where the salience is higher than a fixed threshold.
The module 113 inserts the SEI message into the data stream and sends the video stream thus coded to the transmission network.
An SEI message is transmitted before each picture to which it refers. In other embodiments, it is also possible to transmit the SEI message only when the location of at least one region of interest changes between two or more pictures. Hence, during decoding, the decoder takes into account the last SEI message received, whether it is immediately before the picture to be decoded or if it relates to a picture previously received if the current picture is not preceded by such an SEI message.
Figure 2 shows a coding method in accordance with the coding standard H.264/AVC implementing a preferred embodiment of the invention.
During a step E1 , the salience map associated with the video to be broadcast is determined. In order to determine this salience map that shows the regions of interest, information relating to the video content can also be received to take account of this information during the establishment of the salience map. Particularly, during a sporting event, it can be considered that the position of the ball corresponds to a region of interest for the user and in this case, privilege the zones of the picture in which the ball is situated. When the video corresponds to the broadcast of a televised report, it can also be assumed that the presenter corresponds to a region of interest, and in this case, determine the regions of interest by privileging the zones containing the presenter by detecting for example the face using known picture processing techniques.
At the end of the E1 step, one or more regions of interest relating to the video content are thus obtained.
During a step E2, the coordinates of the regions of interest in the pictures are determined. The size of the regions of interest can also be determined in pixels and semantic information on the content can be associated with each region of interest.
In parallel, during a step E3, the video stream is coded according to the coding standard H.264. During the coding, zones are privileged that were
detected as regions of interest. In order to privilege the regions of interest at the coding level, a lower quantization step is applied to them.
Following step E2, during a step E4, an SEI message is created from location and semantic information associated with the regions of interest. The SEI message thus created is in accordance with the SEI message previously described in tables 1 and 2.
During a step E5, the stream is constituted by inserting SEI messages into the stream to obtain a coded stream according to the H.264 standard.
The video stream thus coded is transmitted to decoding devices in real time or in a deferred manner during a step E6, the decoding devices can be local or remote.
Figure 3 represents a preferred embodiment of a decoding device according to the invention, in accordance with the coding standard H.264/AVC.
A 209 module receives SEI messages at the input. It extracts the different SEI messages. The NALs of useful data are transmitted to an entropic decoding module 201. The SEI messages are analyzed by a module 210. This module enables decoding of the content of SEI messages representative of the regions of interest. The regions of interest of each picture are thus identified at the level of the decoding device in a simple manner and prior to the decoding of each picture using information contained in the field macroblock_alignment.
The macroblocks are transmitted to a re-ordering module 202 to obtain a set of coefficients. These coefficients undergo an inverse quantization in the module 203 and an inverse DCT transformation in the module 204 at the output of which D'n macroblocks are obtained, D'n being a deformed version of Dn. A predictive block P is added to D'n, by an adder 205, to reconstruct a macroblock uF'n. The block P is obtained after motion compensation, carried out by a module 208, of the preceding decoded frame, during a coding in inter mode or after intra prediction of the macroblock uF'n, by the module 207, in the case of coding in intra mode. A filter 206 is applied to the signal uF'n to
reduce the effects of the distortion and the reconstructed frame F'n is created from a series of macroblocks.
Using information relating to the regions of interest comprised in the
SEI messages, the blocks representative of regions of interest are detected in the stream and prior to display, these blocks are identified and can be cropped according to the choice of the user and transmitted for display to a device such as a PDA, or mobile telephone.
It is also possible to leave the choice to the user to choose which macroblock he wants to display, by entering semantic information for example. He enters for example "ball" and in this case the regions of interest containing a ball are displayed. If no region of interest is associated with this semantic, then all the regions of interest can be displayed. The different regions of interest can be displayed in the form of a mosaic on the screen.
When a single region of interest is displayed, this region of interest is displayed in zoom on the screen to take up the full screen.
The decoding device thus only decodes the macroblocks likely to contain information of interest to the user. In this way the decoding is faster and requires less resources at the level of the decoding device and therefore at reception. This is particularly advantageous when the receiving device is a mobile terminal comprising limited processing capacity.
Figure 4 shows a decoding method in accordance with the coding standard H.264/AVC implementing a preferred embodiment of the invention.
Such a method can be implemented in a mobile terminal having a limited display capacity.
During a step S1 , the type of display required is selected. The selection is made by means of the user interface present on the mobile terminal. Either it is decided to function in full picture mode and in this case the integrality of the video stream is displayed as it is transmitted by the transmitter. Or it is decided to display the only the regions of interest of the picture. This particular mode constitutes the particularity of the invention. When it is decided to display the regions of interest, it passes to step S2, if not it passes to step S8. It is understood that different types of SEI messages can be inserted into the
video stream for other applications and in this case, prior to step S8 or during step S8, there can be a step of SEI message analysis.
During a step S2, the user selects the use that he wants to make of the regions of interest. Particularly, he can select: - the maximum number of regions of interest that he wants to display.
- the manner in which he wants to display the various regions of interest on the screen, for example in the form of a mosaic,
- the degree of zoom that he wants on the region of interest.
- using a keyword, the regions of interest whose "semantic information" field comprises the keyword. In this case, for each picture, it is also possible to specify whether it is required to display a single region of interest per picture comprising the keyword (and in this case those for which the salience is maximum) or several regions of interest comprising the key word.
During a step S3, the SEI messages present in the stream, are analyzed as they are being received. The SEI message is used to code the location of regions of interest of the picture as they were detected prior to the picture coding. Hence for each picture, there can be one or more regions of interest according to the visual properties of the picture or according to picture content or both. The SEI message is coded according to the tables 1 and 2 previously described. Information relating to SEI messages is recorded temporally up until the display of the corresponding picture.
During a step S4, the pictures are all decoded in conformance with the decoding standard. During a step S5, the decoded regions of interest are processed according to those that the user selected during the S2 step. If the user selects a zoom of the principle region of interest of the picture, then during step S6, the zone is magnified so as to reach the maximum size of display. If the user has selected a mosaic of regions of interest then the picture is recomposed of regions of interest, each being magnified according to the screen size and the number of regions of interest selected for display. If the user has specified a keyword, then the regions of interest comprising the keyword are displayed and zoomed.
During a step S7, the regions of interest are displayed on the screen of the mobile terminal, according to the user's desire.
During step a S8, following a non-selection by the user to display only the regions of interest, the entire video stream is decoded for display.
Figure 5 shows a video indexing application of the invention.
Figure 5 partially shows a personal recorder (PVR) type device 500.
The PVR 500 receives a compressed video stream at its input. According to the embodiment described, this video data stream is in accordance with the coding standard H.264. The compressed video stream comprises particularly
SEI messages as previously described in the tables 1 and 2.
This video data stream is partly transmitted to a recording support 503. Recording support can be understood as hard disk, holographic support, memory card or "blue ray" disk. This recording support can be remote in other embodiments.
The video data stream is transmitted in another part to a decoder 501 to be decoded in real time, this for example to be displayed on a television set. In the known devices, the stream is transmitted to the decoder 501 when the user wants to view it in real time. If not, it is not decoded but simply recorded, when recording is requested.
The present invention, according to this aspect, offers to decode part of the video data stream, even when viewing in real time is not requested. For a part of the video stream, it is understood particularly the regions of interest or certain regions of interest.
When the decoder 501 receives a video stream for which a recording is requested, the data is transmitted to the recording support 503. The recording support 503 records the data as it is received. In a simultaneous manner, the decoder 501 receives the video data stream and progressively decodes the SEI messages. The decoded regions of interest are transmitted to the video indexing module 502 responsible for their temporary recording before transmitting them to the recording support 503.
Figure 6 illustrates the method implemented by the decoder 501 and the indexing module 502.
During a step T1 , the video data stream is received by the decoder 501. During a step 12, the decoder 501 decodes the SEI messages present in the video data stream. The decoded SEI messages are SEI messages as previously described in the tables 1 and 2. The decoder can also decode other SEI messages but that is not the object of the present invention. Each SEI message can describe one or more regions of interest per picture as described in tables 1 and 2. During a step T3, the decoder 501 analyzes each SEI message and decodes each picture. During this step, the weight indicated in the SEI message is used to select which region of interest will be recorded for each picture. In a preferred embodiment, the region of interest with the maximum of salience is kept, i.e having the highest weight.
Once the region of interest has been decoded, during a step T4, it is transmitted to the indexing module 502. The recording of a region of interest per picture, and this for all the pictures, is of little interest as it represents a large volume of information and also does not enable an efficient indexing of the video. Hence the indexing module decides which picture is used to index the video. According to the preferred embodiment described here, only about 10 pictures will be selected for a video of one and a half hours. It can be imagined that in other embodiments the number of pictures will be greater. These 10 pictures are taken at regular intervals. These selected pictures are recorded temporarily in a RAM type memory comprised in the indexing module 502 and not shown. In order to display them in the best manner, the pictures are zoomed during a step T5, that is they are enlarged so that they are all the same size. According to a preferred embodiment, this size can be the size of the picture. For that, they are read in the temporary memory and re-recorded after their enlargement. According to another embodiment, the pictures are enlarged prior to their recording in the temporary memory.
According to another embodiment, the images are presented as a mosaic on the display. Therefore, instead of being enlarged, the images are reduced to one single size, the same for all of them.
When the entire video is received and so recorded in the recording support 503, during step T6, the indexing pictures are also transferred from the temporary memory to the recording support 503 and recorded in a file.
Then according to the desired use, the regions of interest are used for the indexation or can also be used for display on a PVR type device when the user wants to consult the content of the database.
According to another aspect of the invention, it is also possible to encrypt the location data of the regions of interest during the coding of SEI messages. Hence, only users having the decryption key can access the regions of interest and so access the visualization of regions of interest or indexing of video streams due to the location information of the regions of interest. This encrypting step, in respect of figure 2, would be a step E4' (not shown) but inserted after the step E4.
Obtaining of the decryption key could be the object of a paid for service from the programme broadcaster for example.
To do this, the SEI messages relating to regions of interest are encapsulated in RTP (Real Time Protocol) type packets and transmitted on a different video port. Temporal CTS type labels can link the SEI messages relating to regions of interest with corresponding pictures. Advantageously, this transmission mode enables encrypting only RTP packets containing the SEI messages and not the video.
The decryption is carried out at the level of the terminal receiver. In the case of an MPEG-2 TS encapsulation, the encrypting standard used is DVB-CSA and SEI messages relating to regions of interest are encapsulated in a different PID than that of the video. The SEI messages relating to regions of interest are linked to corresponding pictures via the PTS (timestamp) of the PES packet header. This transmission mode allows encryption only of the PIDs that contain SEI messages relating to regions of interest and not the video PID.
According to another embodiment, the video data stream is coded in accordance with the coding standard H.264/AVC using FMO (Flexible
Macroblock Ordering) which enables coding of different parts of the picture independently and so decoding of them independently. The FMO mode uses "slice groups". The "slice groups" are defined in the standard. In this embodiment, the regions of interest are coded in groups different from the rest of the picture. A PPS type NAL comprises a map of "slice groups". SEI messages are inserted such as those described hereafter indicating in which "slice groups" the regions of interest are coded.
The tables below illustrate the format of the SEI message used according to this embodiment:
Table 3
uuid_iso_iec_11578: single word of 128 bits to indicate our message type to the decoder. user_data_payload_byte: 8 bits comprising a part of the SEI message.
Typically in this case:
• payloadSize = 17 (bytes) thus 16 for the UUID and 1 for the proprietary data.
• user_data_payload_byte:
Table 4
- Slice_group(i)_id: if the slice_group_id equals "1" then the slice_group represents an region of interest, if it equals "0" then the slice_group represents the rest of the picture.
For each slice_group representing a region of interest, a semantic information, a relative weight and which macroblock it concerns can be specified.
Hence, only macroblocks corresponding to the regions of interest can be decoded during reception as they are identified and coded independently.