AU2016277643A1

AU2016277643A1 - Using face detection metadata to select video segments

Info

Publication number: AU2016277643A1
Application number: AU2016277643A
Authority: AU
Inventors: David Ian Johnston; Mark Ronald Tainsh
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2016-12-21
Filing date: 2016-12-21
Publication date: 2018-07-05

Abstract

Using Face Detection Metadata to Select Video Segments A method of determining how a video segment is processed, the method comprising: receiving data points indicating a position of a face within frames of the video segment; determining a 5 periodic position change of the face within the video segment based on the received data points; determining whether an operator of an apparatus that captured the video segment moved according to a defined motion type based on the determined periodic position change; and, upon a positive determination that the operator moved according to the defined motion type, marking the video segment based on the defined motion type to modify processing of the video segment. 121Q-7221 1

Description

USING FACE DETECTION METADATA TO SELECT VIDEO SEGMENTS

TECHNICAL FIELD

The present disclosure relates generally to digital video metadata processing and, in particular, to a method, system and apparatus for selecting sections of a video sequence using face detection metadata. The present invention also relates to a computer program product including a computer readable medium having recorded thereon a computer program for selecting sections of a video sequence using face detection metadata.

BACKGROUND

Video is an effective way to capture a scene or an unfolding event. People often capture videos for birthday parties, weddings, travel and sports events. Unlike still images, video has an advantage of capturing evolving, unstructured events, such as particular natural facial expressions and human interactions (e.g. talking, mutual smiling, kissing, hugging, handshakes). It is often desirable to select segments of a video sequence to generate a shorter version of the video sequence which contains the highlight sections and remove dull, repetitive or poor quality sections of the sequence. It is desirable to select video highlights automatically.

The conventional approach to automatically selecting video highlights is to post-process the video sequence on a personal computer. This has the advantage that considerable processing power is available, so that techniques which use image analysis of the individual frames of the video sequence are possible. It is also possible to use additional video metadata that the camera may capture and record along with the video (such as data from gyro sensors, face detection hardware and exposure sensors). However, the standard video formats do not include this data, so proprietary video formats may be required to exchange data between the camera and the PC.

It would be desirable to select highlights on the camera itself for several reasons. One reason is to simplify the video production process, in order to make it more likely camera operators would use the feature. Furthermore, if the camera is network connected it would also allow the selected highlights to be directly uploaded from the camera to internet video sharing services. Also additional metadata about the video sequence is easily accessible from software running on the camera itself.

The most significant problem with selecting video highlights automatically on the camera, is that available processing power is severely limited. Available general purpose CPU processing power and memory capacity are usually limited, and the requirement to maximise battery life further limits the amount of processing that can be done. These limitations largely rule out using video frame image processing techniques for highlight selection on cameras. It would be preferable to perform highlight selection using only the video metadata, which typically contain much less information than the video frame data, so processing can be performed much more quickly and cheaply.

There are many possible techniques for automatically selecting highlights, two of which will now be discussed. A fully fledged video selection system would need to use many additional techniques.

One aspect that lowers the perceived quality of a section of video is camera shake. While a certain amount of camera shake is acceptable, it is generally desirable to exclude or remove sections of the video when camera shake becomes excessive. To do this, the camera shake must be detected, and a threshold at which point the camera shake becomes objectionable chosen.

One existing technique for detecting camera shake is to use either gyro or accelerometer data to detect high frequency movement of the camera. Another is to use image analysis to track objects in the video foreground or background to detect high frequency movements of the camera. A technique for selecting sections of the video for highlights is to identify an important subject in the video, and then preferentially select sequences involving that subject. One common marker of an important subject is when the video operator walks towards the subject while videoing (the subject in this case might be another person, or an object like a statue). To detect this situation, it is necessary to detect that the operator is walking, and to detect that there is a centred object in the video frame that the camera is approaching.

There are several existing techniques for detecting that the operator is walking. One method is to use accelerometer data, and to look in the accelerometer data for periodic motion that is consistent with walking. Another method is to use GPS tracking data, and look for movement in the data at a rate that is consistent with walking. Detecting a centred subject can be performed by using frame image analysis to look for an unmoving object in the centre of the video frame. However, to simplify hardware architecture of a camera and to minimise costs of manufacture of the camera, it is desirable to determine whether the operator is walking, even in absence of the accelerometer data or when the accelerometer data is noisy.

SUMMARY

In accordance with an aspect of the present disclosure, there is provided a method of determining how a video segment is processed, the method comprising: receiving data points indicating a position of a face within frames of the video segment; determining a periodic position change of the face within the video segment based on the received data points; determining whether an operator of an apparatus that captured the video segment moved according to a defined motion type based on the determined periodic position change; and, upon a positive determination that the operator moved according to the defined motion type, marking the video segment based on the defined motion type to modify processing of the video segment.

In accordance with another aspect of the present disclosure, there is provided an apparatus for determining how a video segment is processed, the apparatus being configured to: receive data points indicating a position of a face within frames of the video segment; determine a periodic position change of the face within the video segment based on the received data points; determine whether an operator of an apparatus that captured the video segment moved according to a defined motion type based on the determined periodic position change; and, upon a positive determination that the operator moved according to the defined motion type, mark the video segment based on the defined motion type to modify processing of the video segment.

In accordance with a further aspect of the present disclosure, there is provided a system for determining how a video segment is processed, the system comprising: a memory comprising data and a computer program; a processor couple to the memory for executing the computer program comprising instructions for: receiving data points indicating a position of a face within frames of the video segment, determining a periodic position change of the face within the video segment based on the received data points, determining whether an operator of an apparatus that captured the video segment moved according to a defined motion type based on the determined periodic position change, and, upon a positive determination that the operator moved according to the defined motion type, marking the video segment based on the defined motion type to modify processing of the video segment.

In accordance with yet a further aspect of the present disclosure, there is provided a non-transitory computer readable medium having a program stored on the medium for determining how a video segment is processed, the program comprising: code for receiving data points indicating a position of a face within frames of the video segment, code for determining a periodic position change of the face within the video segment based on the received data points, code for determining whether an operator of an apparatus that captured the video segment moved according to a defined motion type based on the determined periodic position change, and, upon a positive determination that the operator moved according to the defined motion type, code for marking the video segment based on the defined motion type to modify processing of the video segment.

BRIEF DESCRIPTION OF THE DRAWINGS

One or more embodiments of the invention will now be described with reference to the following drawings, in which:

Fig. 1A is a diagram of a digital camera capable of shooting both still pictures and video.

Fig. IB is a schematic block diagram of a digital camera capable of shooting both still pictures and video.

Fig. 2 is a schematic flow diagram illustrating a method of rejecting video segments with objectionable camera shake.

Fig. 3 is a schematic flow diagram illustrating a method of determining camera shake and low frequency variability using gyro metadata.

Fig. 4 is a schematic flow diagram illustrating a method of detecting walking using face location metadata.

Fig. 5 is a schematic flow diagram illustrating a method of splitting a list of gyro data local minima and maxima into non-walking and potential-walking segments.

Fig. 6 is a schematic flow diagram illustrating a method of processing potential-walking segments to determine if the camera operator was walking.

Fig. 7 is a schematic flow diagram illustrating a method of determining camera shake and low frequency variability using face location metadata.

Fig. 8 is a schematic flow diagram illustrating a method of determining important video frames by detecting when the camera operator is walking towards a subject.

Fig. 9 is a schematic flow diagram illustrating a method of detecting a centred subject with relative movement towards the camera.

Fig. 10A is a diagram illustrating a video frame and associated face detection data.

Fig. 1 OB is a diagram illustrating a plot of smoothed face location data.

DETAILED DESCRIPTION INCLUDING BEST MODE

Fig. 1A is a cross-section diagram of an exemplary image capture system 100, upon which the various arrangements described can be practiced. In the general case, the image capture system 100 may be a digital video camera (also referred to as a camcorder).

As seen in Fig. 1 A, the camera system 100 comprises an optical system 102 which receives light from a scene 101 and forms an image on a sensor 121. The sensor 121 comprises a 2D array of pixel sensors which measure the intensity of the image formed on it by the optical system as a function of position. The operation of the camera, including user interaction and all aspect of reading, processing and storing image data from the sensor 121 is coordinated by a main controller 122 which comprises a special purpose computer system. This system is considered in detail below. The user is able to communicate with the controller 122 via a set of buttons including a shutter release button 128, used to initiate focus and capture of image data, and other general and special purpose buttons 124, 125, 126 which may provide direct control over specific camera functions such as flash operation or support interaction with a graphical user interface presented on a display device 123. The display device may also have a touch screen capability to further facilitate user interaction. Using the buttons and controls it is possible to control or modify the behaviour of the camera. Typically it is possible to control capture settings such as the priority of shutter speed or aperture size when achieving a required exposure level, or the area used for light metering, use of flash, ISO speed, options for automatic focusing and many other photographic control functions. Further, it is possible to control processing options such as the colour balance or compression quality. The display 123 is typically also used to review the captured image or video dataand to provide a live preview of the scene.

The optical system comprises an arrangement of lens groups 110, 112, 113 and 117 which can be moved relative to each other along a line 131 parallel to an optical axis 103 under control of a lens controller 118 to achieve a range of magnification levels and focus distances for the image formed at the sensor 121. The lens controller 118 may also control a mechanism 111 to vary the position, on any line 132 in the plane perpendicular to the optical axis 103, of a corrective lens group 112, in response to input from one or more motion sensors 115, 116 or the controller 122 so as to shift the position of the image formed by the optical system on the sensor. Typically the corrective optical element 112 is used to effect an optical image stabilisation by correcting the image position on the sensor for small movements of the camera such as those caused by handshake. The optical system may further comprise an adjustable aperture 114 and a shutter mechanism 120 for restricting the passage of light through the optical system. Although both the aperture and shutter are typically implemented as mechanical devices they may also be constructed using materials, such as liquid crystal, whose optical properties can be modified under the control of an electrical control signal. Such electro-optical devices have the advantage of allowing both shape and the opacity of the aperture to be varied continuously under control of the controller 122.

Fig. IB is a schematic block diagram for the controller 122 of Fig. IB, in which other components of the camera system which communicate with the controller are depicted as functional blocks. In particular, the image sensor 190, lens controller 197 and gyro sensor 199 are depicted without reference to their physical organisation or the image forming process and are treated only as devices which perform specific pre-defined tasks and to which data and control signals can be passed. Fig. IB also depicts a flash controller 197 which is responsible for operation of a strobe light that can be used during image capture in low light. Auxiliary sensors may include orientation sensors that detect if the camera is in a landscape of portrait orientation during image capture; other sensors that detect the colour of the ambient illumination or assist with autofocus and so on. Although these are depicted as part of the controller 122, they may in some implementations be implemented as separate components within the camera system. The gyro sensor 199 detects angular motion of the camera. The gyro sensor may form part of sensors 115 and/or 116 as shown on Fig. 1 A, or it may be a separate sensor.

The controller comprises a processing unit 150 for executing program code, Read Only Memory (ROM) 160 and Random Access Memory (RAM) 170 as well as non-volatile mass data storage 191. Optionally, there may be a dedicated face detection unit 180. In addition, at least one communications interface 192 is provided for communication with other electronic devices such as printers, displays and general purpose computers. Examples of communication interfaces include USB, ΤΕΚΕ 1394, HDMI and Ethernet. An audio interface 193 comprises one or more microphones and speakers for capture and playback of digital audio data. A display controller 194 and button interface 195 are also provided to interface the controller to the physical display and controls present on the camera body. The components are interconnected by a data bus 181 and control bus 182.

In a capture mode, the controller 122 operates to read data from the image sensor 190 and audio interface 193 and manipulate that data to form a digital representation of the scene that can be stored to a non-volatile mass data storage 191. In the case of a still image camera, image data may be stored using a standard image file format such as JPEG or TIFF, or it may be encoded using a proprietary raw data format that is designed for use with a complimentary software product that would provide conversion of the raw format data into a standard image file format. Such software would typically be run on a general purpose computer. For a video camera, the sequences of images that comprise the captured video are stored using a standard format such DV, MPEG, H.264. Some of these formats are organised into files such as AVI or Quicktime referred to as container files, while other formats such as DV, which are commonly used with tape storage, are written as a data stream. The non-volatile mass data storage 191 is used to store the image or video data captured by the camera system and has a large number of realisations including but not limited to removable flash memory such as a compact flash (CF) or secure digital (SD) card, memory stick, multimedia card, miniSD or microSD card; optical storage media such as writable CD, DVD or Blu-ray disk; or magnetic media such as magnetic tape or hard disk drive (HDD) including very small form-factor HDDs such as microdrives. The choice of mass storage depends on the capacity, speed, usability, power and physical size requirements of the particular camera system. When a video frame or still image has been captured, face detection is performed. This face detection may be performed by a dedicated face detection module 180, if it exists, or it may be performed by a library executing on the processing unit 150. The face detection module or library processes the input still image or video frame as it is captured and detects one or more human faces present in the still photo or video frame. The exact number of faces that can be detected may be limited by the camera design. Information determined about each face, which may be a unique face identifier, the pixel location of the centre of the face and the size of the bounding box enclosing the face, is stored as metadata alongside the captured still photo or video sequence. Camera motion information from the gyro sensor 199, if it exists, may also be stored alongside the still photo or video sequence as metadata. When a video sequence is being captured, the metadata is associated with frames in the video sequence. Thus for each frame in the video sequence there will be face information and camera motion information stored in the metadata which is particular to that frame.

In a playback or preview mode, the controller 122 operates to read data from the mass storage 191 and present that data using the display 194 and audio interface 193.

The processor 150, is able to execute programs stored in one or both of the connected memories 160 and 170. When the camera system 100 is initially powered up system program code 161, resident in ROM memory 160, is executed. This system program permanently stored in the camera system’s ROM is sometimes referred to as firmware. Execution of the firmware by the processor fulfils various high level functions, including processor management, memory management, device management, storage management and user interface.

The processor 150 includes a number of functional modules including a control unit (CU) 151, an arithmetic logic unit (ALU) 152, a digital signal processing engine (DSP) 153 and a local or internal memory comprising a set of registers 154 which typically contain atomic data elements 156, 157, along with internal buffer or cache memory 155. One or more internal buses 159 interconnect these functional modules. The processor 150 typically also has one or more interfaces 158 for communicating with external devices via the system data 181 and control 182 buses.

The system program 161 includes a sequence of instructions 162 through 163 that may include conditional branch and loop instructions. The program 161 may also include data which is used in execution of the program. This data may be stored as part of the instruction or in a separate location 164 within the ROM 160 or RAM 170.

In general, the processor 150 is given a set of instructions which are executed therein. This set of instructions may be organised into blocks which perform specific tasks or handle specific events that occur in the camera system. Typically the system program will wait for events and subsequently execute the block of code associated with that event. This may involve setting into operation separate threads of execution running on independent processors in the camera system such as the lens controller 197 that will subsequently execute in parallel with the program running on the processor. Events may be triggered in response to input from a user as detected by the button interface 195. Events may also be triggered in response to other sensors and interfaces in the camera system.

The execution of a set of the instructions may require numeric variables to be read and modified. Such numeric variables are stored in RAM 170. The disclosed method uses input variables 171, that are stored in known locations 172, 173 in the memory 170. The input variables are processed to produce output variables 177, that are stored in known locations 178, 179 in the memory 170. Intermediate variables 174 may be stored in additional memory locations in locations 175, 176 of the memory 170. Alternatively, some intermediate variables may only exist in the registers 154 of the processor 150.

The execution of a sequence of instructions is achieved in the processor 150 by repeated application of a fetch-execute cycle. The Control unit 151 of the processor maintains a register called the program counter which contains the address in memory 160 of the next instruction to be executed. At the start of the fetch execute cycle, the contents of the memory address indexed by the program counter is loaded into the control unit. The instruction thus loaded controls the subsequent operation of the processor, causing for example, data to be loaded from memory into processor registers, the contents of a register to be arithmetically combined with the contents of another register, the contents of a register to be written to the location stored in another register and so on. At the end of the fetch execute cycle the program counter is updated to point to the next instruction in the program. Depending on the instruction just executed this may involve incrementing the address contained in the program counter or loading it with a new address in order to achieve a branch operation.

Each step or sub-process in the processes of flow charts are associated with one or more segments of the program 161, and is performed by repeated execution of a fetch-execute cycle in the processor 150 or similar programmatic operation of other independent processor blocks in the camera system.

Overview of the Invention

As discussed above, to automatically select highlight segments from a video sequence using software running on the camera, it is advantageous that the software only use the video metadata to make the highlight selection. Using image processing techniques on the video frame data would require too much processing power.

Also as discussed above, there are two aspects of selecting video segments that solutions will be offered to:

The first aspect is to detect camera shake. To do this, three steps are performed: (1) detect camera shake, (2) determine a threshold above which camera shake is undesirable and (3) test the camera shake for each frame against the threshold.

In the case of a camera which can provide both face detection metadata and gyro sensor metadata, embodiments may use both to detect camera shake. The high frequency component of the gyro sensor metadata over time is extracted and smoothed to produce a value of camera shake for every frame. The face location data is then used in combination with the gyro data to detect if the camera operator was walking at the time the video was captured. To do this, the centre position information of one of the detected faces is examined for a periodic signal in one or other axis. The centre position information includes changes in the x and y position of the centre of the face detection bound box across a number of frames. The period, quality and magnitude of this periodic signal are examined to see if they are consistent with walking i.e., that a period of oscillation of the centre position in the range 0.5 to 2.0 seconds, and a magnitude greater than an experimentally determined threshold. The low frequency component of the gyro data is used to confirm the detection of walking.

Therefore, the processing unit of the camera receives data points that indicate a position of a face within frames of the video segment, where those frames may represent a particular scene.

The processing unit then determines a periodic position change of the face within the video segment based on those received data points. Further details of how the periodic position change is determined are provided herein.

Once it has been determined if the operator was walking for any particular frame, a camera shake threshold is chosen. It has been observed that the apparent effect of camera shake is lessened when the operator is walking, so a higher camera shake threshold is used for frames where the camera operator was detected to be walking. Finally, segments of the video where the camera shake is greater than the determined threshold are marked as rejected.

Therefore, upon the processing unit of the camera making a positive determination that the camera operator moved according to a defined motion type (such as walking), a threshold value is determined. This threshold value is then compared against a camera shake value to determine the marking of the video segment, e.g. whether the video segment is to be marked in memory to be included in a highlight portion of a video.

This approach is also applicable to cameras and other devices that perform face detection but have no gyro sensor. For such a camera, the camera shake may be detected from the face location data. The high frequency component of the face centre position data is used to detect camera shake. The size of the bounding box is used to scale this result, as the distance of the face from the camera will affect the amount of movement of the face centre due to a certain amount of camera shake. In this case walking is again detected using the face location data.

The second aspect to be considered, is detecting when the camera operator is walking towards a subject while capturing video. The face location camera metadata will be used to detect this situation as follows. The walking motion of the camera operator is detected using the method described above. If the camera detects that the operator is walking, then the bounding box and centre of the face location data is then examined. If there is a detected face whose bounding box is steadily growing for a segment of the video and whose face centre point is roughly centred in the video frame for the duration of the segment, then walking motion of the camera operator towards a subject has been detected, and the video section marked as important.

Therefore, the processing unit of the camera determines whether the operator of an apparatus that captured the video segment (e.g. the camera operator) moved according to a defined motion type (e.g. walking, running etc.) based on the determined periodic position change.

System Implementation Example 1

Fig. 2 is a schematic flow diagram showing a method 200 of rejecting video sequences with undesirable camera shake. The method is implemented as program code stored in memory and executed by processing unit 150 described herein. The input to the method is the metadata for every frame in the input video sequence. The output of the method is a list of video frames where the camera shake is undesirable. For example, if it is determined for a video segment that the camera shake is undesirable, the video segment may be rejected, i.e. not included in a video highlight section.

The method 200 begins at step 201 where the input video metadata is processed to produce a camera shake and low frequency variability number of the motion signal for every frame. The steps undertaken to perform this processing will be described later with reference to Fig. 3.

Once step 201 is complete, the process moves to step 202. In this step, the face detection metadata, in combination with the low frequency variation data produced in step 201, is processed to detect if the camera operator is walking for each frame in the input video. How this step is performed will be described later with reference to Fig. 4.

Once step 202 is complete, the current frame is set to be the first frame in the input video, and the method moves to the decision point 203.

In this decision point a different path is taken depending on whether step 202 had determined that the camera operator was walking when the current frame was captured. If the operator was not walking, then the processing moves to step 204. If the camera operator was walking then the processing moves to step 205.

In step 204, a low camera shake threshold is selected for the current video frame. The processing then moves to step 206.

In step 205, a high camera shake threshold is selected for the current video frame. The processing then continues on to step 206.

In step 206, if the camera shake for the current video frame, as determined in step 201, is greater than the camera shake threshold determined in the step 204 or 205, then the video frame is marked as rejected. In other words, upon a positive determination that the camera operator moved according to a defined motion type, such as walking, running or the like, the method 200 in step 206 marks the video segment based on the defined motion type. The marking comprises an indication of whether the video segment is of any interest, e.g. for generating a video highlight section. The indication can be further used to modify processing of the video segment. For example, if the segment is marked as rejected, such a video segment will be ignored when a video highlight section is generated. In contrast, if the video segment is marked as important or of interest, such a segment would be selected to generate a video highlight section. In one implementation, such an indication can be stored in memory 170 in association with the corresponding video segment. For example, such an indication can be stored as a part of a frame metadata for each frame of the video segment. Additionally, the indication may be used to adjust further processing of the video segment. For example, video segment marked as rejected can be skipped during display of the video, rendered at a lower resolution or deleted if required.

The processing then moves to the decision point 207. If there are unprocessed frames in the video, then the current frame is set to the next frame and the processing moves back to the decision point 203.

If there are no more remaining unprocessed frames, then the processing moves to the end point 208, and the method 200 is complete. The result of this method is a list of all video frames which have been marked as rejected in step 206.

Having described the high level processing done for method 200, the detailed processing performed for steps 201 and 202 will now be described.

Figure 3 describes the method 300 for the processing performed by step 201. The method is implemented as program code stored in memory and executed by processing unit 150 described herein. In this method the gyro metadata is processed to determine the camera shake and low frequency variability for every frame in the input video.

Method 300 starts at step 301, where the gyro metadata is obtained from the mass storage device 191. This metadata is the angular acceleration information from the gyro module, and is composed of at least two numbers for every frame in the input video. One number is the pitch angular acceleration and the other is the yaw angular acceleration for the frame in question. This metadata exists for every frame in the input video. Once the gyro data is obtained, the processing moves on to step 302.

In step 302, a low pass filter operation is performed on the gyro metadata in order to separate out the low frequency component of the signal. One possible method is to smooth the signal using a weighted moving average. For every frame, a weighted average is taken of all frames within n frames of the frame in question, where the contribution of a particular frame to the weighted average depends on the inverse of the distance of the frame from the current frame.

The value of n should be chosen to well distinguish camera shake from operator walking. A value of around 0.5 seconds worth of frames is appropriate. This smoothing operation is performed separately for the yaw gyro data and for the pitch gyro data.

Once step 302 is complete, the processing moves to step 303. In this step, for every frame in the input video, and for both the pitch and yaw axis, the gyro metadata value for the axis in question is subtracted from the smoothed value calculated in step 302 and the absolute value taken (using the standard function abs()). This is the high frequency variability of the pitch or yaw gyro data: varhf = abs(gyrosm00thed - gyro)

The processing then moves to step 304, where the camera shake is determined for each frame.

To do this, the pitch and yaw high frequency variability calculated in step 303 are further smoothed (using the same method described in step 302), again using a value of n equivalent to 0.5 seconds worth of frames. Then, for every frame in the input video, the larger of either the yaw or pitch smoothed high frequency variability value is taken to be the camera shake value for that frame.

It can therefore be seen that the camera shake value is determined based on metadata related to the video segment, and that the metadata may be at least gyro metadata.

The processing then moves to step 305, where the low frequency variability is calculated for every frame. This value will be later used to help determine that the camera operator is walking. In this step, for every frame in the input video, the absolute value is taken of the both the pitch and yaw low frequency components (as determined in step 302). Each of these new pitch and yaw data values are then further smoothed, using the method of step 302, this time using a value of n equivalent to 2 seconds worth of frames. Finally, for every frame in the input video, the larger of these smoothed pitch and yaw values is taken to be the low frequency variability of the frame in question.

Once step 305 is complete, the processing moves to the end point 306, and the processing for method 300 is complete.

Method 400, which is the processing performed for step 202, will now be described. The method is implemented as program code stored in memory and executed by processing unit 150 described herein. Figure 4 is a schematic flow diagram of this method.

Method 400 starts at step 401. In this step, every frame in the input video is marked as 'Not walking'.

The processing then moves to the next step, 402. In this step, the face location data is acquired for every frame in the input video. The face location data consists of a list of detected faces. For each face there are five data values: a unique face ID, the x and y coordinates of the pixel corresponding to the centre of the face, and the width and height of the bounding box enclosing the face.

Figure 10A is a representation of a video frame which can be used to illustrate the face detection data. The box numbered 1000 is the boundary of the video frame. The frame shows the scene background (numbered 1002 in the figure), and a person (numbered 1001) in the scene foreground. The face detection module in the camera has detected the face of the person 1001, and has assigned a face ID to the detected face. The centre pixel of the face is marked 1004 in the figure, and the bounding box of the face is marked 1003.

Note that the same face in separate frames will have an identical face ID, so this number can be used to track faces between frames.

The processing moves to step 403. In this step, a list is created of all different face IDs in all of the frames in the video sequence. One of these faces is selected using any suitable method. For example, the face that appears in the most number of video frames may be selected. This is the primary selection criterion. If there are several faces which meet this criterion, then among those, the face with the largest bounding box (when averaged over all the frames the face appears in) may be chosen.

The processing step moves to step 404 where a low pass filter operation is performed separately on the face location x and y data of the face selected in step 403 to remove high frequency components. The same method as used in step 302 may be used here, again using a value of n equivalent to 0.5 seconds worth of frames.

The processing next moves to step 405, where the smoothed face location data calculated in step 404 is scanned to identify all frames which are local minima or local maxima. Figure 10B illustrates this step. Fig. 10B shows an imagined plot of the smoothed y axis face location data generated in step 404 against the frame number. 1020 is the plotted list of data points (each data point shown as a small cross). On this plot, the data points marked 1021 and 1022 are local maxima - the points on either side of these points have y values less than these points. Similarly the data points marked 1023 and 1024 are local minima - the points of either side have y values greater than these points. Both of the x and y axes are scanned separately. The result of this step are two lists of frames (one for each of x and y) which are the alternating local maxima and minima. For the data points shown in Figure 10B, the result for the y axis would be the list of frames associated with data points 1023, 1021, 1024 and 1022 in that order.

The method then moves to step 406. In this step one of the x and y axis is selected. To do this, for each axis, the following is done: • For every consecutive pair of data points associated with the frames in the list of alternating local maxima and minima, a value is calculated which is the difference in the x (or y, depending on which axis we are examining) value of the pair of points. In the example plot shown in Figure 10B, the value for the pair of points 1023 and 1021 is the distance indicated by the arrow marked 1025. • These values are averaged together to produce a single number for the list which is the magnitude of the list.

The magnitude of the x and y lists are compared, and if the x axis list has a greater magnitude, then the x axis is chosen, otherwise the y axis is chosen.

The processing then moves to step 407, where the list of frames associated with the local maxima and local minima for the axis chosen in step 406 is split into potential-walking segments (these are section of the video where the camera operator may be walking). The result of this step is a list of zero or more potential-walking segments. How this step operates is detailed below in reference to Fig. 5.

Once step 407 is complete, the processing moves onto step 408. In this step, the list of potentialwalking segments produced in step 407 is further processed to determine when the camera operator is walking. When this step is finished, every frame in the input video is marked as either 'Walking' or 'Not walking'. The details of this step are explained below in reference to Fig. 6.

The processing then moves to the end step 409 and the method 400 is complete.

It can be seen above that the periodic position change is determined by the processing unit of the camera based on variability of a time period between local minima and maxima values that are associated with a change in the position of the face in the video segment on a defined axis.

Method 500 is now described in relation to Fig. 5. The method is implemented as program code stored in memory and executed by processing unit 150 described herein. This is the processing performed for step 407. This method takes the axis list (a list of frames which are local minima or maxima) selected in step 406 and breaks the axis list into zero or more potential-walking segments. The result of this method is a list of potential-walking segments, where each segment is a sub-section of the input axis list.

Initially all elements in the input axis list are unprocessed, and no list segment is active.

This method begins at decision point 501. If there are unprocessed frame elements in the input axis list, then the processing moves to step 502, otherwise it moves to step 508.

In step 502, the next unprocessed frame element in the input axis list is selected. The processing moves to the decision point 503.

In the decision point 503, the low frequency variability for the selected frame (as computed in step 201) is examined. If the low frequency variability is greater than a threshold (experimentally determined to be consistent with walking), then the processing moves to the decision point 504, otherwise the processing moves to step 507. It will be understood that the threshold value may be changed for different types of movement other than walking. For example, the herein described method and associated apparatus may be used to determine operator movement types such as running, skipping, swimming, cycling, horse riding etc.

In the decision point 504, if a potential-walking segment is active, then the processing moves to 506, otherwise the processing moves to step 505.

In step 505 a new empty potential-walking segment is created and made active. The processing then moves to step 506.

In step 506, the frame element selected in step 502 is added to the currently active potentialwalking segment. The processing then moves back to the decision point 501.

In step 507 any currently active potential-walking segment is ended and made not active. The currently selected frame element is discarded and the processing moves back to the decision point 501.

In step 508, any currently active potential-walking segment is ended and the processing moves to the end point 509.

Once 509 is reached, the method 500 is complete.

Figure 6 describes method 600, the method of step 408. The method is implemented as program code stored in memory and executed by processing unit 150 described herein. This method is performed for every potential-walking segment produced in step 407. This method takes the potential-walking segment and determines if the camera operator was walking for the duration of the segment.

This method begins at step 601 where the dominant period of the potential-walking segment is computed. The following is done: • For every consecutive pairs of points (alternating local minima and maxima) in the segment, a value is calculated which is the time difference between the two points. In Figure 10B, the value for the pair of points 1023 and 1021 is indicated by the arrow marked 1026. • The average of these time difference values is then calculated.

The dominant period of the face location oscillation for the potential-walking segment is determined to be twice the calculated average. If it is later decided that the camera operator is walking, then this dominant period will be the period of the operator's steps.

The processing then moves to step 602. The average distance between consecutive points in the segment is calculated. This is the magnitude of the face location oscillation for the segment.

The processing then moves to step 603 where the period variability is calculated. The following is performed for the potential-walking segment: • For every consecutive pair of points in the segment, the time difference between the two points is calculated. This calculated value is then subtracted from half of the dominant period (as calculated in step 601), and the absolute value taken. • This absolute value is then averaged for all of the consecutive pairs of elements in the segment.

This averaged number is then the period variability. Once this average is computed, the processing then moves to the decision point 604.

Decision point 604 uses the three numbers determined in the steps before it. If the dominant period is within a normal walking range (a period between 0.5 seconds and 2 seconds), and the magnitude of the face location oscillation for the segment is above a threshold (the threshold would vary dependent on the camera and gyro sensor) and the period variability is below a threshold (again the threshold would vary dependent on the camera and gyro sensor) then the processing moves to decision point 605. If not, the processing moves to the end point 606, where the processing for method 600 is complete.

In step 605 the camera operator is deemed to be walking for the duration of the potentialwalking segment. Every frame in the input video between the time of the first and last points in the potential-walking segment is marked as 'Walking'.

As such, upon a positive determination that the camera operator moved according to a defined motion type, such as walking, running or the like, the method 600 in step 605 marks the video segment based on the defined motion type, i.e. ‘Walking’, ‘Running’ or the like. The marking comprises an indication of a video segment type or metadata about the video segment, which can be used to modify processing of the segment. In one implementation, such an indication can be stored in memory 170 in association with the corresponding video segment. For example, such an indication can be stored as a part of frame metadata for each frame of the video segment.

Processing of the segment can be modified based on the indication of the video segment type.

For example, if the segment is marked as ‘Walking’, different camera shake thresholds can be used to determine whether the camera shake is acceptable or not, for example, as discussed in relation to steps 203-206. Alternatively or additionally, different image stabilization parameters can be used for segments marked as ‘Walking’ as opposed to unmarked segments or segments marked as ‘Running’, e.g. different assumptions can be made about camera movement in light of video segment type.

The processing then moves to the end point 606, where the processing for method 600 is complete.

Example 2 A second example considers the case where a video camera does not have a gyro sensor, or the gyro sensor data is not available to be used for video selection. In this case, the face location data can be used to determine that camera shake is occurring.

The process in this case is similar to example 1, and only the differences will be described here. A different process is performed for step 201. The new method for this step is method 700 which is described in reference to Fig. 7. The method is implemented as program code stored in memory and executed by processing unit 150 described herein. This method uses the face location data to determine the camera shake and low frequency variability for every frame in the input video.

In accordance with this method, the camera shake value is again determined based on metadata related to the video segment, where the metadata is at least face location metadata. It will be understood that a combination of both gyro metadata and face location metadata may be used to determine the camera shake value.

Method 700 starts at step 701, where the face location metadata is obtained from the mass storage device 191. As explained above, the face location data contains data for a number of faces and for each, the face data includes a face ID, x and y coordinates of the face centre and the width and height of the face bounding box.

The processing moves to step 702, where one of the faces is selected. A list is created of all different face IDs in all of the frames in the video sequence. One of these faces is selected. The selected face should preferably appear in the majority of frames, and it should preferably have a large face bounding box when averaged over all the frames the face appears in.

The processing then moves to step 703. In this step, the x and y axis face location are separately smoothed to separate out the low and high frequency components of the two signals. The method described in step 302 can be used for this. Again a value of n equivalent to 0.5 seconds should be used.

The processing then moves to step 704. In this step, for every frame in the input video, for both of the x and y axes, the original x and y values from the face location metadata are subtracted from the corresponding x and y smoothed values calculated in step 703, and the absolute value of each taken. These values are the high frequency variability of the x and y axes.

The process then moves to step 705, where the camera shake is calculated. This is similar to step 304, only this time the face location data is used. The x and y face location high frequency variability data calculated above are smoothed again, once again using a value of n equivalent to 0.5 seconds worth of frames. Then the camera shake is taken to be the larger of either the x or y smoothed high frequency variability values for each frame in the input video.

The processing then moves to step 706. In this step, for every frame in the input video, the absolute value of both the x and y face location data low frequency components determined above in step 703 is taken. These values are further smoothed, using a value of n equivalent to 2 seconds worth of frames. Then, for each frame in the input video, the larger of these smoothed x and y values is taken to be the low frequency variability of the frame.

The processing then moves to step 707, where the camera shake and low frequency variabilities are scaled to the bounding box to account for relative movement between the camera operator and the person whose face is being used to determine these values. The height of the bounding box is preferably used as a reference. The width may be unreliable because the angle of the face to the camera may not necessarily be known.

The face location data is scanned for every frame in the input video to find the frame in which the face bounding box height is the largest. This is chosen as the reference frame. Then, for every other frame in the input video, the camera shake and low frequency variability values for that frame are multiplied by the bounding box height of the reference frame divided by the bounding box height of the frame in question.

Once all frames have been scaled, the processing moves to the end point 708 and the method 700 is complete.

All other steps for example 2, apart from step 201, are identical to those in example 1.

Example 3 A third example uses changes in the size of the face detection bounding box, along with the face detection position to determine subject importance. It is common to shoot video when walking towards a subject with a detectable face (such as a person or a statue). As a subject approaches the camera, the apparent size of the subjects face will increase. This example looks for changes in the size of the face detection bounding box to determine relative movement. If the camera operator is detected as walking towards the subject for a video segment, then the segment is considered an important segment and should preferably be selected as a highlighted video section.

Method 800 describes a process of using face detection data along with gyro metadata to determine important video segments. This method will be described in reference to Fig. 8. The input to the method is the metadata for every frame in the input video sequence. The output of the method is a list of video frames which have been marked as important.

The method shares some steps with example 1, and the steps which are identical will not be described here.

This method begins at step 800. The method is implemented as program code stored in memory and executed by processing unit 150 described herein. This step determines the camera shake and low frequency variability, and is the same as step 201. Once this step is complete the process moves to step 802.

Step 802 detects when the camera operator is walking. This step is identical to step 202.

The process then moves to step 803. This step uses the face detection data to determine if there are sections of the input video where there is a face that is centred in the video frame and is constantly growing larger. How this step is performed is described below with reference to Fig. 9. That is, the processing unit determines whether the face is getting bigger during the video segment, and when the processing unit makes a positive determination, the processing unit then determines how the marking of the video segment, such as, for example, marking the video segment as an important video segment via a flag in memory.

The current frame is set to be the first frame in the input video and the process then moves to the decision point 804, where a different path is taken depending on the test performed here. If, for the current frame, the camera operator is determined to be walking and a face has been determined to be centred and growing larger, then the process moves to step 805. Otherwise, the process moves to the decision point 806.

Step 805 marks the frame in the input video as important upon a positive determination that the camera operator moved according to a defined motion type, such as walking, running or the like, towards a subject. In one implementation, a flag can be stored in memory 170 in association with the corresponding video segment to indicate that the camera operator moves towards the subject according to a defined motion type. For example, the flag can be stored as a part of frame metadata for each frame of the video segment.

The flag can be further used to modify processing of the video segment. For example, if the segment is marked as important, that video segment can be selected to generate a video highlight section. Additionally, the flag may be used to adjust further processing of the video segment. For example, the video segment marked as important can be skipped during display of the video, rendered at a lower resolution or deleted if required.

The processing then moves to the decision point 806.

At decision point 806, if there are no unprocessed frames remaining, then the processing moves to the end point 807. Otherwise the current frame is set to be the next frame in the input video and the processing returns to the decision point 804.

Once the process has reached the end point 807, method 800 is complete. The output of this method is a list of frames that have been marked as important.

Figure 9 describes method 900, the process performed for step 803. The method is implemented as program code stored in memory and executed by processing unit 150 described herein. This method takes the face detection metadata as input and determines in which frames, if any, in the input video is there a centred face whose bounding box is constantly growing.

The method starts at step 901, where the face location metadata is obtained from the mass storage device 191 and the face location data for all frames in the input video are scanned to compile a list of all faces detected in the video. Initially all of the faces are marked as unprocessed and all frames are marked as Subject Not Moving Towards Camera.

The process then moves to the decision point 902. If there are no unprocessed faces remaining, then the processing moves to the end point 908. Otherwise, the face data for the next unprocessed face is obtained. The data obtained is the x and y position of the centre of the face, and the height of the face bounding box. The processing then moves to step 903.

In step 903, for every frame in the input video, the rate of change of the bounding box height is calculated and smoothed. To do this, for every frame, the value of the bounding box height in the previous frame is subtracted from the bounding box height in the current frame. The calculated values are then smoothed. The same smoothing method as used in step 302 could be used here, again using a value of n equivalent to 0.5 seconds worth of frames.

The process then moves to step 904 where segments of the input video are identified where, for the duration of the segment, the face in question is both approximately centred in the video frame and the bounding box is always growing. To do this, a sequential scan of every frame in the input is performed. A new segment is created when a frame is reached where the face location x value is within the middle 33% of the frame and the smoothed bounding box height rate of change value calculated in step 903 is positive. While this condition is still true, the segment is extended. When this condition is no-longer true the segment is closed and the end of the segment is the previous frame. The scan continues until the last frame is reached, creating a new segment when the condition becomes true again.

When the scan is complete, all segments are marked as unprocessed, and the processing moves to the decision point 905.

In decision point 905, if there are no unprocessed segments then the processing returns to the decision point 902. Otherwise one of the unprocessed segments is chosen as the current segment and the processing moves to decision point 906.

In decision point 906, if the time difference between the start and end of the current segment is greater than 4 seconds then the processing moves to step 907. Otherwise the processing returns to the decision point 905.

In step 907, all frames in the input video between the start and end of the current segment are marked as Subject Moving Towards Camera. The processing then returns to the decision point 905.

The marking comprises a flag indicating that a video segment captures a subject moving towards a camera. The flag can be used to modify processing of the video segment. The flag can be stored in memory 170 in association with the corresponding video segment. For example, a flag can be stored as metadata for each frame of the video segment.

Processing of the segment can be modified based on the stored flag. For example, if the video segment is marked as ‘Subject Moving Towards Camera’, a frame from the video segment can be marked as important. Alternatively, different image stabilization parameters can be used for segments marked as ‘Subject Moving Towards Camera’ as opposed to unmarked segments by making assumptions about camera motion characteristics based on the detected movement.

When the processing reaches the end point 908, the method 900 is complete.

In each of the examples above, where the processing unit determines that the video segment is to be marked based on the defined motion type, the processing unit may then determine whether the video segment is to be selected, such as by way of a flag in memory, for a defined purpose, such as an indication that the segment is suitable for a highlight section of video.

Although the embodiments described above have been described with reference to a camera, it will be understood that the apparatus used to capture the video segments and/or process the video segments in accordance with the herein described embodiments may be any suitable electronic device, such as, for example, a mobile phone, a portable media player as well as a digital camera, in which processing resources are limited. Nevertheless, the methods described may also be performed on higher-level devices such as desktop computers, server computers, and other such devices with significantly larger processing resources.

Further, it will be understood that a computer system would incorporate a memory that has data stored thereon as well as a computer program, where the processor is coupled to the memory so that the processor can execute the computer program. The computer program would have instructions that enable the computer system to carry out the herein described methods. Also, a non-transitory computer readable medium may be used to store the program.

Claims

1. A method of determining how a video segment is processed, the method comprising: receiving data points indicating a position of a face within frames of the video segment; determining a periodic position change of the face within the video segment based on the received data points; determining whether an operator of an apparatus that captured the video segment moved according to a defined motion type based on the determined periodic position change; and, upon a positive determination that the operator moved according to the defined motion type, marking the video segment based on the defined motion type to modify processing of the video segment.

2. The method of claim 1, wherein the periodic position change is determined based on variability of a time period between local minima and maxima values that are associated with a change in the position of the face in the video segment on a defined axis.

3. The method of claim 1, where upon the positive determination that the operator moved according to the defined motion type, a threshold value is determined, wherein the threshold value is compared against a camera shake value to determine the marking of the video segment.

4. The method of claim 3, wherein the camera shake value is determined based on metadata related to the video segment.

5. The method of claim 4, wherein the metadata is one or both of gyro metadata and face location metadata.

6. The method of claim 1 further comprising the step of determining whether the face is getting bigger during the video segment, and upon a positive determination determining the marking of the video segment.

7. The method of claim 1, wherein the step of determining the marking of the video segment based on the defined motion type comprises the step of determining whether the video segment is to be selected for a defined purpose.

8. Apparatus for determining how a video segment is processed, the apparatus being configured to: receive data points indicating a position of a face within frames of the video segment; determine a periodic position change of the face within the video segment based on the received data points; determine whether an operator of an apparatus that captured the video segment moved according to a defined motion type based on the determined periodic position change; and, upon a positive determination that the operator moved according to the defined motion type, mark the video segment based on the defined motion type to modify processing of the video segment.

9. The apparatus of claim 8, wherein the periodic position change is determined based on variability of a time period between local minima and maxima values that are associated with a change in the position of the face in the video segment on a defined axis.

10. The apparatus of claim 8, where upon the positive determination that the operator moved according to the defined motion type, a threshold value is determined, wherein the threshold value is compared against a camera shake value to determine the marking of the video segment.

11. The apparatus of claim 10, wherein the camera shake value is determined based on metadata related to the video segment.

12. The apparatus of claim 11, wherein the metadata is one or both of gyro metadata and face location metadata.

13. The apparatus of claim 8 further configured to determine whether the face is getting bigger during the video segment, and upon a positive determination determine the marking of the video segment.

14. The apparatus of claim 8, further configured, when determining the marking of the video segment based on the defined motion type, to determine whether the video segment is to be selected for a defined purpose.

15. A system for determining how a video segment is processed, the system comprising: a memory comprising data and a computer program; a processor couple to the memory for executing the computer program comprising instructions for: receiving data points indicating a position of a face within frames of the video segment, determining a periodic position change of the face within the video segment based on the received data points, determining whether an operator of an apparatus that captured the video segment moved according to a defined motion type based on the determined periodic position change, and, upon a positive determination that the operator moved according to the defined motion type, marking the video segment based on the defined motion type to modify processing of the video segment.

16. A non-transitory computer readable medium having a program stored on the medium for determining how a video segment is processed, the program comprising: code for receiving data points indicating a position of a face within frames of the video segment, code for determining a periodic position change of the face within the video segment based on the received data points, code for determining whether an operator of an apparatus that captured the video segment moved according to a defined motion type based on the determined periodic position change, and, upon a positive determination that the operator moved according to the defined motion type, code for marking the video segment based on the defined motion type to modify processing of the video segment. Canon Kabushiki Kaisha Patent Attorneys for the Applicant/Nominated Person SPRUSON & FERGUSON