1 Introduction

In today’s digital age, the reliability of digital videos as legal evidence is seriously compromised by the availability of advanced technologies that enable even non-expert users to produce manipulated content. This ease of manipulation of multimedia content has significantly contributed to the spread of misinformation and fake news, making it extremely difficult for common users to establish the authenticity of visual content [1, 2]. For decades, researchers in multimedia forensics have been focusing on the development of algorithms designed to address multiple forensic tasks, including source characterization, integrity verification, and manipulation detection [3,4,5]. All of these methods operate on the fundamental assumption that any form of processing leaves distinct traces within the digital footprint of the media. Consequently, by examining these traces, it becomes feasible to gather insights into the processing history of a given content. During the years, researchers have highlighted that when analyzing digital images such traces can be found not only in what is represented by the media (i.e., pixels), but also in how that data is encoded and stored. Indeed, coding- and compression-related information (quantization tables, discrete cosine transform coefficients, chroma sub-sampling parameters, ...), together with file format structure, have been exploited to obtain useful clues for multiple forensic tasks both by themselves [6,7,8] and when combined with container-based information [9, 10]. Given the usefulness and versatility of this information, numerous extraction tools have been developed to enable forensic experts to design and update their algorithms [11,12,13]. Recently, this kind of information has also been studied for the forensic analysis of videos [14,15,16,17,18]. However, encoding-related information remains underutilized in this kind of media due to the difficulty in its extraction. Indeed, as videos involve a much larger amount of data with respect to images [19], codecs used to store this kind of content usually involve intricate solutions that exploit spatial and temporal redundancy to achieve a high compression factor. This, in turn, leads to a complex encoded representation that cannot be easily extracted, interpreted, and used.

In this paper, we present CoFFEE, a tool for efficient extraction and analysis of codec-based information for videos in H.264/AVC format. The software we developed enables the extraction of the entire set of information related to how the data is stored, including the macroblock partitioning, the internal structure of each macroblock, motion vector data, and quantized residuals. CoFFEE is organized in an information extraction module and an analysis module. The first module is an upgrade of the tool proposed by Tourapis et al. [20], capable of saving the extracted information in a custom binary format designed to make the information storage efficient in terms of both time and space. The second module is a Python library that provides easy and fast access to all data contained in the previously extracted binary file for further analysis. The entire software is released freely as open-source to allow researchers and professionals to use it in any context and adapt it to their needs. To demonstrate the versatility of such an approach, we show how the extracted information can be used in a straightforward manner to address three different forensic scenarios:

  1. (i)

    Social network identification,

  2. (ii)

    Brand identification, and

  3. (iii)

    Double compression detection.

The paper is organized as follows. In Section 2, we discuss existing feature extraction tools and their limitations. In Section 3, we review the main aspects of H.264 encoding, highlighting the type of information that can be extracted from it. Subsequently, in Section 4, we describe the proposed tool and the two software modules. Then, in Section 5, we introduce three distinct experimental scenarios that utilize the extracted features, and in Section 6, we present and discuss the results obtained from these scenarios, along with an evaluation of the tool’s efficiency. Finally, in Section 7, we draw the final conclusions.

2 Related work

There are several video analysis tools on the market, such as StreamEye by Elecard [21], Authenticate by Amped Software [13], and VQ Analyzer by VicueSoft [22], that allow for a detailed codec analysis. However, they are primarily accessible through paid subscriptions, inevitably excluding a large number of researchers and professionals from accessing this kind of information. Although some software for analyzing codec-based information has been released as open source, most of them only allow for the visualization of the features of the video under examination without the possibility of extracting and storing such data [23, 24], thus preventing the development of tools based on them. Regrettably, the limited number of publicly accessible software tools capable of storing the extracted codec-based information are usually limited in the kind of data that they can extract [25]. The most comprehensive tools currently available for this purpose consist of two variants of the JM Reference Software for H.264/AVC by Tourapis et al. [20] and by Blair et al. [26] (also known as Trestles). Even though both tools can be used to extract codec-based information, they differ in the kind of data that can be extracted and in the output format used. The tool developed by Tourapis et al. [20] has the capability to extract a complete set of encoding details. However, its usability is hindered by both the output format used to represent the extracted data and the considerable time required to generate this information. Indeed, the XML format used to save codec-based information is not particularly suitable for storing the vast amount of extracted data, as the saving process is notably slow, resulting in excessively large files that require lengthy reading times for subsequent analyses. The tool by Blair et al. [26], on the other hand, is markedly faster; however, it does only captures a limited subset of the available features rather than the full range present in the data stream.

3 H.264/AVC video coding standard

The H.264/AVC [27] is a widely adopted video coding standard designed for an efficient representation of video data. In this standard, videos are encoded as sequences of coded pictures, referred to as frames. Each frame within an H.264 bitstream is partitioned into one or more slices, each one of them further subdivided into fixed-size macroblocks. Each macroblock’s content is represented using the YCbCr color space, where the luminance component Y represents the brightness and the chrominance components Cb and Cr represent the blue and red channels, respectively. To reduce the amount of space required to store information, the H.264/AVC standard adopts a 4:2:0 sampling approach where chrominance components are saved at one fourth of the resolution with respect to the luminance component. This strategy is informed by the fact that the human visual system is more sensitive to changes in brightness. Therefore, each macroblock consists of \(16 \times 16\) samples for the luminance component and \(8 \times 8\) samples for the chrominance ones.

Fig. 1
figure 1

Example of predictions in a Group of Picture (GOP) structure. While P-frames can leverage only previous encoded frames, B-frames can refer both to the past and the future

Each macroblock, however, is not just represented by storing these samples directly. Indeed, given that the majority of a frame’s components remain relatively consistent across consecutive frames, H.264 employs a technique to only store the differences between macroblocks instead of the complete data as a way to save space. The decoder will be then able to reproduce the original content by combining a prediction derived by the spatial or temporal neighborhood of the macroblock being examined with a residual representing these differences. More specifically, frames are arranged in groups of pictures (GOPs) containing three different kinds of frames: I, P, and B. The first picture of each GOP is always encoded as an I-frame (Intra-coded frame), in which macroblocks predictions are only produced from their spatial neighborhoods. These frames can be independently decoded and they serve as anchors for temporal predictions of other pictures. Subsequent pictures in a GOP can be encoded either as P-frames (predicted frames) or B-frames (bidirectional predicted frames). Macroblocks belonging to either of those frame types can leverage temporal redundancy by using as a base for their predictions macroblocks from previous frames (P) or from both past and future frames (B). A pictorial representation of a GOP structure is reported in Fig. 1.

Fig. 2
figure 2

The four Intra_16\(\times\)16 prediction modes. The entire 16\(\times\)16 block (in blue) is predicted only referring to data inside the current frame, by using one of the four available modes. Prior decoded blocks, which are used to predict the current one, are depicted in gray. The arrows specify the direction of prediction in each mode

Fig. 3
figure 3

The nine Intra_4\(\times\)4 prediction modes. Each 4\(\times\)4 block (in lilac) is predicted from spatially neighboring samples that have been previously encoded (in gray), by using one of the nine prediction modes. The arrows indicate the direction of prediction in each mode

Fig. 4
figure 4

Multiframe motion compensation in H.264/AVC video coding standard. The green macroblock in the current frame is predicted based on a single macroblock in the second frame preceding the current one, while the pink macroblock leverages two different macroblocks in two previous distinct frames. In addition to the motion vector, also picture reference parameters (\(\Delta\)) are transmitted

Fig. 5
figure 5

Macroblocks and sub-macroblocks P and B partitions. (On the left) a 16\(\times\)16 macroblock, (in the middle) the four blue macroblocks of the picture show the four possible ways in which the 16\(\times\)16 macroblock can be divided into smaller sub-macroblocks; (on the right) the four pink macroblocks on the right, instead, represent the four possible ways in which a 8\(\times\)8 sub-macroblock can be further divided

Intra-frame-predicted macroblocks, which can appear on all three kinds of frames, can be either predicted in a single pass (Intra16x16 prediction, depicted in Fig. 2) or further subdivided in four \(4 \times 4\) sub-macroblocks that are predicted separately (Intra4x4 prediction, depicted in Fig. 3). Chroma components are in both cases predicted for the whole \(16 \times 16\) area, as color distribution is usually smoother over large areas. On the contrary, inter-frame-predicted macroblocks, which can only appear in P- and B-frames, capitalize on the temporal consistency across consecutive frames. These macroblocks are linked with one or more motion vectors pointing to macroblocks from different pictures where the content originated. This approach, called motion compensation, minimizes redundancy within video data, resulting in improved compression efficiency. H.264 supports multi-picture prediction, thus allowing the utilization of multiple frames as references for motion prediction, as illustrated in Fig. 4. Unlike intra-predicted macroblocks, inter-predicted macroblocks can be subdivided into numerous different configurations for prediction purposes, thus allowing finer control over the origin of the content (Fig. 5). Finally, all three kind of frames can contain PCM macroblocks where no prediction step is performed.

Additionally, each macroblock is associated with a prediction residual, which is the difference between the predicted and the desired content. These residuals are encoded using a separable integer transform applied to each \(4 \times 4\) sub-macroblock. This technique bears resemblance to a \(4 \times 4\) discrete cosine transform (DCT) but is more efficient in its application. As a result of this encoding, each residual is described by a matrix where low-frequency coefficients describe the main content of the scene while high-frequency ones represent the details. In order to further reduce the space required to store a video file, residual coefficients are quantized by choosing one of the 52 possible quantization parameters and stored using either CAVLC or CABAC entropy coding methods.

4 The CoFFEE framework

In this section, we introduce CoFFEE―a software solution specifically designed to extract codec features from H.264/AVC videos. CoFFEE is structured around two primary modules:

  1. i)

    The CoFFEE roaster, which is responsible for efficiently extracting codec information from H.264 bitstreams;

  2. ii)

    The CoFFEE grinder, which is designed to provide effortless access to the extracted features.

In the following, we will provide a detailed description of the two components of the proposed framework.

4.1 CoFFEE roaster: the optimized JM reference software

To aid the development and evaluation of the H.264/AVC codec, the Video Coding Experts Group (VCEG) and the Moving Picture Experts Group (MPEG) released an open-source framework for the H.264/AVC standard: the Joint Model (JM) Reference software [20]. It serves as a benchmark for video coding research, as it provides a comprehensive set of tools and algorithms for video compression, including encoding and decoding functionalities, allowing to evaluate and improve H.264 video compression performance. JM [28] is able to generate an XML-based trace file during the decoding process of an H.264 bitstream which contains detailed information related to the encoded video. However, it poses significant challenges due to the substantial memory space it occupies (approximately 1.63 GB for a 1080p video lasting less than 4 s or 112 frames) and the extended decoding runtime required (around 6 min). This can be a challenging problem when we want to tackle longer videos.

CoFFEE roaster optimizes the Joint Model reference software by storing the extracted features in a binary format, which allows us to reduce both its runtime and storage requirements. Compared to the XML-based file, we discarded some redundancy data and we extended the extracted features to include quantization parameters of the chroma components. We modified the XML generation process by removing tags and saving the corresponding values as byte directly in a pre-defined order. With the aim of optimizing as much as possible the output binary file, instead of storing the DCT coefficients of each macroblock individually, we adopted an alternative approach of the run-length encoding: we stored the non-zero values, and a zero followed by the number of consecutive zeros. Moreover, in case of all non-zeros coefficients for a macroblock, we only store a single zero value to represent the entire matrix. The result is that the output file in binary format significantly reduces the occupied disk space, as well its runtime (details are reported in Section 6.1).

Fig. 6
figure 6

Hierarchical structure of the output binary file. The figure illustrates that the feature attributes are represented by pins. Moreover, each macroblock can handle up to four motion vectors structures and three DCT coefficients structures: one for the luma component and two chroma components

The features extracted from an H.264 video sequence present a hierarchical structure, as illustrated in Fig. 6. More specifically:

  • For each picture it is specified a unique integer number acting as identifier (PID), a value for the display order of the access units in the encoded bitstream (PoC), the GOP to which this picture belongs and the subpictures;

  • For each subpicture it is indicated its structure (i.e., if the subpicture contains frame or field data) and the slices contained in the subpicture;

  • For each slice it is provided its numerical identifier, the slice type (I, P, B, SI or SP, where the latter two are particular switching slices) and the macroblocks;

  • For each macroblock (MB) it is indicated its identifier with respect to the total number of macroblocks in the frame, the type of macroblock, the coordinates X and Y (indicating the position in pixels of the macroblock in the frame), the quantization parameter for the luma component (QP_Y) and for the chroma ones (QP_U, QP_V), the motion vectors (MV) if present, and the DCT coefficients;

  • For each motion vector it is provided the list prediction, i.e., list 0 or list 1 (which are lists of reference frames for motion compensation), the index of the list to be used for prediction, the reference frame id, the horizontal and vertical motion vector component difference (Diff_X and Diff_Y), and the horizontal and vertical motion vector component (Abs_X and Abs_Y);

  • For the DCT coefficients element, it is specified the component (luma, chroma blue or chroma red), the vector of the DCT values (Values) and the length of this vector (NumValues).

Blair et al. [26] introduced Trestles, an open-source tool built on JM for optimizing feature extraction in terms of both runtime and disk space utilization. However, it only captures a limited subset of features compared to CoFFEE roaster. Trestles, in particular, does not provide access to essential forensic features such as QP values, type B macroblocks, and intra prediction modes. Moreover, it generates six separate CSV files containing the extracted features at the expense of high memory usage (see Section 6.1).

4.2 CoFFEE grinder: the H.264/AVC library

This library allows easy access to H.264 video features extracted from the CoFFEE roaster module, bridging the gap between the optimized binary file and the necessary information users might require within a video sequence. However, for longer videos, the binary files generated by the roaster module remain relatively large despite the optimization efforts. To tackle this challenge, we employed Cython [29] for the implementation of the library, as it allows to translate source code into optimized C/C++ and compiled as Python extension modules. This combination allows for a faster runtime and ease of programming.

The CoFFEE grinder module provides two key methods: read_picture and read_video. The former handles the task of correctly reading a single picture from a video sequence. This entails processing all subpictures, slices, and macroblocks within each frame, along with their associated attributes. The latter is designed to automatically read all frames in the video sequence by repeatedly invoking the read_picture method, as shown in Listing 1.

figure a

Listing 1 Example of CoFFEE grinder library use to get all the pictures from a stream video H.264

Given an H.264/AVC video, CoFFEE grinder allows easy access to its encoding features. For instance, if a user wants to extract the DCT coefficients of the third macroblock of the fifth frame, they only need to navigate to the picture of interest and then access the desired features following the hierarchical structure illustrated in Fig. 6. Listing 2 demonstrates how to accomplish this using the implemented library.

figure b

Listing 2 Example of CoFFEE grinder library use to get the DCT coefficients of the third macroblock within the fifth frame of the video

We equipped the library also with an additional method unpack_dct_coefficients, which reorganizes DCT coefficients into a 16\(\times\)16 matrix for the luma component and two separate 8\(\times\)8 matrices for the chroma components, facilitating a better understanding of their values. Similarly, other features in the library lack direct interpretability, appearing as numerical values without inherent meaning. We aid the interpretation of such data by providing a set of additional methods to assist the end-user understand feature values. More specifically, the following methods are provided:

unpack_dct_coefficients()

Input:

integer referring to the DCT plane and the vector of packed DCT values

Output:

matrix of unpacked DCT values. It provides a 16\(\times\)16 luma matrix and two 8\(\times\)8 chroma matrices.

get_subpicture_structure_type()

Input:

the integer referring to the subpicture structure.

Output:

“Frame” string if the coded subpicture represents an entire frame, “TopField” string if it represents the top field of the frame (i.e., even-numbered rows 0, 2, ..., H/2-1, where H is the number of rows in the frame), or “BottomField” string if it represents the bottom field of the frame (i.e., the odd-numbered rows in the frame).

get_slice_type()

Input:

integer referring to the slice type.

Output:

string which indicates if the slice is of type “I,” “P,” or “B”.

get_macroblock_type()

Input:

integer (or the string) referring to the slice type, and the integer referring to the macroblock type.

Output:

tuple with the first element specifying the type of the macroblock and the second one, if present, specifies the prediction mode of the macroblock. For instance, by calling get_macroblock_type(‘SliceI’, 1), the method will return the tuple (“i_16x16”, “0_0_0”). Specifically, this means that by looking at the first element of the tuple, the macroblock in question has been predicted as an Intra_16\(\times\)16; the second element of the tuple, instead, suggests that this macroblock has been predicted by a vertical prediction described as mode 0, that all chroma transform coefficient levels are equal to zero, and that all transform coefficient levels of the four 4\(\times\)4 luma blocks in the 8\(\times\)8 luma block are equal to zero. For more details regarding the meaning of these values, please refer to Section 7.4.5 within the ITU-T H.264 [30].

get_dct_plane()

Input:

integer referring to the DCT plane.

Output:

string which specifies if the component of the considered macroblock is the “luma,” “chroma blue,” or “chroma red” one;

5 Experimental setting

To assess the usefulness of the extracted information for forensic purposes, we conducted an experimental campaign involving multiple forensic scenarios. To do so, we devised a pipeline where, as a first step, we extracted the binary file containing information related to the H.264 bitstream using CoFFEE’s roaster. Then, CoFFEE’s grinder was used to organize the collected data in feature vectors, which were subsequently used to train three classifiers based on random forest to address three forensic scenarios: social network identification, device brand identification, and double compression detection. These scenarios were selected to showcase CoFFEE’s flexibility as a tool for different forensic needs. The whole experimental pipeline is depicted in Fig. 7.

In the following subsections, we describe how the extracted information was organized in feature vectors, and we introduce the three forensic scenarios analyzed in our experiments.

5.1 Features organization

By leveraging CoFFEE, we can collect a set of features from a H.264 bitstream that can be efficiently used for many different forensic applications. For each video, we extract a feature vector which is subsequently used to train a classifier. The feature vector includes the AC DCT coefficients for both luma and chroma components, the statistics related to macroblock types, and the quantization parameters for both luma and chroma components.

Fig. 7
figure 7

Pipeline of the procedure followed to evaluate the relevance of features analysis in forensic scenarios

More specifically, the AC DCT coefficients are represented as a 2000 \(\times\) 18 matrix. The coefficients are differentiated according to the type of macroblock transform, i.e., the first 9 columns encapsulate the 9 AC values of 8\(\times\)8 macroblocks, while the following 9 columns the AC values of both 16\(\times\)16 and 4\(\times\)4 macroblocks. Each row matrix corresponds to specific AC values within the range [− 1000, 1000], excluding zero. To make accurate comparisons between histograms, regardless of their original scales or sizes, data normalization is applied so that all bins sum to one. In the end, each video is characterized by 36000 features for each of the luma and chroma components.

Regarding the macroblock type statistics, we represented them as a 3\(\times\)7 matrix: each row stands for a frame type (I, P, and B), while each column stands for a distinct macroblock type (I, P, B, SI, SKIP, PCM, and OthersFootnote 1). To this extent, each element of the matrix contains the amount of macroblocks of a specific category contained in frames of a specific type. Each row has been divided by the number of the corresponding frames type within the video, as each video may have a different resolution and number of frames.

Furthermore, we considered the histograms of quantization parameters, which are represented as 2\(\times\)52 matrix. In this matrix, the luma and chroma componentFootnote 2 histograms are located in the rows, while the columns contain information on the 52 distinct possible values for the quantization parameter (ranging from 0 to 51).

5.2 Social network identification

Identifying the social media platform on which a video under investigation was uploaded to is a relevant task in a forensic scenario, as it allows to partially reconstruct the processing history of the video. Some techniques rely on CNNs working on a frame level, as in [31], where a CNN aims to detect double compression traces left by social networks by separately processing I-frames and P-frames, and in [32], where two different solutions based on transfer learning and multitask learning are presented. Other methods, as those proposed in [17, 18], leverage container-based data to attribute videos to specific social media platforms. Meanwhile, more recent approaches like [33,34,35] underscore the importance of leveraging codec-based features from digital videos to recognize the originating social media platform.

Here, we aim to recognize the originating social media platform of a video by leveraging the set of features extracted with CoFFEE. The evaluation is conducted considering the same dataset used in [33] that has been designed for the PREMIER projectFootnote 3. The considered collections follow:

  • PREMIER-N1: is a native data collection that includes 26 videos of flat, indoor, and outdoor scenery and 987 flat and natural images, originating from 8 different smartphones;

  • PREMIER-N2: is a native data collection that includes 352 flat and natural images and a video collection of 58 flat, indoor, and outdoor scenery with and without movement, with some additional videos to evaluate H.265/HEVC codec and other non-default resolutions. These collections are produced by 5 different smartphones;

  • PREMIER-A3: is an altered data collection that includes 400 videos shared through four social networks and a messaging application (Facebook, Instagram, Telegram, Twitter, and YouTube). Original videos belong to 20 devices of Video-ACID [36] and NYUAD-MMD [37] datasets were uploaded and downloded from each platform.

The dataset that we built for this task consists of 392 videos belonging to 28 different smartphones. In particular, we have 136 camera-native videos collected from the Premier-N1, Premier-N2, and Video-ACID datasets and 256 videos shared through Social Networks (Facebook, Instagram, YouTube, and Twitter) collected from the Premier-A3 dataset.

5.3 Device brand identification

Device brand identification task involves the determination of the origin of a bitstream, specifically identifying the brand of the device used, i.e., if the video under investigation has been captured by an Apple device, a Samsung device, and so on. For instance, in this context, Iuliani et al. [14] introduced a method that leverages the video file container to identify the brand of the device that acquired the video under analysis.

In this study, we employ the CoFFEE features to identify the brand of a video. The evaluation is based on the recently released forensics benchmark dataset, FloreView [38], which consists of 9k media samples collected from 45 modern hand-handled devices of 11 major brands. With respect to the total number of available brands, we consider only those that provide more than one device, resulting in a subset of the original dataset, which includes 7 different brands (Apple, Google, Huawei, LG, Motorola, Samsung, Xiaomi) with 1582 videos from 39 smartphones/tablets.

5.4 Double compression detection

Since videos are typically stored in compressed formats, a manipulation process typically involves three primary steps: decoding the video stream, making alterations for specific purposes, and then re-compressing the video. Thus, detecting double compression is a valid method to detect the presence of a manipulation in a video or to prove that the video integrity is compromised. Previous research has focused on examining macroblock statistics within each encoded frame [39, 40].

Our dataset for double compression detection comprises the 136 camera-native videos from the social network identification scenario. Double compressed videos are generated by means of a constant rate factor (CRF) encoding mode which allows different frames to be compressed to varying degrees by adjusting the quantization parameter as needed to maintain a desired level of perceived quality. We consider two CRF values to produce high quality (i.e., CRF 10) and low quality double compressed videos (i.e., CRF 40).

6 Results and discussion

In this section, we present a detailed analysis of the performance and effectiveness of CoFFEE across various forensic scenarios. First, we compare CoFFEE with other existing tools, such as JM [28] and Trestles [26], demonstrating significant improvements in both runtime efficiency and memory usage. Subsequently, we analyze the usage of CoFFEE-extracted features in different forensic contexts, including social network identification, device brand identification, and double compression detection. Through rigorous experiments and comparisons, we highlight the strengths and potential areas for further development of our approach.

6.1 CoFFEE roaster efficiency

First, we evaluated the performance of the proposed tool in terms of both runtime efficiency and memory storage requirements. Tables 1 and 2 report an example of feature extraction with JM [28], Trestles [26], and CoFFEE on the dataset built for the social network identification scenario. Specifically, we consider the video with the overall shortest length (M17), the longest video shared on social media platforms (M16), and the video with the longest overall length (D38), as they differ in video duration, number of frames, and resolution.

In Table 1, CoFFEE reports to significantly reduce the processing time across all video types and resolutions, with reductions ranging from 75 to 99% when compared respectively to Trestles and JM. The largest time reductions are observed with the highest resolution videos, such as M17 and D38, where CoFFEE achieves up to 98.7% reduction compared to JM and up to 83.6% compared to Trestles. Moreover, CoFFEE shows consistent time savings when we consider different resolutions and social media platforms of the same video (e.g., M16).

Similar to its impact on runtime efficiency, CoFFEE also achieves substantial reductions in memory storage requirements across all videos, as shown in Table 2. For example, CoFFEE reduces the file size by up to 99.9% compared to JM and by up to 49.5% compared to Trestles for high-resolution videos. Lower resolution and social media-formatted videos (e.g., Twitter and YouTube formats of M16) also show consistent storage efficiency, with reductions of up to 97.0% from JM.

Finally, when compared to the more recent Trestles, CoFFEE can save up to \(83\%\) in runtime and \(49\%\) in memory storage, while simultaneously extracting a larger amount of informationFootnote 4. This comparison underscores CoFFEE’s superior memory efficiency, making it an ideal tool for handling large datasets in video forensic tasks, particularly when storage resources are limited.

The results indicate that CoFFEE provides a substantial reduction in processing times, making it well-suited for scenarios that require handling large volumes of data. Specifically, processing a video takes approximately twice its playback time, which, although not achieving real-time feature extraction, comes remarkably close.

Table 1 JM, Trestles, and CoFFEE time efficiency comparison. Video resolution is expressed in pixels. Reduction column shows the percentage decrease of CoFFEE from JM and Trestles respectively. Videos are sampled from the dataset built for social network identification task. Specifically, M17 corresponds to video M17_DA_E0006.mp4; M16 corresponds to video M16_DA_E0008.mp4; D38 corresponds to video D38_V_indoor_panrot_0001.mp4
Table 2 JM, Trestles, and CoFFEE memory storage efficiency comparison. Video resolution is expressed in pixels. The filesize of the JM output (XML), Trestles (CSV), and the CoFFEE roaster output are expressed in bytes. Reduction column shows the percentage decrease of CoFFEE from JM and Trestles respectively. Videos are sampled from the dataset built for social network identification task. Specifically, M17 corresponds to video M17_DA_E0006.mp4, M16 corresponds to video M16_DA_E0008.mp4, and D38 corresponds to video D38_V_indoor_panrot_0001.mp4

6.2 Forensics scenario effectiveness

For each scenario, we begin by extracting video features using CoFFEE. We then employ a random forest classifier, with the optimal parameters selected via GridSearch, considering [100, 200, 300, 500] estimators. To ensure balanced distribution of videos across each fold, we applied a 10-fold stratified cross-validation. In the social network identification and double compression detection scenarios, the models that perform best utilize the top 100 features, identified through feature pruning. Finally, all experiments are repeated 10 times to obtain a more reliable estimate of the classification’s effectiveness.

We assessed the performance of the proposed framework in the three forensic scenarios mentioned earlier in this section. For social network identification, we conducted an ablation study to evaluate the performance of each feature individually and in combination, as described in Table 3. Our analysis shows that these features possess significant discriminative power individually, with Macroblock types achieving a balanced accuracy of 93%. However, when we use all the features together, we achieve the highest accuracy of 97%, significantly surpassing the random guess rate of 20%.

In the device brand identification scenario, feature performance differs notably from that observed in the previous task. As reported in Table 4, Macroblock types are the least effective feature on their own, with an accuracy of 58%, whereas DCT coefficients consistently perform well, achieving 87% accuracy. Moreover, our findings show that combining DCT, QP, and MB features is crucial for optimal performance, resulting in a balanced accuracy of 93%, well above the random guess rate of 14%.

Finally, we evaluated the impact of CoFFEE-extracted features in the double compression detection scenario, with results reported in Table 5. In this context, all features demonstrate high accuracy levels for both single and double compressed videos. Our experiments reveal no significant differences in performance when detecting double compressed videos, regardless of high- or low-quality encoding.

We did not compare our results with those obtained using the existing tools proposed by Tourapis et al. [20] and Blair et al. [26]. However, it is important to note that these tools only extract a strict subset of the features available in CoFFEE. As a result, classifiers built with these tools cannot achieve performance levels higher than ours.

The experimental results have yielded promising outcomes, suggesting that incorporating additional features could be particularly beneficial for device brand identification. Moreover, it is evident that no single set of features best for all scenarios; instead, different features are more effective for different tasks. Furthermore, while the current approach is tailored to a specific codec, it would be valuable to adopt the same information extraction strategy for other codecs as well (e.g., AV1 or H.265/HEVC). This could enhance the robustness and applicability of the forensic analysis across a broader range of video formats.

Table 3 Ablation study for the social network identification task: achieved performance in terms of balanced accuracy by using the features extracted, i.e., luma coefficients and chroma coefficients (DCT), quantization parameters (QPs), and macroblock type statistics (MB)
Table 4 Ablation study for device brand identification task: achieved performance in terms of balanced accuracy by using the features extracted, i.e., luma coefficients and chroma coefficients (DCT), quantization parameters (QPs), and macroblock type statistics (MB)
Table 5 Ablation study for double compression detection task: achieved performance in terms of balanced accuracy by using the features extracted, i.e., luma coefficients and chroma coefficients (DCT), quantization parameters (QPs), and macroblock type statistics (MB)

7 Conclusions

In this paper, we presented CoFFEE, a new open-source tool that efficiently extracts and evaluates video compression information (including macroblock structure, prediction residuals, and motion vectors) from H.264 encoded files. Compared to available state-of-the-art software, our tool provides a time-efficient extraction of information from the bitstream while requiring minimal disk space resources. Moreover, to allow its use in the widest possible range of applications, our tool extracts the entire set of available information from within the H.264 bitstream. Additionally, the library offers user-friendly methods that enable users to easily extract codec-based information within the stream. We provided a detailed description of the developed software, which is released free of charge to enable its use by the research community for the development of new video forensic tools. Furthermore, we demonstrated how the features extracted by CoFFEE can be effectively employed in important forensic tasks, including social network identification, camera brand classification, and double compression detection. In the future, it would be interesting to expand this work by applying CoFFEE to a broader range of scenarios, leveraging codec-based features for modern forensic challenges such as identifying synthetically generated videos or detecting AI-based manipulations. A further direction for future research could be to extend our approach to examine the features of more advanced codecs, such as H.265/HEVC or AV1, which are gaining increasing adoption across various fields.