Introduction

In recent times, there has been an increasing enthusiasm to create computer vision applications and services that enhance the lives of citizens. The swift advancement of deep learning (DL) techniques, combined with the widespread presence of video surveillance cameras in modern cities, has given rise to intelligent applications designed for various purposes. These encompass face recognition [1, 2], crowd counting [3, 4], intelligent parking systems [5, 6], pedestrian detection and tracking [7, 8], among others. Such smart applications are now widely implemented worldwide, becoming pivotal in effectively managing public spaces and deterring criminal activities, and they are gradually replacing human supervision for monitoring.

However, state-of-the-art performances of DL algorithms are usually achieved through supervised learning, which relies on two key assumptions [9]. Firstly, it assumes the availability of extensive labeled datasets, which are crucial for accurately training the models. Secondly, it assumes that the training (or source) and test (or target) datasets are i.i.d., i.e., they are independent and have identical distributions. While abundant annotated data may be available for certain predefined domains, such as ImageNet [10] for image classification or COCO [11] for object detection, obtaining manual annotations for every specific target domain or task is often impractical and costly. Consequently, models are often applied to target domains not encountered in the existing labeled training data, and exhibit a drop in performance at inference time due to the domain shifts, i.e., the domain gap between source and target data distributions [12].

Unsupervised domain adaptation (UDA) provides one possible solution to tackle this issue. Its primary goal is to mitigate domain gaps by leveraging labeled data from the source domain and unlabeled data from the target domain. In essence, UDA techniques use annotated data from the source domain, along with non-annotated data from the target domain, which can be easily collected without requiring human effort for labeling. The key challenge here lies in automatically extracting knowledge from this latter data stream to narrow the gap between the two domains. Specifically, the objective is to learn feature representations that are (i) discriminative for the primary learning task in the source domain and (ii) robust to the domain shift.

This paper focuses on detecting violent actions within trimmed videos, wherein the goal is to distinguish between violent and non-violent behaviors in clips that capture an exact action. Essentially, this task is a subcategory of human action recognition, with the objective of classifying video clips into binary categories encoding the presence or the absence of violent actions. Despite its significance through real-world scenarios involving the warning/prevention of criminal activities and public space management, this specific task has received relatively limited exploration compared to other action recognition tasks. Although some annotated datasets exist for general video violence detection, they suffer from shortcomings in terms of size and scenario diversity. Consequently, existing deep learning-based solutions trained on these datasets often exhibit decreased performance when applied to specific contexts, such as violence detection in public transports [13]. To tackle these limitations, we introduce an end-to-end deep learning-based UDA approach designed for improving performance in target scenarios where annotated data is scarce or absent. Our baseline takes inspiration from [14], a video violence detection technique that performs single-image classification by randomly selecting frames from the video. We improve this straightforward approach by exploiting a multiple instance learning (MIL) technique that instead considers the frame with the best classification score. Then, upon this baseline, we integrate a set of UDA techniques into the training process to automatically acquire knowledge from unlabeled data that pertains to the target domain. To the best of our knowledge, it is the first attempt at using a UDA schema for video violence detection. In our experiments, as the source domain, we utilize several annotated datasets of video violence detection in broader contexts. On the other hand, as the target domain, we exploit video clips in specific scenarios: the Hockey Fight dataset [15] containing violent/non-violent actions from hockey matches of the National Hockey League (NHL), and the recently introduced Bus Violence benchmark focusing on identifying violent behaviors occurring within a moving bus [13]. Figure 1 illustrates this experimental scenario. The outcomes indicate that our UDA pipeline substantially improves performance for the examined models. This suggests that these models exhibit enhanced generalization capabilities when adapting to this new scenario, all without the requirement of introducing new labels.

Fig. 1
figure 1

The considered scenario. Our proposal introduces an unsupervised domain adaptation approach for video violence detection, aiming to bridge the domain gap that separates the source domain (depicted on the left) from the target domain (shown on the right). The source domain encompasses three datasets containing annotated videos portraying both violent and non-violent scenarios in broad contexts. In contrast, the target domain comprises two sets of unlabeled clips capturing instances of violent and non-violent actions within different and very specific scenarios, e.g., hockey matches and public transports

This research extends our previous work [16] by modifying the baseline architecture with a MIL approach, considering additional state-of-the-art models for performance comparison, and experimenting with an additional target dataset. The main contributions are summarized below:

  • we present a novel UDA scheme for video violence detection to effectively reduce the domain gap between a labeled source dataset and an unlabeled target dataset;

  • we perform an empirical assessment, using general violent/non-violent videos as the source domain and other clips designed for detecting violent behaviors in specific scenarios (such as hockey matches and public transports) as the target domain;

  • the outcomes reveal that our UDA approach enhances the performance of the examined models, enabling them to achieve improved generalization in situations where labels are unavailable, accommodating novel scenarios.

The paper’s structure is as follows: In Sect. “Related Works”, we review related works. Section “Methodology” outlines our proposed methodology. Section “Experimental Analysis” presents the results from our experimental evaluation. Finally, Sect. “Conclusion” provides the paper’s conclusion.

Related Works

Numerous methods and datasets are explicitly designed for video violence detection within the existing literature. Many of these approaches are tailored to analyze trimmed clips [14, 17,18,19,20,21,22,23,24], which capture precise actions, whether violent or non-violent. Consequently, this task falls under the umbrella of action recognition, where the goal is to classify videos, predicting the presence or absence of violent human behaviors. However, a few studies also delve into the realm of untrimmed videos [25,26,27]. In this case, the objective expands beyond action recognition to include action localization, which entails identifying the temporal boundaries of the actions. This distinction is also reflected in the datasets used for model training: trimmed video datasets typically provide annotations at the video level, while untrimmed video datasets require frame-level annotations. Our effort focuses explicitly on video violence detection in trimmed videos, where we introduce a UDA scheme to tackle a scenario characterized by scarce annotated data. In the subsequent sections, we will explore some collections of trimmed clips and prominent techniques in the literature. Finally, we will conclude this section by reviewing some existing UDA approaches.

Video Violence Detection Datasets

Over the past few years, several benchmarks comprising trimmed video clips suitable for video violence detection have been introduced. There exist numerous challenges concerning these collections of videos that consequently impact the performance of the video violence detection models, such as small amounts of data, video quality, and size. Some notable examples in the literature are (i) the Surveillance Camera Fight (SCF) [24] dataset, a collection of 300 videos from surveillance camera footage, 150 of which describe fight sequences and 150 depict non-fight scenes, (ii) the Real-Life Violence Situations (RLVS) [17] dataset, a set of 2000 video clips of violent/non-violent actions in general real-world scenarios, and (iii) the RWF-2000 [28] dataset which includes 2000 trimmed video clips from YouTube capturing violent/non-violent scenes from surveillance cameras. Furthermore, some other datasets focus on specific environments, such as the Hockey Fight [15] dataset containing 1000 labeled “violent” or “non-violent” trimmed clips of actions from hockey matches of the National Hockey League (NHL), and the Bus Violence [13] benchmark which instead includes 1400 videos of violent/non-violent scenes from several cameras located inside a moving bus, representing the first public dataset for human violence detection in public transport. We report all these accounted benchmarks in Table 1, where details and sample frames are also given. In this work, we point out the difficulties concerning the generalization capabilities of the DL-based techniques for video violence detection when trained with general-context data [17, 24, 28] and applied to specific scenarios [13, 15], providing a solution to mitigate this issue without using further annotations.

Table 1 Summary of datasets. We report statistics and sample frames of some collections of trimmed clips in the literature suitable for video violence detection. SCF, RLVS, and RWF-2000 comprise general-context data, while Bus Violence and Hockey Fight focus on specific environments

Video Violence Detection Approaches

Many of the existing video violence detection methods follow an architecture comprising a series of convolutional layers to extract spatial features, one or more long short memory (LSTM) layers [29] (or some variants of it) to encode the long-term frame level changes from a temporal perspective, and, finally, a sequence of fully connected layers for the final video classification. Some notable works are [17, 18] where the authors proposed a pre-trained VGG-16 on ImageNet as spatial feature extractor followed by LSTM as temporal feature extractor, or [19, 22, 23] that instead exploited ImageNet pre-trained AlexNet, VGG16, and ResNet50 as backbones, respectively, and convolutional LSTM (ConvLSTM) [30] for temporal feature encoding. Similarly, in [20], a spatio-temporal encoder built on a pre-trained VGG13 combined with bidirectional convolutional LSTM (BiConvLSTM) has been introduced, while the authors in [24] proposed a combination of Xception and bidirectional LSTM (BiLSTM) layers. It is also worth noting that these latter three works [19, 20, 23] do not use the raw RGB video stream as input, but they instead employ the frame-difference video stream, i.e., the difference of adjacent frames; frame-difference represents a computationally efficient alternative for optical flow, and it has been successfully exploited to capture short-term frame level changes. On the other hand, in [31], the authors used a different architecture relying on 3D convolutional layers [32] to handle both spatial and temporal dimensions, while in [14], the videos have been classified using single frames randomly sampled within the clips.

Alternatively, methods suitable for human action recognition can also be exploited. In this case, fine-tuning is required to classify videos into two classes: violence and non-violence. For instance, the ResNet 2+1D network [21] treats actions as spatio-temporal objects using a sort of 3D convolutional layer obtained by decomposing the convolutions into separate 2D spatial and 1D temporal filters [33]. Another widely-used model is SlowFast [34]. This architecture incorporates two branches. The first branch aims to capture semantic information through images or a few sparse frames, operating at a low frame rate. In contrast, the second branch captures fast-changing motion by operating at a higher refresh rate. Lastly, recent advancements have introduced architectures based on Transformer attention modules. An example is the Video Swin Transformer [35], which extends the sliding-window Transformers proposed for image processing [36] to the temporal axis. This extension achieves an excellent balance between efficiency and effectiveness.

Unsupervised Domain Adaptation Approaches

Traditional unsupervised domain adaptation (UDA) methods have predominantly focused on solving image classification problems by aligning features between two domains. Prominent examples of these methods include [37, 38]. However, extending these approaches to different applications is not straightforward, as underscored by [39]. Consequently, the existing literature provides a limited number of UDA techniques that are suitable for diverse tasks. Recent advancements have expanded the scope of UDA techniques to encompass areas like semantic segmentation [40, 41] and visual counting [42, 43]. This research introduces a UDA framework tailored specifically for video violence detection. To the best of our knowledge, this represents the first attempt to leverage UDA for this task.

Methodology

Background

In line with the notation introduced in [44, 45], we define a domain \({\mathcal {D}}\) consisting of two main components: a d-dimensional feature space \({\mathcal {X}} \subset {\mathbb {R}}^d\), and a marginal probability distribution P(X), where \(X = \{x_1, \dots , x_n\} \subset {\mathcal {X}}\) represents the set of feature samples. For a specific domain \({\mathcal {D}} = \{{\mathcal {X}}, P(X)\}\), we formulate a task \({\mathcal {T}}\), which is defined by a label space \({\mathcal {Y}}\) and the conditional probability distribution P(Y|X), where \(Y = \{y_1, \dots , y_n\} \subset {\mathcal {Y}}\) corresponds to the set of labels associated with X. In a supervised setting, P(Y|X) can be learned from the provided feature-label pairs \(\langle x_i, y_i \rangle\).

In the context of unsupervised domain adaptation, we are presented with two distinctive domains:

\({\text {(i) a source domain} \ {\mathcal {D}}_S = \{{\mathcal {X}}_S, P(X_S)\}, \text {with} \ {\mathcal {T}}_S = \{{\mathcal {Y}}_S, P(Y_S|X_S)\},}\)

\({\text {(ii) a target domain} \ {\mathcal {D}}_T = \{{\mathcal {X}}_T, P(X_T)\}, \text {with} \ {\mathcal {T}}_T = \{{\mathcal {Y}}_T, P(Y_T|X_T)\},}\)

where \({\mathcal {Y}}_T\) is unknown, meaning that we do not have any labels available for the target samples. Due to the inherent differences between the two domains, the distributions are assumed to be distinct, i.e., \(P(X_S) \ne P(X_T)\) and \(P(Y_S|X_S) \ne P(Y_T|X_T)\). The main goal of UDA is to train a model that exhibits decreased generalization error in the target domain, achieved through the effective reduction of domain discrepancy.

UDA for Video Violence Detection

In this work, the source domain \({\mathcal {D}}_S\) comprises a labeled collection of highly diverse videos depicting both violent and non-violent everyday-life situations. Here, \({\mathcal {Y}}_S = {0, 1}\) indicates whether violent actions are absent or present in these clips, respectively. In contrast, the target domain \({\mathcal {D}}_T\) comprises an entirely separate set of videos lacking any annotations. These videos capture instances of violent or non-violent actions occurring in a distinct and unique context compared to the scenarios observed in the source domain. The main objective is to leverage knowledge from the unlabeled target domain in the training process, aiming to minimize the dissimilarity between the source and target domains. This adaptation enhances the model’s capacity to generalize effectively to scenarios where annotations are not available.

Our approach relies on a deep learning-based model, trained end-to-end with an attached UDA module. A distinctive feature of our UDA scheme is that it is based on single-image classification. We transform the task of video classification into image classification, as scenes with violent actions can be distinguished from non-violent scenes by classifying a sampled image from the entire video clip [14, 16]. Specifically, we improve the idea introduced in [14, 16] where the authors picked up a frame from a clip at random, and we propose a multiple instance learning (MIL) technique that instead considers the frame with the best classification score. MIL is a type of weakly supervised learning in which training instances are organized into groups, referred to as "bags", and a single label is assigned to the entire bag [46]; in our context, bags are represented by the trimmed videos while instances are the frames composing the clips themselves. A straightforward MIL approach involves applying a max pooling operator against the classification scores associated with the instances, therefore obtaining a single score associated with the bag. Building on this baseline, we incorporate into the training pipeline two UDA techniques initially designed for image classification, feeding them with images sampled from the target domain to facilitate inter-domain knowledge transfer.

More in detail, we utilize several convolutional neural networks (CNNs) as backbones for extracting features, excluding the final classification layers. We substitute the last classification head with a binary classification layer, which provides the probability of the presence (or absence) of violent actions in the given video. Additionally, we introduce an extra linear layer followed by a ReLU activation function to transform the feature maps originating from the feature extractor into a fixed-dimensional representation. This fixed-dimensional feature map is subsequently input into a UDA module.

We have explored two distinct UDA approaches. The first approach is known as the Domain-Adversarial Neural Network - (DANN) [37], which involves a domain regressor engaged in an adversarial competition with the classifier. This method achieves UDA by connecting the domain classifier to the feature extractor through a gradient reversal layer. During training, this layer introduces an adversarial loss by reversing the gradient with a specific negative constant. Otherwise, the training process is standard for source examples, minimizing the label prediction loss, and includes domain classification loss for all samples. The adversarial loss ensures that the feature distributions between the two domains become as similar as possible, resulting in domain-invariant features. The second approach is referred to as the Minimum Class Confusion - (MCC) [38], a method that can be exploited as UDA without explicitly aligning domains. MCC is grounded in the concept of class confusion, where the classifier tends to confuse predictions between correct and ambiguous classes. Specifically, MCC operates on the class predictions made by the classifier for the target data, given the feature extractor. During training, MCC is optimized using standard backpropagation to reduce class confusion and enhance feature generalization.

Experimental Analysis

Experimental Setting

Experimental scenario. We exploited three datasets from existing literature as the source domain \({\mathcal {D}}_S\): Surveillance Camera Fight (SCF) [24], Real-Life Violence Situations (RLVS) [17], and RWF-2000 [28], which were previously mentioned in Sect. “Related Works”. These datasets comprise annotated videos captured by stationary security cameras, featuring a diverse array of trimmed violent and non-violent scenes that span various real-life situations. In contrast, we adopted the Hockey Fight [15] dataset and the Bus Violence dataset [13] as the target domains \({\mathcal {D}}_T\). The former consists of trimmed clips from National Hockey League (NHL) matches, while the latter includes trimmed clips recorded within a moving bus, featuring actors simulating both violent and non-violent actions. These scenarios are notably more specific, involving instances of violence within the context of hockey matches or public transportation. More in detail, we divided the Hockey Fight and the Bus Violence datasets into two distinct splits: one has been used as the unlabeled set from which infer some domain-specific knowledge, while the second one served as testing grounds for evaluating the generalization capabilities of all the considered deep learning models.

We employed two popular CNNs, ResNet50 [47] and VGG16 [48], as the primary feature extraction backbones. We replaced their final classification head with a binary classification layer to adapt them for our video violence detection task. These networks served as our baseline models, as already did in our previous work [16] and in [14], and we used both without any UDA modules and as feature extractors and classifiers in our proposed UDA approaches. We also exploited these two feature extractors in our modified setting with the added MIL strategy, again, both with and without our UDA modules. Furthermore, to compare with existing literature, we considered other established methods tailored for video violence detection and video action recognition. Specifically, we leveraged architectures introduced in [17, 19, 20, 24], which utilize LSTM, BiLSTM, ConvLSTM, and BiConvLSTM as spatio-temporal encoders, and the network proposed in [31] that exploits 3D convolutional layers. We also considered popular video action classifiers, including the ResNet 2+1D network [21], the SlowFast [34] architecture, and the Video Swin Transformer [35]. More information about these models can be found in Sect. “Related Works” and their respective papers. We initiated these models with pre-trained versions from ImageNet [10] or Kinetics-400 [49] datasets, with no additional external data.

Performance metrics. Consistent with previous research on video violence detection, we employed accuracy as the key metric to assess the performance of the methods being examined. It is defined as:

$$\begin{aligned} Accuracy = \frac{TP + TN}{TP + TN + FP + FN}, \end{aligned}$$
(1)

where TP represents true positives, TN stands for true negatives, FP denotes false positives, and FN represents false negatives. For a more comprehensive comparison of the results obtained, we also incorporated additional metrics, including the F1-score, false alarm (FA), and missing alarm, which are defined as follows:

$$\begin{aligned} F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}, \end{aligned}$$
(2)
$$\begin{aligned} False Alarm = \frac{FP}{TN + FP}, \end{aligned}$$
(3)
$$\begin{aligned} Missing Alarm = \frac{FN}{TP + FN}, \end{aligned}$$
(4)

where precision and recall are defined as \(\frac{TP}{TP + FP}\) and \(\frac{TP}{TP + FN}\), respectively. Finally, we incorporated the area under the receiver operating characteristics (ROC AUC) metric to account for the probabilities associated with the detections. It is computed by measuring the area under the curve obtained by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

Implementation details and evaluation protocol. We implemented our models using PyTorch. Training and inference of all models are performed on an NVIDIA GeForce RTX4090. We used SGD for training, setting the starting learning rate to 0.005, the momentum at 0.9, and the weight decay to 0.001. The number of epochs is set at 60, 75, and 100 concerning the SCF, RLVS, and RWF-2000 datasets, respectively. We consistently employed a uniform data augmentation strategy throughout the training phase that included horizontal flipping with a probability of 0.5 and resizing to dimensions of \(256\times 256\) pixels.

To ensure the robustness of our results, we implemented the following evaluation protocol. Within each of the three selected source (training) domains, i.e., SCF, RLVS, and RWF-2000, we randomly partitioned the training and validation subsets three times. We then selected the best-performing model based on accuracy and tested it on the target (test) domain, i.e., the splits of Hockey Fight and Bus Violence benchmarks selected as the performance testing ground. Our reported results represent the mean and standard deviation of these three independent runs. We repeated the experiments five times instead of three times for some of the results obtained in our previous work [16] that showed a high standard deviation.

Results and Discussion

The results concerning the Hockey Fight target dataset are shown in Table 2. In general, all the models we examined demonstrate only moderate performance, highlighting the challenges in adapting their capabilities to detect violent actions in videos from the target domain effectively. Specifically, they particularly struggle when the considered source domain is the SCF dataset, while they exhibit better achievements with the RLVS data collection. However, our modified VGG16 architecture with our MIL-based approach together with the MCC UDA module stands out as the top performer in terms of the key metric, i.e., accuracy, in all the considered scenarios. More in detail, when compared to the same architecture without UDA, our proposed technique attains a gain of about \(6\%\), \(4\%\), and \(3\%\) in accuracy concerning the SCF, the RWF-2000, and the RLVS source domains, respectively. More importantly, it overcomes all the other considered state-of-the-art methods present in the literature, and it gains about \(35\%\), \(8\%\), and \(9\%\) of accuracy when compared with the single-image classification-based method proposed in [14] and that constitutes the baseline of our previous work [16].

Regarding the Bus Violence target dataset, we illustrate the results in Table 3. In general, all the models exhibit very poor performances, pointing out more challenges in this recently established scenario compared with the Hockey Fight dataset. However, even in this case, our UDA scheme can mitigate the difficulties arising from the generalization capabilities. In this setting, the best performer in terms of accuracy is our modified ResNet50 architecture with our MIL-based technique and the MCC UDA module. Specifically, compared with the same architecture without UDA, we gain about \(9\%\), \(5\%\), and \(14\%\) in accuracy concerning the SCF, the RWF-2000, and the RLVS source domains, respectively. Furthermore, it is worth noting that, even in this case, we overcome the other considered state-of-the-art techniques, gaining about \(11\%\), \(13\%\), and \(16\%\) of accuracy when compared against our baseline in [16].

Taking into account missing alarms, it’s noticeable that our UDA module can increase the performance compared with the same architecture without UDA, considering both the target domains. MAs are particularly critical in video violence detection as they signify instances of violent actions that occurred but went undetected. Approaches that struggle with this metric represent a significant limitation for violence detection systems; therefore, this represents an added value to our proposal.

Table 2 Performance evaluation over the Hockey Fight dataset [15]. We report the obtained results considering the Hockey Fight benchmark as the target domain and three sets of clips for video violence detection in general contexts
Table 3 Performance evaluation over the Bus Violence dataset [13]. We report the obtained results considering the Bus Violence benchmark as the target domain and three sets of clips for video violence detection in general contexts

Conclusion

In this research, we addressed the challenge of video violence detection within the context of limited data availability. The current state of deep learning solutions heavily relies on abundant labeled data for effective supervised learning. However, these models tend to struggle when applied to new, previously unseen scenarios that were not part of their training data. Consequently, a model trained on one domain, referred to as the source, often experiences a significant performance decline when deployed in another domain, known as the target. To address this issue, we introduced an unsupervised domain adaptation (UDA) approach for identifying violent and non-violent actions within trimmed videos. Our method combines supervised learning in the source domain with the utilization of an unlabeled target dataset. This combination aims to reduce the domain shift between the two datasets. Our proposed solution is based on single-image classification, where a simple multiple instance learning (MIL) approach is responsible for taking frames from video clips having the maximum classification score. The feature representations extracted from the target images are passed through a UDA module, in charge of making them domain-indiscriminate by minimizing the shift between the domains. To the best of our knowledge, this is the first attempt to employ a UDA framework for video violence detection. Our experiments used three source datasets comprising videos depicting violent and non-violent scenes in various general contexts. On the other hand, the target domains consisted of collections of clips capturing violent and non-violent actions in very specific environments, such as hockey matches and public transport. The obtained results indicate that our UDA scheme can enhance the generalization capabilities of the models considered by mitigating the domain gap.