Abstract
In the realm of computer vision, Group Activity Recognition (GAR) plays a vital role, finding applications in sports video analysis, surveillance, and social scene understanding. This paper introduces Recognize Every Action Everywhere All At Once (REACT), a novel architecture designed to model complex contextual relationships within videos. REACT leverages advanced transformer-based models for encoding intricate contextual relationships, enhancing understanding of group dynamics. Integrated Vision-Language Encoding facilitates efficient capture of spatiotemporal interactions and multi-modal information, enabling comprehensive scene understanding. The model’s precise action localization refines joint understanding of text and video data, enabling precise bounding box retrieval and enhancing semantic links between textual descriptions and visual reality. Actor-Specific Fusion strikes a balance between actor-specific details and contextual information, improving model specificity and robustness in recognizing group activities. Experimental results demonstrate REACT’s superiority over state-of-the-art GAR approaches, achieving higher accuracy in recognizing and understanding group activities across diverse datasets. This work significantly advances group activity recognition, offering a robust framework for nuanced scene comprehension.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Group activity recognition (GAR) has emerged as a crucial problem in computer vision, with numerous applications in sports video analysis, video monitoring, and social scene understanding. Unlike conventional action recognition methods that focus on identifying individual actions, GAR aims to classify the actions of a group of people in a given video clip as a whole (Fig. 1). It requires a deeper understanding of the interactions between multiple actors, including accurate localization of actors and modeling their spatiotemporal relationships [1,2,3,4,5,6,7,8]. As a result, GAR poses fundamental challenges that must be addressed to develop practical solutions for this problem. In this context, developing novel techniques for group activity recognition has become an active area of research in computer vision.
Existing GAR methods require ground-truth bounding boxes and action class labels for training and testing [9,10,11,12,13,14,15,16,17]. We compare our method with the previous methods in Fig. 2. Bounding box labels extract actor features and their spatio-temporal relations, which are then aggregated to form a group-level video representation for classification. However, the reliance on bounding boxes and substantial data labeling annotations severely limit their applications. To address these limitations, some methods simultaneously train person detection and group activity recognition using bounding box labels [18, 19]. Another approach is weakly supervised (WSGAR) learning [20, 21], which does not require individual actor-level labels for training and inference.
Yan et al. [20] proposed the WSGAR learning approach that uses a pre-trained detector to generate actor box suggestions and learn to eliminate irrelevant possibilities. However, this method suffers from missing detections when actors are occluded. Kim et al. [21] introduced a detector-free method that captures actor information using partial contexts of token embeddings, but this method can only be learned when there is movement in consecutive frames. Moreover, Kim et al. [21] did not consider the consistency of temporal information among different tokens. Hence, there is a need for a GAR approach that can capture temporal information accurately without the limitations of bounding box annotations or detector-based methods.
1.1 Our contributions in this work
Inspired by attention mechanisms in video contexts, we introduce a transformer-based architecture, as depicted in Fig. 3, to effectively model complex interactions within videos, incorporating temporal, spatial, and multi-modal features. The Vision-Language (VL) Encoder block (Sect. 3.1) stands as a pivotal element, proficient in encoding sparse spatial and multi-modal interactions, with a dedicated fast branch for recovering temporal details. Our architecture integrates a Action Decoder Block (Sect. 3.2) that refines the understanding of text and video data. It balances context and actor-specific details by merging Encoder and Actor Fusion Block outputs. Utilizing temporal spatial attention for discerning temporal and spatial dependencies and temporal cross attention, aligned with ground-truth bounding boxes, ensures precise retrieval of bounding boxes, bridging semantic and visual contexts. The Actor Fusion Block (Sect. 3.3), positioned at the core, orchestrates a harmonious blend of actor-specific information and textual features. Through concatenation and averaging operations, it achieves a holistic representation enriched by convolution operations extracting local patterns. This refined representation contributes to contextually relevant output generation in the Decoder Block. In practical terms, our model’s efficacy is demonstrated in Fig. 1 through an example response to a user’s action query. The results showcase superior performance in accurately recognizing and understanding group activities compared to state-of-the-art methods. Our experiments provide empirical evidence of the significant performance gains achieved by our framework, affirming its potential for diverse real-world applications.
2 Related work
2.1 Group activity recognition (GAR)
In the field of action recognition, group action recognition has become an increasingly popular research topic due to its wide range of applications in various fields, such as video surveillance, human-robot interaction, and sports analysis. GAR aims to identify the actions performed by a group of individuals and the interactions between them.
Initially, researchers in the field of GAR used probabilistic graphical methods and AND-OR grammar methods to process the extracted features [22,23,24,25,26,27,28,29]. However, with the advancement of deep learning techniques, methods involving convolutional neural networks (CNN) and recurrent neural networks (RNN) achieved outstanding performance due to their ability to learn high-level information and temporal context [9, 18, 30,31,32,33,34,35,36].
Recent methods for identifying group actions mainly utilize attention-based models and require explicit character representations to model spatial-temporal relations in group activities [10,11,12,13,14, 17, 20, 37]. For example, graph convolution networks are used to learn spatial and temporal information of actors by constructing relational graphs, and spatial and temporal relation graphs are used to infer actor links. Clustered attention is used to capture contextual spatial-temporal information, and transformer encoder-based techniques with different backbone networks extract features for learning actor interactions from multimodal inputs [12]. Additionally, MAC-Loss [38], a combination of spatial and temporal transformers in two complimentary orders, has been proposed to enhance the learning effectiveness of actor interactions and preserve actor consistency at the frame and video levels. Tamura et al. [39] introduces a framework without using heuristic features for recognizing social group activities and identifying group members. This information is embedded into the features, allowing for easy identification. These recent advancements in GAR have made significant progress toward recognizing complex actions performed by a group of individuals in various settings.
2.2 Weakly supervised group activity recognition (WSGAR)
Various techniques have been developed to address the WSGAR problem with limited supervision, such as using bounding boxes to train built-in detectors or activity maps. WSGAR is one approach that does not rely on bounding box annotations during training or inference and includes an off-the-shelf item detector in the model. Traditional GAR approaches require accurate annotations of individual actors and their actions, which can be challenging and time-consuming. Weakly supervised methods aim to relax these requirements by learning from more readily available data such as activity labels, bounding boxes, or video-level labels. Zhang et al. [40] proposed a technique that employs activity-specific characteristics to enhance WSGAR. It is not mainly designed for GAR. Kim et al. [21] proposed a detector-free approach that uses transformer encoders to extract motion features. We propose a self-supervised training method specialized for WSGAR and does not necessitate actor-level annotations, object detectors, or labels.
2.3 Transformers in vision
The transformer architecture was first introduced by Vaswani et al. [41] for sequence-to-sequence machine translation, and since then, it has been widely applied to various natural language processing tasks. Dosovitskiy et al. [42] introduced a transformer architecture not based on convolution for image recognition tasks. Several works [43,44,45,46] used transformer architecture as a general backbone for various downstream computer vision tasks, achieving remarkable performance progress. In the video domain, many approaches [47,48,49,50,51,52,53] utilize spatial and temporal self-attention to learn video representations effectively. Bertasius et al. [50] explored different mechanisms of space and time attention to learn spatiotemporal features efficiently. Fan et al. [51] used multiscale feature aggregation to improve the learning performance of features. Patrick et al. [52] introduced a self-attention block that focuses on the trajectory, which tracks the patches of space and time in a video transformer.
3 The proposed method
As shown in Fig. 3, given the input video \(\textbf{X}\) and the textual input \(\textbf{t}\), the REACT model aims to localize all the positions \(\textbf{b}\) in the video \(\textbf{X}\) that consists of the group actions of interest based on the input prompt \(\textbf{t}\). By formulating the REACT model, we seek to address the challenge of understanding complex group dynamics and activities within videos, which is crucial for various applications such as video surveillance, activity recognition, and human behavior analysis. Formally, the REACT model can be formulated as follows:
where \(\hat{\textbf{b}} \in [0,1]^{T\times 4}\) is the normalized predicted bounding box coordinates, VE and TE are the visual and textual encoders to extract the feature representations of the video input \(\textbf{X}\), i.e., \(\varvec{v_f}\) = VE(\(\textbf{X}\)), and the textual prompt \(\textbf{t}\), i.e., \(\varvec{t_f}\) = TE(\(\textbf{t}\)), respectively. The motivation behind employing visual and textual encoders is to capture both visual and semantic information, facilitating a comprehensive understanding of group actions in videos.
\(\mathbb {F}\) is the correlation model to exploit the contextual relationship between visual features \(\varvec{v_f}\) and the textual features \(\varvec{t_f}\). The motivation behind using a correlation model is to integrate information from both modalities effectively, enhancing the model’s ability to localize group actions accurately.
Then, the Action decoder(AD) will decode the contextual features produced by the correlation model \(\mathbb {F}\) to localize all the group action regions of interest. The motivation behind the Action decoder is to transform the learned features into interpretable predictions of group action positions, enabling actionable insights from video data. Fig. 3 illustrates the proposed framework of REACT.
Contextual Relationship Modeling in GAR involves modeling the multi-modality and spatio-temporal features. Thus, to efficiently cover these correlations, we model the correlation function \(\mathbb {F}\) via the attention mechanism. In particular, the correlation model \(\mathbb {F}\) is designed as a Transformer network. Our model \(\mathbb {F}\) is at the forefront of our proposed approach which comprises vital components for the integrated modeling of temporal, spatial, and multi-modal interactions. It adeptly encodes spatial and multi-modal interactions, efficiently processing sparsely sampled frames.\(\mathbb {F}\) is a transformer block that mixes text and video representations using cross attention. Formally, \(\mathbb {F}\) can be formed as follows:
where \(\varvec{\widehat{vt_f}}\) is the shared representation of video and text, FFN is a feed-forward neural network, T2V and V2T are the cross-attention models for the visual and textual features. Transforming textual content into a video format is known as text-to-video. This can involve producing dynamic video content, slideshows, or animations based on the text provided as visual representations of written content. T2V can be used for a number of things, such as text-to-video (T2V) automated video generation, marketing content, and instructional films. The method of obtaining textual information from video material is known as "video-to-text." Said words, text in the video, or other pertinent information from the visual content must all be converted into a text format. V2T is frequently used for activities including content indexing, video transcription, and textual representation of video material to increase accessibility. Thanks to the attention mechanism, it has empowered our correlation model \(\mathbb {F}\) to produce a shared representation that captures the contextual relationships between text and video in both spatial and temporal dimensions and encapsulates a joint understanding of both text and video content, enabling the model to bridge the semantic gap between the two modalities effectively.
From the shared contextual feature representations produced by the correlation model \(\mathbb {F}\), the action decoder aims to localize all the group action positions of interest \(\textbf{b}\). In practice, the group action positions \(\textbf{b}\) could consist of a list of positions of interest \(\textbf{b} = \{\mathbf {b_1}, \mathbf {b_2},..., \mathbf {b_N}\}\) where N is the number of group action positions. For example, in a military mission, the group action of enemies could include the positions of many army groups and adversarial armed devices. Additionally, predicting a particular position of group action requires a broader understanding of the contextual feature representation and the surrounding predicted group action positions. Therefore, to efficiently model the group action decoder, in addition the the shared feature representation, the positions of the current predicted group actions are also taken into account. Formally, the Group Action Decoder can be modeled as follows:
where AF (ActorFusion) is the model that exploits the correlation between the current predicted positions of group actions and the textual features. Intuitively, the list of group action positions \(\{\mathbf {b_1},.., \mathbf {b_{i-1}}\}\) provide the spatial context of subjects in the video. Meanwhile, the textual features \(\varvec{t_f}\) represent linguistic aspects of context and semantics related to the group actions and offer a complementary view to the visual information. The AF model distinguishes between capturing detailed spatial information from group action positions and emphasizing the broader action-related information from the textual features. Finally, the ActionDecoder (AD) model will predict the position of group action \(\mathbf {b_i}\) based on the contextual feature representation \(\varvec{\widehat{vt_f}}\) and the action fusion feature \(\varvec{\widehat{bt_f}}\) produced by the AF model. In essence, the AD model uses a wealth of information from the correlation model \(\mathbb {F}\) and fusion features from the AF model to associate textual queries \(\varvec{t_f}\) with specific regions or subjects within the video frames. To efficiently exploit the correlation of these features in AD, we design AD via the attention mechanism. It enables the AD model to precisely identify and highlight the regions within the video that correspond to the described actions.
3.1 Vision-language (VL) encoder
The VL Encoder typically receives input features from their respective backbone networks in this method. These backbones are specialized models designed to extract meaningful information from set of L text data \(\varvec{t_f} \in \mathbb {R}^{L \times d}\) and video data \(\varvec{v_f} \in \mathbb {R}^{T \times HW \times d}\) for all T frames and HW spatial resolution, independently optimizing their representations for their specific modalities.
The VL Encoder receives the features \(\varvec{t_f}\) and \(\varvec{v_f}\). It also uses visual features derived from the video, including frames, motion patterns, bounding boxes, and object presence.
The VL Encoder utilizes attention mechanisms to comprehend these features effectively. It begins with self-attention on text, allowing it to weigh the importance of words and understand the context within the text. Simultaneously, it performs temporal self-attention on video features to capture temporal dependencies and dynamics.
The model then progresses to cross-attention, connecting the video and text modalities. The "video-to-text" step focuses on video features and aligns them with relevant text. The "text-to-video" step attends to video elements based on the textual context. This bidirectional interaction ensures a holistic understanding of both modalities.
Finally, the information from self-attention and cross-attention is combined, creating multi-modal features. These features merge enhanced text and video information, bridging the gap between the two modalities. The VL Encoder facilitates joint text and video analysis, benefiting tasks like video captioning and action recognition. It empowers models to grasp contextual relationships between text and video, enhancing multi-modal understanding.
3.2 Action decoder (AD) block
The AD Block is a crucial part of the architecture responsible for producing meaningful bounding box outputs. It combines information from the encoder and the actor fusion blocks.
First, it concatenates outputs from these sources. The Encoder Block provides a broad multi-modal understanding of text and video, while the Actor Fusion Block adds actor-specific details like normalized text features and actor bounding boxes. This merging process combines contextual understanding with fine-grained actor-specific information.
Then, the AD Block employs temporal spatial attention. This mechanism allows the model to consider temporal (sequential) and spatial (object relationships) aspects in video data. It is particularly valuable for tasks like action recognition and actor localization.
Next, it uses temporal cross-attention, incorporating ground-truth bounding box data. Ground-truth bounding boxes precisely locate objects or actors in video frames. This step refines the model’s understanding of object and actor movements over time, aligning textual descriptions with the video content.
Finally, the AD Block generates bounding boxes as grounding outputs based on text queries. It leverages the information gathered from previous stages, aligning textual descriptions with specific regions or objects in video frames. This output is vital for tasks like actor localization and action recognition, connecting text with visual content to pinpoint regions corresponding to described actions or objects.
3.3 Actor fusion (AF) block
The AF block begins by combining two vital sources of information, including the bounding box data \(\hat{\textbf{b}}\) and the normalized text features \(\mathbf {t_f}\). The bounding box data provides precise location details for objects or actors in video frames, enhancing the model’s spatial understanding.
Simultaneously, normalized text features offer textual descriptions and contextual information related to actions or objects in the video, providing a linguistic perspective.
After combining these sources, the AF block calculates an average over the concatenated features. This averaging serves multiple purposes. It creates a balanced representation that includes actor-specific details (from the bounding box) and semantic context (from the text features). Additionally, it emphasizes action-related information, aligning with the action recognition task.
The AF block uses convolution operations to refine the feature representation further (presented as \(\widehat{\mathbf {bt_f}}\)). Convolutions are essential in deep learning for capturing local data patterns. They extract and highlight relevant spatial and temporal features from the combined actor-specific information in this context. These convolutional operations are the key to improving the representation before passing it back to the AD Block.
3.4 Training loss
The input training data is in the form of a set of frames, where each set of frames belongs to a video, which is annotated with a text label representing the group activity and individual actions s and the corresponding set of bounding boxes b.
We train our architecture with a linear combination of four losses
where \(b \in [0,1]\) denotes the normalized ground truth box coordinates and \(\hat{b}\) the predicted bounding boxes. Finally, different \(\lambda _{\bullet }\) are scalar weights of the individual losses. \(\mathcal {L}_{\mathcal {L}_1}\) is a \(\mathcal {L}_1\) loss on bounding box coordinates. \(\mathcal {L}_{gIoU}\) is a generalized “intersection over union" (IoU) loss [54] on the bounding boxes. Both \(\mathcal {L}_1\) and \(\mathcal {L}_{gIoU}\) are used for spatial and temporal grounding. Losses are computed at each layer of the decoder following [55].
3.5 Inference
A machine learning model uses a labelled dataset to learn from during the training phase. It presents input data, makes predictions, compares those forecasts to the actual labels, and iteratively (epoch-by-epoch) adjusts its internal parameters (weights) to minimise prediction error. The model moves onto the inference stage after training. During this stage, fresh, unobserved data is predicted using the trained model. Based on the patterns it discovered during training, the model uses the input data to generate predictions. During the testing phase, we employ distinct operations for individual action and social group predictions. For individual actions, a softmax operation is applied to the predictions from cross-entropy functions, and a sigmoid operation is performed on predictions from binary cross-entropy functions. Our approach follows a hierarchical strategy, initiating label selection from the first partition and progressing to subsequent ones in the hierarchy only if the "Other" class is predicted. In the context of social group prediction, we leverage the predicted counts of individual actions and social groups. To deduce the activity of each predicted social group, we adopt a strategy where the social activity label is determined by the most frequent action labels among its members.
3.6 Discussion
Need for concatenated attention? The method described earlier, which we will call "concatenation-based fusion," involves the combination of features. It performs self-attention on features from individual sources and then computes cross-attention across different sources. It is in contrast to "cross-attention-based fusion." With flexibility and fewer assumptions about the data’s spatial pattern, the Transformer model offers significant modeling freedom. Compared to cross-attention-based fusion, concatenation-based fusion is more efficient due to operation sharing and reduces model parameters through weight sharing. Weight sharing is a critical design aspect for symmetric metric learning between two data branches. In concatenation-based fusion, we implement this property in feature extraction and feature fusion. In a nutshell, concatenation-based fusion enhances both efficiency and performance.
What if we use a query-based decoder? Many transformer-based models in vision tasks draw inspiration from the vanilla Transformer decoder. They incorporate a learnable query to extract specific target features from the encoder. For instance, in [55], they use object queries; in [56], they refer to target queries. However, our experimental findings reveal that a query-based decoder faces challenges, including slower convergence and subpar performance. The vanilla Transformer decoder, which is fundamentally a generative model, may not be the ideal choice for classification tasks. Furthermore, employing a single universal target query for all types of objects could potentially create a performance bottleneck. It is worth emphasizing that REACT primarily functions as an "encoder" model within the conventional Transformer encoder-decoder framework.
4 Experiment results
4.1 Experiment settings
Upstream task We consistently used the RoBERTa [57] model throughout all the experiments to extract the textual features, whereas the visual backbone varied for different experiments to have fair comparisons. We use hyper-parameters T = 15 (JRDB-PAR) & 8 (Volleyball), N= 6 (JRDB-PAR) & 8 (Volleyball), \(\lambda _{\mathcal {L}_1}\) = 5, \(\lambda _{gIoU}\) = 2 and d=256. We initialized temporal attention weights randomly, while spatial attention weights were initialized using a ViT model trained self-supervised over ImageNet-1K [58]. This initialization scheme facilitated faster convergence of space-time ViT, as seen in the supervised setting [59]. We trained using an Adam optimizer [60] with a learning rate of \(5\times 10^{-4}\), scaled using a cosine schedule with a linear warm-up over five epochs [61, 62]. Additionally, we applied weight decay scaled from 0.04 to 0.1 during training.
Downstream tasks We trained a linear classifier on our pretrained backbone. During training, the backbone was frozen, and we trained the classifier for 100 epochs with a batch size of 32 on a single NVIDIA-V100 GPU using SGD with an initial learning rate of 1e-3 and a cosine decay schedule. We also set the momentum to 0.9.
4.2 Dataset details
Volleyball Dataset [9]: With 55 videos and 4,830 labeled clips (3,493 training, 1,337 testing), this dataset includes annotations for individual actions and group activities with bounding boxes. In WSGAR experiments, we focus solely on group activity labels, excluding individual action annotations. Evaluation metrics include Multi-class Classification Accuracy (MCA) and Merged MCA, ensuring fair comparisons with existing methods.
JRDB-PAR Dataset [63]: Featuring 27 action, 11 social groups, and seven global activity categories, this dataset comprises 27 videos (20 training, seven testing) with 27,920 frames and 628k human bounding boxes. Uniformly sampled keyframes (every 15 frames) are used for annotation and evaluation. Precision (\(\mathcal {P}_g\)), recall (\(\mathcal {R}_g\)), and F1-score (\(\mathcal {F}_g\)) are the evaluation metrics for social group activity recognition, treated as a multi-label classification problem [63].
4.3 Classification task evaluation
Volleyball dataset. We evaluate our approach against the latest GAR and WSGAR methods in two supervision levels: fully supervised and weakly supervised. These differ in using actor-level labels such as ground-truth bounding boxes and individual action class labels in training and inference. To ensure a fair comparison, we report the results of previous methods and reproduce results using only the RGB input and ResNet-18 backbone, respectively. In the weakly supervised setting, we replace the group action classification labels with ground-truth bounding boxes of the actors without their corresponding actions so that the actors localization is learned during the pre-training stage. Table 1 presents the results, with the first and second sections showing the results of earlier techniques in fully supervised and weakly supervised environments, respectively. Our model trained on the ResNet-18 backbone outperforms most fully supervised frameworks by significantly improving the MCA and MPCA metrics. We show that using ViT-Base backbone, our approach significantly outperforms all GAR and WSGAR models in weakly supervised conditions, beating them by 2.4% of MCA and 1.2% of Merged MCA by leveraging the spatiotemporal features using the transformer architecture. Moreover, our approach is better than the current GAR methods that employ less thorough actor-level supervision, such as [12, 13, 18, 33, 36].
JRDB-PAR dataset We conducted a comparative study to evaluate our proposed approach alongside state-of-the-art methods in GAR and WSGAR using the JRDB-Par dataset. We involved fully supervised and weakly supervised settings to evaluate the dataset. The comparison results are presented in Table 2. Our method significantly outperforms the existing social group activity recognition frameworks in the fully supervised setting in all metrics. In the weakly supervised setting, our proposed method outperformed existing GAR and WSGAR methods by a considerable margin, achieving 1.2 of \(\mathcal {P}_g\), 1.7 of \(\mathcal {R}_g\) and 2.3 of \(\mathcal {F}_g\). Additionally, we evaluated this dataset using ResNet-18, ViT-B/16, and ViT-B/32 backbones, where ViT-B/16 proved to be better (whose results are presented in Table 2), which is further analyzed in Appendix. Despite their impressive performance in WSGAR, our approach outperformed them all.
4.4 Human action retrieval evaluation
Our experimental results in Table 3 show that our proposed method outperforms the existing state-of-the-art approaches in terms of the R@K metric. Specifically, our method achieved a significantly higher R@1 score than the other methods, with a margin of 4.8. Additionally, our method achieved higher R@5 and R@10 scores than all other methods, demonstrating its superior performance in video object retrieval. Furthermore, our method demonstrated a more consistent performance across two datasets and experimental setups. We attribute this to our method’s ability to effectively capture the video frames’ global and local features and its incorporation of text embeddings for improved semantic understanding of the video content.
4.5 Ablation study
In this section, we verify the effectiveness of each component in the REACT framework on downstream group activity recognition tasks and examine the significance of the actor fusion block in the proposed method.
4.5.1 Effectiveness of individual components
Table 4 presents the results of the ablation study, focusing on the impact of different components on group activity recognition tasks. Regardless of the visual backbone used, we observe notable improvements in performance when incorporating various components such as feature extraction, encoding, and decoding. Specifically, experiments demonstrate that the inclusion of these components leads to substantial gains in both the JRDB-PAR and Volleyball datasets.
For instance, when using ResNet-18 as the visual backbone, the addition of feature extraction, encoding, and decoding modules results in performance improvements of 4.8% and 4.1% on the JRDB-PAR and Volleyball datasets, respectively. Similarly, employing ViT-B/32 and ViT-B/16 as visual backbones leads to significant performance enhancements across both datasets when incorporating these components.
These results underscore the importance of each component in the REACT framework, highlighting their collective contribution to improving group activity recognition performance.
4.5.2 Significance of actor fusion block
Table 5 specifically focuses on the impact of the actor fusion block on the architecture’s performance. Utilizing ViT-B/32 as the visual backbone, we observe a substantial performance boost across both datasets when incorporating the actor fusion block. In particular, the inclusion of the actor fusion block leads to a remarkable improvement of 13.5% and 20.6% in group activity recognition performance on the JRDB-PAR and Volleyball datasets, respectively.
These results demonstrate the crucial role played by the actor fusion block in enhancing the REACT framework’s effectiveness. By effectively fusing actor-specific details and contextual information, the actor fusion block enables the model to achieve more accurate and robust predictions, thereby improving overall performance on group activity recognition tasks.
In summary, the ablation study results provide empirical evidence supporting the efficacy of individual components and the significance of the actor fusion block in the REACT framework. These findings further validate the design choices and architectural decisions made in our proposed approach.
4.6 Visualization
In order to gain insights into the visualization capabilities of our proposed method and demonstrate its significance, we conducted an analysis that involved examining the visualizations based on the input action query. The visualizations are presented in Fig. 4, which showcases the localization of actions based on the input query for the JRDB-PAR and Volleyball datasets. The results of the decoder’s generated proposals during the inference stage indicate that our method effectively localizes the specified actions. It demonstrates the ability of our framework to accurately recognize and pinpoint specific actions within a group activity context. Furthermore, the t-SNE plots are visualized in the Appendix. The results illustrated in these plots demonstrate that the proposed framework effectively recognizes all actions within a given video. Moreover, the framework enhances the overall performance in group activity recognition.
5 Conclusion
We introduce REACT, a novel video transformer-based model for group activity recognition using contrastive learning. Our approach generates diverse spatio-temporal views from a single video, leveraging different scales and frame rates. Correspondence learning tasks capture motion properties and cross-view relationships between sampled clips and textual information. The core of our approach is a contrastive learning objective that reconstructs video-text modalities in the latent space. The input–output pair in the V2T architecture is reversed from that of the T2V network topology. Using the fundamental cGAN architecture, our network is made up of a Discriminator D and a Generator G. REACT effectively models long-range spatio-temporal dependencies and performs dynamic inference within a single architecture. REACT outperforms state-of-the-art models in group activity recognition, showcasing its superior performance.
Limitations While our methods excel in group activity classification and human action-retrieval tasks, they may need to be optimized for other visual-language problems due to the lack of textual descriptions in the datasets. Future research directions could focus on incorporating additional data involving question-answering, phrases for video grounding, and other related textual information to enhance the versatility of our methods.
Data availability
No datasets were generated or analysed during the current study.
References
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV, pp. 20–36. Springer (2016)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018)
Ranasinghe, K., Naseer, M., Khan, S., Khan, F.S., Ryoo, M.S.: Self-supervised video transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2874–2884 (2022)
Nguyen, T.-T., Nguyen, P., Luu, K.: Hig: Hierarchical interlacement graph approach to scene graph generation in video understanding. arXiv preprint arXiv:2312.03050 (2023)
Nguyen, P., Quach, K.G., Duong, C.N., Phung, S.L., Le, N., Luu, K.: Multi-camera multi-object tracking on the move via single-stage global association approach. arXiv preprint arXiv:2211.09663 (2022)
Nguyen, P., Quach, K.G., Kitani, K., Luu, K.: Type-to-track: Retrieve any object via prompt-based tracking. In: Advances in Neural Information Processing Systems 36 (2024)
Quach, K.G., Le, N., Duong, C.N., Jalata, I., Roy, K., Luu, K.: Non-volume preserving-based fusion to group-level emotion recognition on crowd videos. Pattern Recogn. 128, 108646 (2022)
Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: CVPR, pp. 1971–1980 (2016)
Wu, J., Wang, L., Wang, L., Guo, J., Wu, G.: Learning actor relation graphs for group activity recognition. In: CVPR, pp. 9964–9974 (2019)
Hu, G., Cui, B., He, Y., Yu, S.: Progressive relation learning for group activity recognition. In: CVPR, pp. 980–989 (2020)
Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C.G.: Actor-transformers for group activity recognition. In: CVPR, pp. 839–848 (2020)
Pramono, R.R.A., Chen, Y.T., Fang, W.H.: Empowering relational network by self-attention augmented conditional random fields for group activity recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, August 23–28, 2020, Proceedings, Part I 16, pp. 71–90. Springer (2020)
Ehsanpour, M., Abedin, A., Saleh, F., Shi, J., Reid, I., Rezatofighi, H.: Joint learning of social groups, individuals action and sub-group activities in videos. In: ECCV, pp. 177–195. Springer (2020)
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: HIGCIN: hierarchical graph-based cross inference network for group activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2020). https://doi.org/10.1109/TPAMI.2020.3034233
Yuan, H., Ni, D.: Learning visual context for group activity recognition. In: AAAI, vol. 35, pp. 3261–3269 (2021)
Li, S., Cao, Q., Liu, L., Yang, K., Liu, S., Hou, J., Yi, S.: Groupformer: Group activity recognition with clustered spatial-temporal transformer. In: ICCV (2021)
Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P., Savarese, S.: Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In: CVPR, pp. 4315–4324 (2017)
Zhang, P., Tang, Y., Hu, J.-F., Zheng, W.-S.: Fast collective activity recognition under weak supervision. IEEE Trans. Image Process. 29, 29–43 (2019)
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: Social adaptive module for weakly-supervised group activity recognition. In: ECCV, pp. 208–224. Springer (2020)
Kim, D., Lee, J., Cho, M., Kwak, S.: Detector-free weakly supervised group activity recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20083–20093 (2022)
Amer, M.R., Xie, D., Zhao, M., Todorovic, S., Zhu, S.-C.: Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In: ECCV, pp. 187–200. Springer (2012)
Amer, M.R., Todorovic, S., Fern, A., Zhu, S.-C.: Monte carlo tree search for scheduling activity recognition. In: ICCV, pp. 1353–1360 (2013)
Amer, M.R., Lei, P., Todorovic, S.: HIRF: Hierarchical random field for collective activity recognition in videos. In: ECCV, pp. 572–585. Springer (2014)
Amer, M.R., Todorovic, S.: Sum product networks for activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(4), 800–813 (2015)
Lan, T., Wang, Y., Yang, W., Robinovitch, S.N., Mori, G.: Discriminative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 1549–1562 (2011)
Lan, T., Sigal, L., Mori, G.: Social roles in hierarchical models for human activity recognition. In: CVPR, pp. 1354–1361. IEEE (2012)
Shu, T., Xie, D., Rothrock, B., Todorovic, S., Chun Zhu, S.: Joint inference of groups, events and human roles in aerial videos. In: CVPR, pp. 4576–4584 (2015)
Wang, Z., Shi, Q., Shen, C., Van Den Hengel, A.: Bilinear programming for human activity recognition with unknown mrf graphs. In: CVPR, pp. 1690–1697 (2013)
Deng, Z., Vahdat, A., Hu, H., Mori, G.: Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In: CVPR, pp. 4772–4781 (2016)
Ibrahim, M.S., Mori, G.: Hierarchical relational networks for group activity recognition and retrieval. In: ECCV, pp. 721–736 (2018)
Li, X., Choo Chuah, M.: Sbgar: Semantics based group activity recognition. In: ICCV, pp. 2876–2885 (2017)
Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Van Gool, L.: stagnet: An attentive semantic RNN for group activity recognition. In: ECCV, pp. 101–117 (2018)
Shu, X., Tang, J., Qi, G., Liu, W., Yang, J.: Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2019). https://doi.org/10.1109/TPAMI.2019.2942030
Wang, M., Ni, B., Yang, X.: Recurrent modeling of interaction context for collective activity recognition. In: CVPR, pp. 3048–3056 (2017)
Yan, R., Tang, J., Shu, X., Li, Z., Tian, Q.: Participation-contributed temporal dynamic model for group activity recognition. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1292–1300 (2018)
Yuan, H., Ni, D., Wang, M.: Spatio-temporal dynamic inference network for group activity recognition. In: ICCV (2021)
Han, M., Zhang, D.J., Wang, Y., Yan, R., Yao, L., Chang, X., Qiao, Y.: Dual-ai: Dual-path actor interaction learning for group activity recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2990–2999 (2022)
Tamura, M., Vishwakarma, R., Vennelakanti, R.: Hunting group clues with transformers for social group activity recognition. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pp. 19–35. Springer (2022)
Zhang, Y., Li, X., Marsic, I.: Multi-label activity recognition using activity-specific features and activity correlations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14625–14635 (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Li, M., Cai, W., Liu, R., Weng, Y., Zhao, X., Wang, C., Chen, X., Liu, Z., Pan, C., Li, M., et al. Ffa-ir: Towards an explainable and reliable medical report generation benchmark. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., Tay, F.E., Feng, J., Yan, S.: Tokens-to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986 (2021)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV (2021)
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122 (2021)
Han, M., Wang, Y., Chang, X., Qiao, Y.: Mining inter-video proposal relations for video object detection. In: European Conference on Computer Vision, pp. 431–446 (2020). Springer
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021)
Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: UniFormer: Unifying Convolution and Self-attention for Visual Recognition (2022)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? ICML 2, 4 (2021)
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. arXiv preprint arXiv:2104.11227 (2021)
Patrick, M., Campbell, D., Asano, Y.M., Metze, I.M.F., Feichtenhofer, C., Vedaldi, A., Henriques, J., et al.: Keeping your eye on the ball: Trajectory attention in video transformers. In: NeurIPS (2021)
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Tubedetr: Spatio-temporal video grounding with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16442–16453 (2022)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666 (2019)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV, pp. 213–229 (2020). Springer
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457 (2021)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. In: IJCV (2015)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? ICML 2, 4 (2021)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., Beyer, L.: How to train your vit? Data, augmentation, and regularization in vision transformers (2021)
Chen*, X., Xie*, S., He, K.: An empirical study of training self-supervised vision transformers (2021)
Han, R., Yan, H., Li, J., Wang, S., Feng, W., Wang, S.: Panoramic human activity recognition. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pp. 244–261. Springer (2022)
Azar, S.M., Atigh, M.G., Nickabadi, A., Alahi, A.: Convolutional relational machine for group activity recognition. In: CVPR, pp. 7892–7901 (2019)
Chappa, N.V., Nguyen, P., Nelson, A.H., Seo, H.-S., Li, X., Dobbs, P.D., Luu, K.: Sogar: Self-supervised spatiotemporal attention-based social group activity recognition. arXiv preprint arXiv:2305.06310 (2023)
Chappa, N.V., Nguyen, P., Nelson, A.H., Seo, H.-S., Li, X., Dobbs, P.D., Luu, K.: Spartan: Self-supervised spatiotemporal transformers approach to group activity recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5157–5167 (2023)
Ehsanpour, M., Saleh, F., Savarese, S., Reid, I., Rezatofighi, H.: Jrdb-act: A large-scale dataset for spatio-temporal action, social group and activity detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20983–20992 (2022)
Author information
Authors and Affiliations
Contributions
NC and PN conceived of the presented idea. NC developed the theory, prepared the figures and performed the experiments. PN verified the experimental results. KL and PD encouraged NC to investigate vision-language modeling and supervised the findings of this work. All authors discussed the results and contributed to the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chappa, N.V.S.R., Nguyen, P., Dobbs, P.D. et al. React: recognize every action everywhere all at once. Machine Vision and Applications 35, 102 (2024). https://doi.org/10.1007/s00138-024-01561-z
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-024-01561-z