1 Introduction

Group activity recognition (GAR) has emerged as a crucial problem in computer vision, with numerous applications in sports video analysis, video monitoring, and social scene understanding. Unlike conventional action recognition methods that focus on identifying individual actions, GAR aims to classify the actions of a group of people in a given video clip as a whole (Fig. 1). It requires a deeper understanding of the interactions between multiple actors, including accurate localization of actors and modeling their spatiotemporal relationships [1,2,3,4,5,6,7,8]. As a result, GAR poses fundamental challenges that must be addressed to develop practical solutions for this problem. In this context, developing novel techniques for group activity recognition has become an active area of research in computer vision.

Fig. 1
figure 1

An example of the response of REACT model to the user’s input query. The user provides a video sequence and an action prompt. Then, the model outputs all the requested actions in the scene by localizing the corresponding actors and provides the overall group activity. Best viewed in color and zoom in

Existing GAR methods require ground-truth bounding boxes and action class labels for training and testing [9,10,11,12,13,14,15,16,17]. We compare our method with the previous methods in Fig. 2. Bounding box labels extract actor features and their spatio-temporal relations, which are then aggregated to form a group-level video representation for classification. However, the reliance on bounding boxes and substantial data labeling annotations severely limit their applications. To address these limitations, some methods simultaneously train person detection and group activity recognition using bounding box labels [18, 19]. Another approach is weakly supervised (WSGAR) learning [20, 21], which does not require individual actor-level labels for training and inference.

Fig. 2
figure 2

Comparison between prior methods and our approach. Prior methods do a single classification/detection task while fully supervised, whereas our approach performs group activity classification and query-based action detection simultaneously. Best viewed in color and zoom in

Yan et al. [20] proposed the WSGAR learning approach that uses a pre-trained detector to generate actor box suggestions and learn to eliminate irrelevant possibilities. However, this method suffers from missing detections when actors are occluded. Kim et al. [21] introduced a detector-free method that captures actor information using partial contexts of token embeddings, but this method can only be learned when there is movement in consecutive frames. Moreover, Kim et al. [21] did not consider the consistency of temporal information among different tokens. Hence, there is a need for a GAR approach that can capture temporal information accurately without the limitations of bounding box annotations or detector-based methods.

1.1 Our contributions in this work

Inspired by attention mechanisms in video contexts, we introduce a transformer-based architecture, as depicted in Fig. 3, to effectively model complex interactions within videos, incorporating temporal, spatial, and multi-modal features. The Vision-Language (VL) Encoder block (Sect. 3.1) stands as a pivotal element, proficient in encoding sparse spatial and multi-modal interactions, with a dedicated fast branch for recovering temporal details. Our architecture integrates a Action Decoder Block (Sect. 3.2) that refines the understanding of text and video data. It balances context and actor-specific details by merging Encoder and Actor Fusion Block outputs. Utilizing temporal spatial attention for discerning temporal and spatial dependencies and temporal cross attention, aligned with ground-truth bounding boxes, ensures precise retrieval of bounding boxes, bridging semantic and visual contexts. The Actor Fusion Block (Sect. 3.3), positioned at the core, orchestrates a harmonious blend of actor-specific information and textual features. Through concatenation and averaging operations, it achieves a holistic representation enriched by convolution operations extracting local patterns. This refined representation contributes to contextually relevant output generation in the Decoder Block. In practical terms, our model’s efficacy is demonstrated in Fig. 1 through an example response to a user’s action query. The results showcase superior performance in accurately recognizing and understanding group activities compared to state-of-the-art methods. Our experiments provide empirical evidence of the significant performance gains achieved by our framework, affirming its potential for diverse real-world applications.

2 Related work

2.1 Group activity recognition (GAR)

In the field of action recognition, group action recognition has become an increasingly popular research topic due to its wide range of applications in various fields, such as video surveillance, human-robot interaction, and sports analysis. GAR aims to identify the actions performed by a group of individuals and the interactions between them.

Initially, researchers in the field of GAR used probabilistic graphical methods and AND-OR grammar methods to process the extracted features [22,23,24,25,26,27,28,29]. However, with the advancement of deep learning techniques, methods involving convolutional neural networks (CNN) and recurrent neural networks (RNN) achieved outstanding performance due to their ability to learn high-level information and temporal context [9, 18, 30,31,32,33,34,35,36].

Recent methods for identifying group actions mainly utilize attention-based models and require explicit character representations to model spatial-temporal relations in group activities [10,11,12,13,14, 17, 20, 37]. For example, graph convolution networks are used to learn spatial and temporal information of actors by constructing relational graphs, and spatial and temporal relation graphs are used to infer actor links. Clustered attention is used to capture contextual spatial-temporal information, and transformer encoder-based techniques with different backbone networks extract features for learning actor interactions from multimodal inputs [12]. Additionally, MAC-Loss [38], a combination of spatial and temporal transformers in two complimentary orders, has been proposed to enhance the learning effectiveness of actor interactions and preserve actor consistency at the frame and video levels. Tamura et al. [39] introduces a framework without using heuristic features for recognizing social group activities and identifying group members. This information is embedded into the features, allowing for easy identification. These recent advancements in GAR have made significant progress toward recognizing complex actions performed by a group of individuals in various settings.

Fig. 3
figure 3

Overall architecture of the proposed REACT network. The visual and textual representation learning components from our approach incorporate multi-level feature representations. The extracted features are passed through the contextual relationship modeling block to obtain the concatenated multi-modality features. Then, it is passed through the prompt action retrieval block to obtain the detected bounding boxes based on the prompt

2.2 Weakly supervised group activity recognition (WSGAR)

Various techniques have been developed to address the WSGAR problem with limited supervision, such as using bounding boxes to train built-in detectors or activity maps. WSGAR is one approach that does not rely on bounding box annotations during training or inference and includes an off-the-shelf item detector in the model. Traditional GAR approaches require accurate annotations of individual actors and their actions, which can be challenging and time-consuming. Weakly supervised methods aim to relax these requirements by learning from more readily available data such as activity labels, bounding boxes, or video-level labels. Zhang et al. [40] proposed a technique that employs activity-specific characteristics to enhance WSGAR. It is not mainly designed for GAR. Kim et al. [21] proposed a detector-free approach that uses transformer encoders to extract motion features. We propose a self-supervised training method specialized for WSGAR and does not necessitate actor-level annotations, object detectors, or labels.

2.3 Transformers in vision

The transformer architecture was first introduced by Vaswani et al. [41] for sequence-to-sequence machine translation, and since then, it has been widely applied to various natural language processing tasks. Dosovitskiy et al. [42] introduced a transformer architecture not based on convolution for image recognition tasks. Several works [43,44,45,46] used transformer architecture as a general backbone for various downstream computer vision tasks, achieving remarkable performance progress. In the video domain, many approaches [47,48,49,50,51,52,53] utilize spatial and temporal self-attention to learn video representations effectively. Bertasius et al. [50] explored different mechanisms of space and time attention to learn spatiotemporal features efficiently. Fan et al. [51] used multiscale feature aggregation to improve the learning performance of features. Patrick et al. [52] introduced a self-attention block that focuses on the trajectory, which tracks the patches of space and time in a video transformer.

3 The proposed method

As shown in Fig. 3, given the input video \(\textbf{X}\) and the textual input \(\textbf{t}\), the REACT model aims to localize all the positions \(\textbf{b}\) in the video \(\textbf{X}\) that consists of the group actions of interest based on the input prompt \(\textbf{t}\). By formulating the REACT model, we seek to address the challenge of understanding complex group dynamics and activities within videos, which is crucial for various applications such as video surveillance, activity recognition, and human behavior analysis. Formally, the REACT model can be formulated as follows:

$$\begin{aligned} \hat{\textbf{b}} = \textit{AD}\bigg (\mathbb {F}\bigg (\textit{VE}\bigg (\textbf{X}\bigg ), \textit{TE}\bigg (\textbf{t}\bigg )\bigg )\bigg ) \end{aligned}$$
(1)

where \(\hat{\textbf{b}} \in [0,1]^{T\times 4}\) is the normalized predicted bounding box coordinates, VE and TE are the visual and textual encoders to extract the feature representations of the video input \(\textbf{X}\), i.e., \(\varvec{v_f}\) = VE(\(\textbf{X}\)), and the textual prompt \(\textbf{t}\), i.e., \(\varvec{t_f}\) = TE(\(\textbf{t}\)), respectively. The motivation behind employing visual and textual encoders is to capture both visual and semantic information, facilitating a comprehensive understanding of group actions in videos.

\(\mathbb {F}\) is the correlation model to exploit the contextual relationship between visual features \(\varvec{v_f}\) and the textual features \(\varvec{t_f}\). The motivation behind using a correlation model is to integrate information from both modalities effectively, enhancing the model’s ability to localize group actions accurately.

Then, the Action decoder(AD) will decode the contextual features produced by the correlation model \(\mathbb {F}\) to localize all the group action regions of interest. The motivation behind the Action decoder is to transform the learned features into interpretable predictions of group action positions, enabling actionable insights from video data. Fig. 3 illustrates the proposed framework of REACT.

Contextual Relationship Modeling in GAR involves modeling the multi-modality and spatio-temporal features. Thus, to efficiently cover these correlations, we model the correlation function \(\mathbb {F}\) via the attention mechanism. In particular, the correlation model \(\mathbb {F}\) is designed as a Transformer network. Our model \(\mathbb {F}\) is at the forefront of our proposed approach which comprises vital components for the integrated modeling of temporal, spatial, and multi-modal interactions. It adeptly encodes spatial and multi-modal interactions, efficiently processing sparsely sampled frames.\(\mathbb {F}\) is a transformer block that mixes text and video representations using cross attention. Formally, \(\mathbb {F}\) can be formed as follows:

$$ \begin{aligned} \varvec{\widehat{vt_f}} \& = & {} \mathbb {F}\bigg (\varvec{v_f},\varvec{t_f}\bigg ) \nonumber \\= & {} \textit{FFN}\bigg (\textit{T2V}\bigg (\varvec{t_f}, \varvec{v_f}\bigg ) \quad \bigg ( \bigg (\textit{V2T}(\varvec{v_f}, \varvec{t_f}\bigg )\bigg )\bigg )\bigg ) \end{aligned}$$
(2)

where \(\varvec{\widehat{vt_f}}\) is the shared representation of video and text, FFN is a feed-forward neural network, T2V and V2T are the cross-attention models for the visual and textual features. Transforming textual content into a video format is known as text-to-video. This can involve producing dynamic video content, slideshows, or animations based on the text provided as visual representations of written content. T2V can be used for a number of things, such as text-to-video (T2V) automated video generation, marketing content, and instructional films. The method of obtaining textual information from video material is known as "video-to-text." Said words, text in the video, or other pertinent information from the visual content must all be converted into a text format. V2T is frequently used for activities including content indexing, video transcription, and textual representation of video material to increase accessibility. Thanks to the attention mechanism, it has empowered our correlation model \(\mathbb {F}\) to produce a shared representation that captures the contextual relationships between text and video in both spatial and temporal dimensions and encapsulates a joint understanding of both text and video content, enabling the model to bridge the semantic gap between the two modalities effectively.

From the shared contextual feature representations produced by the correlation model \(\mathbb {F}\), the action decoder aims to localize all the group action positions of interest \(\textbf{b}\). In practice, the group action positions \(\textbf{b}\) could consist of a list of positions of interest \(\textbf{b} = \{\mathbf {b_1}, \mathbf {b_2},..., \mathbf {b_N}\}\) where N is the number of group action positions. For example, in a military mission, the group action of enemies could include the positions of many army groups and adversarial armed devices. Additionally, predicting a particular position of group action requires a broader understanding of the contextual feature representation and the surrounding predicted group action positions. Therefore, to efficiently model the group action decoder, in addition the the shared feature representation, the positions of the current predicted group actions are also taken into account. Formally, the Group Action Decoder can be modeled as follows:

$$\begin{aligned} \mathbf {\widehat{b_i}}= & {} \textit{AD}\bigg (\varvec{\widehat{vt_f}}, \textit{AF}\bigg (\varvec{t_f}, \{\mathbf {b_1}, \mathbf {b_2},..., \mathbf {b_N}\}\bigg )\bigg ) \nonumber \\= & {} \textit{AD}\bigg (\varvec{\widehat{vt_f}}, \varvec{\widehat{bt_f}}\bigg ) \end{aligned}$$
(3)

where AF (ActorFusion) is the model that exploits the correlation between the current predicted positions of group actions and the textual features. Intuitively, the list of group action positions \(\{\mathbf {b_1},.., \mathbf {b_{i-1}}\}\) provide the spatial context of subjects in the video. Meanwhile, the textual features \(\varvec{t_f}\) represent linguistic aspects of context and semantics related to the group actions and offer a complementary view to the visual information. The AF model distinguishes between capturing detailed spatial information from group action positions and emphasizing the broader action-related information from the textual features. Finally, the ActionDecoder (AD) model will predict the position of group action \(\mathbf {b_i}\) based on the contextual feature representation \(\varvec{\widehat{vt_f}}\) and the action fusion feature \(\varvec{\widehat{bt_f}}\) produced by the AF model. In essence, the AD model uses a wealth of information from the correlation model \(\mathbb {F}\) and fusion features from the AF model to associate textual queries \(\varvec{t_f}\) with specific regions or subjects within the video frames. To efficiently exploit the correlation of these features in AD, we design AD via the attention mechanism. It enables the AD model to precisely identify and highlight the regions within the video that correspond to the described actions.

3.1 Vision-language (VL) encoder

The VL Encoder typically receives input features from their respective backbone networks in this method. These backbones are specialized models designed to extract meaningful information from set of L text data \(\varvec{t_f} \in \mathbb {R}^{L \times d}\) and video data \(\varvec{v_f} \in \mathbb {R}^{T \times HW \times d}\) for all T frames and HW spatial resolution, independently optimizing their representations for their specific modalities.

The VL Encoder receives the features \(\varvec{t_f}\) and \(\varvec{v_f}\). It also uses visual features derived from the video, including frames, motion patterns, bounding boxes, and object presence.

The VL Encoder utilizes attention mechanisms to comprehend these features effectively. It begins with self-attention on text, allowing it to weigh the importance of words and understand the context within the text. Simultaneously, it performs temporal self-attention on video features to capture temporal dependencies and dynamics.

The model then progresses to cross-attention, connecting the video and text modalities. The "video-to-text" step focuses on video features and aligns them with relevant text. The "text-to-video" step attends to video elements based on the textual context. This bidirectional interaction ensures a holistic understanding of both modalities.

Finally, the information from self-attention and cross-attention is combined, creating multi-modal features. These features merge enhanced text and video information, bridging the gap between the two modalities. The VL Encoder facilitates joint text and video analysis, benefiting tasks like video captioning and action recognition. It empowers models to grasp contextual relationships between text and video, enhancing multi-modal understanding.

3.2 Action decoder (AD) block

The AD Block is a crucial part of the architecture responsible for producing meaningful bounding box outputs. It combines information from the encoder and the actor fusion blocks.

First, it concatenates outputs from these sources. The Encoder Block provides a broad multi-modal understanding of text and video, while the Actor Fusion Block adds actor-specific details like normalized text features and actor bounding boxes. This merging process combines contextual understanding with fine-grained actor-specific information.

Then, the AD Block employs temporal spatial attention. This mechanism allows the model to consider temporal (sequential) and spatial (object relationships) aspects in video data. It is particularly valuable for tasks like action recognition and actor localization.

Next, it uses temporal cross-attention, incorporating ground-truth bounding box data. Ground-truth bounding boxes precisely locate objects or actors in video frames. This step refines the model’s understanding of object and actor movements over time, aligning textual descriptions with the video content.

Finally, the AD Block generates bounding boxes as grounding outputs based on text queries. It leverages the information gathered from previous stages, aligning textual descriptions with specific regions or objects in video frames. This output is vital for tasks like actor localization and action recognition, connecting text with visual content to pinpoint regions corresponding to described actions or objects.

3.3 Actor fusion (AF) block

The AF block begins by combining two vital sources of information, including the bounding box data \(\hat{\textbf{b}}\) and the normalized text features \(\mathbf {t_f}\). The bounding box data provides precise location details for objects or actors in video frames, enhancing the model’s spatial understanding.

Simultaneously, normalized text features offer textual descriptions and contextual information related to actions or objects in the video, providing a linguistic perspective.

After combining these sources, the AF block calculates an average over the concatenated features. This averaging serves multiple purposes. It creates a balanced representation that includes actor-specific details (from the bounding box) and semantic context (from the text features). Additionally, it emphasizes action-related information, aligning with the action recognition task.

The AF block uses convolution operations to refine the feature representation further (presented as \(\widehat{\mathbf {bt_f}}\)). Convolutions are essential in deep learning for capturing local data patterns. They extract and highlight relevant spatial and temporal features from the combined actor-specific information in this context. These convolutional operations are the key to improving the representation before passing it back to the AD Block.

3.4 Training loss

The input training data is in the form of a set of frames, where each set of frames belongs to a video, which is annotated with a text label representing the group activity and individual actions s and the corresponding set of bounding boxes b.

We train our architecture with a linear combination of four losses

$$\begin{aligned} \mathcal {L}= & {} \lambda _{\mathcal {L}_1}\mathcal {L}_{\mathcal {L}_1}(\hat{b}, b) + \lambda _{gIoU}\mathcal {L}_{gIoU}(\hat{b}, b) \nonumber \\{} & {} +\lambda _{gIoU}\mathcal {L}_{gIoU}(\hat{n}, n) \end{aligned}$$
(4)

where \(b \in [0,1]\) denotes the normalized ground truth box coordinates and \(\hat{b}\) the predicted bounding boxes. Finally, different \(\lambda _{\bullet }\) are scalar weights of the individual losses. \(\mathcal {L}_{\mathcal {L}_1}\) is a \(\mathcal {L}_1\) loss on bounding box coordinates. \(\mathcal {L}_{gIoU}\) is a generalized “intersection over union" (IoU) loss [54] on the bounding boxes. Both \(\mathcal {L}_1\) and \(\mathcal {L}_{gIoU}\) are used for spatial and temporal grounding. Losses are computed at each layer of the decoder following [55].

3.5 Inference

A machine learning model uses a labelled dataset to learn from during the training phase. It presents input data, makes predictions, compares those forecasts to the actual labels, and iteratively (epoch-by-epoch) adjusts its internal parameters (weights) to minimise prediction error. The model moves onto the inference stage after training. During this stage, fresh, unobserved data is predicted using the trained model. Based on the patterns it discovered during training, the model uses the input data to generate predictions. During the testing phase, we employ distinct operations for individual action and social group predictions. For individual actions, a softmax operation is applied to the predictions from cross-entropy functions, and a sigmoid operation is performed on predictions from binary cross-entropy functions. Our approach follows a hierarchical strategy, initiating label selection from the first partition and progressing to subsequent ones in the hierarchy only if the "Other" class is predicted. In the context of social group prediction, we leverage the predicted counts of individual actions and social groups. To deduce the activity of each predicted social group, we adopt a strategy where the social activity label is determined by the most frequent action labels among its members.

3.6 Discussion

Need for concatenated attention? The method described earlier, which we will call "concatenation-based fusion," involves the combination of features. It performs self-attention on features from individual sources and then computes cross-attention across different sources. It is in contrast to "cross-attention-based fusion." With flexibility and fewer assumptions about the data’s spatial pattern, the Transformer model offers significant modeling freedom. Compared to cross-attention-based fusion, concatenation-based fusion is more efficient due to operation sharing and reduces model parameters through weight sharing. Weight sharing is a critical design aspect for symmetric metric learning between two data branches. In concatenation-based fusion, we implement this property in feature extraction and feature fusion. In a nutshell, concatenation-based fusion enhances both efficiency and performance.

What if we use a query-based decoder? Many transformer-based models in vision tasks draw inspiration from the vanilla Transformer decoder. They incorporate a learnable query to extract specific target features from the encoder. For instance, in [55], they use object queries; in [56], they refer to target queries. However, our experimental findings reveal that a query-based decoder faces challenges, including slower convergence and subpar performance. The vanilla Transformer decoder, which is fundamentally a generative model, may not be the ideal choice for classification tasks. Furthermore, employing a single universal target query for all types of objects could potentially create a performance bottleneck. It is worth emphasizing that REACT primarily functions as an "encoder" model within the conventional Transformer encoder-decoder framework.

4 Experiment results

4.1 Experiment settings

Upstream task We consistently used the RoBERTa [57] model throughout all the experiments to extract the textual features, whereas the visual backbone varied for different experiments to have fair comparisons. We use hyper-parameters T = 15 (JRDB-PAR) & 8 (Volleyball), N= 6 (JRDB-PAR) & 8 (Volleyball), \(\lambda _{\mathcal {L}_1}\) = 5, \(\lambda _{gIoU}\) = 2 and d=256. We initialized temporal attention weights randomly, while spatial attention weights were initialized using a ViT model trained self-supervised over ImageNet-1K [58]. This initialization scheme facilitated faster convergence of space-time ViT, as seen in the supervised setting [59]. We trained using an Adam optimizer [60] with a learning rate of \(5\times 10^{-4}\), scaled using a cosine schedule with a linear warm-up over five epochs [61, 62]. Additionally, we applied weight decay scaled from 0.04 to 0.1 during training.

Downstream tasks We trained a linear classifier on our pretrained backbone. During training, the backbone was frozen, and we trained the classifier for 100 epochs with a batch size of 32 on a single NVIDIA-V100 GPU using SGD with an initial learning rate of 1e-3 and a cosine decay schedule. We also set the momentum to 0.9.

4.2 Dataset details

Volleyball Dataset [9]: With 55 videos and 4,830 labeled clips (3,493 training, 1,337 testing), this dataset includes annotations for individual actions and group activities with bounding boxes. In WSGAR experiments, we focus solely on group activity labels, excluding individual action annotations. Evaluation metrics include Multi-class Classification Accuracy (MCA) and Merged MCA, ensuring fair comparisons with existing methods.

JRDB-PAR Dataset [63]: Featuring 27 action, 11 social groups, and seven global activity categories, this dataset comprises 27 videos (20 training, seven testing) with 27,920 frames and 628k human bounding boxes. Uniformly sampled keyframes (every 15 frames) are used for annotation and evaluation. Precision (\(\mathcal {P}_g\)), recall (\(\mathcal {R}_g\)), and F1-score (\(\mathcal {F}_g\)) are the evaluation metrics for social group activity recognition, treated as a multi-label classification problem [63].

4.3 Classification task evaluation

Volleyball dataset. We evaluate our approach against the latest GAR and WSGAR methods in two supervision levels: fully supervised and weakly supervised. These differ in using actor-level labels such as ground-truth bounding boxes and individual action class labels in training and inference. To ensure a fair comparison, we report the results of previous methods and reproduce results using only the RGB input and ResNet-18 backbone, respectively. In the weakly supervised setting, we replace the group action classification labels with ground-truth bounding boxes of the actors without their corresponding actions so that the actors localization is learned during the pre-training stage. Table 1 presents the results, with the first and second sections showing the results of earlier techniques in fully supervised and weakly supervised environments, respectively. Our model trained on the ResNet-18 backbone outperforms most fully supervised frameworks by significantly improving the MCA and MPCA metrics. We show that using ViT-Base backbone, our approach significantly outperforms all GAR and WSGAR models in weakly supervised conditions, beating them by 2.4% of MCA and 1.2% of Merged MCA by leveraging the spatiotemporal features using the transformer architecture. Moreover, our approach is better than the current GAR methods that employ less thorough actor-level supervision, such as [12, 13, 18, 33, 36].

Table 1 Comparison with the state-of-the-art methods on the Volleyball dataset [9]

JRDB-PAR dataset We conducted a comparative study to evaluate our proposed approach alongside state-of-the-art methods in GAR and WSGAR using the JRDB-Par dataset. We involved fully supervised and weakly supervised settings to evaluate the dataset. The comparison results are presented in Table 2. Our method significantly outperforms the existing social group activity recognition frameworks in the fully supervised setting in all metrics. In the weakly supervised setting, our proposed method outperformed existing GAR and WSGAR methods by a considerable margin, achieving 1.2 of \(\mathcal {P}_g\), 1.7 of \(\mathcal {R}_g\) and 2.3 of \(\mathcal {F}_g\). Additionally, we evaluated this dataset using ResNet-18, ViT-B/16, and ViT-B/32 backbones, where ViT-B/16 proved to be better (whose results are presented in Table 2), which is further analyzed in Appendix. Despite their impressive performance in WSGAR, our approach outperformed them all.

Table 2 Comparative results of the group activity recognition on JRDB-PAR dataset [63]

4.4 Human action retrieval evaluation

Our experimental results in Table 3 show that our proposed method outperforms the existing state-of-the-art approaches in terms of the R@K metric. Specifically, our method achieved a significantly higher R@1 score than the other methods, with a margin of 4.8. Additionally, our method achieved higher R@5 and R@10 scores than all other methods, demonstrating its superior performance in video object retrieval. Furthermore, our method demonstrated a more consistent performance across two datasets and experimental setups. We attribute this to our method’s ability to effectively capture the video frames’ global and local features and its incorporation of text embeddings for improved semantic understanding of the video content.

Table 3 Human Action-retrieval task results on transformer-based methods

4.5 Ablation study

In this section, we verify the effectiveness of each component in the REACT framework on downstream group activity recognition tasks and examine the significance of the actor fusion block in the proposed method.

4.5.1 Effectiveness of individual components

Table 4 presents the results of the ablation study, focusing on the impact of different components on group activity recognition tasks. Regardless of the visual backbone used, we observe notable improvements in performance when incorporating various components such as feature extraction, encoding, and decoding. Specifically, experiments demonstrate that the inclusion of these components leads to substantial gains in both the JRDB-PAR and Volleyball datasets.

For instance, when using ResNet-18 as the visual backbone, the addition of feature extraction, encoding, and decoding modules results in performance improvements of 4.8% and 4.1% on the JRDB-PAR and Volleyball datasets, respectively. Similarly, employing ViT-B/32 and ViT-B/16 as visual backbones leads to significant performance enhancements across both datasets when incorporating these components.

These results underscore the importance of each component in the REACT framework, highlighting their collective contribution to improving group activity recognition performance.

4.5.2 Significance of actor fusion block

Table 5 specifically focuses on the impact of the actor fusion block on the architecture’s performance. Utilizing ViT-B/32 as the visual backbone, we observe a substantial performance boost across both datasets when incorporating the actor fusion block. In particular, the inclusion of the actor fusion block leads to a remarkable improvement of 13.5% and 20.6% in group activity recognition performance on the JRDB-PAR and Volleyball datasets, respectively.

These results demonstrate the crucial role played by the actor fusion block in enhancing the REACT framework’s effectiveness. By effectively fusing actor-specific details and contextual information, the actor fusion block enables the model to achieve more accurate and robust predictions, thereby improving overall performance on group activity recognition tasks.

In summary, the ablation study results provide empirical evidence supporting the efficacy of individual components and the significance of the actor fusion block in the REACT framework. These findings further validate the design choices and architectural decisions made in our proposed approach.

Table 4 Ablation study of each Group Activity Recognition task component
Table 5 Ablation study results on the Actor Fusion component in the architecture

4.6 Visualization

In order to gain insights into the visualization capabilities of our proposed method and demonstrate its significance, we conducted an analysis that involved examining the visualizations based on the input action query. The visualizations are presented in Fig. 4, which showcases the localization of actions based on the input query for the JRDB-PAR and Volleyball datasets. The results of the decoder’s generated proposals during the inference stage indicate that our method effectively localizes the specified actions. It demonstrates the ability of our framework to accurately recognize and pinpoint specific actions within a group activity context. Furthermore, the t-SNE plots are visualized in the Appendix. The results illustrated in these plots demonstrate that the proposed framework effectively recognizes all actions within a given video. Moreover, the framework enhances the overall performance in group activity recognition.

Fig. 4
figure 4

Visualization based on input action query.The top two rows are the results from the JRDB-PAR dataset, and the bottom row is from the Volleyball dataset. Best viewed in color and zoom in.

5 Conclusion

We introduce REACT, a novel video transformer-based model for group activity recognition using contrastive learning. Our approach generates diverse spatio-temporal views from a single video, leveraging different scales and frame rates. Correspondence learning tasks capture motion properties and cross-view relationships between sampled clips and textual information. The core of our approach is a contrastive learning objective that reconstructs video-text modalities in the latent space. The input–output pair in the V2T architecture is reversed from that of the T2V network topology. Using the fundamental cGAN architecture, our network is made up of a Discriminator D and a Generator G. REACT effectively models long-range spatio-temporal dependencies and performs dynamic inference within a single architecture. REACT outperforms state-of-the-art models in group activity recognition, showcasing its superior performance.

Limitations While our methods excel in group activity classification and human action-retrieval tasks, they may need to be optimized for other visual-language problems due to the lack of textual descriptions in the datasets. Future research directions could focus on incorporating additional data involving question-answering, phrases for video grounding, and other related textual information to enhance the versatility of our methods.