JointContrast: Skeleton-Based Interaction Recognition with New Representation and Contrastive Learning†
Round 1
Reviewer 1 Report
The authors propose the interaction information embedding skeleton graph representation by using the graph convolution operation to represent the intra-subject spatial structure information and inter-subject interaction information. The article is well structured and deserves to be accepted for publication after minor concerns:
-Figures 4, and 6: please insert unit in the axes
- The authors highlight the novelty of the work but, what is the gap in the research field that motivated the authors to propose this approach? Please, promote a comparative analysis.
- in the conclusion, the authors mentioned “Our future works will focus on the more flexible and efficient representation of skeleton data and more effective pre-training pretext tasks.” What does “more flexible and efficient representation of skeleton data” mean? Does the new technique leave some gaps? What is the strategy to fill it? Please comment on the conclusion.
- why did the proposed work not reach a “more effective pre-training pretext tasks” like the authors write in the conclusion?
Author Response
Please see the attachment. Responses for Reviewer 1.
Author Response File: Author Response.pdf
Reviewer 2 Report
the related work section appears to provide a thorough overview of the existing literature on skeleton-based action recognition. It effectively summarizes the main approaches to handling the irregular structure of skeleton data, including the conversion of 3D skeletal data to pseudo-images, graph convolution networks (GCNs), and attention-based methods. Then it smoothly reviewed interaction recognition and contrastive learning in the context of skeleton-based tasks.
One potential improvement to this section would be to provide more detailed explanations of some of the technical concepts introduced, such as GCNs and self-attentive mechanisms. This could help readers who are not familiar with these concepts to better understand the approaches being discussed.
Here I provided one potential paper. Please read it and add at least a paragraph about GNNS and GATs:
“Deep learning approaches have been exploited for graph data modeling and representation. An essential step in performing graph-structured data tasks is learning better representations. Conventional neural networks are limited to handling only Euclidean data. By leveraging representation learning, graph neural networks have generalized deep learning models to perform on structural graph data with good performance. In GNNs, an iterative process propagates the entity state until equilibrium. This idea was extended by [42] to use gated recurrent units in the propagation step. Graph neural network (GNN) models have been proved to be a powerful family of networks that learns the representation of an entity by aggregation of the features of the entities and neighbors [43]. In traditional GNNs, multiple layers are stacked to aggregate information throughout the knowledge graph and output learned entity embeddings. Some GNN models also can learn relation embeddings.
Several attempts have been exploited to apply neural networks to deal with structured graphs. Recursive neural networks were laid as early work to process data in acyclic graphs [45]. This idea has been extended to graph neural networks in [46] to generalize recursive graph neural networks for directed and undirected graphs. They generally learn the target node’s representation by its neighbor’s information iteratively until it reaches an equilibrium point. Graph neural networks with the key factor of high dimensional data growing have been widely studied and applied to learn representations from complex graph-structured data with remarkable performance in different domains.
Despite GCNs in which all neighbors share fixed weights and contribute equally during information passing, graph attention networks assign different levels of importance to each neighborhood of a specific node. In graph attention networks, layers are stacked, and nodes can attend over their neighbors. Each node will hold a different weight in a neighborhood. The advantage of these networks is that they do not require any knowledge about the structure of the graph and any matrix operations [49]. Graph attention network (GAT) [49] uses multi-head attention to stabilize the learning process and boost performance by concatenating n attention heads. However, using multi-head attention can have a large size of parameters. “
Zamini, Mohamad, Hassan Reza, and Minou Rabiei. "A Review of Knowledge Graph Completion." Information 13, no. 8 (2022): 396.
Additionally, it may be helpful to highlight the limitations or drawbacks of some of the existing approaches, in order to emphasize the need for further research and innovation in this field. For example, the section briefly mentions that self-attentive models are difficult to train, but it may be beneficial to elaborate on why this is the case and how it can be addressed.
Overall, this paper seems to be well-written and well-organized. The authors provide a clear problem statement and background information on skeletal data action recognition, and they propose a novel approach that effectively utilizes interaction information between subjects. The paper provides detailed descriptions of the proposed method, including the backbone model, interaction information embedding, and pre-training with fine-tuning strategy. The experimental results on benchmark datasets show that it outperforms existing state-of-the-art methods, demonstrating the effectiveness of the proposed approach. The authors also provide insightful analysis and ablation experiments to validate their approach. Overall, this paper contributes to the field of skeletal data action recognition and provides a promising direction for future research. Just I noticed the papers cited from 2022 and 2021 were 2 and 3 respectively. Expected to add more recent ones. One can be the paper I mentioned earlier.
Author Response
Please see the attachment. Responses for Reviewer 2.
Author Response File: Author Response.pdf
Reviewer 3 Report
This is an interesting and well written paper. The methodology sounds. I have following minor remarks:
1) "the skeleton data has an irregular structure" - why? Skeleton data is composed of the certain number of body joints organized in hierarchy. Skeleton data might be derived from RGB images and has lower dimensionality than RGB. You also defined it in Section 3.1.
2) "In this paper, we use the Graph Convolutional embedding LSTM (GC-LSTM) network as the backbone, which is originally designed in [37]" please summarize the architecture of that backbone network.
3) What optimization algorithm has been used in your research?
4) "Despite the large number of samples in this dataset, we performed the data augmentation due to the satisfactory performance we achieved with this on the SBU dataset" What augmentation methods have been used for dataset?
5) Figure 5 - image is too small. Better change it from raster to vector image.
6) "We notice that the recognition accuracy does not improve with more extended training, probably because both baselines suffer from over-fitting." How to prevent that overfitting? This seems to be among biggest drawbacks of proposed method.
7) Please report the proposed method accuracy in abstract.
8) How the proposed solution has been implemented? Please publish source codes and make the experiments reproducible.
Author Response
Please see the attachment.: Responses for Reviewer 3.
Author Response File: Author Response.pdf
Reviewer 4 Report
The present study deals with a new method for skeleton-based action recognition and its validation on the SBU and NTU RGB-D datasets. The topic is in line with current researches and can be object of interest in several applications and fields. Nevertheless, some improvements must be applied to obtain a better description and comprehension of the study.
Here some comments for each section:
- Abstract: too much space dedicated to the introduction of the interaction recognition, while no results reported. Try to modify the abstract adding the most meaningful results and their explanation.
- Introduction and Related Work: these two sections might be revised in only one chapter (called introduction) with the attempt to sum up the most important information from the literature and current researches. Indeed, in the present form, information are not clearly reported. I suggest to describe previous works in terms of methodology, applications and limitations. The principal aims of the study, the main innovation and the reason why it is has been proposed must be clearly reported at the end of the introduction.
- Methodology: this section must be revised because the description of passages is a bit tricky. Too long section, try to concentrate only on the description of the adopted algorithm. At the beginning of each subsection you reported a brief introduction and description of previous works: if these parts are necessary for the comprehension of the study, they must be reported in the introduction. In the present section, only the current algorithm must be described.
- Experiments: I have no comments on this section, the experimental procedure is well presented.
-Results: in this section, the authors reported not only the results but also the discussion. I suggest to modify the title as "Results and Discussion".
-Hyper parameters: it is not clear why this is a single chapter instead of a subsection of results.
-Conclusion: limitations of the current study are missing.
The Reference title is reported twice.
Tables and Figures must be reported after they are mentioned in the text.
Author Response
Please see the attachment: Responses for Reviewer 4.
Author Response File: Author Response.pdf
Round 2
Reviewer 4 Report
The authors fullfilled the suggested revisions.