1 Introduction

Event Detection (ED) is a fundamental subtask of Information Extraction (IE) [12, 27] that aims to identify the event type mentioned in a sentence from a predefined set of event types. ED is critical for many downstream tasks, including event extraction [18] and event relation extraction [41]. For example, given a sentence such as "They were caught within an hour," the ED task is to identify the trigger word "caught" and classify it into the predefined event type "Arrest-Jail". Event triggers are words and phrases that most clearly indicate the occurrence of events [22]. Figure 2 shows the mapping of words to event labels. While deep neural networks have achieved significant progress in ED [10, 24, 30, 33, 35, 44], they heavily rely on labeled data, which can be costly and time-consuming to obtain. Moreover, in real-world scenarios, new event types may emerge with limited available data, leading to poor performance in detecting these newly-emerged events. Thus, there is a need to transform the conventional supervised event detection task into a few-shot event detection (FSED) problem, which aims to detect new event types with limited data accurately.

Most existing few-shot event detection (FSED) methods adopt a meta-learning paradigm [39] and generate N-way K-shot meta-tasks. For example, Table 1 shows a 3-way-2-shot FSED task with a support and query set. The support set comprises two sentences for each of the three event types, providing meta-knowledge to the model for predicting labels for sentences in the query set. Specifically, in each meta-task, the model generates a prototype for every class using the support set and then predicts labels for instances in the query set by matching them with the closest prototype in the metric space [48]. Currently, FSED approaches can be classified into two categories: (1) Pipeline methods [4, 20, 21], which involve sequential stages of trigger identification and event type classification; (2) Joint methods [48], which perform few-shot event detection using a single model without task decomposition and directly classify each word in a sentence. It is noteworthy that FSED models using joint methods tend to outperform pipeline models, mainly due to the issue of error propagation caused by trigger misidentification in pipeline models.

Table 1 An example of 3-way 2-shot FSED task and highlighted words indicate the trigger word to be recognized

FSED task can manifest in several forms, encompassing event identification, event classification, and event detection. Event identification [1] determines whether a word in a sentence is a trigger for a specific event type. Event classification [22] involves selecting the event type associated with an identified trigger. Event detection [3, 36, 48] combines both event identification and event classification to accomplish these steps jointly. Recently, joint models for few-shot event detection generally have been formulated as sequence labeling task, such as PA-CRF [3, 36] and HCL-TAT [48]. PA-CRF improves upon the Conditional Random Field (CRF) method by treating the transition score as a random variable and using a Gaussian distribution to approximate event types dependencies. Although PA-CRF is more powerful than CRF-based methods in exploring event types dependencies, it also faces a challenge in learning label dependencies with limited data. [36]’work is to study different options for selecting and combining the 12 layers of BERT to obtain more relevant token representations for the FSED. Tuo et al. [36] proposed the BERT layer weighted strategy combined with PA-CRF getting the improvement. Liking PA-CRF and [36], these models require generating 2N+1 prototypes with only K instances when selecting the BIO sequence annotation method, and the prototypes generated in this way are not well-discriminative for each label.

HCL-TAT proposes Support-Support and Prototype-Query Contrastive learning based on Supervised Contrastive Learning [13, 19] to improve the representations for instances in both the support and query sets. Contrastive learning encourages instances of the same type to be closer and instances of different types to be farther away. However, when dealing with very long sentences, the number of triggers and non-triggers can be extremely unbalanced, leading to category imbalance and many easy negative samples. Specifically, the number of non-trigger word ("O" tags) exceeds another tag. When calculating the loss, the loss of the non-trigger word ("O" category) accounts for the majority of total losses, preventing the model from being effectively trained with the loss. For example, this issue is visually reflected in Fig. 1, despite the relatively short sentence length. Therefore, the current research encounters two main challenges. Firstly, the generation of discriminative prototypes is hindered by the limited availability of training data per batch. Secondly, the presence of a multitude of semantically heterogeneous words in lengthy sentences poses difficulties in distinguishing between trigger and non-trigger words.

Fig. 1
figure 1

Visualization of triggers of 3-way-2-shot support set based on the example of Table 1

To address the issues mentioned above, we propose a network called Multi-channels Prototype and Contrastive learning method with Conditional Adversarial attack (MPC-CA). MPC-CA proposes improved multi-channels prototype and contrastive networks to alleviate the categories and hard-easy sample imbalance and devises conditional adversarial learning to address the problem of limited training data in an episode. Specifically, we introduce the multi-channels Prototype (MP) network, inspired by the multi-heads attention concept [38]. MP allows the model to jointly attend to information from different representation spaces at different positions. MP can effectively generate prototype representations of each category from multiple perspectives and alleviate the category imbalance via aggregating different representation spaces of the "O" class simultaneously. Moreover, we introduce the multi-channels support contrastive Learning (MSS-CL) based on the HCL-TAT model, which encourages instances’ contrastive learning from different feature representation spaces within the support set. MSS-CL can enhance the comparison of relationships between classes and alleviate the issue of hard-easy sample imbalance caused by too many "O" classed tokens.

Besides, we use the focal loss [23] to address the issue of category imbalance and hard-easy sample imbalance. The focal loss method replaces the cross-entropy loss to calculate the loss between the prototypes and the query set. However, a more significant issue in few-shot learning is the limited supervised data in an episode, which can lead to inadequate learning of prototype representations. To address this problem, we propose Condition Adversarial (CA) Attacks based on the Projected Gradient Descent (PGD) algorithm [28]. The CA aims to increase the training sample data in a disguised form by adding perturbations to the model input. Specifically, the CA only performs adversarial attacks when the predefined condition is met, rather than from the beginning to the end, which saves time and performs better than unconditional adversarial attacks. Moreover, we conducted data cleanup on the FewEvent [4] dataset to address problems such as labeling errors and conducted comparative experiments on the cleaned dataset, FewEvent++.

Our contributions are summarized as follows:

  1. (1)

    We propose a network, MPC-CA, which addresses the challenges of limited training data and category imbalance in the FSED and sequence labeling scenario. MPC-CA consists of a multi-channels prototype network, multi-channels contrastive learning, and condition adversarial learning.

  2. (2)

    We introduce the focal loss to replace the cross-entropy loss to mitigate the issue of category imbalance and hard-easy sample imbalance. We performed data analysis and clean-up on the FewEvent dataset and renamed the cleaned dataset as FewEvent++.

  3. (3)

    Experimental results demonstrate that our proposed MPC-CA outperforms other competitive baselines on FewEvent and FewEvent++. Further analysis shows the effectiveness of our proposed model.

  4. (4)

    Further comparative experiments conducted on the GLUE benchmark validate the excellent adaptability and robustness of the proposed conditional adversarial method.

2 Related Work

2.1 Traditional Event Detection

Event Detection (ED) aims to identify specific event types by recognizing the trigger word in a sentence. With the rapid development of deep learning, various models have been proposed and achieved promising performance in ED [7, 16, 24,25,26, 35, 43, 44]. However, these methods have achieved promising results with sufficient supervised training data. They may not adapt well to Few-Shot Event Detection (FSED) scenarios with limited data in real-world contexts.

2.2 Few-Shot Event Detection

FSED is a relatively new sub-task of Event Detection, and only a limited number of methods are proposed for FSED. However, significant progress has been made in Few-shot learning scenarios for Named Entity Recognition (NER) task [14, 17, 45, 49], from which FSED task can draw ideas and knowledge. Deng [4] first proposed the benchmark FSED dataset, FewEvent, and designed the Dynamic-Memory Bi-Partite Propagation Network (DMBPN), which leverages dynamic memory to preserve contextual information of event mentions.Lai [20, 21] proposed the LoLoss and two regularization matching losses to train the model. PA-CRF [3] improved the Conditional Random Field (CRF) method by treating the transition score as a random variable and utilizing the Gaussian distribution to approximate event type dependencies. Deng [5] proposed the OntoED model, which formulates ED as a process of event ontology population in Low-resource sense. OntoED links event instances to pre-defined event types in event ontology and enriches event ontology with linkages among event types. Zhang [47] proposed to leverage the semantic meaning of the event types labels, which leverages label representations induced by pre-trained language models and maps identified events to the target types via representation similarity. Lai [22] proposed to exploit the relationship between training tasks for FSED,which computes prototypes based on cross-task modeling and presents a regularization to enforce prediction consistency of classifiers across tasks. Shen et al. [32] proposed AKE-BML, which leverages the external knowledge base FrameNet to learn prototype representations for event types and alleviates the insufficient sample diversity problem in few-shot learning. Besides, AKE-BML presented a novel knowledge adaptation mechanism to tackle the uncertainty and incompleteness issues in knowledge coverage. Chen [1] found that the trigger is a confounder of the context and the result, which makes previous FSED methods much easier to overfit triggers. To resolve this problem, they propose to intervene on the context via backdoor adjustment during training. Tuo [36] have studied different ways to better exploit the information contained in the BERT pre-trained model for the task of FSED. HCL-TAT [48] proposed a hybrid contrastive learning approach to generate more discriminative representations for both the support and query sets. However, these methods still suffer from category imbalance, hard-easy sample imbalance and poor representation learning. OUTFIT [37] formulates FSED as an Out-of-Domain Detection task using prototypical networks. OUTFIT avoids building a prototype for the null class, which is inherently heterogeneous, and provides a dynamic threshold to decide whether a word is a trigger or not. Xia [42] proposed a domain-aware Few-Shot generative model that can generate domain based training data through a relatively small amount of labeled data. In this paper, we propose a model that leverages improved learning and adversarial learning to address these problems.

2.3 Contrastive Learning

In recent years, contrastive learning has emerged as a promising approach in various domains. In the self-supervised domain, specific strategies have been designed to generate positive and negative pairs from unsupervised data [2, 8, 15, 41]. In the supervised learning domain, several works [13, 19] have leveraged label information to improve unsupervised contrastive learning. Furthermore, contrastive learning has been increasingly used in few-shot learning tasks to enhance performance, which encourages query samples to be closer to the prototype of the same class and furthers away from those of different classes [9, 48, 49]. In this paper, we propose the multi-channels support contrastive Learning (MSS-CL) approach, which encourages instances’ contrastive learning from different feature representation spaces and alleviates the issue of categories imbalance caused by too many "O" classed tokens.

2.4 Adversarial Attack

Adversarial attack is a method that improves the robustness and generalization ability of the model by introducing perturbations to the model input. During the training stage, the model input undergoes a gradient increase (increase loss), and the model parameters undergo a gradient decrease (decrease loss). Adversarial training was first introduced in FGSM [11], and subsequently, FGM [29] was proposed based on specific gradients. Madry et al. [28] found that training against multi-step methods (PGD) can find the optimal perturbation. However, the cost of PGD-based adversarial training is much higher than conventional training. To mitigate this cost, [31] proposed a "free" adversarial training algorithm that simultaneously updates both model parameters and adversarial perturbations on a single backward pass. Zhu et al. [50] proposed FreeLB, which promotes higher invariance in the embedding space by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples. In this paper, we propose condition adversarial attack (CA) learning, which only applies adversarial attacks when a predefined condition is met, not throughout the entire training process. The CA not only saves training time but also improves the generalization and robustness of the model based on the vanilla adversarial method.

In this paper, we propose Multi-channels Prototype Contrastive Learning with Condition Adversarial Attacks, which improves the prototype and feature representation from multiple different representation sub-spaces. Furthermore, we present condition adversarial attacks, a strategy that mitigates the challenge of limited supervised data during the training phase. By introducing these methods, we aim to improve the robustness and generalization capabilities of the model, which can lead to better performance with limited data.

3 Background

In this section, we first present the basic concepts of FSED and then provide a detailed introduction to N-way-K-shot learning.

3.1 Problem Definition

In this paper, we formulate few-shot event detection (FSED) as a sequence labeling task. A typical few-shot task data comprises a training set \(D_{train}\), a development set \(D_{dev}\) and a test set \(D_{test}\), and the three label sets \(Y_{train}\), \(Y_{dev}\) and \(Y_{test}\) are disjoint to ensure that the model learns no more than K examples from a novel class, i.e., \(Y_{train}\) \(\cap \) \(Y_{dev}\) \(\cap \) \(Y_{test}\) \(=\) \(\emptyset \). The training set \(D_{train}\) consists of word sequences, trigger words and their corresponding label sequences, \(D_{train} = \{X^{i}, a^{i},Y^{i}\}_{i=1}^{|D_{train}|}\). Given a word sequence \(X = \{x_1, x_2,\dots , x_n\}\), a trigger word a and the corresponding label sequence \(Y = \{y_1, y_2,\dots , y_n\}\), each word \(x_{i}\) corresponds to a event type label. Figure 2 illustrates the mapping of words to event labels. The "O" type is utilized to indicate words that do not belong to any event type, referred to as non-triggers. Additionally, in this paper, each sentence is restricted to having only one trigger word and one associated event type.

Fig. 2
figure 2

Label mapping from X to Y of three sentences

3.2 N-Way-K-Shot Learning

Few-shot learning includes two stages: training and testing. During the training stage, the model is trained on N-way-K-shot tasks or episodes sampled from the training set \(D_{train}\). Each N-way-K-shot task or episode includes a support set and a query set. To create an N-way-K-shot task, we randomly sample N event types from \(Y_{train}\). Then, for each of the N event types, we randomly select K instances as the support set \(S = \{X^{i}, a^{i},Y^{i} \}_{i=1}^{N \times K}\) and M instances as the query set \(Q = \{X^{i}, a^{i},Y^{i} \}_{i=1}^{N \times M}\), and \(S \cap Q = \emptyset \). The support set is similar to the training set in traditional supervised learning but contains only a few samples. The query set acts as the test set but is used to compute gradients for updating model parameters [14]. Similarly, we sample episodes from \(D_{test}\) during the testing stage to test whether the model can generalize well to new classes. Finally, we evaluate the average performance across all testing episodes.

Episodic training is commonly employed in few-shot learning, where an episode \(T = \{ S , Q \} \) can be viewed as a mini-batch. Hence, the training process of an episode is a batch training process. The training stage can be formulated as \(T _{train} = \{ T _{i}\}_{i=1}^{M{train}}\), and the testing stage can be formulated as \(T _{test} = \{ T _{i}\}_{i=1}^{M{test}}\), where \(M_{train}\) and \(M_{test}\) denote the number of training episodes and testing episodes, respectively.

4 Method

An overview of the proposed MPC-CA method is illustrated in Fig. 3. In Sect. 4.1, MPC-CA uses the BERT text encoder to generate textual feature representations. In Sect. 4.2, we introduce the prototypical network and our proposed multi-channels prototypical Network in Sect. 4.2.1. In Sect. 4.3, we describe the Focal loss for solving classes and hard-easy samples imbalance problems. In Sect. 4.4, we describe contrastive Learning and our proposed multi-channels support contrastive learning in 4.4.2. The training procedure is given in Sect. 4.6. Section 4.5 describes the improved Condition Adversarial Attacks learning. Specially, Sects. 4.2.14.4.2 and 4.5 provide a detailed description of the methods proposed in this article.

Fig. 3
figure 3

The framework of Multi-channels Prototype Contrastive Learning with Condition Adversarial Attacks (MPC-CA). MPC-CA is based on a prototypical network, which composed of four components: (1) multi-channels prototypical network (2) multi-channels Support Contrastive Learning (3) Focal loss (4) Condition Adversarial learning (It is not shown in the figure). The green, purple, and blue arrows represent calculating Focal loss, multi-channels Support Contrastive loss, and query-prototype Contrastive loss, respectively

4.1 Text Encoder

As pre-training language models (PLMs) have shown superior performance across multiple NLP tasks, we use BERT [6] as our text encoder to generate textual representations, following [48]. Specifically, given a word sequence X, we input it into BERT and take the output of the last hidden layer as the tokens representations \({\textbf{H}}\):

$$\begin{aligned} {\textbf{H}} =\left[ {\textbf{h}}_1,\ldots , {\textbf{h}}_n\right] ={\text {BERT}}\left( \left[ x_1, \ldots , x_n\right] \right) \end{aligned}$$
(1)

Here, \({\textbf{h}}_i \in {\textbf{R}}^{d_h}\) represents the hidden representation of each token \(x_i\), \(d_h\) is the dimension of the hidden representation and n is the maximum sentence length.

4.2 Prototypical Network

Prototypical Networks learn a metric space in which classification can be performed by computing distances to prototype representations of each class [34]. Specifically, prototypical networks first project samples into the vector spaces, and generate the prototype by computing the vector spaces center of each category. In addition, each prototype is the mean vector of the support set \(S \) belonging to its class, as follows:

$$\begin{aligned} {\textbf{p}}_c=\frac{1}{\left| S_c\right| } \sum _{\left( X_i, Y_i\right) \in S_c} {\textbf{H}}_{i} \end{aligned}$$
(2)

where \({\textbf{p}}_c\) is the prototype for class c and \(|S_c|\) denotes the number of samples labeled with class c. The Fig. 4a shows the architecture of traditional prototypical network. However, in this paper, FSED is formulated as a sequence labeling task, so the prototype is constructed at the token level rather than the sentence level. The formula for calculating the prototype needs to be changed to the following:

$$\begin{aligned} {\textbf{p}}_c=\frac{1}{\left| S_c\right| } \sum _{i \in {\mathscr {S}}(c)} {\textbf{h}}_i,\quad c=0,1, \ldots , N \end{aligned}$$
(3)

where \(S_c\) represents the trigger words of class c in support set S. When \(c = 0\), the \(S_c\) represents all non-trigger words. Prototypical networks produce a distribution over classes for a query point \(x_{i}\) based on a softmax over distances to the prototypes:

$$\begin{aligned} p_{t}(y_i\mid x_i )=\frac{\exp \left( -d\left( {\textbf{h}}_i, \;{\textbf{p}}_{c=y_{i}}\right) \right) }{\sum _{c^{\prime }\in {\mathscr {C}}} \exp \left( -d\left( {\textbf{h}}_i,\; {\textbf{p}}_{c^{\prime }}\right) \right) } \end{aligned}$$
(4)

where \(d(\cdot )\) denotes a distance function and \(-d(\cdot ) \) is the similarity metric, which means the greater the distance, the smaller the similarity, and vice versa. Learning proceeds by minimizing the negative log-probability \( p_{t}(y_i\mid x_i )\) to compute the cross-entropy loss on the query set Q.

$$\begin{aligned} {\mathscr {L}}_{C E}=-\sum _{\left( x_i, y_i\right) \in Q }\log p_{t}(y_i\mid x_i ) \end{aligned}$$
(5)
Fig. 4
figure 4

The architecture of traditional prototypical network and our proposed multi-channels prototypical network and the number of channel is 3

4.2.1 Multi-Channels Prototypical Network

We find that the prototypes (excluding the "O" type) calculated by the formula 3 have the weakness of inadequate event type information and cannot generalize well to detect trigger words that the model has not learned before. This is because, in the N-way-K-shot and sequence labeling conditions, the prototype of each event type is only aggregated by K tokens information when one sentence has a trigger word. In addition, there are too many non-triggers in a sentence, so the prototype of the "O" type contains more miscellaneous types of words, which can exacerbate category imbalance in the scenario of few-shot learning.

To tackle the above problems, we proposed the multi-channels prototypical (MP) network inspired by multi-heads attention [38] that allows models to jointly attend to information from different representation sub-spaces at different positions. Thus, MP is that the Prototypical network with a multi-channels machine calculates the prototype of each class from different feature representation sub-spaces. Figure 4b shows the architecture of our proposed multi-channels prototypical network.

Specifically, we first initialize \((N+1)\) prototype vectors with 0. Then, to build multi-channels setting, each prototype vectors \({\textbf{p}}_c\) and each token representation \({\textbf{h}}_{i}\) need to be split into m vectors from the dimension level, as follows:

$$\begin{aligned} {\textbf{P}}= \{\;{\textbf{p}}_0, \ {\textbf{p}}_1, \ \dots , \ {\textbf{p}}_c \; \}, c=0,1, \ldots , N \end{aligned}$$
(6)
$$\begin{aligned} {\textbf{p}}_c= \{\; {\textbf{p}}_c^{1} \ | \ {\textbf{p}}_c^{2} \ | \ \dots \ | \ {\textbf{p}}_c^{m} \;\} \end{aligned}$$
(7)
$$\begin{aligned} {\textbf{h}}_i= \{\; {\textbf{h}}_i^{1} \ | \ {\textbf{h}}_i^{2} \ | \dots | \ {\textbf{h}}_i^{m} \; \} \end{aligned}$$
(8)

Here, \(\{ \;| \; \}\) denotes the split operation on the dimension and m is the number of multi-channels. Then, for each prototype sub-vector \({\textbf{p}}_c^k \) of a specific event label, we use token feature sub-representation \({\textbf{h}}_i^k \) of specific label to fill. Finally, the m prototypes of each class are aggregated into a unified feature space, as follows:

$$\begin{aligned} {\textbf{p}}_c^{k}=\frac{\iota }{\left| S_c\right| } \sum _{i \in {\mathscr {S}}(c)} {\textbf{h}}_{i}^{k} \end{aligned}$$
(9)
$$\begin{aligned} {\textbf{p}}_c= \{\; {\textbf{p}}_c^{1} \; \ {\textbf{p}}_c^{2} \; \ \dots \;\ {\textbf{p}}_c^{m} \;\} \end{aligned}$$
(10)

where the dimension of \({\textbf{p}}_c^k \) and \({\textbf{h}}_i^k \) are both \({\textbf{R}}^{(d_h/m)}\). the \(\iota \) is a scalar temperature and \(\{ \;; \; \}\) denotes the concatenation operation on the dimension. On the one hand, MP can enrich prototype feature information of each event type from multiple perspectives.

The token features of trigger words belonging to the same event type, post-encoding by the encoder, not only exhibit proximity in the language space but also encapsulate distinct representations for each trigger word. A well-generalized prototype representation should have both a common representation, indicating the similarity of features among various trigger words, and specific representations for each trigger word. We extend this principle to the dimensionality of feature representation. The MP model divides the token features of K trigger words into m channels along the dimension, respectively. Different channels for different trigger words undergo linear combinations in spatial dimensions. Aggregating feature representations of different trigger words through multi-channel fusion yields m sub-prototype representations. These m sub-prototype representations may include a common representation describing the event type, while others may contain specific representations for different trigger words. Assuming \({\textbf{h}}_1^{1}\), \({\textbf{h}}_2^{1}\), and \({\textbf{h}}_3^{1}\) contain common representations for the same event type, the aggregated sub-prototype representations p1 exhibit generality. Assuming \({\textbf{h}}_1^{2}\) and \({\textbf{h}}_2^{2}\) represent common features of an event, while \({\textbf{h}}_3^{2}\) captures the specificity of a trigger word, the aggregated sub-prototype representations \({\textbf{p}}_2\) exhibit both generality and specificity.

Traditional prototype representations predominantly incorporate general representations of a particular event type. MP prototype representations not only include general representations but also amplify the specificity of different trigger words, resulting in superior generalization capabilities for multi-channels prototype representations. On the other hand, MP also alleviates the problem of category imbalance caused by too many non-trigger words ("O" class) by aggregating different feature sub-spaces to extract more representative features.

4.3 Focal Loss

When calculating the cross-entropy loss between the query set and prototypes, we noticed that a significant portion of the loss \({\mathscr {L}}_{C E}\) is accounted for by the token \(x_{i}\) labeled with "O" tag(non-trigger words). This problem causes \({\mathscr {L}}_{C E}\) to be dominated by the easy examples (non-trigger words), resulting in the model not being effectively trained with \({\mathscr {L}}_{C E}\) loss.

To solve the issue, we utilize the Focal loss [23] to replace \({\mathscr {L}}_{C E}\), which can suppress easy samples and allow more positive and negative hard samples to play a more significant role in the loss, better solving the problem of category and hard-easy sample imbalance. Focal loss is a modification based on cross-entropy loss formula 45, as follows:

$$\begin{aligned} p_{t}=\frac{\exp \left( sim\left( {\textbf{h}}_i , \;{\textbf{p}}_c\right) \right) }{\sum _{c^{\prime }\in {\mathscr {C}}} \exp \left( sim\left( {\textbf{h}}_i , \; {\textbf{p}}_{c^{\prime }}\right) \right) } \end{aligned}$$
(11)
$$\begin{aligned} \textrm{FL}\left( p_{\textrm{t}}\right) =-\alpha _{t}\left( 1-p_{\textrm{t}}\right) ^\gamma \log \left( p_{\textrm{t}}\right) \end{aligned}$$
(12)
$$\begin{aligned} {\mathscr {L}}^{focal}=-\sum _{\left( x_i, y_i\right) \in Q } \textrm{FL}\left( p_{\textrm{t}}\right) \end{aligned}$$
(13)

The green arrows in Fig. 3 indicates this process. Here, \({\textbf{h}}_i\) is the token representation of query set and sim() denotes the similarity calculator. The \(\gamma > 0 \) is used to control the contribution of hard and easy samples, which reduces the impact of easy samples and focus more on positive and negative hard samples. The \(\alpha _{t}\) controls the weight of positive and negative samples.

4.4 Contrastive Learning

In the self-supervised domain, contrastive learning aims to pull together an anchor and a "positive" sample and push apart the anchor from many "negative" samples in embedding space. In order to model the differences between different source samples, we need to construct a Contrastive loss. We will refer to a set of N samples as a "batch" and 2N augmented samples for the batch as I. The formula is listed as follows:

$$\begin{aligned} {\mathscr {L}}^{\text {self}}=-\sum _{i\in I}\log \frac{\exp \left( {\varvec{h}}_i \cdot {\varvec{h}}_{j(i)} \,/\tau \right) }{\sum _{a\in A(i)} \exp \left( {\varvec{h}}_i \cdot {\varvec{h}}_a \,/\tau \right) } \end{aligned}$$
(14)

The contrastive loss encourages the similarity of positive pairs to be large enough (the numerator in formula 14) and encourages the similarity of negative pairs to be small enough (the denominator). Here, \(h_{i}\) and \(h_{j(i)}\) come from the same sample, with \(h_{i}\) being called the anchor and \(h_{j(i)}\) being called the positive instance. The symbol \(\cdot \) generally is the dot product operation, the \(\tau \) is a scalar temperature parameter and \(A(i) \equiv I \backslash \{i\}\). Note that each anchor \(h_{i}\) has one positive pair and \(2N-2\) negative pairs.

4.4.1 Supervised Contrastive Loss

Khosla et al. [19] propose the supervised contrastive loss, which allows the model to leverage label information to split the positive and negative samples of an anchor and satisfy each anchor to have multiple positive samples. Supervised contrastive learning is when clusters of points belonging to the same class are pulled together in embedding space and simultaneously push apart clusters of samples from different classes. The Fig. 5a show the architecture of supervised contrastive learning and the formula for supervised contrastive loss is as follows:

$$\begin{aligned} {\mathscr {L}}^{\text{ sup } }=-\sum _{i \in I} {\mathscr {L}}_{i}^{\text{ sup } } \end{aligned}$$
(15)
$$\begin{aligned} {\mathscr {L}}{i}^{\text{ sup } }= \sum _{p \in P(i)} \log \frac{\exp \left( {\varvec{h}}_i \cdot {\varvec{h}}_p \,/ \tau \right) }{\sum _{a \in A(i)} \exp \left( {\varvec{h}}_i \cdot {\varvec{h}}_a \,/ \tau \right) } \end{aligned}$$
(16)
Fig. 5
figure 5

The architecture of supervised contrastive network and our proposed multi-channels contrastive network

4.4.2 Multi-Channels Support Contrastive Learning

In the few-shot learning scenario, contrastive learning also faces the same problems mentioned in Sects. 4.2.1 and 4.3. Specifically, the number of tokens labeled "O" (non-trigger word) is greater than trigger words, which leads to severe category and hard-easy sample imbalance. When calculating the contrastive loss with support sets, a significant portion of the loss is accounted for by the "O" tag token, which prevents the model from being effectively trained with an appropriate loss.

To address these issues, we propose a extension to supervised contrastive learning, MMS-CL, which allows the model to calculate the supervised contrastive loss from multiple feature sub-spaces. The purple arrows in Fig. 3 represents this process. Following [48], the token feature representation \({\textbf{h}}_{i}\) of support set first projects into a latent space by 2-layer MLP, which will improve the token representation quality for contrastive learning. Then, token representation will be projected into multiple feature spaces from dimension level as formula 8.

$$\begin{aligned} \tilde{{\textbf{h}}}_i= \textbf{MLP}({\textbf{h}}_{i}) \end{aligned}$$
(17)

MSS-CL calculates the contrastive loss with a support set, and the formula is listed as follows:

$$\begin{aligned} {\mathscr {L}}_{i,k}^{\text{ mss-cl }}= \sum _{p \in P(i)} \log \frac{\exp \left( \tilde{{\varvec{h}}_{i}^{k}} \cdot \tilde{{\varvec{h}}_{p}^{k}} \,/ \tau \right) }{\sum _{a \in A(i)} \exp \left( \tilde{{\varvec{h}}_i^{k}} \cdot \tilde{{\varvec{h}}_a^{k}} \,/ \tau \right) } \end{aligned}$$
(18)
$$\begin{aligned} {\mathscr {L}}^{\text{ mss-cl }}=- \frac{1}{m}\sum _{i \in I} \sum _{k \in m} {\mathscr {L}}_{i,k}^{\text{ mss-cl } } \end{aligned}$$
(19)

Here, Fig. 5b show the architecture of multi-channels contrastive network and the figure corresponds to \(\frac{1}{m}\sum _{k \in m} {\mathscr {L}}_{i,k}^{\text{ mss-cl } }\) of the formula 19. \(I = S \ni \{ x_{i}, y_{i}\} \) denotes each token and its label in support set. The m is the number of channels, which is the same as m in Sect. 4.2.1.

MSS-Cl and MP employ a similar mechanism, dividing the token features of N \(\times \) K trigger words into m channels along the dimension, enabling N \(\times \) K \(\times \) m channels representing of trigger words of N event categories to undergo contrastive learning in spatial dimensions. In comparison to traditional supervised contrastive learning, MSS-Cl introduces additional adversarial iterations between trigger words of the same class and those of different classes within a single episode. This strengthens the model’s ability to pull together trigger words of the same category in the feature representation space while simultaneously pushing apart clusters of samples from different categories. As described in Sect. 4.2.1, within different channels of token representations, some may contain a generic representation describing the event type, while others may capture the specificity of trigger words. Assuming \({\textbf{h}}_1^{1}\) and \({\textbf{h}}_2^{1}\) contain generic representations of different trigger words from the same event type, and \({\textbf{h}}_1^{2}\) and \({\textbf{h}}_2^{2}\) capture the specificity of different trigger words from the same event type, MSS-Cl not only possesses the capability to bring together the generic representations of different trigger words from the same category but also learns the similarity between the specific representations of different trigger words from the same category. This significantly contributes to enhancing the model’s generalization.

In addition, we add the prototype-query contrastive loss (pqcl) proposed by [48] to MPC-CA, and the formula is listed as follows:

$$\begin{aligned} {\mathscr {L}}^{\text{ pqcl }}=\sum _{i \in \{ x_{i}, y_{i}\}=Q } -{\mathscr {L}}_{i}^{\text{ pqcl } } \end{aligned}$$
(20)
$$\begin{aligned} {\mathscr {L}}_{i}^{\text{ pqcl } }= \log \frac{\exp \left( \tilde{{\varvec{h}}_i} \cdot {\textbf{p}}_{c=y_{i}} \,/ \tau \right) }{\sum _{c^{\prime }\in {\mathscr {C}}} \exp \left( \tilde{{\varvec{h}}_i} \cdot {\textbf{p}}_{c^{\prime }} \,/ \tau \right) } \end{aligned}$$
(21)

The blue arrows in Fig. 3 represents this process. Here, the token representation \(\tilde{{\varvec{h}}_i} \in Q \) and \({\textbf{p}}_{c=y_{i}}\) has the same label with \(\tilde{{\varvec{h}}_i}\). Besides, we also run experiments that pqcl combines with multi-channels operation, but the experimental results are not ideal.

4.5 Condition Adversarial Attacks

In the few-shot setting, models generally face the problem of poor generalization caused by limited supervised data. To solve the problem, we introduce adversarial learning in this paper. Adversarial learning adds a certain perturbation to the input of the model in the training stage. The model can acquire strong recognition with different attack samples, which can effectively improve the robustness of the model and also improve its generalization ability.

For models that incorporate various input representations, including word embeddings, segment embeddings and position embeddings, adversarial learning only modify the concatenated word embeddings, leaving other components of the sentence representation unchanged [50]. Given a language model (encoder) as a function \({\varvec{y}}=f_{\varvec{\theta }}({\varvec{X}})\), \({\varvec{X}}\) is word embeddings and \(\varvec{\theta }\) denotes all the learnable parameters. Adversarial learning add perturbations \(\delta \) to the \({\varvec{X}}\) becoming \({\varvec{y}}^{\prime }=f_{\varvec{\theta }}({\varvec{X}}+\varvec{\delta })\), which seeks to not change model’s prediction and minimize the maximum risk for constrained perturbation.

In the paper, we propose a extension to adversarial learning, CA, which only conducts adversarial training when the predefined condition is met. Besides, The condition adversarial attack can apply to any adversarial training method, and we select the PGD [28] as the basic adversarial learning method. PGD is a kind of iterative attack that, compared to a regular FGM that only performs one iteration, PGD performs multiple iterations, taking a small step each time, and each iteration projects the perturbation within a norm ball. Specially,for the constraint \(\Vert \varvec{\delta }\Vert _F \le \epsilon \), PGD takes the following step (with step size \(\alpha \))in each iteration as:

$$\begin{aligned} g\left( \varvec{\delta }_t\right) =\nabla _{\varvec{\delta }} L\left( f_{\varvec{\theta }}\left( {\varvec{X}}+\varvec{\delta }_t\right) , y\right) \end{aligned}$$
(22)
$$\begin{aligned} \varvec{\delta }_{t+1}=\Pi _{\Vert \varvec{\delta }\Vert _F \le \epsilon }\left( \varvec{\delta }_t+\alpha g\left( \varvec{\delta }_t\right) /\left\| g\left( \varvec{\delta }_t\right) \right\| _F\right) \end{aligned}$$
(23)

where \(\Pi _{\Vert \varvec{\delta }\Vert _F \le \epsilon }\Pi _{\Vert \varvec{\delta }\Vert _F \le \epsilon }\) projects the disturbance within the specified range. The pseudocode of the condition adversarial learning is listed in Algorithm 1. Specially, the predefined condition is that the evaluation metric F1 score begins to decrease with the dev set. When the model reaches a stationary stage, the model begins adversarial learning, which can effectively avoid introducing too much perturbation and causing the noise. It is worth noting that this predefined condition can be flexibly chosen based on the evaluation criteria of the task. Conditional adversarial learning not only enhances the model’s adversarial robustness but also indirectly increases the diversity of training samples. Assuming during the training process, the current episode (=i) employs adversarial learning, and perturbations are applied to all examples in the support set. In a subsequent episode (=j), if adversarial learning is not applied and, by chance, some examples are randomly selected from the support set of episode (=i). Consequently, the model is trained on the same samples with and without perturbations, thereby improving both the model’s generalization and robustness. Additionally, selectively engaging in adversarial learning not only enhances model performance but also, compared to traditional adversarial learning, can save training time.

In this paper, we conducted comparative experiments in Sect. 6.4 with different tasks and evaluation metrics, demonstrating the generalizability of our proposed conditional adversarial training for multiple tasks. We also compared the performance of the conditional adversarial mechanism combined with different adversarial models, showing that conditional adversarial training can be adapted to various adversarial models.

Algorithm 1
figure a

Condition Adversarial Attacks (based on PGD)

4.6 Training Procedure

In this section, we provide a detailed explanation of the training process. Assuming a setting of N-way-K-shot, for each episode, we randomly selecting N classes from the training set event types and pick N\(\times \)K instances as the support set and N\(\times \)Q (where Q=1) instances as the query set. Firstly, the support set and query set enter the encoder layer to obtain vector representations S,Q respectively. Then, we input S into the Multi-channels prototypical network module to obtain prototype representations of N event types, conduct similarity calculations between Q and the N prototype representations and calculate losses \({\mathscr {L}}^{focal}\) based on the true labels of the query set. Meanwhile, S is fed into a multi-channel support contrastive learning model to calculate \({\mathscr {L}}^{mss-cl}\), while Q and the N prototype representations are input into the query-prototype contrastive learning model to calculate \({\mathscr {L}}^{pqcl}\). In the training process of each episode, we also add condition adversarial attacks mechanism, based on its performance in the validation set in the previous episode.Please refer to Algorithm 1 for the detailed process. Finally, the total loss of the model consists of focal loss, multi-channels support contrastive loss and prototype-query contrastive loss, as follows:

$$\begin{aligned} {\mathscr {L}}={\mathscr {L}}^{focal}+\alpha {\mathscr {L}}^{mss-cl}+\beta {\mathscr {L}}^{pqcl} \end{aligned}$$
(24)

where the \(\alpha \) and \(\beta \) are trade-off parameters to balance the total loss \({\mathscr {L}}\).

5 Experimental Setup

5.1 Datasets

5.1.1 FewEvent

We conduct experiments on the benchmark dataset FewEventFootnote 1 proposed by [4], which is the currently largest and latest few-shot dataset for event detection. FewEvent is built from ACE2005, KBP2017 and external knowledge bases like Freebase and Wikipedia, and it contains 70,852 instances for 19 event types graded into 100 event subtypes in total. Following [48], we use 80 event types as the training set, 10 event types as the dev set, and the rest 10 event types as the test set, where event types between each subset are disjoint. In addition, based on the above data preprocessing, we have also carried out other data processing works for FewEvent and renamed it as FewEvent++. Table 2 shows the statistics of FewEvent and FewEvent++ datasets. About the FewEvent++, we execute the following data augmentation operations and more details are listed in Appendix A.

  1. (1)

    We utilize the Stanford CoreNLPFootnote 2 to perform re-tokenization.

  2. (2)

    We eliminated duplicate instances that have the same text and trigger word position.

  3. (3)

    We removed instances where the trigger word position surpasses the maximum sentence length (128).

Table 2 The statistics of FewEvent and FewEvent++ dataset

5.1.2 GLUE

We conducted ablation experiments to validate the effectiveness of conditional adversarial learning on the GLUE benchmark [40]. The GLUE benchmark comprises 9 tasks related to natural language understanding. Among these tasks, one is a regression task, while the remaining 8 tasks are classification problems. For our comparative experiments, we selected six tasks: QQP, MNLI, QMLI, MRPC, COLA, and SST-2. All of these tasks involve classification, and the evaluation metrics are either accuracy or F1 score. As the test set for GLUE is not publicly available, we partitioned a portion of the train dataset to create our own test sets. The statistics of the six selected tasks are presented in Table 3.

Table 3 Task descriptions and statistics, All tasks are binary classification, except MNLI (three classes)

5.2 Evaluation Metric

For FSED task, following previous works [49], we adopt the micro F1 score as the evaluation metric and report the averages and standard deviations over 5 runs for a fair comparison. Following [3], we use the same episodic evaluation method to evaluate our model in few-shot settings by randomly selecting episodes containing N-way-K shot samples from the test set. For GLUE benchmarks, we adopt the evaluation criteria corresponding to each corpus, and the specific information can be found in Table 3.

5.3 Implementation Details

We utilize the bert-base-uncased model implemented through the Hugging Face library as the text encoder to obtain the 768-dimensional token-level embeddings. The maximum sentence length is 128, and we select the AdamW with a 1e-5 learning rate as the optimizer. The weight decay rate is 0.01, the warmup step is 100 and batchsize is 1. During the training stage, the model is trained on 20,000 episodes on the training set, the adversarial frequency K is is set to 3, and hyper-parameters are tuned by running the model on 1,000 episodes on the dev set. During the testing stage, model performance is evaluated using 3,000 episodes on the test set. For the multi-channels prototypical network and multi-channels support contrastive loss, the number of channel m is set to 4. The hyper-parameters \(\alpha \) and \(\beta \) are both set to 0.5. The \(\gamma \) value of the focal loss is set to 2, and the \(\alpha _{t}\) is 1.

For the few-shot learning experiments, we evaluate the model’s performance in the 5-way-5-shot, 5-way-10-shot, 10-way-5-shot, and 10-way-10-shot settings. Specially, in the N-way-K-shot setting, the number of instances of support set is \(N \times K\) and query set is \(N \times 1 \)in each episodic. For the former three settings, we execute the experiments using PyTorch 1.8.1 on an Nvidia 3080Ti GPU with an Intel i5-12400KF CPU. For the 10-way-10-shot setting, the experiments are run on an Nvidia RTX A5000 GPU with an Intel(R) Xeon(R) Platinum 8358P CPU. To ensure reproducibility and fairness, we use five seeds (6369, 6667, 7457, 6672, 7238) to execute experiments on FewEvent, FewEvent++ datasets and use five seeds (42, 63, 77, 86, 95) on GlUE dataset. Our code and dataset can be obtained from the code library.Footnote 3

5.4 Baselines

To investigate the effectiveness of our proposed model (MPC-CA), we compare it with a range of baselines on FewEvent following [48]. These models can be classified into two categories: Pipeline methods and Joint methods.

For Pipeline models, this type of method first performs trigger word identification and then classifies the trigger word into an event type.

  1. (1)

    DMBPN [4] was the first to propose the benchmark FewEvent dataset and used dynamic memory to preserve contextual information of event mentions. For a fair comparison, we replace its encoder with BERT.

  2. (2)

    LoLoss [20] proposed to exploit the matching information between the examples in the support set as additional training signals based on metric learning.

  3. (3)

    MatchLoss [21] extends LoLoss to consider intra-cluster matching and inter-cluster information. In addition, the re-implemented version of LoLoss and MatchLoss adds a trigger identification model before their methods [3].

For Joint models, this type of method formulates FSED into a sequence labeling task.

  1. (1)

    We re-implement three typical few-shot classification baselines: (1) Proto [34] uses the negative value of the square of Euclidean Distance as the similarity metric between prototype and query points. (2) Match [39] uses the cosine function as the similarity metric. (3) Proto-Dot uses the dot product as the similarity metric.

  2. (2)

    Besides, we also compare four CRF-based methods with BIO tagging: (1) Vanilla CRF adds a simple CRF layer behind baseline models. (2) CDT [17] is a few-shot NER task and adapt it in the FSED task to replace the transition module. (3) PA-CRF [3] improved the conditional random field method (CRF), and it treated the transition score as the random variable and utilized the Gaussian distribution to approximate event type dependencies. (4) [36] proposed different ways to better exploit the information contained in the BERT pre-trained model for the task of FSED. We selected the strategy with the best performance for comparison, and the best strategy is PA-CRF’BERT with Weighted strategy in [36].

  3. (3)

    Then, we also compare the ProAcT [22], HCL-TAT [48] and Meta-Event [46] in the FSED task on the FewEvent dataset. (1) The ProAcT propose to model the relations between two consecutive training episodes by introducing cross-task prototypes. Essentially, during training process, ProAcT utilize K*2 and Q*2 instances to construct prototypes in the N-Way-K-shot setting. In addition, the ProAcT formulates FSED into a event classification task, the labels of the query set were exposed to construct the prototypes in training process comparing with event detection. Therefore, we have re-implemented it as event detection task. (2) HCL-TAT proposed a hybrid contrastive learning method to generate more discriminative representations for both the support and query sets. (3) Meta proposed a unified meta learning framework for both zero- and few-shot event detection, which is designed to exploit prompt tuning and contrastive learning for quick adaptation to unseen tasks. In the article of MetaEvent, for,the instances of support set and query set are both N \(\times \) K of each epoch. That is different with our settings and our query set instances is N \(\times \) 1. Another important point is that from the code provided by the author of MetaEvent, it can be seen that during model validation and testing, the model still participates in gradient updates when the support set input to the model, which is different from our validation and testing strategy. For a fair comparison, we have modified the validation and testing code to not participate in gradient updates.

6 Results and Discussion

6.1 Main Results

Following the HCL-TAT, we also utilize Proto as the backbone of our proposed model, MPC-CA. Table 4 shows the experimental results on the FewEvent test set.

Table 4 F1 scores (\(10^{-2}\)) of different models on the FewEvent test set

Compared with Pipeline models    (1) The majority of Joint models outperform the three pipeline models, as the trigger word identification stage in pipeline models may not possess sufficient adaptability to novel event triggers. As a result, cascading errors can occur, leading to poor performance in the trigger word identification stage and limiting the performance of pipeline models [3]. (2) Although DMBPN performs better than the other two pipeline models, it still struggles to handle the FSED due to cascading errors. Compared with MPC-CA, DMBPN exhibits a large performance gap of approximately 32%. These findings demonstrate the effectiveness of the Joint framework and the MPC-CA model.

Compared with Metric-based models    (1) Proto achieves the best results among the three metric-based few-shot classification models, indicating that the negative value of the square of Euclidean Distance is more suitable as the similarity metric between prototype and query points for FSED tasks. (2) Additionally, the result of 10-shot settings outperforms the 5-shot settings, suggesting that more training samples benefit FSED tasks. Thus, the condition adversarial attacks of MPC-CA can effectively improve model performance by indirectly increasing the number of training samples.

Compared with CRF-based models    (1) The four CRF-based models’ performance shows further improvement over the metric-based models, indicating the effectiveness of modeling label dependency with CRF. (2) Our proposed MPC-CA model outperforms the four CRF-based models by approximately 7%, 8%, 7%, and 8% in the four few-shot settings, respectively. We believe that the excessive number of labels can significantly impact model performance. In few-shot learning scenarios, CRF-based models using BIO tags may not efficiently model label dependency due to limited training data in an episode. From the results, we can notice that CRF-based models have a higher gap between the 10-way-k-shot and 5-way-k-shot than other models.

Compared with ProAcT, HCL-TAT model and Meta-Event    (1) ProAcT performs 61.10%, 61.18% in 5-way-5-shot and 5-way-10-shot settings, respectively. But in 10-way settings, the ProAcT performs poorly. This indicates that in the case of a large number of categories, ProAcT is difficult to cope with Few-shot setting. Besides, our proposed MPC-CA outperforms ProAcT by 6.17%, 8.96%, 23.59%, and 29.06% in the four few-shot settings, respectively. (2) Based on the HCL-TAL model, MPC-CA proposes the multi-channels prototypical, multi-channels support contrastive loss, and condition adversarial attacks methods, achieving better results in four few-shot settings on the FewEvent dataset. Specifically, MPC-CA outperforms HCL-TAT by 0.31%, 1.34%, 1.5%, and 1.82% in the four few-shot settings, respectively, demonstrating the effectiveness of these proposed methods. (3) Similar to ProAcT, Meta-Event performs well in a 5-way-5-shot and 5-way-10-shot settings, but not well in the 10-way-5-shot and 10-way-10shot settings. In addition, we found that Meta-Event is unstable on all few-shot settings and has a significant difference in F1 scores under different random seeds, indicating that its robustness is not as good as the model proposed in this paper. Besides, MPC-CA outperforms Meta-Event by 1.27%, 9.48%, 21.06%, and 29.06% in the four few-shot settings, respectively.

MPC-CA achieves the best results, effectively addressing the problem of insufficient prototype representation learning by introducing conditional adversarial attacks. The multi-channels prototypical and multi-channels support contrastive learning solve the categories and hard-easy sample imbalance resulting from excessive "O" class tokens.

In addition, we also perform experiments on the FewEvent++ dataset. The results demonstrate that all models achieved significant improvement, particularly for previously underperformed models. Moreover, MPC-CA again outperforms all other models in all four few-shot settings. The Table 5 and Fig. 6 show all the experimental results on the FewEvent++ dataset.

Table 5 F1 scores (\(10^{-2}\)) of different models on the FewEvent++ test set
Fig. 6
figure 6

Visualization results of models on FewEvent and FewEvent++ among four few-shot learning settings. The PA-CRF-LW denotes the model os [36]

FewEvent++ Compare with FewEvent    (1) We observe that the metric-based models, Proto and Proto-dot, achieve significant improvements of approximately 5% and 9% among four few-shot learning settings, respectively. PA-CRF and [36]also shows a slight improvement in the 5-way settings. ProAcT achieves 64.34%, 68.10%, 48.46%, 52.29% in the four settings respectively. Specially, for the 10-way settings, ProAcT brings a better improvement of 6.36% and 13.53%. HCL-TAT brings an improvement of 4.01%, 2.37%, 4.01%, and 4.67% in the four settings, respectively. HCL-TAT brings an improvement of 4.01%, 2.37%, 4.01%, and 4.67% in the four settings, respectively. Meta-Event achieves significant improvement of 6%, 10.67%, 10.26% and 13.48% among four few-shot learning settings, respectively. These improvements indicate that our data processing approaches effectively enhance data quality. (2) In addition, we observed a decrease in the standard deviation of F1 scores for the baselines on Fewevent++, such as PA-CRF, [36]’model, HCL-TAT and Meta-Event. This indicates that the quality of the Fewevent++ dataset is higher, and data augmentation operation performed in this article is effective. (3) Moreover, MPC-CA brings an improvement of 5.39%, 4.26%, 4.41% and 5.33%, and all results outperform 70% in four few-shot setting. On the enhanced dataset, MPC-CA achieves the best results among the compared models and shows a greater improvement than HCL-TAT. MPC-CA outperforms HCL-TAT by 1.69%, 3.23%, 1.93%, and 2.48% in four few-shot settings, with a larger gap than in FewEvent. These results demonstrate the effectiveness of MPC-CA in learning better feature representations, achieving better classification results, and having stronger generalization ability.

6.2 Ablation Study

To investigate the effectiveness of each component in the MPC-CA model, we conduct ablation studies in four few-shot settings on FewEvent++. The results of these experiments are presented in Table 6.

Table 6 F1 scores (\(10^{-2}\)) of ablation study results on FewEvent++ test set

Effective of MP   further demonstrate the effectiveness of the multi-channels prototypical network (MP), we replace it with vanilla prototype network 4.2 and conduct experiments in the four few-shot settings. The results show that without MP, the F1 scores decrease by 0.91%, 2.47%, and 0.37% in the 5-5, 10-5, and 10-10 few-shot settings, respectively. In particular, the F1 score significantly declines in the 10-way-5-shot setting, where there are more categories and fewer samples, and the vanilla prototype network is unable to acquire adequate prototype feature information. These results suggest that MP can effectively address the issue of inadequate sample information by enriching prototype feature information from multiple perspectives. Furthermore, the MP aggregates different feature sub-spaces to extract more representative features, which helps alleviate the impact of category imbalance caused by too many "O" class words.

Effective of MMS-CL   To investigate the usefulness of multi-channels support contrastive learning, we replace MMS-CL with SSQL from HCL-TAT. As a result, the F1 scores decreased by 1.7%, 0.17%, 1.59%, and 0.58% in four few-shot settings, respectively. Compared with SSQL, MMS-CL enhances the comparison of class relationships and alleviates the category and hard-easy sample imbalance caused by too many "O" classed tokens. MMS-CL promotes contrastive learning of instances from different feature spaces within the support set, which can enrich trigger feature representation from multiple sub-spaces and reduce the proportion of non-trigger words contrastive losses.

Effective of MPC   To investigate the effectiveness of the proposed multi-channels prototype contrastive (MPC) Learning, we conduct experiments with removing MP and MMS-CL simultaneously. The experimental results indicate a decrease in F1 scores by 0.97%, 0.64%, 2.29%, and 0.89% in four few-shot settings, respectively. Especially for more categories and fewer samples in the 10-way-5-shot setting, the F1 score drops most among the four few-shot settings. The MP and MMS-CL methods complement each other from multiple feature sub-spaces solving the problem of limited training sample and category imbalance. In addition, we find that the F1 scores are still overall higher than HCL-TAT in four few-shot settings when we remove the FL or CA. That demonstrates that MPC can effectively enrich prototype feature representation and alleviate category imbalance.

Effective of FL   To certify that using focal loss can effectively suppress easy samples and allow more positive and negative hard samples to play a greater role in the loss, we conduct experiments by replacing the focal loss with cross-entropy loss. The experimental results show that the model drops 1.26%, 0.2%, 1.48% and 1.53%, respectively. These results indicate that the Focal loss can effectively address the problem of the \({\mathscr {L}}_{C E}\) being dominated by the loss of too many easy examples (tokens labeled with the "O" tag) in the FSED setting.

Effective of CA   We conduct experiments to investigate the impact of the condition adversarial (CA) learning on the performance of the proposed model. By removing the CA constraints and only using traditional adversarial learning, we observed a decrease in F1 score by 3.43%, 1.37%, 0.59%, and 0.51% in the four few-shot settings. These results indicate that CA can benefit the model by indirectly adding additional training samples. Additionally, compared to Vanilla adversarial learning, the CA combines with conditional constraints to avoid introducing too much perturbation on model input and causing the noise. Especially in the 5-way settings, the F1 scores significantly decreased, indicating that CA can effectively improve the robustness of the model and its generalization ability in the case of fewer categories and limited supervised data.

6.3 Analysis of the Numbers of Multiple Channels

In this section, we aim to analyze the MPC-CA performance in the different numbers of multiple channels. We conducted experiments with five different channels, namely 1, 2, 4, 6, and 8 and the multiple channels are applied to mm of MP and MMS-CL. The detailed result is listed in Table 7, and the visualization is illustrated in Fig. 7.

Table 7 F1 scores (\(10^{-2}\)) of MPC-CA with different number of multiple channels on the FewEvent++ test set.
Fig. 7
figure 7

Visualization results of MPC-CA with different numbers of multiple channels on FewEvent++

(1) Under the 5-way-5-shot and 5-way-10-shot settings, the overall F1 scores gradually increase with the increase of channel number (m) and the model acquires the best result when \( m= 8\). Under 10-way-5-shot and 10-way-10-shot settings, the F1 scores show a trend of first increasing, and the model achieves maximum value when \( m= 4\). After comprehensive consideration, we choose \( m= 4\) as the basic configuration of MPC-CA. (2) In addition, we find that the performances of MPC-CA with \(m > 1\) are better than without using the multi-channels machine (\( m= 1\)) in the 5-5, 5-10 and 10-5 settings, which indicates the MPC learning fully exploiting information from multiple feature sub-spaces and effectively solving the inadequate learning of prototype representations resulting from limited supervised data.

Fig. 8
figure 8

Visualization results of condition FreeLB and FreeLB on GLUE benchmark

Table 8 Results on the test sets of GLUE from 5 runs with the same hyper parameter but different random seeds. Bold marks the highest number among all models and ± marks the standard deviation

6.4 Effectiveness of Condition Adversarial Learning

To further validate the effectiveness of condition adversarial learning, we execute experiments of adversarial learning on GLUE dataset, comparing the condition adversarial machine against FreeLB [50]. For comparability, we use the same step size and the number of steps for condition FreeLB and FreeLB. In addition, the encoder use bert-base-cased and the code is from the official code of FreeLB.Footnote 4 We summarize results on GLUE in Table 8 and Fig. 8 shows the visualization results.

Based on the experimental results, we observed that the conditional adversarial learning with FreeLB outperforms the traditional FreeLB method across all six tasks in the GLUE benchmark. Specifically, in terms of the similarity and interpretation tasks (which determine if two sentences are semantically equivalent), the conditional FreeLB approach achieves a 1.58% improvement on the QQP corpus and a 0.5% improvement on the MRPC corpus compared to FreeLB. For the emotional and grammatical single sentence classification task, the conditional FreeLB method performs 0.83% better on the COLA corpus and 0.44% better on the SST-2 corpus compared to FreeLB. In the case of the natural language inference task, the conditional FreeLB approach achieves a 0.21% improvement on the MNLI corpus and a 0.5% improvement on the QNLI corpus compared to FreeLB. Hence, our proposed conditional adversarial learning not only adapts well to multiple tasks but also adjusts to different datasets. Moreover, the conditional adversarial learning can easily accommodate various adversarial models by selectively reducing the number of adversaries, thereby improving training speed.

6.5 Analysis of Different Adversarial Methods

This section evaluates the MPC-CA performance with different adversarial methods in four few-shot settings. We compare the widely used PGD method with the superior FreeLB method. Specifically, for MPC-CA with FreeLB, we only apply adversarial perturbation to the support set due to the limited few-shot setting. Moreover, to ensure fairness, the adversarial frequency is set to 3 for FreeLB and PGD in all experiments.

Table 9 F1 scores (\(10^{-2}\)) of MPC-CA with different adversarial methods on the FewEvent++ test set
Fig. 9
figure 9

Visualization results of MPC-CA with different adversarial methods on the FewEvent++

(1) From Table 9 and Fig. 9, we observe that the F1 score of MPC-CA(PGD) outperforms that of MPC-CA(FreeLB) by 0.24% and 1.15% under 5-way-5-shot and 10-way-5-shot few-shot settings, respectively, and is slightly lower than MPC-CA(FreeLB) by 0.02% and 0.17% under 5-way-10-shot and 10-way-10-shot few-shot settings. (2) Overall, MPC-CA(PGD) performs better than MPC-CA(FreeLB), especially for 5-shot settings. The results denote that MPC-CA with the PGD method is more suitable for processing scenarios with limited samples and has a stronger ability to generalize new categories. However, the training time of MPC-CA(FreeLB) is shorter than that of MPC-CA(PGD), which suggests that we could also choose FreeLB as the basic adversarial method if we pursue a shorter running time. Further details are listed in Table 10.

Table 10 Training time (Hours) of ablation study results on FewEvent++ test set
Fig. 10
figure 10

Visualization results of Training time (hours) of models on FewEvent++

6.6 Analysis of Training Time

To further prove the effectiveness of our proposed MPC-CA model, we present the average training time in the four few-shot settings, as shown in Table 10 and Fig. 10. Additionally, the average training time of CRF-based is 2 h among all four minor lens settings.

The results demonstrate that MPC-CA outperforms HCL-TAT in terms of F1 scores and compares favorably with HCL-TAT regarding training time, with even MPC-CA(FreeLB) running much faster than HCL-TAT. This finding suggests that MPC-CA can quickly acquire important feature information during training and better generalize to new event types. This advantage is primarily due to MPC, as MP and MSS-CL aim to explore multiple feature representation sub-spaces and rapidly identify the information that best matches the event type.

7 Conclusion and Further Work

In this paper, we propose a network (MPC-CA) to address the issues of inadequate learning of prototype representations, hard-easy samples imbalance and category imbalance in the FSED task. Specifically, we propose multi-channels prototype contrastive learning, which consists of a multi-channels prototypical network (MP) and support contrastive learning (MMS-CL)and extracts important feature information from multiple feature sub-spaces. Additionally, we replace the cross-entropy loss with Focal loss to alleviate the hard-easy samples imbalance problem. Furthermore, we introduce condition adversarial learning, which aims to indirectly add moderate training samples through adversarial attacks with a condition constraint. Finally, Comparative experiment results on the FewEvent, FewEvent++ dataset and GLUE benchmark demonstrate the effectiveness of our proposed method. In the future, we plan to adapt the part of speech to few-shot event detection or extraction tasks.