SLAN: Self-Locator Aided Network For Vision-Language Understanding
SLAN: Self-Locator Aided Network For Vision-Language Understanding
SLAN: Self-Locator Aided Network For Vision-Language Understanding
Jiang-Tian Zhai1 * Qi Zhang2 * Tong Wu2 Xing-Yu Chen2 Jiang-Jiang Liu1† Ming-Ming Cheng1†
1 2
VCIP, CS, Nankai University Tencent Youtu Lab
{jtzhai30,j04.liu}@gmail.com, townswu@tencent.com, cmm@nankai.edu.cn
gion proposals (e.g., 100) for an image. Directly selecting 4: for i ∈ {1, 2, ..., L} do
EG i+1 , Ei+1 ← CrossAttention(Ei , Ei )
T G T
all regions will lead to unnecessary computational cost and 5:
may also cause the model to learn from some meaningless 6: Ei ← NeighbourEmbedding(Ni , Ni , Gi )
N h w
region-to-word pairs. The strategy to control the maximum 7: ∆xi ,∆yi ← Offset(Similarity(EN T
i ,Ei+1 ))
number of selected regions has three steps. (a) Normalize 8: Gi+1 ← Update(Gi ,∆xi ,∆yi ,pwi , phi )
all saliency scores of the regions. After this process, the 9: EG i+1 ← Embedding(Gi+1 , Fi )
scores are represented as S = {S1 , ..., Sk }, Si ∈ [0, 1]. 10: end for
(b) Sort these regions in descending order according to 11: Tv , Tt ← ExtractCLS(EG T
L+1 , EL+1 )
their saliency scores. (c) We pick no more than top T re- 12: Gout ← GL+1
PL+1
gions with saliency scores above a threshold h. Finally, 13: G ← ( i=2 Gi )/L
we weight region embeddings by the scores. The saliency
score of each proposed region is updated with gradients
from downstream vision-language supervision, which will ordinate updates by searching for highly correlated regions
be described in Sec. 3.3. around the original one. Specifically, the neighborhood
of region g = (x, y, w, h) is defined as a region of size
3.2.3 Region Adaptor: Progressive Region Regression (Nih , Niw ) centered on it, where Nih and Niw are pre-
defined parameters for the i-th level region regression pro-
The region adaptor aims at adjusting the coordinates of cess. The neighborhood is split to K × K regions to com-
proposed regions to align with words with the same se- pute region-word similarities. As shown in Fig. 4, each re-
mantics. The difficulty comes from no annotated text- gion embedding is extracted with RoIAlign and then aver-
referenced regions as ground truths. We turn this problem age pooling from Fi .
into a L-level cascaded coarse-to-fine progressive regres- With different response scores to words, neighbor re-
sion progress, with L = 3 by default. As shown in Fig. 4, gions aggregate context information to the central one. The
the i-th level of the region regression process receives three coordinate update for the central region is in the form of
T
inputs: word embeddings ETi ∈ RN ×D , region embed- weighted summation of coordinates of its neighbor center
N G ×D G
dings EG i ∈ R with their coordinates Gi ∈ RN ×4 , points, as shown in Equ. (2):
Hi ×Wi ×D
and a global decoder feature map Fi ∈ R , where 2
KX −1
N T and N G denotes the number of words and selected re- j K
gions, respectively. D denotes the dimension of embed- ∆x = Mj Njh (⌊ ⌋ − ⌊ ⌋),
j=0
K 2
dings.
2
The detailed procedure of progressive region regression KX −1
K (2)
is described in Algorithm 1. The vision-language multi- ∆y = Mj Njw (j mod K − ⌊ ⌋),
j=0
2
head attention layers fuse region and word embeddings and
model their interactions as follows: x′ = x + ∆x, y ′ = y + ∆y,
w′ = pw w, h′ = ph h,
EG ET ⊤
Ai = i√ i ,
D where ⌊·⌋ is the round down operation. Every region in
(1)
Ei+1 = Sof tmax(Ai )ETi ,
G all levels of the region adaptor has its own pw and ph ,
which are set as learnable parameters. Mj is the maxi-
ETi+1 = Sof tmax(A⊤ G
i )Ei . mum cosine similarity between the embedding of the j-
With vision-language semantics, the updated vision- th neighbor region and all word embeddings. The pur-
aware word embeddings ETi are able to guide region co- pose of the last term in the first two lines of Equ. (2) is
to map the 1D index to a 2D index (e.g., from {0, 1, .., 8} to where ymask denotes the masked word to predict and
{(0, 0), (0, 1), ..., (2, 2)}). pmask (I, T) is its predicted probability. Lds is the down-
For each original region g, let gi denotes its updated ver- stream loss, which is computed by the sum of previous three
sion after the i-th layer in region regression. We take the losses.
average of them as the ground truth and apply the L1 and
GIoU regression loss:
PL+1 Lds (I, T) = Litm (I, T) + Litc (I, T) + Llm (I, T).
gi (8)
g = i=2 ,
L (3) The full pre-training objective is the combination of the
Lreg (g) = LL1 (g, g) + LGIoU (g, g). downstream loss and our constraint on progressive region
3.3. Pre-training Objectives with SLAN regression, computed as follows:
SLAN is pre-trained on image-text pairs and learns fine- L = Lds + Lreg . (9)
grained region-word alignments with the supervision from
three common losses. Lreg denotes the summation of the regression loss in
Equ. (3) for all regions. The model is supervised by L dur-
Image-Text Matching Loss (ITM) predicts whether a
ing training.
given image-text pair is positive or not, which can be viewed
as a binary classification problem. The visual and textual to-
kens (Tv , Tt ) are concatenated and sent to a linear layer fc . 4. Experiments
The ITM loss is formalized as follows: SLAN is first pre-trained on a combined dataset of 14M
Litm (I, T) = H(fc (cat(Tv , Tt )), yv,t ), (4) image-text pairs from five datasets: COCO [28], Visual
Genome [19] (excluding COCO images), Conceptual Cap-
where yv,t denotes the matching relation (1 for matched and tions [5], Conceptual [5], and SBU Captions [29]. We eval-
0 for unmatched), and H is the cross-entropy loss for clas- uate SLAN by comparing it to other state-of-the-art cross-
sification. We directly select positive pairs from the dataset modal methods on several downstream tasks. We also con-
and build hard negative samples with batch sampling, fol- duct extensive ablation studies to investigate how each com-
lowing ALBEF [24]. ponent of SLAN influences the performance.
Image-Text Contrastive Loss (ITC) ensures that visual
4.1. Implementation Details
and textual embeddings share the same semantic space and
the positive (matched) image-text pairs are pulling closer We choose BERTbase [18] as our text encoder, which
than negative (unmatched) ones. We use two queues Iq , Tq is initialized from HuggingFace [39]. For the vision en-
to save the latest visited image and text samples. For coder, we explore four design choices: one CNN-based
each image-text pair (I, T), the softmax-normalized vision- model (i.e., ResNet50) and three transformer-based mod-
language similarity is computed as as: els (i.e., ViT-Base, ViT-Large and ViT-Huge), which are all
exp(sim(Tv , Tt )/τ ) random initialized. As for the neighbour size for each re-
pi2t (I, T, Tq ) = P ′ gion adaptor, we use a ratio ri to denote them: (Nih , Niw ) =
T′ ∈Tq exp(sim(Tv , Tt )/τ ) (ri Hi , ri Wi ), where r1 , r2 , r3 = 1, 0.5, 0.25, respectively.
, (5)
exp(sim(Tt , Tv )/τ ) We pre-train SLAN for 20 epochs. For different choices of
pt2i (T, I, Iq ) = P ′
T′ ∈Iq exp(sim(Tt , Tv )/τ ) the vision encoder, the batch size is set to 1280, 960, 640,
v
640 for ResNet50, ViT-Base, ViT-Large and ViT-Huge, re-
where τ is a temperature parameter and sim(·) measures spectively. The AdamW optimizer is adopted with an initial
vision-language similarity, which is implemented by the dot learning rate of 3e-4, and the learning rate is linearly de-
product between the image and text embeddings. Following cayed to 0. We resize the input images to 224×224.
ALBEF [24], we compute ITC loss as:
Litc (I, T) = −log(pi2t (I, T, Tq )) − log(pt2i (T, I, Iq )).
4.2. Comparison on Downstream Tasks
(6) We compare SLAN with other state-of-the-art methods
on five challenging vision-language understanding tasks, in-
Language Modeling Loss (LM) encourages the model to
cluding image-text retrieval, image captioning, visual ques-
predict masked words with context information. We ran-
tion answering, natural language visual reasoning, zero-
domly mask 15% text tokens and apply the masked lan-
shot video-text retrieval. We also generalize SLAN to two
guage modeling loss as follows:
localization tasks: object detection and phrase grounding.
Llm (I, T) = H(pmask (Tv , Tt ), ymask ), (7) The default vision encoder is ViT-Huge, if not specified.
Zero-shot Fine-tune
Pre-training
Method Backbone
Data Image → Text Text → Image Image → Text Text → Image
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
ALIGN [16] EfficientNet 1.8B 88.6 98.7 99.7 75.7 93.8 96.8 95.3 99.8 100.0 84.9 97.4 98.6
FILIP [44] ViT-Large 300M 89.8 99.2 99.8 75.0 93.4 96.3 96.6 100.0 100.0 87.1 97.7 99.1
BLIP [23] ViT-Large 14M 94.8 99.7 100.0 84.9 96.7 98.3 96.6 99.8 100.0 87.2 97.5 98.8
Beit-3 [37] ViT-Giant 21M 94.9 99.9 100.0 81.5 95.6 97.8 98.0 100.0 100.0 90.3 98.7 99.5
Ours ViT-Huge 14M 96.0 100.0 100.0 86.1 97.0 98.5 98.1 100.0 100.0 90.2 99.0 99.6
Table 2. Comparison with state-of-the-art image-text retrieval methods on Flickr30k. We use Recall@k scores as the evaluation metrics
under both zero-shot and fine-tuning settings.
Table 3. Comparison on more downstream tasks. For COCO retrieval, I2T and T2I represent image to text and text to image retrieval task,
respectively. For COCO image captioning, we report BLEU@4 (B@4), METEOR (M), CIDEr (C), and SPICE (S) scores on the Karpathy
test split. For VQA, we evaluate the vqa-score on the VQAv2 test-dev and test-standard (test-std) splits. For NLVR, we report accuracy on
the NLVR2 development set (dev) and public test set (test-P).
4.2.2 Image Captioning performance than Beit-3 on the VQAv2 test-dev and test-
Given an input image, the captioning task generates a sen- std sets, which adopts a larger vision backbone and requires
tence description to describe the image in detail. We use more pre-training data.
COCO Karpathy split to fine-tune and evaluate. SLAN out-
performs most existing methods under this efficient setting, 4.2.4 Natural Language Visual Reasoning
as shown in Tab. 3.
Natural Language Visual Reasoning (NLVR2) [33] mea-
sures whether a sentence describes a pair of images. We
4.2.3 Visual Question Answering
extract the image and text embeddings from the image-text
Visual Question Answering (VQA) [1] requires the model input, which are then fused with a cross-attention layer. We
to predict an answer from an image-question pair. We use a binary classification module to predict their relations.
follow [23] and treat VQA as an open-ended question- SLAN surpasses most existing methods by a large margin,
generation task. We fuse the image embedding with the and achieves comparable performance with Beit-3, show-
question embedding and send them to the question decoder ing the importance of learning fine-grained vision-language
to get the result. As shown in Tab. 3, SLAN achieves higher alignments.
Pretrain Data (M) Object Detection (COCO) Phrase Grounding (Flickr30k)
Method Backbone
Image-Text Region-Word Zero-shot Fine-tune R@1 R@5 R@10
DETR [4]ECCV’20 ResNet50 0 0 - 42.0 - - -
MDETR [17]ICCV’21 ResNet101 0 0.2 - - 84.3 93.9 95.8
GLIP [25]CVPR’22 Swin-Large 24 3 49.8 60.8 87.1 96.9 98.1
GLIPv2 [46]NeurIPS’22 Swin-Huge 16 3 - 60.2 87.7 97.3 98.5
Beit-3 [37]CVPR’23 ViT-Giant 21 0 - 63.7 - - -
ResNet50 14 0 46.9 59.2 86.8 96.6 97.4
ViT-Base 14 0 47 59.6 87.4 96.9 98.2
Ours
ViT-Large 14 0 48.5 60.5 89.1 98.0 98.9
ViT-Huge 14 0 50.1 63.5 90.6 98.6 99.3
Table 5. Comparison on two localization tasks: object detection on COCO and phrase grounding on Flickr30k. The pre-training data
includes image-text pairs and word-specific region annotations. We evaluate both the zero-shot and fine-tune settings on object detection.
We use Recall@k scores to evaluate the phrase grounding task.