SLAN: Self-Locator Aided Network For Vision-Language Understanding

SLAN: Self-Locator Aided Network for Vision-Language Understanding
Jiang-Tian Zhai1 * Qi Zhang2 * Tong Wu2 Xing-Yu Chen2 Jiang-Jiang Liu1† Ming-Ming Cheng1†
1 2
VCIP, CS, Nankai University Tencent Youtu Lab
{jtzhai30,j04.liu}@gmail.com, townswu@tencent.com, cmm@nankai.edu.cn
Abstract A person is flying in the air. A teddy bear

on top of a
phone.
Learning fine-grained interplay between vision and lan-
guage contributes to a more accurate understanding for
Vision-Language tasks. However, it remains challenging to
extract key image regions according to the texts for semantic
alignments. Most existing works are either limited by text-
agnostic and redundant regions obtained with the frozen re- (a) Image-text Retrieval (b) Image Caption
gion proposal module, or failing to scale further due to their
A person in green cloth is on a
heavy reliance on scarce grounding (gold) data to pre-train bike.
detectors. To solve these problems, we propose Self-Locator
Aided Network (SLAN) for vision-language understanding
tasks without any extra gold data. SLAN consists of a re-
gion filter and a region adaptor to localize regions of in-
terest conditioned on different texts. By aggregating vision-
language information, the region filter selects key regions
and the region adaptor updates their coordinates with text (c) Object Detection (d) Phrase Grounding
Figure 1. Visualization on four different tasks. We visualize the
guidance. With detailed region-word alignments, SLAN can
activation map for text-to-image retrieval task in (a). As for the
be easily generalized to many downstream tasks. It achieves
caption task in (b), we visualize regions selected by our model.
fairly competitive results on five vision-language under- Besides vision-language understanding task, SLAN can transfer to
standing tasks (e.g., 85.7% and 69.2% on COCO image-to- localization tasks, shown in (c) and (d), and we list the confidence
text and text-to-image retrieval, surpassing previous SOTA score for each region.
methods). SLAN also demonstrates strong zero-shot and
fine-tuned transferability to two localization tasks. The
code is available at https://github.com/scok30/ ones [17,25,47,48] have turned to considering latent vision-
SLAN . language alignments at the level of regions and words.
In order to achieve fine-grained vision-language align-
ments, some works [20, 21, 26] use object detectors to ex-
1. Introduction tract key regions in images. Treated as black boxes, the de-
Recent years have witnessed growing interest in ex- tectors only support for fixed vocabulary object detection.
ploring relationships between vision and language modal- Meanwhile, the extracted regions cannot adapt to different
ities. A wide range of applications have been boosted text information due to the freezing parameters of the de-
by its rapid development, such as multi-modal search en- tectors. To alleviates the problem, VinVL [47] applies a
gines [3, 7, 12] and recommender systems [6, 34, 35]. It mo- pre-trained object detector with more than 2000 classes and
tivates researchers to find semantic correspondence between attributes to enrich local visual representations. However,
two modalities and bridging their visual-semantic discrep- the extended label set still limits the perceptive capability
ancy. Some earlier works [14,16,24,31] focused on learning of object detectors for vision-language understanding com-
joint embeddings for the two modalities, while more recent pared to free-form text from real-world scenes.
* Indicates equal contributions. This work was done when J.T. Zhai and
Recently, more works have attempted to apply learn-
J.J. Liu were interning at Tencent Youtu Lab. able region locators for vision-language tasks, which ex-
† J.J. Liu and M.M. Cheng are the corresponding authors. tract regions of interest conditioned on different texts.
Unlike previous methods using frozen object detectors, to various downstream multi-modal tasks. Methods such
MDETR [17] builds an end-to-end framework on datasets as DeViSE [13], TBNN [36], and [49] have proposed loss
with region-to-word annotations. GLIP [25] directly pro- functions and network structures to learn semantic visual-
poses grounded language-image pre-training for learning language alignments. Other approaches like SGG [41] and
object-level, language-aware, and semantic-rich visual rep- ViSTA [8] leverage prior tools or knowledge for image-text
resentations. These methods demonstrate their effective- matching analysis.
ness in vision-language reasoning by introducing trainable Recently, leveraging visual backbone networks [11,
locators. However, in order to supervise the training of lo- 15, 40] and language encoders [18], vision-language pre-
cators, these methods require a certain amount of region-to- training on larger datasets has become increasingly pop-
word grounding annotations (gold data), which are based ular. CLIP [31] pre-trains using 400M image-text pairs
on burdensome and expensive annotation efforts. It limits from the web, establishing global relations between images
their applications on existing larger scale of vision-language and texts. BLIP [23] benefits from extensive web data for
datasets which have abundant but coarse-grained image and vision-language understanding and generation tasks. Beit-
text pairs. 3 [37] adopts mask-then-predict self-supervised training on
To address the problems above, we propose Self-Locator large-scale monomodal and multi-modal data to learn inter-
Aided Network (SLAN) for vision-language understand- nal vision-language dependencies.
ing. The designed self-locator is capable of accurately lo- However, these methods are constrained by the expense
cating regions of interest based on different texts. Specifi- of fine-grained region-word datasets, making it challeng-
cally, the self-locator consists of a region filter to select im- ing to directly provide local matching signals during pre-
portant regions and a region adaptor to update coordinates training for more accurate cross-modal knowledge. This
of regions with text guidance. By incorporating the self- knowledge enables models to precisely localize objects ac-
locator into our framework, SLAN performs context-aware cording to corresponding words, providing cues for down-
region extraction and vision-language feature fusion. More- stream tasks.
over, SLAN is trained solely on datasets with paired images
and texts, making it scalable to larger pre-training settings 2.2. Localization for Vision-language Task
for further performance improvements. With fine-grained
region-word alignments, SLAN has a more detailed under- Localization of image regions and words in sentences
standing of interactions in vision and language modalities. helps models learn local alignment. There are two kinds
To sum up, our contributions have three aspects: of methods based on whether the region proposal module is
frozen or trained for vision-language tasks.
• We propose a framework termed SLAN to capture The first kind uses a frozen object detector (e.g., Faster
fine-grained interplay between vision and language R-CNN) pre-trained on Visual Genomes to extract detailed
modalities. A self-locator is introduced to per- visual representations. Some later works (e.g., VinVL [47],
form text-guided region adaptation, enabling dynamic Oscar [26]) increase the number of detection labels and
region-word alignments for vision-language under- introduce attribute information to complement visual con-
standing tasks, as shown in Fig. 1. cepts.
• We demonstrate that SLAN can be easily applied to The other kind relies on fine-grained annotations of the
large-scale pre-training on vision-language datasets for vision-language dataset for pre-training. MDETR [17] in-
being free from training with gold data. SLAN can troduces a modulated detector with multi-modal datasets
also be naturally generalized to typical localization that have precise alignments between phrases in text and
tasks, such as object detection and phrase grounding, objects in images. GLIP [25] applies grounded pre-training
due to its ability to locate key regions in images. to learn object-level, language-aware, and semantic-rich
visual representations. However, these methods require
• Experiments on five vision-language understanding vision-language data with fine-grained annotations, limiting
and two localization tasks demonstrate the effective- their application on larger-scale pre-training settings.
ness of our method. For example, SLAN achieves
state-of-the-art performance on COCO image-text re- 3. Self-Locator Aided Network (SLAN)
trieval.
The framework of SLAN is shown in Fig. 2. We first
2. Related Work briefly introduce the two unimodel encoders and then the
detailed structures of other components. SLAN adaptively
2.1. Vision-language Task
proposes and selects informative regions with text guidance,
Previous research has explored the relationship between as described in Fig. 3. Finally, we list our pre-training ob-
visual and textual modalities and applied this knowledge jectives. The relevant symbols are described in Tab. 1.
Downstream tasks Symbol Dimensions Meaning
D 1×1 token dimension
Region Adaptor ×L L 1×1 number of layers/stages
K 1×1 number of grids per axis
ET1 {Fi } EG
1
Nih , Niw 1×1 grid size of neighbour
Si
Vision Si 1×1 saliency score of region
Text Decoder
Encoder pwi , phi 1×1 scaling parameter
Vision ET NT × D text embedding
Encoder Region Filter EG NG × D region embedding
A man is
hitting a golf Fi H i × Wi × D pyramid feature map
Region
ball from a Proposal Gi NG × 4 region coordinates
sand bunker Tv , Tt 1×D global visual/textual token
Ai NG × NT cross attention map
Figure 2. The SLAN framework. Two unimodal encoders extract
textual and visual representations, respectively. The self-locator Table 1. Table of symbols, their dimensions, and meaning.
automatically generates, filter, and then iteratively adapts the im-
age regions for fine-grained region-word alignments. The learned
vision and language features can be used for downstream tasks. Different from most traditional object detection tasks
that use the pre-defined label set, vision-language tasks usu-
ally have a wider vocabulary and free-form textual expres-
sions. Therefore, our self-locator introduces a region fil-
ter for region importance prediction and a region adaptor
for progressive region regression. By replacing fixed vo-
cabulary prediction with region importance prediction, our
(a) Input image (b) Region proposal (c) Region filter self-locator assigns each region a saliency score Si to esti-
mate the probability that the region is useful for the align-
ment process. For traditional detection settings, the regres-
sion targets are annotated region coordinates. Since there
is no grounding (gold) annotations in our setting, we pro-
(d) Results after the 1st, 2nd, and 3rd region adaptors pose progressive region regression in the multi-stage region
Figure 3. Sample intermediate results of the self-locator. adaptor, producing intermediate updated regions in each
level. These updated regions are then used for supervising
the internal region proposal module. As shown in Fig. 3,
3.1. Unimodal Encoding SLAN dynamically adapts region embeddings in L = 3
Two unimodal encoders learn textual and visual repre- levels, yielding more flexible and accurate visual represen-
sentations with D dimensions. We use BERT [18] as our tations than the global visual feature maps, or patch embed-
text encoder, encoding words into a shared semantic space. dings from the vision transformer.
T
The encoded embeddings ET ∈ RN ×D summarize the
whole sentence, including a textual token Tt ∈ RD from 3.2.1 Vision Decoder: Pyramid Feature Extraction
BERT’s classification token and N T − 1 word embeddings. Our proposed self-locator is designed for regression in a
For image feature extraction, we encode images with coarse-to-fine manner, requiring visual features of multi-
classic vision backbone(e.g., ResNet50 [15], ViT-Base [11], scale. Considering these characteristics, we adopt a vision
ViT-Large, and ViT-Huge) to obtain the vision feature map decoder after the global visual feature to extract multi-scale
V with high-level semantics. feature maps {Fi }, where i ∈ {1, 2, ..., L}. Fi denotes the
3.2. Self-locator for Vision-language Understanding i-th level of decoder features, and L = 3 is the default num-
ber of layers of the self-locators. Fi is then fed to the i-th
Since fine-grained region-word alignments are important level of region adaptor. The structure of vision encoder and
for vision-language relation exploration, our self-locator decoder follows the feature pyramid network [27].
follows the region proposal network [32] to output regions,
where each region i contains spatial coordinates (x, y, w, h)
3.2.2 Region Filter: Region Importance Prediction
and corresponding region embedding EG i ∈ RD . The
G
text-relevant local features Ei is extracted from V using When describing images, people usually focus on limited
RoIAlign. A vision token Tv is then obtained from global salient regions in the images [9, 10]. However, region pro-
average pooling of V as a global summary of this image. posal module [32] typically outputs a large number of re-
ETi Cross Attention ETi+1 Algorithm 1 Self-localization
EG
i x, y, w, h w ′ , h′
Input: Image I, region embeddings EG i , text embed-
dings ETi , pyramid feature map Fi , neighbour size
.94 .08 .12
Fi .34 .87 .10 EG
(Nih , Niw ), total region regression layers L.
i+1
.01 .04 .07 1: pwi , phi are learnable parameters independent for every
region in each levels.
Figure 4. The i-th stage of the region adaptor. The region adaptor Output: Updated regions Gout , region supervision on the
update each region’s coordinate with text guidance. We use the region proposal module G, visual token Tv , textual to-
feature map from vision decoder to extract region embeddings and
ken Tt .
explore latent region-word alignments.
2: G1 , EG 1 ← RegionProposal (I)
3: G1 , S, EG 1 ← RegionImportancePrediction (G1 , E1 )
G
gion proposals (e.g., 100) for an image. Directly selecting 4: for i ∈ {1, 2, ..., L} do
EG i+1 , Ei+1 ← CrossAttention(Ei , Ei )
T G T
all regions will lead to unnecessary computational cost and 5:
may also cause the model to learn from some meaningless 6: Ei ← NeighbourEmbedding(Ni , Ni , Gi )
N h w
region-to-word pairs. The strategy to control the maximum 7: ∆xi ,∆yi ← Offset(Similarity(EN T
i ,Ei+1 ))
number of selected regions has three steps. (a) Normalize 8: Gi+1 ← Update(Gi ,∆xi ,∆yi ,pwi , phi )
all saliency scores of the regions. After this process, the 9: EG i+1 ← Embedding(Gi+1 , Fi )
scores are represented as S = {S1 , ..., Sk }, Si ∈ [0, 1]. 10: end for
(b) Sort these regions in descending order according to 11: Tv , Tt ← ExtractCLS(EG T
L+1 , EL+1 )
their saliency scores. (c) We pick no more than top T re- 12: Gout ← GL+1
PL+1
gions with saliency scores above a threshold h. Finally, 13: G ← ( i=2 Gi )/L
we weight region embeddings by the scores. The saliency
score of each proposed region is updated with gradients
from downstream vision-language supervision, which will ordinate updates by searching for highly correlated regions
be described in Sec. 3.3. around the original one. Specifically, the neighborhood
of region g = (x, y, w, h) is defined as a region of size
3.2.3 Region Adaptor: Progressive Region Regression (Nih , Niw ) centered on it, where Nih and Niw are pre-
defined parameters for the i-th level region regression pro-
The region adaptor aims at adjusting the coordinates of cess. The neighborhood is split to K × K regions to com-
proposed regions to align with words with the same se- pute region-word similarities. As shown in Fig. 4, each re-
mantics. The difficulty comes from no annotated text- gion embedding is extracted with RoIAlign and then aver-
referenced regions as ground truths. We turn this problem age pooling from Fi .
into a L-level cascaded coarse-to-fine progressive regres- With different response scores to words, neighbor re-
sion progress, with L = 3 by default. As shown in Fig. 4, gions aggregate context information to the central one. The
the i-th level of the region regression process receives three coordinate update for the central region is in the form of
T
inputs: word embeddings ETi ∈ RN ×D , region embed- weighted summation of coordinates of its neighbor center
N G ×D G
dings EG i ∈ R with their coordinates Gi ∈ RN ×4 , points, as shown in Equ. (2):
Hi ×Wi ×D
and a global decoder feature map Fi ∈ R , where 2
KX −1
N T and N G denotes the number of words and selected re- j K
gions, respectively. D denotes the dimension of embed- ∆x = Mj Njh (⌊ ⌋ − ⌊ ⌋),
j=0
K 2
dings.
2
The detailed procedure of progressive region regression KX −1
K (2)
is described in Algorithm 1. The vision-language multi- ∆y = Mj Njw (j mod K − ⌊ ⌋),
j=0
2
head attention layers fuse region and word embeddings and
model their interactions as follows: x′ = x + ∆x, y ′ = y + ∆y,
w′ = pw w, h′ = ph h,
EG ET ⊤
Ai = i√ i ,
D where ⌊·⌋ is the round down operation. Every region in
(1)
Ei+1 = Sof tmax(Ai )ETi ,
G all levels of the region adaptor has its own pw and ph ,
which are set as learnable parameters. Mj is the maxi-
ETi+1 = Sof tmax(A⊤ G
i )Ei . mum cosine similarity between the embedding of the j-
With vision-language semantics, the updated vision- th neighbor region and all word embeddings. The pur-
aware word embeddings ETi are able to guide region co- pose of the last term in the first two lines of Equ. (2) is
to map the 1D index to a 2D index (e.g., from {0, 1, .., 8} to where ymask denotes the masked word to predict and
{(0, 0), (0, 1), ..., (2, 2)}). pmask (I, T) is its predicted probability. Lds is the down-
For each original region g, let gi denotes its updated ver- stream loss, which is computed by the sum of previous three
sion after the i-th layer in region regression. We take the losses.
average of them as the ground truth and apply the L1 and
GIoU regression loss:
PL+1 Lds (I, T) = Litm (I, T) + Litc (I, T) + Llm (I, T).
gi (8)
g = i=2 ,
L (3) The full pre-training objective is the combination of the
Lreg (g) = LL1 (g, g) + LGIoU (g, g). downstream loss and our constraint on progressive region
3.3. Pre-training Objectives with SLAN regression, computed as follows:
SLAN is pre-trained on image-text pairs and learns fine- L = Lds + Lreg . (9)
grained region-word alignments with the supervision from
three common losses. Lreg denotes the summation of the regression loss in
Equ. (3) for all regions. The model is supervised by L dur-
Image-Text Matching Loss (ITM) predicts whether a
ing training.
given image-text pair is positive or not, which can be viewed
as a binary classification problem. The visual and textual to-
kens (Tv , Tt ) are concatenated and sent to a linear layer fc . 4. Experiments
The ITM loss is formalized as follows: SLAN is first pre-trained on a combined dataset of 14M
Litm (I, T) = H(fc (cat(Tv , Tt )), yv,t ), (4) image-text pairs from five datasets: COCO [28], Visual
Genome [19] (excluding COCO images), Conceptual Cap-
where yv,t denotes the matching relation (1 for matched and tions [5], Conceptual [5], and SBU Captions [29]. We eval-
0 for unmatched), and H is the cross-entropy loss for clas- uate SLAN by comparing it to other state-of-the-art cross-
sification. We directly select positive pairs from the dataset modal methods on several downstream tasks. We also con-
and build hard negative samples with batch sampling, fol- duct extensive ablation studies to investigate how each com-
lowing ALBEF [24]. ponent of SLAN influences the performance.
Image-Text Contrastive Loss (ITC) ensures that visual
4.1. Implementation Details
and textual embeddings share the same semantic space and
the positive (matched) image-text pairs are pulling closer We choose BERTbase [18] as our text encoder, which
than negative (unmatched) ones. We use two queues Iq , Tq is initialized from HuggingFace [39]. For the vision en-
to save the latest visited image and text samples. For coder, we explore four design choices: one CNN-based
each image-text pair (I, T), the softmax-normalized vision- model (i.e., ResNet50) and three transformer-based mod-
language similarity is computed as as: els (i.e., ViT-Base, ViT-Large and ViT-Huge), which are all
exp(sim(Tv , Tt )/τ ) random initialized. As for the neighbour size for each re-
pi2t (I, T, Tq ) = P ′ gion adaptor, we use a ratio ri to denote them: (Nih , Niw ) =
T′ ∈Tq exp(sim(Tv , Tt )/τ ) (ri Hi , ri Wi ), where r1 , r2 , r3 = 1, 0.5, 0.25, respectively.
, (5)
exp(sim(Tt , Tv )/τ ) We pre-train SLAN for 20 epochs. For different choices of
pt2i (T, I, Iq ) = P ′
T′ ∈Iq exp(sim(Tt , Tv )/τ ) the vision encoder, the batch size is set to 1280, 960, 640,
v
640 for ResNet50, ViT-Base, ViT-Large and ViT-Huge, re-
where τ is a temperature parameter and sim(·) measures spectively. The AdamW optimizer is adopted with an initial
vision-language similarity, which is implemented by the dot learning rate of 3e-4, and the learning rate is linearly de-
product between the image and text embeddings. Following cayed to 0. We resize the input images to 224×224.
ALBEF [24], we compute ITC loss as:
Litc (I, T) = −log(pi2t (I, T, Tq )) − log(pt2i (T, I, Iq )).
4.2. Comparison on Downstream Tasks
(6) We compare SLAN with other state-of-the-art methods
on five challenging vision-language understanding tasks, in-
Language Modeling Loss (LM) encourages the model to
cluding image-text retrieval, image captioning, visual ques-
predict masked words with context information. We ran-
tion answering, natural language visual reasoning, zero-
domly mask 15% text tokens and apply the masked lan-
shot video-text retrieval. We also generalize SLAN to two
guage modeling loss as follows:
localization tasks: object detection and phrase grounding.
Llm (I, T) = H(pmask (Tv , Tt ), ymask ), (7) The default vision encoder is ViT-Huge, if not specified.
Zero-shot Fine-tune
Pre-training
Method Backbone
Data Image → Text Text → Image Image → Text Text → Image
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
ALIGN [16] EfficientNet 1.8B 88.6 98.7 99.7 75.7 93.8 96.8 95.3 99.8 100.0 84.9 97.4 98.6
FILIP [44] ViT-Large 300M 89.8 99.2 99.8 75.0 93.4 96.3 96.6 100.0 100.0 87.1 97.7 99.1
BLIP [23] ViT-Large 14M 94.8 99.7 100.0 84.9 96.7 98.3 96.6 99.8 100.0 87.2 97.5 98.8
Beit-3 [37] ViT-Giant 21M 94.9 99.9 100.0 81.5 95.6 97.8 98.0 100.0 100.0 90.3 98.7 99.5
Ours ViT-Huge 14M 96.0 100.0 100.0 86.1 97.0 98.5 98.1 100.0 100.0 90.2 99.0 99.6
Table 2. Comparison with state-of-the-art image-text retrieval methods on Flickr30k. We use Recall@k scores as the evaluation metrics
under both zero-shot and fine-tuning settings.
Pre-training Retrieval (COCO) Caption (COCO) VQA (VQAv2) NLVR (NLVR2)

Method Backbone
Data I2T R@1 T2I R@1 B@4 M C S test-dev test-std dev test-P
Oscar [26] ResNet101 6.5M 73.5 57.5 37.4 30.7 127.8 23.5 73.6 73.8 79.1 80.3
VinVL [47] ResNeXt152-C4 8.9M 75.4 58.8 38.5 30.4 130.8 23.4 76.5 76.6 82.6 83.9
SimVLM [38] ViT-Huge 1.8B - - 40.6 33.7 143.3 25.4 80.0 80.3 84.5 85.1
GLIPv2-H [46] Swin-Huge 16M - - - - 131.0 - 74.6 74.8 - -
CoCa [45] ViT-Giant 4.8B - - 40.9 33.9 143.6 24.7 82.3 82.3 86.1 87.0
BLIP [23] ViT-Large 14M 82.4 65.1 40.4 - 136.7 - 78.2 78.3 82.1 82.2
Beit-3 [37] ViT-Giant 21M 84.8 67.2 44.1 32.4 147.6 25.4 84.2 84.0 91.5 92.5
Ours ViT-Huge 14M 85.7 69.2 44.2 34.3 147.8 25.8 84.5 84.7 91.0 91.7
Table 3. Comparison on more downstream tasks. For COCO retrieval, I2T and T2I represent image to text and text to image retrieval task,
respectively. For COCO image captioning, we report BLEU@4 (B@4), METEOR (M), CIDEr (C), and SPICE (S) scores on the Karpathy
test split. For VQA, we evaluate the vqa-score on the VQAv2 test-dev and test-standard (test-std) splits. For NLVR, we report accuracy on
the NLVR2 development set (dev) and public test set (test-P).
4.2.1 Image-Text Retrieval Method R@1 ↑ R@5 ↑ R@10 ↑ MdR ↓

ClipBERT [22] 22.0 46.8 59.9 6
Given an image, the retrieval task expects to retrieve the VideoCLIP [42] 30.9 55.4 66.8 -
corresponding text from the text gallery through the in- FiT† [2] 43.3 65.6 74.7 2
put image, and vice versa. We evaluate our method on BLIP† [23] 43.3 65.6 74.7 2
Flickr30k [30] under zero-shot and fine-tune settings with
Ours† 46.8 70.5 83.6 1.5
Karpathy split and the performance is evaluated in terms
of Recall@k. The comparative results are shown in Tab. 2. Table 4. Comparison on the text-video retrieval task on the 1k test
Specifically, on the same pre-training setting, SLAN outper- split of the MSRVTT [43] dataset. † denotes the zero-shot set-
forms BLIP [23] by 3.3% in average recall@1 on COCO. tings, while the others are fine-tuned.
4.2.2 Image Captioning performance than Beit-3 on the VQAv2 test-dev and test-
Given an input image, the captioning task generates a sen- std sets, which adopts a larger vision backbone and requires
tence description to describe the image in detail. We use more pre-training data.
COCO Karpathy split to fine-tune and evaluate. SLAN out-
performs most existing methods under this efficient setting, 4.2.4 Natural Language Visual Reasoning
as shown in Tab. 3.
Natural Language Visual Reasoning (NLVR2) [33] mea-
sures whether a sentence describes a pair of images. We
4.2.3 Visual Question Answering
extract the image and text embeddings from the image-text
Visual Question Answering (VQA) [1] requires the model input, which are then fused with a cross-attention layer. We
to predict an answer from an image-question pair. We use a binary classification module to predict their relations.
follow [23] and treat VQA as an open-ended question- SLAN surpasses most existing methods by a large margin,
generation task. We fuse the image embedding with the and achieves comparable performance with Beit-3, show-
question embedding and send them to the question decoder ing the importance of learning fine-grained vision-language
to get the result. As shown in Tab. 3, SLAN achieves higher alignments.
Pretrain Data (M) Object Detection (COCO) Phrase Grounding (Flickr30k)
Method Backbone
Image-Text Region-Word Zero-shot Fine-tune R@1 R@5 R@10
DETR [4]ECCV’20 ResNet50 0 0 - 42.0 - - -
MDETR [17]ICCV’21 ResNet101 0 0.2 - - 84.3 93.9 95.8
GLIP [25]CVPR’22 Swin-Large 24 3 49.8 60.8 87.1 96.9 98.1
GLIPv2 [46]NeurIPS’22 Swin-Huge 16 3 - 60.2 87.7 97.3 98.5
Beit-3 [37]CVPR’23 ViT-Giant 21 0 - 63.7 - - -
ResNet50 14 0 46.9 59.2 86.8 96.6 97.4
ViT-Base 14 0 47 59.6 87.4 96.9 98.2
Ours
ViT-Large 14 0 48.5 60.5 89.1 98.0 98.9
ViT-Huge 14 0 50.1 63.5 90.6 98.6 99.3
Table 5. Comparison on two localization tasks: object detection on COCO and phrase grounding on Flickr30k. The pre-training data
includes image-text pairs and word-specific region annotations. We evaluate both the zero-shot and fine-tune settings on object detection.
We use Recall@k scores to evaluate the phrase grounding task.
Trainable Adaptor COCO Flickr30k COCO Flickr30k

Top K Threshold
Region Proposal Number TR@1 IR@1 TR@1 IR@1 TR@1 IR@1 TR@1 IR@1
✘ 0 68.5 53.5 85.0 74.1 - - 69.4 54.1 85.9 74.7
✔ 0 69.1 53.8 86.7 76.2 10 - 70.6 56.8 87.5 77.3
✔ 1 70.0 57.2 88.3 77.4 10 0.3 71.2 57.6 89.1 78.2
✔ 2 70.8 57.5 88.7 78.1 10 0.5 72.1 58.3 90.3 78.9
✔ 3 72.1 58.3 90.3 78.9 Table 7. Ablations on different settings of the region filter.
Table 6. Ablations on the trainable region proposal module and
region adaptor in SLAN. ✘ in the first column denotes applying
a frozen region proposal module and no self-locator. TR@1 and tion tasks. For example, in the task of object detection
IR@1 denote recall@1 of image to text and text to image retrieval, with ViT-Base as the backbone, SLAN achieves compara-
respectively. To evaluate the effect of the self-locator against a ble results to GLIP requiring a larger backbone and 3M gold
frozen region proposal module, we load the weights pre-trained data. Though not designed for localization tasks, SLAN
on COCO detection task and compare it with our method (Row 1
with ViT-Huge as backbone outperforms almost all com-
vs. 2). The remaining experiments are trained from scratch. ViT-
parative methods.
Base is used as the vision encoder.
4.3. Ablation Study

4.2.5 Zero-shot Video-Text Retrieval
4.3.1 Effectiveness of Self-locator
Besides the image-text tasks mentioned above, SLAN can Importance of learnable region proposal module. As
generalize to the video-text retrieval task. We randomly se- shown in Tab. 6, the 1st row represents replacing self-
lect m frames from the video input and concatenate them locator with a frozen detector pre-trained on the COCO de-
to get an image-text sequence, which are then directly fed tection task, and the 2nd row is our learnable region pro-
into our image-text retrieval model. As shown in Tab. 4, posal module. We do not initialize the region proposal mod-
SLAN achieves comparable performance to the other meth- ule with pre-trained weights, but only fine-tune them on the
ods, demonstrating the vision-language knowledge learned downstream task’s datasets. Our method improves on aver-
in SLAN is semantic-rich. age about 0.5% and 2% on COCO and Flickr30k’s image-
to-text and text-to-image retrieval tasks, respectively.
4.2.6 Localization Tasks
Number of region adaptors for region regression. The
We conduct experiments on two localization tasks: object region adaptor performs progressive regression on the re-
detection on COCO, and phrase grounding on Flickr30k. gions outputted by the region proposal module to provide
For the text input in the object detection task, we use more accurate region localization for vision-language un-
a prompt composed of concatenated labels from COCO derstanding tasks. As shown in Tab. 6, when the number of
(e.g., “detect: person, bicycle, car, ... , toothbrush”). We region adaptors increases from 0 to 3, the retrieval perfor-
adopt the output from the last layer of the region adaptor. mance can be significantly improved by an average of more
Tab. 5 shows exciting performance of SLAN on localiza- than 3%.
COCO
Method Backbone Params(M) FLOPs(G)
TR@1 IR@1
BLIP ViT-Base 370 558 81.9 64.3
BLIP ViT-Large 810 1594 82.4 65.1
Coca ViT-Giant 2100 4103 83.0 65.5
Beit-3 ViT-Giant 1900 - 84.8 67.2
Ours ResNet50 322 324 85.1 68.9
Table 8. Comparison on number of parameters and FLOPs on the
vision-language retrieval task. The FLOPs is calculated with an
input image resolution of 384x384. “Backbone” denotes the vision
encoder.
A man in a red coat skiing. A man in black pants goes
Region filter for saliency prediction. Tab. 7 illustrates down a snow slope.
how the region filter affects the performance on COCO and
Flickr30k retrieval tasks. Learnable region proposal mod- Figure 5. Illustration of text-specific region adaptation. We col-
ule is trained from scratch and the number of region adap- orize three words per sentence and use the corresponding colors to
tors is set to 3. The first two rows show that when the re- mark the regions with the highest matching scores. This highlights
gions are sorted by their saliency scores and only selected SLAN’s ability to suggest adaptive text-relevant regions.
a certain number (top K), we can achieve an performance
gain of ∼ 2% on each dataset. When using in combina-
tion with saliency score threshold, our region filter is able A man is sitting at the beach reading his newspaper.
to remove redundant regions that negatively affect vision-
language adaptation and achieves even higher performance.
4.3.2 Computational Cost

Tab. 8 shows the comparison on computational cost of
SLAN and other state-of-the-art methods. As can be seen,
SLAN has the smallest amount of parameters and FLOPs
for that in this experiments our vision backbone is a rela-
tively lightweight ResNet50. However, our retrieval perfor-
mance significantly outperforms other methods. We believe
Figure 6. Illustration of the coarse-to-fine process of region adap-
that the above phenomena demonstrate the efficiency and
tation. We also show the matching score between the regions and
effectiveness of our proposed SLAN. counterpart words. Note that each region at different levels in the
region adaptor has independent scaling and moving behavior in
4.4. Visualization Analysis our implementation.
4.4.1 Text-guided Region Adaptation
As shown in Fig. 5, our region adaptor produces text- 5. Conclusions and Future Work
specific results with relatively high confidence. When we
change the detailed description of the sentence, e.g., “a man In this paper, we introduce the Self-Locator Aided Net-
in a red coat” to “a man in black pants”, the interesting phe- work (SLAN), which leverages a self-locator to adapt the
nomenon is that the attention regions of our self-locator are proposed regions for vision-language alignments without
also shifted accordingly with relatively high confidence. the need for extra grounding (region-to-word) annotations.
We aim to further investigate and optimize the self-locator’s
4.4.2 Coarse-to-fine Region Adaptation performance for various localization applications.
To verify the calibration effect of region adaptation, we vi-

sualize an image with its text in Fig. 6. Model locates more Aknowledgements. This research was supported by the
accurate regions of interest with higher similarity scores af- NSFC (NO. 62225604, 62176130) and the Fundamental
ter three levels of region adaptor. It shows that our self- Research Funds for the Central Universities (Nankai Uni-
locator can hierarchically refine the relevant regions corre- versity, 070-63233089). The Supercomputing Center of
sponding to the provided words. Nankai University supports computation.
References [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In IEEE Conf.
[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Comput. Vis. Pattern Recog., 2016. 2, 3
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. [16] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Vqa: Visual question answering. In Int. Conf. Comput. Vis., Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
2015. 6 Duerig. Scaling up visual and vision-language representa-
[2] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisser- tion learning with noisy text supervision. In Int. Conf. Mach.
man. Frozen in time: A joint video and image encoder for Learn., 2021. 1, 6
end-to-end retrieval. In Int. Conf. Comput. Vis., 2021. 6 [17] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel
[3] Yue Cao, Mingsheng Long, Jianmin Wang, and Shichen Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-
Liu. Collective deep quantization for efficient cross-modal modulated detection for end-to-end multi-modal understand-
retrieval. In AAAI, 2017. 1 ing. In Int. Conf. Comput. Vis., 2021. 1, 2, 7
[4] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [18] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Toutanova. Bert: Pre-training of deep bidirectional trans-
end object detection with transformers. In Eur. Conf. Com- formers for language understanding. In NAACL-HLT, 2019.
put. Vis., 2020. 7 2, 3, 5
[19] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
[5] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
Soricut. Conceptual 12m: Pushing web-scale image-text pre-
tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
training to recognize long-tail visual concepts. In IEEE Conf.
Connecting language and vision using crowdsourced dense
Comput. Vis. Pattern Recog., 2021. 5
image annotations. Int. J. Comput. Vis., 2017. 5
[6] Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, [20] Chia-Wen Kuo and Zsolt Kira. Beyond a pre-trained ob-
and Jungong Han. Imram: Iterative matching with recurrent ject detector: Cross-modal textual and visual context for im-
attention memory for cross-modal image-text retrieval. In age captioning. In IEEE Conf. Comput. Vis. Pattern Recog.,
IEEE Conf. Comput. Vis. Pattern Recog., 2020. 1 2022. 1
[7] Kan Chen, Trung Bui, Chen Fang, Zhaowen Wang, and Ram [21] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xi-
Nevatia. Amc: Attention guided multi-modal correlation aodong He. Stacked cross attention for image-text matching.
learning for image search. In IEEE Conf. Comput. Vis. Pat- In Eur. Conf. Comput. Vis., 2018. 1
tern Recog., 2017. 1 [22] Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg,
[8] Mengjun Cheng, Yipeng Sun, Longchao Wang, Xiongwei Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for
Zhu, Kun Yao, Jie Chen, Guoli Song, Junyu Han, Jingtuo video-and-language learning via sparse sampling. In IEEE
Liu, Errui Ding, et al. Vista: Vision and scene text aggrega- Conf. Comput. Vis. Pattern Recog., 2021. 6
tion for cross-modal retrieval. In IEEE Conf. Comput. Vis. [23] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.
Pattern Recog., 2022. 2 Blip: Bootstrapping language-image pre-training for unified
[9] Ming-Ming Cheng, Niloy J. Mitra, Xiaolei Huang, Philip vision-language understanding and generation. In Int. Conf.
H. S. Torr, and Shi-Min Hu. Global contrast based salient Mach. Learn., 2022. 2, 6
region detection. IEEE Trans. Pattern Anal. Mach. Intell., [24] Junnan Li, Ramprasaath R Selvaraju, Akhilesh Gotmare,
37(3):569–582, 2015. 3 Shafiq R Joty, Caiming Xiong, and Steven Chu-Hong Hoi.
[10] Robert Desimone and John Duncan. Neural mechanisms of Align before fuse: Vision and language representation learn-
selective visual attention. Annual review of neuroscience, ing with momentum distillation. In NIPS, 2021. 1, 5
18(1):193–222, 1995. 3 [25] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian-
wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
language-image pre-training. In IEEE Conf. Comput. Vis.
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
Pattern Recog., 2022. 1, 2, 7
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
[26] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei
worth 16x16 words: Transformers for image recognition at
Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu
scale, 2020. 2, 3
Wei, et al. Oscar: Object-semantics aligned pre-training for
[12] Benjamin Elizalde, Shuayb Zarar, and Bhiksha Raj. Cross vision-language tasks. In Eur. Conf. Comput. Vis., 2020. 1,
modal audio search and retrieval with joint embeddings 2, 6
based on text and audio. In ICASSP, 2019. 1 [27] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
[13] Andrea Frome, Gregory S Corrado, Jonathon Shlens, Samy Bharath Hariharan, and Serge Belongie. Feature pyramid
Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas networks for object detection. In IEEE Conf. Comput. Vis.
Mikolov. Devise: A deep visual-semantic embedding model. Pattern Recog., 2017. 3
In NIPS, 2013. 2 [28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[14] Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
and Jingjing Liu. Large-scale adversarial training for vision- Zitnick. Microsoft coco: Common objects in context. In Eur.
and-language representation learning. NIPS, 2020. 1 Conf. Comput. Vis., 2014. 5
[29] Vicente Ordonez, Girish Kulkarni, and Tamara Berg. [43] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large
Im2text: Describing images using 1 million captioned pho- video description dataset for bridging video and language. In
tographs. NIPS, 2011. 5 IEEE Conf. Comput. Vis. Pattern Recog., 2016. 6
[30] Bryan A Plummer, Liwei Wang, Chris M Cervantes, [44] Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe
Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and
Flickr30k entities: Collecting region-to-phrase correspon- Chunjing Xu. Filip: Fine-grained interactive language-image
dences for richer image-to-sentence models. In Int. Conf. pre-training. In Int. Conf. Learn. Represent., 2021. 6
Comput. Vis., 2015. 6 [45] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo-
[31] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, captioners are image-text foundation models. arXiv preprint
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- arXiv:2205.01917, 2022. 6
ing transferable visual models from natural language super- [46] Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun
vision. In Int. Conf. Mach. Learn., 2021. 1, 2 Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu
[32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: Uni-
Faster r-cnn: Towards real-time object detection with region fying localization and vision-language understanding. NIPS,
proposal networks. In Advances in neural information pro- 2022. 6, 7
cessing systems, 2015. 3 [47] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang,
[33] Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao.
Bai, and Yoav Artzi. A corpus for reasoning about natural Vinvl: Revisiting visual representations in vision-language
language grounded in photographs. ACL, 2019. 6 models. In IEEE Conf. Comput. Vis. Pattern Recog., 2021.
1, 2, 6
[34] Rui Sun, Xuezhi Cao, Yan Zhao, Junchen Wan, Kun Zhou,
[48] Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li.
Fuzheng Zhang, Zhongyuan Wang, and Kai Zheng. Multi-
Context-aware attention network for image-text retrieval. In
modal knowledge graphs for recommender systems. In
IEEE Conf. Comput. Vis. Pattern Recog., 2020. 1
CIKM, 2020. 1
[49] Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang,
[35] Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-peng Lim, and
Mingliang Xu, and Yi-Dong Shen. Dual-path convolutional
Steven CH Hoi. Learning cross-modal embeddings with ad-
image-text embeddings with instance loss. TOMM, 2020. 2
versarial networks for cooking recipes and food images. In
[36] Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik.
Learning two-branch neural networks for image-text match-
ing tasks. IEEE Trans. Pattern Anal. Mach. Intell., 2018. 2
[37] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil-
iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo-
hammed, Saksham Singhal, Subhojit Som, and Furu Wei.
Image as a foreign language: BEiT pretraining for vision and
vision-language tasks. In IEEE Conf. Comput. Vis. Pattern
Recog., 2023. 2, 6, 7
[38] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yu-
lia Tsvetkov, and Yuan Cao. Simvlm: Simple visual lan-
guage model pretraining with weak supervision. In Int. Conf.
Learn. Represent., 2021. 6
[39] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau-
mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Rault, Rémi Louf, Morgan Funtowicz, et al. Transform-
ers: State-of-the-art natural language processing. In EMNLP,
2020. 5
[40] Yu-Huan Wu, Yun Liu, Xin Zhan, and Ming-Ming Cheng.
P2t: Pyramid pooling transformer for scene understanding.
IEEE Trans. Pattern Anal. Mach. Intell., 2022. 2
[41] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei.
Scene graph generation by iterative message passing. In
[42] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko,
Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and
Christoph Feichtenhofer. Videoclip: Contrastive pre-training
for zero-shot video-text understanding. EMNLP, 2021. 6

SLAN: Self-Locator Aided Network For Vision-Language Understanding

Uploaded by

Copyright:

Available Formats

SLAN: Self-Locator Aided Network For Vision-Language Understanding

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SLAN: Self-Locator Aided Network For Vision-Language Understanding

Uploaded by

Copyright:

Available Formats

SLAN: Self-Locator Aided Network for Vision-Language Understanding

Abstract A person is flying in the air. A teddy bear

Pre-training Retrieval (COCO) Caption (COCO) VQA (VQAv2) NLVR (NLVR2)

4.2.1 Image-Text Retrieval Method R@1 ↑ R@5 ↑ R@10 ↑ MdR ↓

Trainable Adaptor COCO Flickr30k COCO Flickr30k

4.3. Ablation Study

4.3.2 Computational Cost

To verify the calibration effect of region adaptation, we vi-

You might also like