research-article

Open access

Toward Attribute-Controlled Fashion Image Captioning

Authors:

Chen Cai,

Kim-Hui Yap,

Suchen WangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 9

Article No.: 280, Pages 1 - 18

https://doi.org/10.1145/3671000

Published: 23 September 2024 Publication History

PDF eReader

Abstract

Fashion image captioning is a critical task in the fashion industry that aims to automatically generate product descriptions for fashion items. However, existing fashion image captioning models predict a fixed caption for a particular fashion item once deployed, which does not cater to unique preferences. We explore a controllable way of fashion image captioning that allows the users to specify a few semantic attributes to guide the caption generation. Our approach utilizes semantic attributes as a control signal, giving users the ability to specify particular fashion attributes (e.g., stitch, knit, sleeve) and styles (e.g., cool, classic, fresh) that they want the model to incorporate when generating captions. By providing this level of customization, our approach creates more personalized and targeted captions that suit individual preferences. To evaluate the effectiveness of our proposed approach, we clean, filter, and assemble a new fashion image caption dataset called FACAD170K from the current FACAD dataset. This dataset facilitates learning and enables us to investigate the effectiveness of our approach. Our results demonstrate that our proposed approach outperforms existing fashion image captioning models as well as conventional captioning methods. Besides, we further validate the effectiveness of the proposed method on the MSCOCO and Flickr30K captioning datasets and achieve competitive performance.

1 Introduction

Fashion has always been an industry that fascinates individuals around the world. From clothing and accessories to footwear and beauty products, fashion has a universal appeal that transcends geographical and cultural boundaries. With the rise of e-commerce platforms, fashion companies tend to increase their focus on fashion-related research. Extensive efforts in fashion image-related research [9, 12, 40] direction have been made in recent years, with studies exploring areas such as in-shop fashion clothes retrieval [31, 37, 52], fashion landmark detection [34, 64], and fashion attribute recognition [10, 17, 23]. Despite this progress, in recent years, there has been significant research dedicated to automating the generation of high-quality descriptions for fashion images [42, 43, 59]. Fashion image captioning is a process that involves generating accurate and engaging sentences that effectively describe input fashion images and specify item attributes, such as style, shape, and fabric. Fashion image captioning helps fashion e-commerce platforms generate descriptions for goods more efficiently. By using automated processes to generate descriptions, companies can save time and resources while still providing accurate and engaging product descriptions. Additionally, fashion image captioning can help consumers better understand the design and attributes of the products, which can lead to increased customer satisfaction and, ultimately, benefits for fashion companies.

Recent advancements in fashion image captioning models have leveraged the encoder–decoder framework. The fashion image encoder encodes the rich visual features of the fashion image, while the fashion caption decoder generates fashion descriptions based on these visual cues. Yang et al. [59] proposed pioneer work in the field of fashion image captioning, establishing the FAshion Image CAptioning Dataset (FACAD) for this task. Nguyen et al. [43] conducted experimental research in the field of fashion image caption generation by integrating visual attention from the backbone network [7] into a Recurrent Neural Network (RNN) language decoder. However, most of the existing fashion image captioning model leverages visual representations to generate a fixed caption for a given fashion image. For example, in Figure 1, given the fashion image, the pioneer method [59] generates a fixed fashion image caption soft and cozy this irresistible button front cardi is one you love wearing option in (a). They do not provide the flexibility to automatically change the generated fashion caption for the given image according to individual preferences and requirements, which limits adaptability for industry-level applications.

Fig. 1.

To solve this problem, we proposed Attribute-Controlled fashion image Captioning (ACC)—a controllable fashion image captioning method that can automatically change and generate fashion captions that suit unique preferences [5]. We propose using semantic attributes as a control signal together with visual information to generate fashion image captions. The semantic attributes consist of fashion attributes and fashion styles. Based on the different preferred semantic attributes given, we can control and generate the desired fashion image captions that interpret the fashion image with the semantic meaning of given attributes. As illustrated in Figure 1(b) and (c), our proposed method generates different engaging fashion captions by changing the given fashion attributes and/or fashion style. Inspired by conventional image captioning work [32, 63], we adopt the transformer-based encoder–decoder framework [48] to build our controllable fashion image captioning model. We develop the model architecture that simultaneously understands fashion image content and semantic attributes to provide engaging fashion captions. Furthermore, the existing FACAD [59] contains many clothing texture images, which could result in generating error-prone captions. Therefore, we clean, filter, and construct a new FACAD called FACAD170K to study controllable fashion image captioning. The main contributions of our work consist of the following parts.

(1) Our proposed approach aims to enhance the adaptability of fashion image captioning by incorporating controllability into the method, which allows for dynamic adjustments of the generated fashion captions based on individual preferences and requirements.

(2) We introduce a new model that leverages multi-modal information from both semantic attributes and fashion images to generate fashion captions. The semantic attributes given served as the control signal to control and produce the desired fashion image captions that involve the semantic information of the given attributes.

(3) We introduce a Fashion Image Captioning Dataset—FACAD170K. The FACAD170K will benefit the development of fashion image captioning-related research applications. Experimental results demonstrate that our method outperforms existing fashion and conventional image captioning work for fashion caption generation. We make the FACAD170K dataset publicly available.

(4) We further validated the effectiveness of the proposed method on the MSCOCO and Flickr30K captioning dataset. Experimental results showed that the proposed model achieved comparable and superior performance to other state-of-the-art conventional image captioning methods.

2 Related Works

2.1 Fashion Image Captioning

Fashion image captioning has attracted significant attention in the field of computer vision for fashion images. Yang et al. [59] introduce a semantic-guided fashion image captioning model, which utilizes spatial attention mechanism [57], semantic attribute visual features, and a reinforcement learning-based approach to effectively generate descriptive fashion captions with the FACAD. Nguyen et al. [43] experimented using channel and spatial attention features [7] for fashion image captioning. Moratelli et al. propose a transformer-based fashion captioning model that integrates external textual memory through k-nearest neighbor searches, which retrieves fashion contexture information for captioning. These pioneer methods produce static image captions for fashion, lacking the ability to dynamically adjust the generated caption based on individual preferences. In contrast, our proposed method incorporates text-based semantic attribute information to provide additional controllability to the fashion captioning model, which generates preferable fashion captions. Different from our previous work [5] that directly fuse semantic attribute and visual features for controllable fashion image captioning, in this article, we further explore and develop various multimodal feature fusion strategies that effectively integrate the visual and semantic attribute features in the decoding stage for better attribute-controlled fashion image caption generation.

2.2 Image Captioning

Image captioning is a cross-modal task combining computer vision [4, 18, 24, 55] and natural language processing [48, 53]. Many methods [26, 44, 50, 54, 57] have been proposed since theemergence of deep learning in recent years. Vinyals et al. [50] proposed to use the deep Convolutional Neural Network (CNN) encoder to extract the visual features of the image and utilized the RNN network as the language decoder to generate caption sequences. Subsequently, the spatial attention mechanism [57] was further investigated for its application in image captioning tasks. More recently, the transformer network [48] with a Multi-Head Attention (MHA) mechanism has been introduced to benefit image caption generation. Huang et al. [26] proposed attention on attention architecture, which utilized the gated attention mechanism to measure the relevance between the attention weight and the queries. Cornia et al. [11] developed a meshed-memory transformer to exploit both low-level and high-level contributions of the visual feature for caption generation. Yan et al. [58] integrated a task-adaptive attention module into the transformer-based model, enabling it to identify task-specific clues and reduce misleading information from improper key-value pairs. Li et al. [33] developed a long short-term graph, which effectively captures short-term spatial relationships and long-term transformation dependencies of visual features for image captioning. Guo et al. [22] introduce the normalized attention module, allowing the transformer network to incorporate the geometry structure of the input objects through geometry-aware self-attention. Pan et al. [44] developed the unified X-linear attention block for image captioning. Luo et al. [41] proposed a dual-level collaborative transformer for image captioning to realize the complementary advantages of the region and spatial image features. Moreover, many existing methods [27, 61] incorporated semantic visual features, which allows the captioning model to generate the most relevant attribute words. Jiang et al. [27] built a guiding network to learn an extra guiding vector for caption sentence generation. Deng et al. [13] proposed a syntax-guided hierarchical attention network that incorporates visual features with semantic and syntactic information to improve the interpretability of the captioning model. Li et al. [32] devised the transformer-based framework, which exploits the visual and semantic information simultaneously. Yao et al. [60] proposed to utilize the Graph Convolutional Network (GCN) to integrate semantic and spatial relationships between objects. Yang et al. [21] proposed to integrate semantic priors into the model by exploiting a graph-based representation of both images and sentences. Dong et al. [15] introduced dual-GCN to model the object relationships over a single image and model the relationships over similar images in the dataset for image captioning.

2.3 Controllable Image Captioning

Controllable image captioning generates image descriptions following designated control signals. Deshpande et al. [14] proposed to predict a meaningful summary of the image using the part-of-speech tag and subsequently guiding the caption generation based on that summary. Chen et al. [8] utilized abstract scene graph structure to represent user intention at the fine-grained level and control what and how detailed the generated description should be. Chen et al. [6] developed a controllable image captioning model that uses human-like verb-specific semantic roles as the control signal, considering both event-compatible and sample-suitable requirements. Shuster et al. [47] specified 215 different text-based personality traits and were able to generate the visual caption based on the different traits. Differently, our controllable fashion image captioning method controls both text-based fashion attributes and styles, which allows users to inject the preferred semantic meaning into the fashion description.

3 Overall Framework

In this section, we introduce our ACC model. The model consists of the transformer-based attribute controlling encoder, a fashion image encoder with the Feature Enhancement Module (FEM), and an adaptive fashion caption decoder. Figure 2 gives an overview of the proposed architecture.

Fig. 2.

3.1 Attribute Controlling Encoder

We utilize semantic attributes to control and guide the process of caption generation for fashion images. By incorporating semantic attributes, such as fashion attributes and fashion style, we expect that the model can interpret fashion image content with semantic information to impart a higher level of control and allow for a more tailored fashion description generation. Let \(\mathcal{A}=\{a_{1},a_{2},...,a_{n}\}\) be the input semantic attributes, where \(a_{i}\) denotes one text word describing the attribute (e.g., stripe, fuzzy, sleeve, comfort) and \(n\) represents the total number of attributes which can be flexibly determined depending on personal preference. We use an attribute controlling encoder to encode the semantic attributes \(\mathcal{A}\) to feature space as \(\hat{\mathbf{A}}\in\mathbb{R}^{n\times d}\), where \(d\) is the feature dimension. Specifically, the set of semantic attributes \(\mathcal{A}\) are first encoded as word embeddings \(\mathbf{T}=[\mathbf{t}_{1};\mathbf{t}_{2};...;\mathbf{t}_{n}]\in\mathbb{R}^{n \times d}\) and then fed to a single layer transformer encoder block with the MHA module, which can be formulated as

\[\displaystyle\begin{split}\displaystyle\text{Head}_{i}&\displaystyle=\text{ self-attention}(\mathbf{T}^{q},\mathbf{T}^{k},\mathbf{T}^{v})\\ &\displaystyle=\text{softmax}\left(\frac{\mathbf{T}^{q}\mathbf{T}^{k}_{\top}}{ \sqrt{d_{k}}}\right)\mathbf{T}^{v}\end{split}\]

(1)

\[\displaystyle\hat{\mathbf{A}}=\text{concat}(\text{Head}_{ 1},\text{Head}_{2},\dots,\text{Head}_{h})\mathbf{W}_{o}\in\mathbb{R}^{n\times d}\]

(2)

where \(\mathbf{T}^{q},\mathbf{T}^{k}\), and \(\mathbf{T}^{v}\) are the query, key, and value matrix projected from \(\mathbf{T}\), respectively. \(\mathbf{W}_{o}\) denotes the output projection matrix that aggregates the information from \(h\) parallel attention heads. \(\top\) represents the transpose operation. To ease the presentation, we neglect the Feed-Forward Network and Layer normalization modules in the equation. \(\hat{\mathbf{A}}\) will be injected into the fashion caption decoder and serve as the control signal.

3.2 Fashion Image Encoder

The fashion image encoder is utilized to extract rich and meaningful visual information from fashion images. For the input fashion image \(\mathcal{I}\), we first use a transformer-based image encoder upon ResNet backbone [24] to extract its visual features. Furthermore, we introduce an FEM that attends and enhances the visual representation over channel dimension. Applying an attention mechanism in a channel-wise manner can be viewed as a process of selecting critical visual semantic representations, which further improve the feature representations and contribute to better fashion caption prediction.

Specifically, we use the pre-trained CNN-based network to extract visual features and reshape to \(\mathbf{X}\in\mathbb{R}^{N\times D}\), where \(N=H\times W\) are spatial size, \(H\) and \(W\) is the height and width of the visual features, and \(D\) denotes the channel dimension. The visual features \(\mathbf{X}\) are then fed into the Transformer encoder to obtain the attended visual representations \(\mathbf{X}^{a}\)

\[\displaystyle\begin{split}\displaystyle\text{Head}_{i}&\displaystyle=\text{ self-attention}(\mathbf{X}^{q},\mathbf{X}^{k},\mathbf{X}^{v})\\ &\displaystyle=\text{softmax}\left(\frac{\mathbf{X}^{q}\mathbf{X}^{k}_{t}}{ \sqrt{d_{k}}}\right)\mathbf{X}^{v}\end{split}\]

(3)

\[\displaystyle\mathbf{X}^{a}=\text{concat}(\text{Head}_{1},\text{Head}_{2}, \dots,\text{Head}_{h})\mathbf{W}_{o}\in\mathbb{R}^{N\times D}.\]

(4)

In this way, the visual encoder can capture long-range contextual information over spatial dimensions. In addition, the encoder further enhances the visual representation by capturing the long-range contextual information over channel dimensions. To achieve this, we introduce an FEM to aggregate the context from the channel dimension \(D\). Concretely, the FEM first reshapes the extracted visual features from the backbone network with the dimension of \(H\times W\times D\) to \(\mathbf{\bar{X}}\in\mathbb{R}^{D\times N}\). Subsequently, it calculates the channel attention score based on \(\mathbf{\bar{X}}\) using the \(softmax\) function, followed by conducting matrix multiplication to weight the visual features. We present the following steps to capture the channel-level dependencies:

\[\displaystyle\mathbf{E}=\text{softmax}(\mathbf{\bar{X}} \mathbf{\bar{X}}^{\top})\]

(5)

\[\displaystyle\mathbf{X}^{e}=(\mathbf{E}\mathbf{\bar{X}})+ \mathbf{\bar{X}}\in\mathbb{R}^{D\times N}.\]

(6)

Here, \(\mathbf{E}\in\mathbb{R}^{D\times D}\) is a channel attention weight. \(\mathbf{X}^{e}\) is reshaped to the same dimension as \(\mathbf{X}^{a}\) and they are concatenated to obtain the final visual representations \(\hat{\mathbf{X}}=\mathbf{X}^{a}\oplus\mathbf{X}^{e}\in\mathbb{R}^{N\times D}\), where \(\oplus\) denotes the concatenation operator. \(\hat{\mathbf{X}}\) will be input into the fashion caption decoder for fashion description generation.

3.3 Adaptive Fashion Caption Decoder

The main goal of our decoder is to predict the fashion caption based on the visual features \(\hat{\mathbf{X}}\) and the semantic attribute features \(\hat{\mathbf{A}}\). One possible solution is to apply the transformer decoder [48]. Suppose we are given \(t\) caption words, which are either predicted at inference or given at training. We can encode them as word embeddings \(\mathbf{Y}=[\mathbf{y}_{1};\mathbf{y}_{2};\dots;\mathbf{y}_{t}]\in\mathbb{R}^{ T\times d}\) and apply an MHA module to obtain the sentence representations \(\mathbf{U}\in\mathbb{R}^{T\times d}\), i.e., \(\mathbf{U}={\textrm{MHA}}(\mathbf{Y}^{q},\mathbf{Y}^{k},\mathbf{Y}^{v})\). The MHA module captures the long-term dependencies among the caption words. Moreover, we observe that the modeling local structure of the words is also beneficial for our fashion image captioning task (e.g., as shown in Table 3). We propose incorporating a localLSTM with MHA to model both local structures and global dependencies information over the caption words.

Specifically, we segment the long sequence of word embeddings, denoted as \(\mathbf{Y}\), into multiple shorter sequences based on a predefined window size of \(M=3\) (as illustrated in Figure 3). These short sequences of word embeddings are then processed by the shared Long Short-Term Memory (LSTM) network (LocalLSTM) independently and output a sequence of hidden representations \(\mathbf{H}=[\mathbf{h}_{1};\mathbf{h}_{2};\dots;\mathbf{h}_{t}]\in\mathbb{R}^{ T\times d}\) with the information of local structures, that is,

\begin{align}\mathbf{h}_{1},\mathbf{h}_{2},...,\mathbf{h}_{t}=\text{LocalLSTM}(\mathbf{y}_{ 1},\mathbf{y}_{2},...,\mathbf{y}_{t}).\end{align}

(7)

Then, we use a masked MHA to further capture the long-term global dependencies and obtain the sentence representations as

\[\displaystyle\begin{split}\displaystyle\text{Head}_{i}&\displaystyle=\text{ masked-self- attention}(\mathbf{H}^{q},\mathbf{H}^{k},\mathbf{H}^{v})\\ &\displaystyle=\text{softmax}\left(\frac{\mathbf{H}^{q}\mathbf{H}^{k}_{t}}{ \sqrt{d_{k}}}\right)\mathbf{H}^{v}\end{split}\]

(8)

\[\displaystyle\mathbf{U}=\text{concat}(\text{Head}_{1}, \text{Head}_{2},\dots,\text{Head}_{h})\mathbf{W}_{o}\in\mathbb{R}^{T\times d}.\]

(9)

Fig. 3.

Next, we predict the fashion caption words based on sentence features \(\mathbf{U}\), visual features \(\hat{\mathbf{X}}\), and semantic attribute features \(\hat{\mathbf{A}}\). Nevertheless, our previous work [5] combined visual features with semantic attribute features through direct fusion or concatenation, which makes it challenging for the decoder to balance both representations for fashion caption generation. Hence, it is essential to investigate and develop an effective feature fusion strategy that seamlessly integrates multimodal features. We explored three feature fusion strategies to combine the visual and semantic attribute features. The evaluation result is shown in Table 5. The three feature fusion strategies include (1) Direct Concatenation (DC) of both features and using Cross-MHA with sentence features \(\mathbf{U}\) to compute fashion caption representations \(\mathbf{C}_{dc}\); (2) Cross Fusion that utilizes two Cross-MHA modules conditioned on visual features \(\hat{\mathbf{X}}\) and semantic attribute features \(\hat{\mathbf{A}}\) to compute the fashion caption representations \(\mathbf{C}_{cf}\); (3) Utilizing a gating mechanism to balance the contribution of multimedia representation and contribute to better fashion caption generation \(\mathbf{C}_{amf}\), which is named Adaptive Multimodal Fusion Module (AMFM) and shown in Figure 4.

\[\displaystyle(1)\ \mathbf{C}_{dc}=\text{Cross-MHA}(\mathbf{U}^{q},\mathbf{F}^{k},\mathbf{F}^{v})\in\mathbb{R}^{T\times d}\]

(10)

\[\displaystyle\mathbf{F}=[\hat{\mathbf{X}};\hat{\mathbf{A }}]\]

(11)

\[\displaystyle(2)\ \mathbf{C}_{cf}=\mathbf{W}_{cf}([ \mathbf{C}_{X};\mathbf{C}_{A}])\in\mathbb{R}^{T\times d}\]

(12)

\[\displaystyle(3)\ \mathbf{C}_{amf}=\mathbf{C}_{X}\odot \mathbf{g}+\mathbf{C}_{A}\odot(1-\mathbf{g})\in\mathbb{R}^{T\times d}\]

(13)

\[\displaystyle\mathbf{g}=\delta(\mathbf{W}_{g}[\mathbf{C} _{X};\mathbf{C}_{A}])\]

(14)

\[\displaystyle\mathbf{C}_{X}=\text{Cross-MHA}(\mathbf{U}^ {q},\hat{\mathbf{X}}^{k},\hat{\mathbf{X}}^{v})\]

(15)

\[\displaystyle\mathbf{C}_{A}=\text{Cross-MHA}(\mathbf{U}^ {q},\hat{\mathbf{A}}^{k},\hat{\mathbf{A}}^{v}),\]

(16)

where [;] and \(\delta\) denote concatenation operation and Sigmoid activation function. \(\mathbf{W}_{cf}\) and \(\mathbf{W}_{g}\) is the projection matrix. The fashion caption representations \(\mathbf{C}\) will be fed into a linear projection layer and a softmax layer for the prediction of caption words. During the inference, we can vary the input semantic attributes to compute the different fashion caption representations and obtain different caption words.

Fig. 4.

4 Training Objective

For this fashion image captioning work, we train the model by optimizing the Cross Entropy (XE) loss \(\mathcal{L}_{XE}\) as

\begin{align}\mathcal{L}_{XE}=-\sum_{t=1}^{T}\log(p_{\theta}(\mathbf{y}_{t}^{*}|\mathbf{y}_ {1:t-1}^{*},\mathcal{A},\mathcal{I})).\end{align}

(17)

The captioning model is trained to predict the target Ground Truth (GT) caption \(\mathbf{y}_{t}^{*}\) with the given words \(\mathbf{y}_{1:t-1}^{*}\), fashion attributes \(\mathcal{A}\) and fashion image \(\mathcal{I}\).

5 FACAD170K Dataset

The original FACAD [59] crawls about 1 million fashion images from Google, which contains many noisy images. As shown in Figure 5(a), the FACAD contains many identical single-colored texture images. Many of these identical texture images predict similar captions, which affects the training result. Furthermore, some different fashion item images have the same GT caption. For example, the last two images in Figure 5(a) have the same GT caption that describes jeans only. Hence, we clean the FACAD and choose the fashion item’s best front view images in their color set to form the FACAD170K dataset. The new FACAD170K does not contain texture images. The resolution of the fashion images is \(1{,}560\times 2{,}392\) and labeled with fashion caption and semantic attributes. The semantic attributes are adopted from [59] that extract the nouns and adjectives words when they both appear in the captions and fashion item titles from the source website. The semantic attributes annotation such as “cotton,” “plaid” (fashion attributes), and “comfort” (fashion style) provide some detailed information about a specific item. We keep the maximum length of the GT caption to 25 and a maximum of 5 semantic attributes for each fashion item. The dataset statistics are shown in Table 1, and some examples are shown in Figure 5(b). We utilized 168,862, 5,000, and 5,000 images for training, validation, and testing, respectively. The fashion items that are different in color but share the same GT caption are contained in the same data split. We make the FACAD170K dataset publicly available at https://github.com/caicch/FACAD170K-dataset.

Fig. 5.

Table 1.

Total fashion images	Total Fashion captions	Average caption length	Average attributes per image	Attribute vocab	Caption vocab
178,862	126,749	20.62	4.5	984	8,808

Table 1. Fashion Image Captioning—FACAD170K Dataset Statistics

6 Experiments

In this section, we demonstrate and investigate the effectiveness of the proposed method through ablation studies and empirical evaluations over the abovementioned FACAD170K fashion captioning dataset, conventional MSCOCO, and Flickr30K image captioning dataset.

6.1 Datasets and Evaluation Matrices

In the following experiments, we used the FACAD170K dataset to conduct the training and testing for fashion image captioning. We further evaluate the effectiveness of the proposed method on the conventional MSCOCO [36] and Flickr30K [62] image captioning dataset. It consists of more than 120,000 images, and each image is annotated with five GT captions. We follow the Karpathy dataset split [30], where 113k, 5k, and 5k images are used for training, testing, and validation, respectively. The vocabulary size for the MSCOCO caption dataset is 9,487. The Flickr30K dataset consists of 31,000 images and is annotated with 5 captions per image. Similarly, we used splits from Karpathy, where 29k, 1k, and 1k images are used for training, validation, and testing, respectively. The vocabulary size is 7,000.

We adopt the commonly used automatic evaluation metrics to evaluate the quality of image captions, which includes BLEU\(@\)N (B\(@\)N, N = 1,2,3,4) [45], METEOR (M) [3], ROUGE-L (R) [35], CIDER (C) [49], and SPICE (S) [1]. The BLEU evaluation metric is employed to evaluate the precision accuracy between the candidate and reference sentences, with “N” denoting the n-gram precision between these sentences. METEOR evaluates the probabilities of uni-gram precision and recall, while ROUGE-L measures similarity by calculating the longest common subsequence between two sentences. Both of these two matrices account for sentence fluency that involves the penalty factor. CIDER further measures the semantic representation of the sentence by computing the cosine similarity with term frequency-inverse document frequency values. SPICE measures the effectiveness of the captions in recovering objects, attributes, and relations. In all cases, higher metric scores indicate greater accuracy in the generated captions.

6.2 Training Details for Fashion Image Captioning

We fine-tune the conv5 layers to extract the fashion image features similar to the existing method [59]. The dimensions of image features, semantic attribute features, and caption features are 2,048, 300, and 512, respectively. The local window size is set to 3. The number of layers for encoder and decoder is set to 2 and 4, with 8 attention heads for the best captioning performance. We use the Adam optimizer to train the proposed model with a batch size of 16. The base learning rate is set to \(1\times 10^{-4}\). The model is trained using the XE loss. We use the beam search strategy, and the beam size is set to 5. At the inference stage, the input semantic attributes are optional, and we assume that given attributes should reasonably match the image if they are the input of the model.

6.3 Quantitative Evaluation for Fashion Image Captioning

Table 2 compares our proposed method to the existing fashion image captioning method and the conventional image captioning methods trained on the FACAD170K dataset. The column Semantic Attribute (SA) represents whether the model uses semantic attributes for captioning. For all the baselines that are used for comparison, we re-implement their method based on their original paper. We report the results using ResNet-101 as the backbone network. The methods compared include SAT [57], AFIC [59], ETA [32], MVT [63], ASET [66], EDK [56], and MFFT [67]. AFIC [59] is the method proposed for fashion image captioning. Other methods are originally proposed for the conventional image captioning, and we use them as the baselines for the fashion image captioning task. We modify \(\text{MVT}_{fa}\), \(\text{ASET}_{fa}\), \(\text{EDK}_{fa}\) that takes semantic attributes as additional input to train the model and use it for comparison. \(\text{ACC}_{visual}\) is our proposed method that only takes the image as input to generate fashion caption without any semantic attributes, while \(\text{ACC}_{visual+fa}\) uses image and up to 5 semantic attributes for fashion captioning. In Table 2, we can observe that when we only use the image as input, our framework achieves the best performance over all the metrics, and we believe that it’s due to feature enhancement and localLSTM module. Furthermore, our proposed model can significantly improve the performance when it takes paired semantic attributes and images as input. The semantic attributes are beneficial for fashion image captioning tasks, in which we can generate better fashion image captions by incorporating the semantic attributes.

Table 2.

Model	SA	B@1	B@2	B@3	B@4	M	R	C
SAT [58]	\(\times\)	27.5	16.1	11.8	8.1	12.5	23.2	69.3
ETA [32]	\(\times\)	26.5	15.7	12.0	8.6	12.7	22.2	72.4
AFIC [60]	\(\times\)	28.2	16.9	12.5	10.5	13.0	23.9	88.7
MVT [64]	\(\times\)	27.7	16.8	12.6	10.7	13.0	24.1	92.0
ASET [67]	\(\times\)	28.3	17.3	13.1	11.0	13.2	24.1	93.7
EDK [57]	\(\times\)	28.1	17.4	13.0	11.0	13.2	24.2	93.2
MFFT [68]	\(\times\)	27.3	16.9	12.7	10.6	13.1	23.5	91.4
\(\text{MVT}_{fa}\) [64]	\(\sqrt{}\)	45.6	29.4	22.5	17.4	21.8	37.8	162.6
\(\text{ASET}_{fa}\) [67]	\(\sqrt{}\)	45.3	30.4	22.6	18.8	21.8	38.0	168.8
\(\text{EDK}_{fa}\) [57]	\(\sqrt{}\)	46.1	31.0	22.6	18.5	22.0	38.1	169.5
\(\text{ ACC}_{visual}\)	\(\times\)	28.5	17.7	13.5	11.4	13.3	24.5	97.0
\(\text{ ACC}_{visual+fa}\)	\(\sqrt{}\)	46.1	31.9	24.4	20.8	22.4	38.9	183.5

Table 2. Fashion Image Captioning Performance on the FACAD170K Test Split

All values are in percentages (%), and higher is better. The italic numbers indicate the best result when only taking the fashion image as input, and the bold numbers indicate the best result by including the semantic attribute as input.

6.4 Ablation Analysis for Fashion Image Captioning

In Table 3, we investigate the effectiveness of various proposed modules in fashion image captioning. \(Baseline\) uses the full transformer-based encoder–decoder architecture and takes fashion images as input only for captioning. \(Baseline+LocalLSTM\) includes the localLSTM module to model the local structure of word embeddings during fashion caption training, while \(Baseline+LocalLSTM+FEM\) further involves the FEM to incorporate more informative visual features from fashion images. We can observe that both LocalLSTM and FEM benefit the fashion image captioning model in generating more accurate fashion captioning, in which \(Baseline+LocalLSTM+FEM\) model improves C and B@4 score of 8.4% and 1%, respectively.

Table 4 measures the effectiveness of the given semantic attributes that benefit the fashion image captioning model. However, it is hard to evaluate the performance of generated controlled fashion captions based on the user’s preferable fashion attributes or fashion styles. Hence, we include semantic attributes multi-label image classifier to model the semantic attributes representations \(Visual+f_{rep}\) similar to the work [66] and assume that the modeled attributes semantic features represent the user’s preferable fashion attributes. Furthermore, we also evaluate the effectiveness of using the paired semantic attributes as input for fashion image captioning. We test the effectiveness of utilizing 1, 3, and 5 (\(Visual+5f_{a}\)) semantic attributes. We can see that the performance of using modeled attributes semantic features \(Visual+f_{rep}\) is close to the \(Visual\) model, meaning that the model performs well when using reasonably matched fashion semantic attributes. In addition, we can observe that the evaluation result improves when the number of given semantic attributes increases. This further proves that the semantic attributes are beneficial for fashion image captioning tasks.

Table 3.

Models	B@1	B@2	B@3	B@4	M	R	C
\(Baseline\)	28.0	16.9	12.7	10.4	12.9	24.0	88.4
\(Baseline+LocalLSTM\)	28.3	17.4	13.3	11.2	13.2	24.2	94.9
\(Baseline+LocalLSTM+FEM\)	28.5	17.7	13.5	11.4	13.3	24.5	97.0

Table 3. Ablation Study on Different Settings When Taking Fashion Imageas Input for Caption Generation

The best scores are denoted in bold.

Table 4.

Models	B@1	B@2	B@3	B@4	M	R	C
\(Visual\)	28.5	17.7	13.5	11.4	13.3	24.5	97.0
\(Visual+f_{rep}\)	28.5	17.6	13.4	11.5	13.4	24.7	95.3
\(Visual+1f_{a}\)	32.7	20.7	15.8	13.3	16.0	30.4	109.7
\(Visual+3f_{a}\)	41.2	26.3	19.9	16.6	19.2	34.3	147.8
\(Visual+5f_{a}\)	46.1	31.9	24.4	20.8	22.4	38.9	183.5

Table 4. The Evaluation Result of Utilizing Semantic Attributes for Fashion Captioning

The best scores are denoted in bold.

Furthermore, we investigated the three multimodal feature fusion strategies (in Section 3.3) for the fashion captioning model and showed the evaluation result in Table 5. We compared the strategies in Equations (10), (12), and (13) and concluded that the proposed AMFM aids in the fuse visual and attributes features for ACC. The proposed model with AMFM outperforms another two strategies in all evaluation matrices.

Table 5.

Fusion methods	B@1	B@2	B@3	B@4	M	R	C
\(\mathbf{C}_{dc}\)	46.2	31.3	23.2	19.4	22.1	37.5	176.5
\(\mathbf{C}_{cf}\)	46.5	31.0	23.1	19.6	22.3	38.6	178.5
\(\mathbf{C}_{amf}\)	46.1	31.9	24.4	20.8	22.4	38.9	183.8

Table 5. The Performance of Adopting Various Multimodal Fusion Strategies

The best scores are denoted in bold.

6.5 Qualitative Analysis for Fashion Image Captioning

Figure 6 shows the examples of fashion image captions generated by our ACC model and the existing methods [59, 63] as well as the GT captions. Our method lets user describe the fashion image according to their preferences. For instance, in example (a), the fashion expert or user can inject the fashion attributes of “logo” and “crewneck” into the model to generate a sentence that focuses more on describing the clothing detail of “iconic logo on this comfy crewneck sweatshirt,” or they may inject the fashion style like “cool” to control the fashion caption generation that describes the fashion image with more stylish feelings of cool as “a cool black sweatshirt keep you comfy and chic while kicking back at home.” Similar to the examples shown in (b), (c), (d), (e), and (f), we can generate charming and engaging fashion captioning by changing the preferred fashion attribute and style words. Especially in (e), the generated caption based on image-only input tends to describe the sneaker with “Italian” style (e.g., “handcrafted italian sneake”), but the user can change the style more toward “French” as “minimal detailing brings maximum versatility to a black french fashion sneaker kicked up a notch” based on their preferences. Besides controlled fashion caption, \(\text{ACC}_{visual}\) and \(\text{ACC}_{visual+f_{a}}\) models generate a more accurate caption that better describes fashion items. The fashion caption generated by existing methods is aligned with the logic of language but less accurate with respect to GT caption. We can see that the word predicted (bold words) in the captions generated by \(\text{ACC}_{visual}\) is more accurate as compared to the GT caption.

Fig. 6.

6.6 Ablation Analysis for MSCOCO Dataset

Besides fashion image captioning evaluation, we further test the effectiveness of the introduced modules in the captioner on the MSCOCO captioning dataset. Since our proposed ACC model is an end-to-end captioning method, we followed the settings in recent end-to-end transformer-based captioning works [38, 65]. We evaluated the captioning performance by including the LocalLSTM and FEM in the baseline transformer encoder–decoder framework. The model is trained using XE loss. For a fair comparison, the semantic attribute is not included for captioning evaluation on MSCOCO captioning dataset. We use Swin Transformer [39] (less parameter) as a backbone network instead of the Vision Transformer [16] backbone used in [38] to ease the training process. The ablative evaluation is shown in Table 6. We can observe that the introduced model performs better as compared to the baseline model in 1% and 1.7% in B@4 and C scores. The improved results are aligned with the result shown in Table 3 and further prove the effectiveness of the introduced components.

Table 6.

Models	B@1	B@2	B@3	B@4	M	R	C	S
\(Baseline\)	77.3	61.1	47.4	36.7	28.8	57.3	121.0	21.8
\(Baseline+LocalLSTM\)	77.6	61.7	48.0	37.3	28.6	57.6	122.0	21.8
\(Baseline+LocalLSTM+FEM\)	77.8	62.2	48.4	37.6	28.6	58.7	122.7	21.9

Table 6. Ablation Study on Different Settings on MSCOCO Image Captioning Dataset

The bold number indicates the best result.

6.7 Quantitative Analysis for Conventional Image Captioning Dataset

We report the performance comparisons between our proposed method with the state-of-art models trained using self-critical loss (SCST) [46] on the MSCOCO captioning test set (Karpathy test split) in Table 7. The models compared include Up-Down [2], GCN-LSTM [60], RFNet [28], ORT [25], AoANet [26], ETA [32], MVT [63], ASET [66], BCAN [29], CPTR [38], S2TC [65], MFFT [67], and IRM [69]. Up-down explored using bottom-up region features for image captioning with soft attention. GCN-LSTM exploits the pairwise relationship information between region features for image captioning. RFNet and BCAN learn to collaboratively enhance the attention over regional visual features and word features to improve the captioning performance. OTR, AoANet, and MVT utilize transformer-based design for image captioning. OTR introduces geometric attention to the captioner that additionally injects the spatial relationship over regional visual features. AoANet improves the MHA mechanisms to select the relevant features for captioning. MVT incorporates multiple visual features with various backbones as visual inputs. ETA and ASET inject semantic and visual information into transformer-based architecture to boost the captioning performance. CPTR and S2TC studied end-to-end image captioning methods with grid or token-based visual features. MFFT proposed to fuse multi-features, such as spatial and semantic information, to assist the caption generation. IRM introduced a retrieval-based method that retrieves the semantic-related information from sentences for the description generation. Our method achieves superior performance in one of the most important metrics “C” for image captioning tasks that evaluate the sentence accuracy and the semantic precision of the predicted sentences. Furthermore, it can be observed that our proposed method achieves competitive performance in most of the evaluation metrics as compared to other methods. The performance emphasizes the effectiveness of our approach.

Table 7.

Methods	B@1	B@2	B@3	B@4	R	M	C	S
Up-Down [2]	79.8	-	-	36.3	56.9	27.7	120.1	21.4
GCN-LSTM [60]	80.5	-	-	38.2	58.3	28.5	127.6	22.0
RFNet [28]	79.1	63.1	48.4	36.5	57.3	27.7	121.9	21.2
AoANet [26]	80.2	-	-	38.9	58.8	29.2	129.8	22.4
ORT [25]	80.5	-	-	38.6	58.4	28.7	128.3	22.6
ETA [32]	81.5	-	-	39.3	58.9	28.8	126.6	22.7
MVT [63]	80.8	-	-	39.8	59.1	29.1	130.9	-
BCAN [29]	81.1	65.2	50.3	37.9	58.6	28.6	125.3	22.5
CPTR [38]	81.7	66.6	52.2	40.0	59.4	29.1	129.4	-
S2TC [65]	81.1	-	-	39.6	59.1	29.6	133.5	23.2
ASET [66]	80.6	-	-	39.3	58.9	29.2	131.0	23.1
MFFT [67]	81.0	-	-	39.6	59.1	29.2	131.1	23.0
IRM [69]	81.2	-	-	39.4	59.3	29.4	131.0	22.9
Ours	81.3	66.6	52.4	39.6	59.4	29.4	134.2	23.3

Table 7. Image Captioning Performance on the MSCOCO Test Split When Using the SCST [47] loss

All values are in percentages (%), and higher is better. The italic numbers indicate the second best result, and the bold numbers indicate the best result.

We further validate the performance of the proposed method on the Flick30K captioning test set (Karpathy test split) in Table 8. We compared with the recent methods, namely DAN [19], STP [20], IPSG [68], IGCA [51], ASET [66], and IRM [69], which have reported their performance on the Flick30K captioning dataset. DAN introduced a deliberate residual attention network that generates preliminary captions and then refined them for better captions. STP explored a ruminant decoding framework to refine the initial caption generated by the base decoder to produce a more comprehensive and polished result. IPSG proposed to integrate part of speech information into the encoder-decoder framework. IGCA captured global information on the target words, thereby improving the predictions of target words in image captioning. Table 8 shows that our proposed method outperforms other approaches across most evaluation metrics. These improvements highlight the effectiveness of the proposed method.

Table 8.

Methods	B@1	B@2	B@3	B@4	M	R	C
DAN [19]	73.8	55.1	40.3	29.4	23.0	-	66.6
STP [20]	-	-	-	26.8	20.5	48.1	57.0
IPSG [68]	69.4	49.8	35.5	25.4	25.1	53.8	46.9
IGCA [51]	73.3	-	-	30.2	25.7	54.1	58.1
ASET [66]	73.3	56.6	42.9	32.1	23.6	\(\_\)	70.9
IRM [69]	74.3	-	-	29.8	23.4	50.6	68.3
Ours	74.1	57.5	43.7	33.0	24.3	55.3	73.7

Table 8. Image Captioning Performance on the Flickr30K Test Split

The bold number indicates the best result.

7 Conclusion

In this article, we propose a novel method for controllable fashion image captioning. We utilize paired semantic attributes together with fashion images to train the proposed method. Furthermore, we introduce FEM to enhance the fine-turned visual representation by emphasizing the contextual information over channel dimensions, and we integrate the LocalLSTM module to learn the local structures of the sentence besides modeling global dependencies with MHA. In addition, we conducted extensive experiments and proposed to fuse multimodal information of visual and semantic attributes information with the AMFM for better attribute-conditioned fashion image caption generation. At the inference stage, our proposed method allows the end user to randomly choose the preferable semantic attributes to influence and control the generation of fashion image captions. The generated caption will contain the semantic information of the user-provided preferred semantic attributes. Our proposed method provides flexibility to control and generate engaging fashion image captions based on personally preferred semantic attributes.

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In 2016 Proceedings of the European Conference on Computer Vision. 382–398.

Abstract

1 Introduction

2 Related Works

2.1 Fashion Image Captioning

2.2 Image Captioning

2.3 Controllable Image Captioning

3 Overall Framework

3.1 Attribute Controlling Encoder

3.2 Fashion Image Encoder

3.3 Adaptive Fashion Caption Decoder

4 Training Objective

5 FACAD170K Dataset

6 Experiments

6.1 Datasets and Evaluation Matrices

6.2 Training Details for Fashion Image Captioning

6.3 Quantitative Evaluation for Fashion Image Captioning

6.4 Ablation Analysis for Fashion Image Captioning

6.5 Qualitative Analysis for Fashion Image Captioning

6.6 Ablation Analysis for MSCOCO Dataset

6.7 Quantitative Analysis for Conventional Image Captioning Dataset

7 Conclusion

References

Index Terms

Recommendations

Inception Models for Fashion Image Captioning: An Extensive Study on Multiple Datasets

Improving fashion captioning via attribute-based alignment and multi-level language model

Diagnosing fashion outfit compatibility with deep learning techniques

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations