Human Attribute Recognition— A Comprehensive Survey
<p>Typical pipeline to develop a HAR model.</p> "> Figure 2
<p>The proposed taxonomy for main challenges in HAR.</p> "> Figure 3
<p>Number of citations to HAR datasets. The datasets are arranged in an increasing order by their publication date. The “oldest” dataset being HAT, published in 2009, while the latest is RAP v2, published in 2018.</p> "> Figure 4
<p>Frequency distribution of the labels describing the ‘Age’ class in the PETA [<a href="#B152-applsci-10-05608" class="html-bibr">152</a>] (on the <b>left</b>) and CRP [<a href="#B162-applsci-10-05608" class="html-bibr">162</a>] (on the <b>right</b>) databases.</p> "> Figure 5
<p>As human, not only we describe the available attributes in occluded images but also we can predict the covered attributes in a negative strategy based on the attribute relations.</p> "> Figure 6
<p>State-of-the-art mAP results on three well-known PAR datasets.</p> ">
Abstract
:1. Introduction
- Facial Attribute Analysis (FAA). Facial attribute analysis aims at estimating the facial attributes or manipulating the desired attributes. The former is usually carried out by extracting a comprehensive feature representation of the face image, followed by a classifier to predict the face attributes. On the other hand, in manipulation works, face images are modified (e.g., glasses are removed or added) using generative models.
- Full-body Attribute Recognition (FAR). Full-body attribute recognition regards the task of inferring the soft-biometric labels of the subject, including clothing style, head-region attributes, recurring actions (talking to the phone) and role (cleaning lady, policeman), regardless of the location or body position (eating in a restaurant).
- Pedestrian Attribute Recognition (PAR). As an emerging research sub-field of HAR, it focuses on the full-body human data that have been exclusively collected from video surveillance cameras or panels, where persons are captured while walking, standing, or running.
- Clothing Attribute Analysis (CAA). Another sub-field of human attribute analysis that is exclusively focused on clothing style and type. It comprises several sub-categories such as in-shop retrieval, costumer-to-shop retrieval, fashion landmark detection, fashion analysis, and cloth attribute recognition, each of which requires specific solutions to handle the challenges in the field. Among these sub-categories, cloth attribute recognition is similar to pedestrian and full-body attribute recognition and studies the clothing types (e.g., texture, category, shape, style).
- Capturing raw data, which can be accomplished using mobile cameras (e.g., drone) or stationary cameras (e.g., CCTV). Also, the raw data might even be collected from images/videos publicly available (e.g., Youtube, or similar sources).
- In most supervised training approaches, HAR models consider one person at a time (instead of analyzing a full-frame with multiple persons). Therefore, detecting the bounding boxes of each subject is essential and can be done by state-of-the-art object detection solutions (i.e., Mask R-CNN [10], You Only Look Once (YOLO) [11], Single Shot Detection (SSD) [12], etc.)
- If the raw data is in video format, spatio-temporal information should be kept. in such cases, the accurate tracking of each object (subject) in the scene can significantly ease the annotation process.
- Finally, in order to label the data with semantic attributes, all the bounding boxes of each individual are displaced to human annotators. based on human perception, the desired labels (e.g., ‘gender’ or ‘age’) are then associated to each instance of the dataset.
- learn in an end-to-end manner and yield multiple attributes at once;
- extract a discriminative and comprehensive feature representation from the input data;
- leverage the intrinsic correlations between attributes;
- consider the location of each attribute in a weakly supervised manner;
- are robust to primary challenges such as low-resolution data, pose variation, occlusion, illumination variation, and cluttered background;
- handle the classes imbalance;
- manage the limited-data problem effectively.
- The recent literature on HAR has been mostly focused on addressing some particular challenges of this problem (such as class imbalance, attribute localization, etc.) rather devising a general HAR system. Therefore, instead of providing a methodological categorization of the literature as in Reference [16], our survey proposes a challenge-based taxonomy, discussing the state-of-the-art solutions and the rationale behind them;
- Contrary to Reference [16], we analyze the motivation of each work and the intuitive reason for its superior performance;
- The datasets main features, statistics and types of annotation are compared and discussed in detail;
- Beside the motivations, we discuss HAR applications, divided into three main categories: security, commercial, and related research directions.
Motivation and Applications
2. Human Attribute Recognition Preliminaries
2.1. Data Preparation
2.2. HAR Model Development
3. Discussion of Sources
3.1. Localization Methods
3.1.1. Pose Estimation-Based Methods
3.1.2. Pose-Let-Based Methods
3.1.3. Part-Based Methods
3.1.4. Attention Based Methods
3.1.5. Attribute Based Methods
3.2. Limited Data
3.3. Attributes Relationship
- Network-Oriented methods that take advantage of the various implementation of convolutional layers/blocks to discover the relation between attributes,
- math-oriented methods that may or may not extract the features using CNNs, but perform some mathematical operations on the features to modify them regarding the existing intrinsic correlations among the attributes.
3.3.1. Network-Oriented Attribute Correlation Consideration
3.3.2. Math-Oriented Attribute Correlation Consideration
3.4. Occlusion
3.5. Classes Imbalance
3.5.1. Hard Solutions
3.5.2. Soft Solutions
3.5.3. Hybrid Solutions
3.6. Part-Based and Attribute Correlation-Based Methods
4. Datasets
4.1. PAR Datasets
- PETA dataset. PEdesTrian Attribute (PETA) [152] dataset combines 19,000 pedestrian images gathered from 10 publicly available datasets; therefore the images present large variations in terms of scene, lighting conditions and image resolution. The resolution of the images varies from 17 × 39 to 169 × 365 pixels. The dataset provides rich annotations: the images are manually labeled with 61 binary and 4 multi-class attributes. The binary attributes include information about demographics (gender: Male, age: Age16–30, Age31–45,Age46–60, AgeAbove61), appearance (long hair), clothing (T-shirt, Trousers etc.) and accessories (Sunglasses, Hat, Backpack etc.). The multi-class attributes are related to (eleven basic) color(s) for the upper-body and lower-body clothing, shoe-wear, and hair of the subject. When gathering the dataset, the authors tried to balance the binary attributes; in their convention, a binary class is considered balanced if the maximal and minimal class ratio is less than 20:1. In the final version of the dataset, more than half of the binary attributes (31 attributes) have a balanced distribution.
- RAP dataset. Currently, there are two versions of the RAP (Richly Annotated Pedestrian) dataset. The first version, RAP-v1 v1 [153] was collected from a surveillance camera in shopping malls over a period of three months; next, 17 hours of video footage were manually selected for attribute annotation. In total, the dataset comprises 41,585 annotated human silhouettes. The 72 attributes labeled in this dataset include demographic information (gender and age), accessories (backpack, single shoulder bag, handbag, plastic bag, paper bag etc.), human appearance (hair style, hair color, body shape) and clothing information (clothes style, clothes color, footware style, footware color etc.). In addition, the dataset provides annotations about occlusions, viewpoints and body-parts information.The second version of the RAP dataset [108] is intended as a unifying benchmark for both person retrieval and person attribute recognition in real-world surveillance scenarios. The dataset was captured indoor, in a shopping mall and contains 84,928 images (2589 person identities) from 25 different scenes. High-resolution cameras (1280 × 720) were used to gather the dataset, and the resolution of human silhouettes varies from 33 × 81 to 415 × 583 pixels. The attributes annotated are the same as in RAP v2 (72 attributes, and occlusion, viewpoint, and body-parts information).
- DukeMTMC dataset. DukeMTMC-reid (Multi-Target, Multi-Camera) dataset [154] was collected in Duke’s university campus and contains more than 14 h of video sequences gathered from 8 cameras, positioned such that they capture crowded scenes. The main purpose of this dataset was person re-identification and multi-camera tracking; however, a subset of this dataset was annotated with human attributes. The annotations were provided at the identity level, and they included 23 attributes, regarding the gender (male, female), accessories: wearing hat (yes, no), carrying a backpack (yes, no), carrying a handbag (yes, no), carrying other types of the bag (yes, no), and clothing style: shoe type (boots, other shoes), the color of shoes (dark, bright), length of upper-body clothing (long, short), 8 colors of upper-body clothing (black, white, red, purple, gray, blue, green, brown) and 7 colors of lower-body clothing (black, white, red, gray, blue, green, brown). Due to violation of civil and human rights, as well as privacy issues, since June 2019, Duke University has terminated the DukeMTMC dataset page.
- PA-100K dataset. PA-100k dataset [88] was developed with the intention to surpass the existing HAR datasets both in quantity and in diversity; the dataset contains more than 100,000 images captured in 598 different scenarios. The dataset was captured by outdoor surveillance cameras; therefore, the images provide large variance in image resolution, lighting conditions, and environment. The dataset is annotated with 26 attributes, including demographic (age, gender), accessories (handbag, phone) and clothing information.
- Market-1501 dataset. Market-1501 attribute [24,155] dataset is a version of the Market-1501 dataset augmented with the annotation of 27 attributes. Market-1501 was initially intended for cross camera person re-identification, and it was collected outdoor in front of a supermarket using 6 cameras (5 high-resolution cameras and one low resolution). The attributes are provided at the identity level, and in total, there are 1501 annotated identities. In total, the dataset has 32,668 bounding boxes for these 1501 identities. The attributes annotated in Market-1501 attribute include demographic information (gender and age), information about accessories (wearing hat, carrying backpack, carrying bag, carrying handbag), appearance (hair length) and clothing type and color (sleeve length, length of lower-body clothing, type of lower-body clothing, 8 color of upper-body clothing, 9 color of lower-body clothing).
- P-DESTRE Dataset. Over the recent years, as their cost has diminished considerably, UAVs applications extended rapidly in various surveillance scenarios. As a response, several UAVs datasets have been collected and made publicly available to the scientific community. Most of them are intended for human detection [156,157], action recognition [158] or re-identification [159]. To the best of our knowledge, the P-DESTRE [160] dataset is the first benchmark that addresses the problem of HAR from aerial images.P-DESTRE dataset [160] was collected in the campuses of two Universities from India and Portugal, using DJI-Phantom 4 drones controlled by human operators. The dataset provides annotations both for person re-identification, as well as for attribute recognition. The identities are consistent across multiple days. The annotations for the attributes include demographic information: gender, ethnicity and age, appearance information: height, body volume, hair color, hairstyle, beard, moustache; accessories information: glasses, head accessories, body accessories; clothing information and action information. In total, the dataset contains over 14 million person bounding boxes, belonging to 261 known identities.
4.2. FAR Datasets
- Parse27k dataset. Pedestrian attribute recognition in sequences (Parse27k) dataset [161] contains over 27,000 pedestrian images, annotated with 10 attributes. The images were captured by a moving camera across a city environment; every 15th video frame was fed to the Deformable Part Model(DPM) pedestrian detector [78] and the resulting bounding boxes were annotated with the 10 attributes based on binary or multinomial propositions. As opposed to other datasets, the authors also included an N/A state (i.e., the labeler cannot decide on that attribute). The attributes from this dataset include gender information (3 categories—male, female, N/A), accessories (Bag on Left Shoulder, Bag on Right Shoulder Bag in Left Hand, Bag in Right Hand, Backpack; each with three possible states: yes, no, N/A), orientation (with 4 + N/A or 8 + N/A discretizations) and action attributes: posture (standing, walking, sitting and N/A) and isPushing (yes, no, N/A). As the images were initially processed by a pedestrian detector, the images of this dataset consist of a fixed-size bounding region of interest, and thus are strongly aligned and contain only a subset of possible human poses.
- CRP dataset. CRP (Caltech Roadside Pedestrians) [162] dataset was captured in real world conditions, from a moving vehicle. The position (bounding-box) of each pedestrian, together with 14 body joints are annotated in each video frame. CRP comprises 4222 video tracks, with 27,454 pedestrian bounding boxes. The following attributes are annotated for each pedestrian—age ( 5 categories: child, teen, young adult, middle aged and senior), gender (2 categories—female and male), weight (3 categories: Under, Healthy and Over), and clothing style (4 categories—casual, light athletic, workout and dressy). The original, un-cropped videos together with the annotations are publicly available.
- Describing People dataset. Describing People dataset [68] comprises 8035 images from the H3D [163] and the PASCAL VOC 2010 [164] datasets. The images from this database are aligned, in the sense that for each person, the image is cropped (by leaving some margin) and then scaled so that the distance between the hips and the shoulders is 200 pixels. The dataset features 9 binary (True/False) attributes, as follows: gender (is male), appearance (long hair), accessories (glasses) and several clothing attributes (has hat, has t-shirt, has shorts, has jeans, long sleeves, long pants). The dataset was annotated on Amazon Mechanical Turk by five independent labelers; the authors considered a valid label if at least four of the five annotators agreed on its value.
- HAT dataset. Human ATtributes (HAT) [66,78] contains 9344 images gathered from Flickr; for this purpose, the authors used more than 320 manually specified queries to retrieve images related to people and then, employed an off-the-shelf person detector to crop the humans in the images. The false positives were manually removed. Next, the images were labeled with 27 binary attributes; these attributes incorporate information about the gender (Female), age (Small baby, Small kid, Teen aged, Young (college), Middle Aged, Elderly), clothing (Wearing tank top, Wearing tee shirt, Wearing casual jacket, Formal men suit, Female long skirt, Female short skirt, Wearing short shorts, Low cut top, Female in swim suit, Female wedding dress, Bermuda/beach shorts), pose (Frontal pose, Side pose, Turned Back), Action (Standing Straight, Sitting, Running/Walking, Crouching/bent, Arms bent/crossed) and occlusions (Upper body). The images have high variations both in image size and in the subject’s position.
- WIDER dataset. WIDER Attribute dataset [75] comprises a subset of 13,789 images selected from the WIDER database [165], by discarding the images full of non-human objects and the images in which the human attributes are indistinguishable; the human bounding boxes from these images are annotated with 14 attributes. The images contain multiple humans under different and complex variations. For each image, the authors selected a maximum of 20 bounding boxes (based on their resolution), so in total, there are more than 57,524 annotated individuals. The attributes follow a ternary taxonomy: positive, negative and unspecified, and include information about age (Male), clothing (Tshirt, longSleeve, Formal, Shorts, Jeans, Long Pants, Skirt), accessories (Sunglasses, Hat, Face Mask, Logo), appearance (Long Hair). In addition, each image is annotated into one of 30 event classes (meeting, picnic, parade, etc.), thus allowing to correlate the human attributes with the context they were perceived in.
- CAD dataset. Clothing Attributes Dataset [123] uses images gathered from the website Sartorialist (https://www.thesartorialist.com/) and Flikcr. The authors downloaded several images, mostly of pedestrians, and applied an upper-body detector to detect humans; they ended up with 1856 images. Next, the ground truth was established by labelers from Amazon Mechanical Turk. Each image was annotated by 6 independent individuals, and a label was accepted as ground truth if it has at least 5 agreements. The dataset is annotated with the gender of the wearer, information about the accessories (Wearing scarf, Collar presence, Placket presence) and with several attributes regarding the clothing appearance (clothing pattern, major color, clothing category, neckline shape etc.).
- APiS dataset. The Attributed Pedestrians in Surveillance dataset [166] gathers images from four different sources: KITTI database [167], CBCL Street Scenes [168] (http://cbcl.mit.edu/software-datasets/streetscenes/), INRIA database [48] and some video sequences collected by the authors at a train station; in total APiS comprises 3661 images. The human bounding boxes are detected using an off-the-shelf pedestrian detector, and the results are manually processed by the authors: the false positives and the low-resolution images (smaller than 90 pixels in height and 35 pixels in width) are discarded. Finally, all the images of the dataset are normalized in the sense that the cropped pedestrian images are scaled to 128 × 48 pixels. These cropped images are annotated with 11 ternary attributes (positive, negative, and ambiguous) and 2 multi-class attributes. These annotations include demographic (gender) and appearance attributes (long hair), as well as information about accessories (back bag, S-S (Single Shoulder) bag, hand carrying) and clothing (shirt, T-shirt, long pants, M-S (Medium and Short) pants, long jeans, skirt, upper-body clothing color, lower-body clothing color). The multi-class attributes are the two attributes related to the clothing color. The annotation process is performed manually and divided into two stages: annotation stage (the independent labeling of each attribute) and validation stage (which exploits the relationship between the attributes to check the annotation; also, in this stage, the controversial attributes are marked as ambiguous).
4.3. Fashion Datasets
- DeepFashion Dataset. The DeepFashion dataset [91] was gathered from shopping websites, as well as image search engines (blogs, forums, user-generated content). In the first stage, the authors downloaded 1,320,078 images from shopping websites and 1,273,150 images from Google images. After a data cleaning process, in which duplicate, out-of-scope, and low-quality images were removed, 800,000 clothing images were finally selected to be included in the DeepFashion dataset. The images are annotated solely with clothing information; these annotations are divided into categories (50 labels: dress, blouse, etc.) and attributes (1000 labels: adjectives describing the categories). The categories were annotated by expert labelers, while for the attributes, due to their huge number, the authors resorted to meta-data annotation (provided by Google search engine or by the shopping website). In addition, a set of clothing landmarks, as well as their visibility, are provided for each image.DeepFashion is split into several benchmarks for different purposes: category and attribute prediction (classification of the categories and the attributes), in-shop clothes retrieval (determine if two images belong to the same clothing item), consumer-to-shop clothes retrieval (matching consumer images to their shop counterparts) and fashion landmark detection.
4.4. Synthetic Datasets
- DeepFashion—Fashion Image Synthesis. The authors of DeepFashion [91] introduce FashionGAN, an adversarial network for generating clothing images on a wearer [171]. FashionGAN is organized into two stages: on a first level, the network generates a semantic segmentation map modeling the wearer’s pose. In the second level, a generative model renders an image with precise regions and textures conditioned on this map. In this context, the DeepFashion dataset was extended with 78,979 images (taken for the In-shop Clothes Benchmark), associated with several caption sentences and a segmentation map.
- CTD Dataset. Clothing Tightness dataset [169] (CTD) comprises 880 3D human models, under various poses, both static and dynamic, “dressed” with 228 different outfits. The garments in the dataset are grouped under various categories, such as “T/long shirt, short/long/down coat, hooded jacket, pants, and skirt/dress, ranging from ultra-tight to puffy”. CTD was gathered in the context of a deep learning method that maps a 3D human scan into a hybrid geometry image. This synthetic dataset has important implications in virtual try-on systems, soft biometrics, and body pose evaluation. The main drawbacks of this dataset are that it cannot capture exaggerated human postures of low 3D human scans.
- CLOTH3D Dataset. CLOTH3D [170] comprises thousands of 3D sequences of animated human silhouettes, “dressed” with different garments. The dataset features a large variation on the garment shape, fabric, size, and tightness, as well as human pose. The main applications of this dataset listed by the authors include—“human pose and action recognition in-depth images, garment motion analysis, filling missing vertices of scanned bodies with additional metadata (e.g., garment segments), support designers and animators tasks, or estimating 3D garment from RGB images”.
5. Evaluation Metrics
6. Discussion
6.1. Discussion Over HAR Datasets
- Attributes definition. The first issues that should be addressed when developing a new dataset for HAR are: (1) which attributes should be annotated? and (2) how many and which classes are required to describe an attribute properly?. Obviously, both these questions depend on the application domain of the HAR system. Generally, the ultimate goal on a HAR, regardless of the application domain, would be to accurately describe an image in terms of human-understandable semantic labels, for example, “a five-year-old boy, dressed in blue jeans, with a yellow T-shirt carrying a striped backpack”. As for the second question, the answer is straightforward for some attributes, such as gender, but it becomes more complex and subjective for other attributes, such as age or clothing information. Let’s take for example, the age label; different datasets provided different classes for this information: PETA distinguishes between AgeLess15, Age16-30, Age31-45, Age46-60, AgeAbove61, while CRP dataset adopted a different age classification scheme: child, teen, young adult, middle aged and senior. Now, if a HAR analyzer is integrated into a surveillance system in a crowded environment, such as Disneyland, and this system should be used to locate a missing child, the age labels from the PETA dataset are not detailed enough, as the “lowest” age class is AgeLess15. Secondly, these differences between the different taxonomies make it difficult to assess the performance of a newly developed algorithm across different datasets.
- Unbalanced data. An important issue in any dataset is related to unbalanced data. Although some datasets were developed by explicitly striking for balanced classes, some classes are not that frequent (especially those related to clothing information), and fully balanced datasets are not a trivial problem. The problem of imbalance also affects the demographic attributes. In all HAR datasets, the class of young children is poorly represented. To illustrate the problem of unbalanced classes, we selected two of the most prominent HAR related datatsets which are labeled with age information: CRP and PETA. In Figure 4, for each of these two datasets, we plot a pie charts to show age distribution of the labeled images.Furthermore, as datasets are usually gathered in a single region (city, country, continent), the data tends to be unbalanced in terms of ethnicity. This is an important issue as some studies [172] proved the existence of the other race effect –-the tendency to more easily recognize faces from the same ethnicity-– for machine learning classifier.
- Data context. Strongly linked to the problem of data unbalance is the context or environment in which the frames were captured. The environment has a great influence on the distribution of the clothing and demographic (age, gender) attributes. In [75] the authors noticed “strong correlations between image event and the frequent human attributes in it”. This is quite logical, as one would expect to encounter more casual outfits in a picnic or sporting event, while at ceremonies (wedding, graduation proms), people tend to be more elegant and dressed-up. The same is valid for the demographic attributes: if the frames are captured in the backyard of a kindergarten, one would that most of the subjects to be children. Ideally, a HAR dataset should provide images captured from multiple and variate scenes. Some datasets explicitly annotated the context in which the data was captured [75], while others address this issue by merging images from various datasets [152]. From another point of view, this leads our discussion to how the images from the datasets are presented. Generally speaking, the dataset provides the images either aligned (all the images have the same size and cropped around the human silhouette with a predefined margin; for example, [68]), or make the full video frame/image available and specify the bounding box of each human in the image. We consider that the latter approach is preferable, as it also incorporates context information and allows researches to decide how to handle the input data.
- Binary attributes. Another question in database annotation is what happens when the attribute to annotate is indistinguishable due to low resolution and degraded images, occlusions, or other ambiguities. The majority of datasets tend to ignore this problem and classify the presence of an attribute or provide a multi-class attribute scheme. However, in a real-world setup, we cannot afford this luxury, as the case of indistinguishable attributes might occur quite frequently. Therefore, some datasets [161,166] formulate the attribute classification task with N + 1 classes (+1 for the N/A label). This approach is preferable, as it allows taking both views over the data: depending on the application context, one could simply ignore the N/A attributes or, make the classification problem more interesting, integrate the N/A value into the classification framework.
- Camera configuration. Another aspect that should be taken into account when discussing HAR datasets is the camera setup used to capture the images or video sequences. We can distinguish between fixed-camera and moving-camera setups; obviously, this choice again depends on the application domain into which the HAR system will be integrated. For automotive applications or robotics, one should opt for a moving camera, as the camera movement might influence the visual properties of the human silhouettes. An example of a moving-camera dataset is Parse27k dataset [161]. For surveillance applications, a static camera setup will suffice. In another way, we could distinguish between indoor or outdoor camera setups; for example, RAP dataset [153] uses an indoor camera, while Parse27k dataset [161] comprises outdoor video sequences. Indoor captured datasets, such as [153], although captured in real-world scenarios, do not pose that many challenges as outdoor captured datasets, where the weather and lighting conditions are more volatile. Finally, the last aspect regarding the camera setup is related to the presence of a photographer. If the images are captured by a (professional) photographer some bias is introduced, as a human decides how and when to capture the images, such that it will enhance the appearance of the subject. Some databases, such as CAD [123] or HAT [66,78] use images downloaded from public websites. However, in these images, the persons are aware of being photographed and perhaps even prepared for this (posing for the image, dressed up nicely for a photo session, etc.). Therefore, even if some datasets contain in-the-wild images gathered for a different system, they might still contain important differences from real-world images in which the subject is unaware of being photographed, the image is captured automatically, without any human intervention, are the subjects are dressed normally and performing natural dynamic movements.
- Pose and occlusion labeling. Another nice to have feature for a HAR dataset is the annotation of pose and occlusions. Some databases already provide this information [66,78,108,153]. Amongst other things, these extra labels prove useful in the evaluation of HAR systems, as they allow researchers to diagnose the errors of HAR and examine the influence of various factors.
- Data partitioning strategies. When dealing with HAR, the datasets partitioning scheme (into the train, validation, and test splits) should be carefully engineered. A common pitfall is to split the frames into the train and validation splits randomly, regardless of the person’s identity. This can lead to an unfair assignment of a subject into one of these splits, and inducing bias in the evaluation process. This is even more important, as the current state-of-the-art methods generally rely on deep neural network architectures, which have a black-box behavior in nature, and it is not so straightforward to determine which image features lead to the final classification result.Solutions to this problem include extracting each individual (along with its track-lets) from the video sequence or providing the annotations at the identity level. Then, each person could be randomly assigned to one of the dataset splits.
- Synthetic data. Recently, significant advances have been made in the field of computer graphics and synthetic data generation. For example, in the field of drone surveillance, generated data [173] has proven its efficiency in training accurate machine vision systems. In this section, we have presented some computer-generated datasets which contain human attribute annotations. We consider that synthetically generated data is worth taking into consideration, as theoretically, it can be considered an inexhaustible source of data, which could be able to generate subjects with various attributes, under different poses, in diverse scenarios. However, state-of-the-art generative models rely on deep learning, which is known to be "hungry" for data, so data is needed to build a realistic generative model. Therefore, this solution might prove to be just a vicious circle.
- Privacy issues. In the past, as traditional video surveillance systems were simple and involved only human monitoring, privacy was not a major concern; however, these days, the pervasiveness of systems equipped with cutting-edge technologies in public places (e.g., shopping malls, private and public buildings, bus and train stations) have aroused new privacy and security concerns. For instance, Privacy Commissioner of Canada (OPC) is an organization that helps people report their privacy concerns and enforces the enterprises to manage people’s personal data in their business activities based on restricting standards (https://www.priv.gc.ca/en/report-a-concern/).When gathering a dataset with real-world images, we deal with privacy and human rights violations. Ideally, HAR datasets should contained images captured by real-world surveillance cameras, with the subjects are unaware of being filmed, such that their behavior is as natural as possible. From an ethical perspective, humans should consent before their images are annotated and publicly distributed. However, this is not feasible for all scenarios. For example, BRAINWASH [174] dataset was gathered inside a private cafe for the purpose of head detection, and comprised 11,917 images. Although this benchmark is not very popular, it is seen in the lists of the popular datasets for commercial and military applications, as it has captured the regular customers without their awareness. DUKE MTMC [152] dataset targets the task of multi-person re-identification from full-body images taken by several cameras. This dataset was collected in a university campus in an outdoor environment and contains over 2 million frames of 2000 students captured by 8 cameras at 1080p. MS-CELEB-1M [175] is another large dataset of 10 million faces collected from the Internet.However, despite the success of these datasets (if we evaluate success by the number of citations and database downloads), the authors decided to shout-down the datasets due to human rights and privacy violation issues.According to Pew Research Center Privacy Panel Survey conducted from 27 January to 16 February 2015, among 461 adults, more than 90 percent agreed that two factors are critical for surveillance systems: (1) who can access to their information? (2) what information is collected about them? Moreover, it is notable that they consent to share confidential information with someone they trust (93%); however, it is important not to be monitored without permission (88%).As people’s faces contain sensitive information that could be captured in the wild, authorities have published some standards (https://gdpr-info.eu/) to enforce enterprises respect the privacy of their costumers.
6.2. Critical Discussion and Performance Comparison
- the expressiveness of the data is lost (e.g., when processing a jacket only by several parts, some global features that encode the jacket’s shape and structure are ignored).
- as the person detector cannot always provide aligned and accurate bounding boxes, rigid partitioning methods are prone to error in body-part captioning, mainly when the input data includes a wide background. Therefore, methods based on stride/grid patching of the image are not robust to misalignment errors in the person bounding boxes, leading to degradation in prediction performance.
- different from gender and age, most human attributes (such as glasses, hat, scarf, shoes, etc.) belong to small regions of the image; therefore, analyzing other parts of the image may add irrelevant features to the final feature representation of the image.
- some attributes are view-dependent and highly changeable due to human body-pose, and ignoring them reduces the model performance; for example, glasses recognition in the side-view images is more laborious than front-view, while it may be impossible in back-view images. Therefore, in some localization methods (e.g., pose-let based techniques), regardless of this fact, features of different parts may be aggregated to perform a prediction on an unavailable attribute.
- some localization methods rely on the body-parsing techniques [176] or body-part detection methods [177] to extract local features. Not only requires training such part detectors rich annotations of data but also errors in body-parsing and body-part detection methods directly affect the performance of the HAR model.
- in opposition to the breakthrough in face generative models [181], full-body generative models are still in early stages and their performance is still unsatisfactory,
- the generated data is unlabelled, while HAR is yet far from the stage to be implemented based on unlabeled data. It worth mentioning that, automatic annotations is an active research area in object detection [182].
- not only takes learning high-quality generative models for human full-body too much time, but it also requires a large amount of high-resolution learning data, which is yet not available.
- There should be some relationships between the data of task A and task B. For example, applying pre-trained weights of the ImageNet dataset [50] on HAR task is beneficial as both domains are dealing with RGB images of objects, including human data, while transferring the knowledge of medical imagery (e.g., CT/MRI) are not useful and may only impose some heavy parameters to the model.
- The data in task A is much more than the data in task B as transferring the knowledge of other small datasets cannot guarantee performance improvements.
- we can freeze the backbone model and adding several classification layers on top of the model for fine-tuning.
- we can add the proper classification layers on top of the model and train all the model layers in several steps: (1) freeze the backbone model and fine-tune the last layers, (2) considering a lower learning rate, we unfreeze high-level feature extractor layers and fine-tune the model, (3) we unfreeze mid-level and low-level layers in other steps and train them with a lower learning rate, as these features are normally common between most tasks with the same data types.
7. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
Acc | Accuracy |
CAA | Clothing Attribute Analysis |
CAMs | Class Activation Maps |
CCTV | Closed-Circuit TeleVision |
CNN | Convolutional Neural Network |
CSD | Color Structure Descriptor |
CRF | Conditional Random Field |
DPM | Deformable Part Model |
FAA | Facial Attribute Analysis |
FAR | Full-body Attribute Recognition |
FP | False Positives |
GAN | Generative Adversarial Network |
GCN | Graph Convolutional Network |
HAR | Human Attribute Recognition |
HOG | Histogram of Oriented Gradients |
re-id | re-identification |
LMLE | Large Margin Local Embedding |
LSTM | Long Short Term Memory |
mA | mean Accuracy |
MAP | Maximum A Posterioris |
MAResNet | Multi-Attribute Residual Network |
MCSH | Major Colour Spectrum Histogram |
MSCR | Maximally Stable Colour Regions |
Prec | Precision |
Rec | Recall |
ResNet | Residual Networks |
RHSP | Recurrent Highly-Structured Patches |
RoI | Regions of Interests |
RNN | Recurrent Neural Networks |
SIFT | Scale Invariant Feature Transform |
SE-Net | Squeeze-and-Excitation Networks |
SMOTE | Synthetic Minority Over-sampling TEchnique |
SPR | Spatial Pyramid Representation |
SSD | Single Shot Detector |
STN | Spatial Transformer Network |
SVM | Support Vector Machine |
TN | True Negative |
TP | True Positive |
UAV | Unmanned Aerial Vehicle |
VAE | Variational Auto-Encoders |
YOLO | You Only Look Once |
References
- Tripathi, G.; Singh, K.; Vishwakarma, D.K. Convolutional neural networks for crowd behaviour analysis: A survey. Vis. Comput. 2019, 35, 753–776. [Google Scholar] [CrossRef]
- Yan, Y.; Zhang, Q.; Ni, B.; Zhang, W.; Xu, M.; Yang, X. Learning context graph for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2158–2167. [Google Scholar]
- Munjal, B.; Amin, S.; Tombari, F.; Galasso, F. Query-guided end-to-end person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 811–820. [Google Scholar]
- Priscilla, C.V.; Sheila, S.A. Pedestrian Detection-A Survey. In Proceedings of the International Conference on Information, Communication and Computing Technology, Istanbul, Turkey, 30–31 October 2019; pp. 349–358. [Google Scholar]
- Narayan, N.; Sankaran, N.; Setlur, S.; Govindaraju, V. Learning deep features for online person tracking using non-overlapping cameras: A survey. Image Vis. Comput. 2019, 89, 222–235. [Google Scholar] [CrossRef]
- Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.C. Deep learning for person re-identification: A survey and outlook. arXiv 2020, arXiv:2001.04193. [Google Scholar]
- Xiang, J.; Dong, T.; Pan, R.; Gao, W. Clothing Attribute Recognition Based on RCNN Framework Using L-Softmax Loss. IEEE Access 2020, 8, 48299–48313. [Google Scholar] [CrossRef]
- Guo, B.H.; Nixon, M.S.; Carter, J.N. A joint density based rank-score fusion for soft biometric recognition at a distance. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 3457–3462. [Google Scholar]
- Thom, N.; Hand, E.M. Facial Attribute Recognition: A Survey. In Computer Vision: A Reference Guide; Springer: Cham, Switzerland, 2020; pp. 1–13. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2012; pp. 1097–1105. [Google Scholar]
- Bekele, E.; Lawson, W. The deeper, the better: Analysis of person attributes recognition. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–8. [Google Scholar]
- Zheng, X.; Guo, Y.; Huang, H.; Li, Y.; He, R. A Survey of Deep Facial Attribute Analysis. Int. J. Comput. Vis. 2020, 1–33. [Google Scholar] [CrossRef] [Green Version]
- Wang, X.; Zheng, S.; Yang, R.; Luo, B.; Tang, J. Pedestrian attribute recognition: A survey. arXiv 2019, arXiv:1901.07474. [Google Scholar]
- Masi, I.; Wu, Y.; Hassner, T.; Natarajan, P. Deep face recognition: A survey. In Proceedings of the 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Paraná, Brazil, 29 October–1 November 2018; pp. 471–478. [Google Scholar]
- Huang, G.B.; Lee, H.; Learned-Miller, E. Learning hierarchical representations for face verification with convolutional deep belief networks. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2518–2525. [Google Scholar]
- Sun, Y.; Liang, D.; Wang, X.; Tang, X. Deepid3: Face recognition with very deep neural networks. arXiv 2015, arXiv:1502.00873. [Google Scholar]
- De Marsico, M.; Petrosino, A.; Ricciardi, S. Iris recognition through machine learning techniques: A survey. Pattern Recognit. Lett. 2016, 82, 106–115. [Google Scholar] [CrossRef]
- Battistone, F.; Petrosino, A. TGLSTM: A time based graph deep learning approach to gait recognition. Pattern Recognit. Lett. 2019, 126, 132–138. [Google Scholar] [CrossRef]
- Terrier, P. Gait recognition via deep learning of the center-of-pressure trajectory. Appl. Sci. 2020, 10, 774. [Google Scholar] [CrossRef] [Green Version]
- Layne, R.; Hospedales, T.M.; Gong, S.; Mary, Q. Person re-identification by attributes. Bmvc 2012, 2, 8. [Google Scholar]
- Lin, Y.; Zheng, L.; Zheng, Z.; Wu, Y.; Hu, Z.; Yan, C.; Yang, Y. Improving person re-identification by attribute and identity learning. Pattern Recognit. 2019, 95, 151–161. [Google Scholar] [CrossRef] [Green Version]
- Liu, J.; Kuipers, B.; Savarese, S. Recognizing human actions by attributes. In Proceedings of the CVPR 2011, Colorado, AZ, USA, 21–23 June 2011; pp. 3337–3344. [Google Scholar]
- Shao, J.; Kang, K.; Change Loy, C.; Wang, X. Deeply learned attributes for crowded scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4657–4666. [Google Scholar]
- Tsiamis, N.; Efthymiou, L.; Tsagarakis, K.P. A Comparative Analysis of the Legislation Evolution for Drone Use in OECD Countries. Drones 2019, 3, 75. [Google Scholar] [CrossRef] [Green Version]
- Fukui, H.; Yamashita, T.; Yamauchi, Y.; Fujiyoshi, H.; Murase, H. Robust pedestrian attribute recognition for an unbalanced dataset using mini-batch training with rarity rate. In Proceedings of the 2016 IEEE Intelligent Vehicles Symposium (IV), Gothenburg, Sweden, 19–22 June 2016; pp. 322–327. [Google Scholar]
- Prabhakar, S.; Pankanti, S.; Jain, A.K. Biometric recognition: Security and privacy concerns. IEEE Secur. Priv. 2003, 1, 33–42. [Google Scholar] [CrossRef]
- Xiu, Y.; Li, J.; Wang, H.; Fang, Y.; Lu, C. Pose flow: Efficient online pose tracking. arXiv 2018, arXiv:1802.00977. [Google Scholar]
- Neves, J.; Narducci, F.; Barra, S.; Proença, H. Biometric recognition in surveillance scenarios: A survey. Artif. Intell. Rev. 2016, 46, 515–541. [Google Scholar]
- Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Safavian, S.R.; Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 1991, 21, 660–674. [Google Scholar] [CrossRef] [Green Version]
- Kamiński, B.; Jakubczyk, M.; Szufel, P. A framework for sensitivity analysis of decision trees. Cent. Eur. J. Oper. Res. 2018, 26, 135–159. [Google Scholar] [CrossRef]
- McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
- Zhang, G.P. Neural networks for classification: A survey. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2000, 30, 451–462. [Google Scholar] [CrossRef] [Green Version]
- Georgiou, T.; Liu, Y.; Chen, W.; Lew, M. A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int. J. Multimed. Inf. Retr. 2020, 9, 135–170. [Google Scholar] [CrossRef] [Green Version]
- Satta, R. Appearance descriptors for person re-identification: A comprehensive review. arXiv 2013, arXiv:1307.5748. [Google Scholar]
- Piccardi, M.; Cheng, E.D. Track matching over disjoint camera views based on an incremental major color spectrum histogram. In Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, Como, Italy, 15–16 September 2005; pp. 147–152. [Google Scholar]
- Chien, S.Y.; Chan, W.K.; Cherng, D.C.; Chang, J.Y. Human object tracking algorithm with human color structure descriptor for video surveillance systems. In Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, Toronto, ON, Canada, 9–12 June 2006; pp. 2097–2100. [Google Scholar]
- Wong, K.M.; Po, L.M.; Cheung, K.W. Dominant color structure descriptor for image retrieval. In Proceedings of the 2007 IEEE International Conference on Image Processing, San Antonio, QC, USA, 16–19 September 2007; Volume 6, p. VI-365. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Iqbal, J.M.; Lavanya, J.; Arun, S. Abnormal Human Activity Recognition using Scale Invariant Feature Transform. Int. J. Curr. Eng. Technol. 2015, 5, 3748–3751. [Google Scholar]
- Forssén, P.E. Maximally stable colour regions for recognition and matching. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MI, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
- Basovnik, S.; Mach, L.; Mikulik, A.; Obdrzalek, D. Detecting scene elements using maximally stable colour regions. In Proceedings of the EUROBOT Conference, Prague, Czech Republic, 15–17 June 2009. [Google Scholar]
- He, N.; Cao, J.; Song, L. Scale space histogram of oriented gradients for human detection. In Proceedings of the 2008 International Symposium on Information Science and Engineering, Shanghai, China, 20–22 December 2008; Volume 2, pp. 167–170. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Beiping, H.; Wen, Z. Fast Human Detection Using Motion Detection and Histogram of Oriented Gradients. JCP 2011, 6, 1597–1604. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the CVPR09, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
- Shin, H.C.; Roth, H.R.; Gao, M.; Lu, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 2016, 35, 1285–1298. [Google Scholar] [CrossRef] [Green Version]
- Alirezazadeh, P.; Yaghoubi, E.; Assunção, E.; Neves, J.C.; Proença, H. Pose Switch-based Convolutional Neural Network for Clothing Analysis in Visual Surveillance Environment. In Proceedings of the 2019 International Conference of the Biometrics Special Interest Group (BIOSIG), Darmstadt, Germany, 18–20 September 2019; pp. 1–5. [Google Scholar]
- Yaghoubi, E.; Alirezazadeh, P.; Assunção, E.; Neves, J.C.; Proençaã, H. Region-Based CNNs for Pedestrian Gender Recognition in Visual Surveillance Environments. In Proceedings of the 2019 International Conference of the Biometrics Special Interest Group (BIOSIG), Darmstadt, Germany, 18–20 September 2019; pp. 1–5. [Google Scholar]
- Zeng, H.; Ai, H.; Zhuang, Z.; Chen, L. Multi-Task Learning via Co-Attentive Sharing for Pedestrian Attribute Recognition. arXiv 2020, arXiv:2004.03164. [Google Scholar]
- Lu, Y.; Kumar, A.; Zhai, S.; Cheng, Y.; Javidi, T.; Feris, R. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5334–5343. [Google Scholar]
- Guo, Y.; Liu, Y.; Oerlemans, A.; Lao, S.; Wu, S.; Lew, M.S. Deep learning for visual understanding: A review. Neurocomputing 2016, 187, 27–48. [Google Scholar] [CrossRef]
- Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 1–62. [Google Scholar] [CrossRef] [Green Version]
- Li, Y.; Xu, H.; Bian, M.; Xiao, J. Attention based CNN-ConvLSTM for pedestrian attribute recognition. Sensors 2020, 20, 811. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, J.; Liu, H.; Jiang, J.; Qi, M.; Ren, B.; Li, X.; Wang, Y. Person Attribute Recognition by Sequence Contextual Relation Learning. IEEE Trans. Circuits Syst. Video Technol. 2020. [Google Scholar] [CrossRef]
- Krause, J.; Gebru, T.; Deng, J.; Li, L.J.; Fei-Fei, L. Learning features and parts for fine-grained recognition. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 26–33. [Google Scholar]
- Sarfraz, M.S.; Schumann, A.; Wang, Y.; Stiefelhagen, R. Deep view-sensitive pedestrian attribute inference in an end-to-end model. arXiv 2017, arXiv:1707.06089. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Li, D.; Chen, X.; Zhang, Z.; Huang, K. Pose guided deep model for pedestrian attribute recognition in surveillance scenarios. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
- Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4724–4732. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. In Advances in Neural Information Processing Systems 25; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2015; pp. 2017–2025. [Google Scholar]
- Sharma, G.; Jurie, F. Learning discriminative spatial representation for image classification. In BMVC 2011—British Machine Vision Conference; Hoey, J., McKenna, S.J., Trucco, E., Eds.; BMVA Press: Dundee, UK, 2011; pp. 1–11. [Google Scholar] [CrossRef] [Green Version]
- Lazebnik, S.; Schmid, C.; Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 2169–2178. [Google Scholar]
- Bourdev, L.; Maji, S.; Malik, J. Describing people: A poselet-based approach to attribute classification. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1543–1550. [Google Scholar]
- Joo, J.; Wang, S.; Zhu, S.C. Human attribute recognition by rich appearance dictionary. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 721–728. [Google Scholar]
- Sharma, G.; Jurie, F.; Schmid, C. Expanded Parts Model for Human Attribute and Action Recognition in Still Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013. [Google Scholar]
- Zhang, N.; Paluri, M.; Ranzato, M.; Darrell, T.; Bourdev, L. Panda: Pose aligned networks for deep attribute modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 1637–1644. [Google Scholar]
- Zhu, J.; Liao, S.; Yi, D.; Lei, Z.; Li, S.Z. Multi-label cnn based pedestrian attribute learning for soft biometrics. In Proceedings of the 2015 International Conference on Biometrics (ICB), Phuket, Thailand, 19–22 May 2015; pp. 535–540. [Google Scholar]
- Zhu, J.; Liao, S.; Lei, Z.; Li, S.Z. Multi-label convolutional neural network based pedestrian attribute classification. Image Vis. Comput. 2017, 58, 224–229. [Google Scholar] [CrossRef]
- Yu, K.; Leng, B.; Zhang, Z.; Li, D.; Huang, K. Weakly-supervised learning of mid-level features for pedestrian attribute recognition and localization. arXiv 2016, arXiv:1611.05603. [Google Scholar]
- Li, Y.; Huang, C.; Loy, C.C.; Tang, X. Human attribute recognition by deep hierarchical contexts. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 684–700. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chili, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
- Gkioxari, G.; Girshick, R.; Malik, J. Actions and attributes from wholes and parts. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2470–2478. [Google Scholar]
- Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern ana. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef] [Green Version]
- Zhang, N.; Farrell, R.; Iandola, F.; Darrell, T. Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013. [Google Scholar]
- Yang, L.; Zhu, L.; Wei, Y.; Liang, S.; Tan, P. Attribute Recognition from Adaptive Parts. In Proceedings of the British Machine Vision Conference (BMVC); Richard, C., Wilson, E.R.H., Smith, W.A.P., Eds.; BMVA Press: Dundee, UK, 2016; pp. 81.1–81.11. [Google Scholar] [CrossRef] [Green Version]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 3686–3693. [Google Scholar]
- Zhang, Y.; Gu, X.; Tang, J.; Cheng, K.; Tan, S. Part-based attribute-aware network for person re-identification. IEEE Access 2019, 7, 53585–53595. [Google Scholar] [CrossRef]
- Fan, X.; Zheng, K.; Lin, Y.; Wang, S. Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1347–1355. [Google Scholar]
- Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 685–694. [Google Scholar]
- Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; Oliva, A. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems 25; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2014; pp. 487–495. [Google Scholar]
- Guo, H.; Fan, X.; Wang, S. Human attribute recognition by refining attention heat map. Pattern Recognit. Lett. 2017, 94, 38–45. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Liu, X.; Zhao, H.; Tian, M.; Sheng, L.; Shao, J.; Yi, S.; Yan, J.; Wang, X. Hydraplus-net: Attentive deep features for pedestrian analysis. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 350–359. [Google Scholar]
- Wang, W.; Xu, Y.; Shen, J.; Zhu, S.C. Attentive Fashion Grammar Network for Fashion Landmark Detection and Clothing Category Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499. [Google Scholar]
- Liu, Z.; Luo, P.; Qiu, S.; Wang, X.; Tang, X. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1096–1104. [Google Scholar]
- Tan, Z.; Yang, Y.; Wan, J.; Hang, H.; Guo, G.; Li, S.Z. Attention-Based Pedestrian Attribute Analysis. IEEE Trans. Image Process. 2019, 28, 6126–6140. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Wu, M.; Huang, D.; Guo, Y.; Wang, Y. Distraction-Aware Feature Learning for Human Attribute Recognition via Coarse-to-Fine Attention Mechanism. arXiv 2019, arXiv:1911.11351. [Google Scholar] [CrossRef]
- Zhu, F.; Li, H.; Ouyang, W.; Yu, N.; Wang, X. Learning spatial regularization with image-level supervisions for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5513–5522. [Google Scholar]
- Yaghoubi, E.; Borza, D.; Neves, J.; Kumar, A.; Proença, H. An attention-based deep learning model for multiple pedestrian attributes recognition. Image Vis. Comput. 2020, 1–25. [Google Scholar] [CrossRef]
- Liu, P.; Liu, X.; Yan, J.; Shao, J. Localization guided learning for pedestrian attribute recognition. arXiv 2018, arXiv:1808.09102. [Google Scholar]
- Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
- Zitnick, C.L.; Dollár, P. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 391–405. [Google Scholar]
- Tang, C.; Sheng, L.; Zhang, Z.; Hu, X. Improving Pedestrian Attribute Recognition With Weakly-Supervised Multi-Scale Attribute-Specific Localization. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–3 November 2019; pp. 4997–5006. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Bekele, E.; Lawson, W.E.; Horne, Z.; Khemlani, S. Implementing a robust explanatory bias in a person re-identification network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2165–2172. [Google Scholar]
- Bekele, E.; Narber, C.; Lawson, W. Multi-attribute residual network (MAResNet) for soft-biometrics recognition in surveillance scenarios. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 386–393. [Google Scholar]
- Dong, Q.; Gong, S.; Zhu, X. Multi-task Curriculum Transfer Deep Learning of Clothing Attributes. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 520–529. [Google Scholar]
- Chen, Q.; Huang, J.; Feris, R.; Brown, L.M.; Dong, J.; Yan, S. Deep domain adaptation for describing people based on fine-grained clothing attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5315–5324. [Google Scholar]
- Li, Q.; Zhao, X.; He, R.; Huang, K. Pedestrian attribute recognition by joint visual-semantic reasoning and knowledge distillation. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 833–839. [Google Scholar]
- Li, D.; Zhang, Z.; Chen, X.; Huang, K. A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios. IEEE Trans. Image Process. 2018, 28, 1575–1590. [Google Scholar] [CrossRef]
- Wang, J.; Zhu, X.; Gong, S.; Li, W. Attribute recognition by joint recurrent learning of context and correlation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 531–540. [Google Scholar]
- Li, Q.; Zhao, X.; He, R.; Huang, K. Visual-semantic graph reasoning for pedestrian attribute recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8634–8641. [Google Scholar]
- He, K.; Wang, Z.; Fu, Y.; Feng, R.; Jiang, Y.G.; Xue, X. Adaptively weighted multi-task deep network for person attribute classification. In Proceedings of the 25th ACM international conference on Multimedia, Silicon Valley, CA, USA, 23–27 October 2017; pp. 1636–1644. [Google Scholar]
- Sarafianos, N.; Giannakopoulos, T.; Nikou, C.; Kakadiaris, I.A. Curriculum learning of visual attribute clusters for multi-task classification. Pattern Recognit. 2018, 80, 94–108. [Google Scholar] [CrossRef] [Green Version]
- Sarafianos, N.; Giannakopoulos, T.; Nikou, C.; Kakadiaris, I.A. Curriculum learning for multi-task classification of visual attributes. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 2608–2615. [Google Scholar]
- Martinho-Corbishley, D.; Nixon, M.S.; Carter, J.N. Soft biometric retrieval to describe and identify surveillance images. In Proceedings of the 2016 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA), Sendai, Japan, 29 February–2 March 2016; pp. 1–6. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; So Kweon, I. CBAM: Convolutional Block Attention Module. In Proceedings of the The European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Liu, H.; Wu, J.; Jiang, J.; Qi, M.; Ren, B. Sequence-based person attribute recognition with joint CTC-attention model. arXiv 2018, arXiv:1811.08115. [Google Scholar]
- Zhao, X.; Sang, L.; Ding, G.; Guo, Y.; Jin, X. Grouping Attribute Recognition for Pedestrian with Joint Recurrent Learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 3177–3183. [Google Scholar]
- Zhao, X.; Sang, L.; Ding, G.; Han, J.; Di, N.; Yan, C. Recurrent attention model for pedestrian attribute recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9275–9282. [Google Scholar]
- Ji, Z.; Zheng, W.; Pang, Y. Deep pedestrian attribute recognition based on LSTM. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 151–155. [Google Scholar]
- Tan, Z.; Yang, Y.; Wan, J.; Guo, G.; Li, S.Z. Relation-Aware Pedestrian Attribute Recognition with Graph Convolutional Networks. In Proceedings of the AAAI, New York, NY, USA, 7–12 February 2020; pp. 12055–12062. [Google Scholar]
- Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. In Advances in Neural Information Processing Systems 29; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2016; pp. 4898–4906. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Chen, H.; Gallagher, A.; Girod, B. Describing clothing by semantic attributes. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 609–623. [Google Scholar]
- Park, S.; Nie, B.X.; Zhu, S. Attribute And-Or Grammar for Joint Parsing of Human Pose, Parts and Attributes. IEEE Trans. Pattern Analy. Mach. Intell. 2018, 40, 1555–1569. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Shu, H.; Liu, C.; Xu, C.; Xu, C. Attribute aware pooling for pedestrian attribute recognition. arXiv 2019, arXiv:1907.11837. [Google Scholar]
- Ji, Z.; He, E.; Wang, H.; Yang, A. Image-attribute reciprocally guided attention network for pedestrian attribute recognition. Pattern Recognit. Lett. 2019, 120, 89–95. [Google Scholar] [CrossRef]
- Liang, K.; Chang, H.; Shan, S.; Chen, X. A Unified Multiplicative Framework for Attribute Learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Nice, France, 13–16 October 2015. [Google Scholar]
- Li, D.; Chen, X.; Huang, K. Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios. In Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; pp. 111–115. [Google Scholar]
- Zhao, Y.; Shen, X.; Jin, Z.; Lu, H.; Hua, X.S. Attribute-driven feature disentangling and temporal aggregation for video person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4913–4922. [Google Scholar]
- Hou, R.; Ma, B.; Chang, H.; Gu, X.; Shan, S.; Chen, X. VRSTC: Occlusion-Free Video Person Re-Identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Xu, J.; Yang, H. Identification of pedestrian attributes based on video sequence. In Proceedings of the 2018 IEEE International Conference on Advanced Manufacturing (ICAM), Yunlin, Taiwan, 16–18 November 2018; pp. 467–470. [Google Scholar]
- Fabbri, M.; Calderara, S.; Cucchiara, R. Generative adversarial models for people attribute recognition in surveillance. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
- Chen, Z.; Li, A.; Wang, Y. A temporal attentive approach for video-based pedestrian attribute recognition. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xi’an, China, 8–11 November 2019; pp. 209–220. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Haibo He, E.A.G. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008. [Google Scholar]
- Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005. [Google Scholar]
- Wang, Y.; Gan, W.; Yang, J.; Wu, W.; Yan, J. Dynamic curriculum learning for imbalanced data classification. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–3 November 2019; pp. 5017–5026. [Google Scholar]
- Tang, Y.; Zhang, Y.Q.; Chawla, N.V.; Krasser, S. SVMs Modeling for Highly Imbalanced Classification. IEEE Trans. Syst. Man Cybern. Part B 2008, 39, 281–288. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhou, Z.H.; Liu, X.Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 2005, 18, 63–77. [Google Scholar] [CrossRef]
- Zadrozny, B.; Langford, J.; Abe, N. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the Third IEEE International Conference on Data Mining, Melbourne, FL, USA, 22 November 2003. [Google Scholar]
- Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery(PKDD), Cavtat-Dubrovnik, Croatia, 22–26 September 2003. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Jo, T.; Japkowicz, N. Class imbalances versus small disjuncts. ACM Sigkdd Explor. Newsl. 2004, 6, 40–49. [Google Scholar] [CrossRef]
- Haixiang, G.; Yijing, L.; Shang, J.; Mingyun, G.; Yuanyue, H.; Bing, G. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar] [CrossRef]
- Kubat, G.M.; Matwin, S. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the ICML, Nashville, TN, USA, 8–12 July 1997. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar]
- Sarafianos, N.; Xu, X.; Kakadiaris, I.A. Deep imbalanced attribute classification using visual attention aggregation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 680–697. [Google Scholar]
- Yamaguchi, K.; Okatani, T.; Sudo, K.; Murasaki, K.; Taniguchi, Y. Mix and Match: Joint Model for Clothing and Attribute Recognition. In Proceedings of the British Machine Vision Conference (BMVC); BMVA Press: Dundee, UK, 2015; Volume 1, p. 4. [Google Scholar]
- Yamaguchi, K.; Berg, T.L.; Ortiz, L.E. Chic or social: Visual popularity analysis in online fashion networks. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 773–776. [Google Scholar]
- Deng, Y.; Luo, P.; Loy, C.C.; Tang, X. Pedestrian attribute recognition at far distance. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 789–792. [Google Scholar]
- Li, D.; Zhang, Z.; Chen, X.; Ling, H.; Huang, K. A richly annotated dataset for pedestrian attribute recognition. arXiv 2016, arXiv:1603.07054. [Google Scholar]
- Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In Proceedings of the European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, Amsterdam, The Netherlands, 8–10 October 2016. [Google Scholar]
- Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision meets drones: A challenge. arXiv 2018, arXiv:1804.07437. [Google Scholar]
- Barekatain, M.; Martí, M.; Shih, H.F.; Murray, S.; Nakayama, K.; Matsuo, Y.; Prendinger, H. Okutama-action: An aerial view video dataset for concurrent human action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 28–35. [Google Scholar]
- Perera, A.G.; Law, Y.W.; Chahl, J. Drone-Action: An Outdoor Recorded Drone Video Dataset for Action Recognition. Drones 2019, 3, 82. [Google Scholar] [CrossRef] [Green Version]
- Zhang, S.; Zhang, Q.; Yang, Y.; Wei, X.; Wang, P.; Jiao, B.; Zhang, Y. Person Re-identification in Aerial imagery. IEEE Trans. Multimed. 2020, 1. [Google Scholar] [CrossRef] [Green Version]
- Aruna Kumar, S.; Yaghoubi, E.; Das, A.; Harish, B.; Proença, H. The P-DESTRE: A Fully Annotated Dataset for Pedestrian Detection, Tracking, Re-Identification and Search from Aerial Devices. arXiv 2020, arXiv:2004.02782. [Google Scholar]
- Sudowe, P.; Spitzer, H.; Leibe, B. Person attribute recognition with a jointly-trained holistic cnn model. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 87–95. [Google Scholar]
- Hall, D.; Perona, P. Fine-grained classification of pedestrians in video: Benchmark and state of the art. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5482–5491. [Google Scholar]
- Bourdev, L.; Malik, J. Poselets: Body part detectors trained using 3d human pose annotations. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 1365–1372. [Google Scholar]
- Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
- Xiong, Y.; Zhu, K.; Lin, D.; Tang, X. Recognize complex events from static images by fusing deep channels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1600–1609. [Google Scholar]
- Zhu, J.; Liao, S.; Lei, Z.; Yi, D.; Li, S. Pedestrian attribute classification in surveillance: Database and evaluation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 331–338. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets Robotics: The KITTI Dataset. Int. J. Robot. Res. (IJRR) 2013. [Google Scholar] [CrossRef] [Green Version]
- Bileschi, S.M.; Wolf, L. CBCL Streetscenes; Technical Report; Center for Biological and Computational Learning (CBCL): Cambridge, MA, USA, 2006. [Google Scholar]
- Chen, X.; Pang, A.; Zhu, Y.; Li, Y.; Luo, X.; Zhang, G.; Wang, P.; Zhang, Y.; Li, S.; Yu, J. Towards 3D Human Shape Recovery Under Clothing. arXiv 2019, arXiv:1904.02601. [Google Scholar]
- Bertiche, H.; Madadi, M.; Escalera, S. CLOTH3D: Clothed 3D Humans. arXiv 2019, arXiv:1912.02792. [Google Scholar]
- Zhu, S.; Fidler, S.; Urtasun, R.; Lin, D.; Loy, C.C. Be Your Own Prada: Fashion Synthesis with Structural Coherence. In Proceedings of the International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Phillips, P.J.; Jiang, F.; Narvekar, A.; Ayyad, J.; O’Toole, A.J. An other-race effect for face recognition algorithms. ACM Trans. Appl. Percept. (TAP) 2011, 8, 1–11. [Google Scholar] [CrossRef]
- Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics; Springer: Cham, Switzerland, 2018; pp. 621–635. [Google Scholar]
- Stewart, R.; Andriluka, M.; Ng, A.Y. End-to-End People Detection in Crowded Scenes. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2325–2333. [Google Scholar]
- Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. Ms-celeb-1m: A dataset and benchmark for large-scale face recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 87–102. [Google Scholar]
- Wang, T.; Wang, H. Graph-Boosted Attentive Network for Semantic Body Parsing. In Proceedings of the International Conference on Artificial Neural Networks, Munich, Germany, 17–19 September 2019; pp. 267–280. [Google Scholar]
- Li, S.; Yu, H.; Hu, R. Attributes-aided part detection and refinement for person re-identification. Pattern Recognit. 2020, 97, 107016. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems 27; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2014. [Google Scholar]
- Kim, B.; Shin, S.; Jung, H. Variational autoencoder-based multiple image captioning using a caption attention map. Appl. Sci. 2019, 9, 2699. [Google Scholar] [CrossRef] [Green Version]
- Xu, W.; Keshmiri, S.; Wang, G. Adversarially approximated autoencoder for image generation and manipulation. IEEE Trans. Multimed. 2019, 21, 2387–2396. [Google Scholar] [CrossRef] [Green Version]
- Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
- Jiang, H.; Wang, R.; Li, Y.; Liu, H.; Shan, S.; Chen, X. Attribute annotation on large-scale image database by active knowledge transfer. Image Vis. Comput. 2018, 78, 1–13. [Google Scholar] [CrossRef]
- Tay, C.; Sharmili Roy, K.H.Y. AANet: Attribute Attention Network for Person Re-Identifications. Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7134–7143. [Google Scholar]
- Raza, M.; Zonghai, C.; Rehman, S.; Zhenhua, G.; Jikai, W.; Peng, B. Part-Wise Pedestrian Gender Recognition Via Deep Convolutional Neural Networks. In Proceedings of the 2nd IET International Conference on Biomedical Image and Signal Processing (ICBISP 2017), Wuhan, China, 13–14 May 2017. [Google Scholar] [CrossRef]
- Wang, T.; Shu, K.C.; Chang, C.H.; Chen, Y.F. On the Effect of Data Imbalance for Multi-Label Pedestrian Attribute Recognition. In Proceedings of the 2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI), Taichung, Taiwan, 30 November–3 December 2018; pp. 74–77. [Google Scholar]
- Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. In Proceedings of the AAAI, New York, NY, USA, 7–12 February 2020; pp. 13001–13008. [Google Scholar]
- Yaghoubi, E.; Borza, D.; Alirezazadeh, P.; Kumar, A.; Proença, H. Person Re-identification: Implicitly Defining the Receptive Fields of Deep Learning Classification Frameworks. arXiv 2020, arXiv:2001.11267. [Google Scholar]
- Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A survey on deep transfer learning. In Proceedings of the International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018; pp. 270–279. [Google Scholar]
- Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 9. [Google Scholar] [CrossRef] [Green Version]
- Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
- Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-Guided Feature Alignment for Occluded Person Re-Identification. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Corbiere, C.; Ben-Younes, H.; Ramé, A.; Ollion, C. Leveraging weakly annotated data for fashion image retrieval and label prediction. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 2268–2274. [Google Scholar]
- Gray, D.; Brennan, S.; Tao, H. Evaluating appearance models for recognition, reacquisition, and tracking. In Proceedings of the IEEE International Workshop on Performance Evaluation for Tracking and Surveillance (PETS), Rio de Janeiro, Brazil, 14–14 October 2007; Volume 3, pp. 1–7. [Google Scholar]
- Zheng, L.; Bie, Z.; Sun, Y.; Wang, J.; Su, C.; Wang, S.; Tian, Q. Mars: A video benchmark for large-scale person re-identification. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 868–884. [Google Scholar]
- Ji, Z.; Hu, Z.; He, E.; Han, J.; Pang, Y. Pedestrian Attribute Recognition Based on Multiple Time Steps Attention. Pattern Recognit. Lett. 2020, 138, 170–176. [Google Scholar] [CrossRef]
- Jia, J.; Huang, H.; Yang, W.; Chen, X.; Huang, K. Rethinking of Pedestrian Attribute Recognition: Realistic Datasets with Efficient Method. arXiv 2020, arXiv:2005.11909. [Google Scholar]
- Bai, X.; Hu, Y.; Zhou, P.; Shang, F.; Shen, S. Data Augmentation Imbalance For Imbalanced Attribute Classification. arXiv 2020, arXiv:2004.13628. [Google Scholar]
- Ke, X.; Liu, T.; Li, Z. Human attribute recognition method based on pose estimation and multiple-feature fusion. Signal Image Video Process. 2020. [Google Scholar] [CrossRef]
- Yamaguchi, K.; Kiapour, M.H.; Ortiz, L.E.; Berg, T.L. Parsing clothing in fashion photographs. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3570–3577. [Google Scholar]
- Yang, J.; Fan, J.; Wang, Y.; Wang, Y.; Gan, W.; Liu, L.; Wu, W. Hierarchical feature embedding for attribute recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13055–13064. [Google Scholar]
Dataset Type | Dataset | #images | Demographic | Accessories | Appearance | Clothing | Colour | Setup |
---|---|---|---|---|---|---|---|---|
Pedestrian | PETA [152] | 19,000 | ✓ | ✓ | ✓ | ✓ | ✓ | 10 databases |
RAP v1 [153] | 41,585 | ✓ | ✓ | ✓ | ✓ | ✓ | indoor static camera | |
RAP v2 [108] | 84,928 | ✓ | ✓ | ✓ | ✓ | ✓ | indoor static camera | |
DukeMTMC | 34,183 | ✓ | ✓ | ✗ | ✓ | ✓ | outdoor static camera | |
PA-100K [88] | 100,000 | ✓ | ✓ | ✗ | ✓ | ✗ | outdoor, surveillance cameras | |
Market-1501 [24] | 1501 | ✓ | ✓ | ✓ | ✓ | ✓ | outdoor | |
P-DESTRE [160] | 14M | ✓ | ✓ | ✓ | ✓ | ✗ | UAV | |
Full body | Parse27k [161] | 27,000 | ✓ | ✓ | ✗ | ✗ | ✗ | outdoor moving camera |
CRP [162] | 27,454 | ✓ | ✓ | ✗ | ✗ | ✗ | moving vehicle | |
APiS [166] | 3661 | ✓ | ✓ | ✓ | ✓ | ✓ | 3 databases | |
HAT [66] | 9344 | ✓ | ✓ | ✗ | ✓ | ✗ | Flickr | |
CAD [123] | 1856 | ✓ | ✓ | ✗ | ✓ | ✓ | website crawling | |
Describing People [68] | 8035 | ✓ | ✓ | ✗ | ✓ | ✗ | 2 databases | |
WIDER [75] | 13,789 | ✓ | ✓ | ✓ | ✓ | ✗ | website crawling | |
Synthetic | CTD [169] | 880 | ✗ | ✗ | ✗ | ✓ | ✓ | generated data |
CLOTH3D [170] | 2.1M | ✗ | ✗ | ✗ | ✓ | ✓ | generated data |
Ref., Year, Cat. | Taxonomy | Dataset | mA | Acc. | prec. | rec. | F1 |
---|---|---|---|---|---|---|---|
[66], 2011, FAR | Pose-Let | HAT [66] | 53.80 | - | - | - | - |
[68], 2011, FAR | Pose-Let | [68] | 82.90 | - | - | - | - |
[123], 2012, FAR and CAA | Attribute relation | [123] | - | 84.90 | - | - | - |
D.Fashion [91] | 35.37 (top-5) | - | - | - | - | ||
[79], 2013, FAR | Body-Part | HAT[66] | 69.88 | - | - | - | - |
[69], 2013, FAR | Pose-Let | HAT[66] | 59.30 | - | - | - | - |
[70], 2013, FAR | Pose-Let | HAT[66] | 59.70 | - | - | - | - |
[77], 2015, FAR | Body-Part | DP[68] | 83.60 | - | - | - | - |
[128], 2015, PAR | Loss function | PETA[152] | 82.6 | - | - | - | - |
[77], 2015, FAR | Body-Part | DP[68] | 83.60 | - | - | - | - |
[150], 2015, CAA | Attribute location and relation | Dress[150] | - | 84.30 | 65.20 | 70.80 | 67.80 |
[75], 2016, FAR | Pose-Let | WIDER [75] | 92.20 | - | - | - | - |
[74], 2016, PAR | Pose-Let | RAP [108] | 81.25 | 50.30 | 57.17 | 78.39 | 66.12 |
PETA [152] | 85.50 | 76.98 | 84.07 | 85.78 | 84.90 | ||
[91], 2016, CAA | Limited data | D.Fashion [91] | 54.61 (top-5) | - | - | - | - |
[86], 2017, FAR | Attention | WIDER [75] | 82.90 | - | - | - | - |
Berkeley [68] | 92.20 | - | - | - | - | ||
[88], 2017, PAR | Attention | RAP [108] | 76.12 | 65.39 | 77.33 | 78.79 | 78.05 |
PETA [152] | 81.77 | 76.13 | 84.92 | 83.24 | 84.07 | ||
PA-100K [88] | 74.21 | 72.19 | 82.97 | 82.09 | 82.53 | ||
[124], 2018, FAR | Grammar | DP [68] | 89.40 | - | - | - | - |
[61], 2018, PAR and FAR | Pose Estimation | RAP [108] | 77.70 | 67.35 | 79.51 | 79.67 | 79.59 |
PETA [152] | 83.45 | 77.73 | 86.18 | 84.81 | 85.49 | ||
WIDER [75] | 82.40 | - | - | - | - | ||
[86], 2017, PAR | Attention | RAP [108] | 78.68 | 68.00 | 80.36 | 79.82 | 80.09 |
PA-100K [88] | 76.96 | 75.55 | 86.99 | 83.17 | 85.04 | ||
[109], 2017, PAR | RNN | RAP [108] | 77.81 | - | 78.11 | 78.98 | 78.58 |
PETA [152] | 85.67 | - | 86.03 | 85.34 | 85.42 | ||
[104], 2017, PAR | Loss Function - Augmentation | PETA [152] | - | 75.43 | - | 70.83 | - |
[132], 2017, PAR | Occlusion | RAP [108] | 79.73 | 83.97 | 76.96 | 78.72 | 77.83 |
[105], 2017, CAA | Transfer Learning | [105] | 64.35 | - | 64.97 | 75.66 | - |
[111], 2017, PAR | Multitask | Market [24] | - | 88.49 | - | - | - |
Duke [152] | - | 87.53 | - | - | - | ||
[192], 2017, CAA | Multiplication | D.Fashion [91] | 30.40 (top-5) | - | - | - | - |
[89], 2018, CAA | Attention | D.Fashion [91] | 60.95 (top-5) | - | - | - | - |
[112], 2018, PAR | Soft-Multitask | SoBiR [114] | 74.20 | - | - | - | - |
VIPeR [193] | 84.00 | - | - | - | - | ||
PETA [152] | 87.54 | - | - | - | - | ||
[149], 2018, PAR and FAR | Soft solution | WIDER [75] | 86.40 | - | - | - | - |
PETA [152] | 84.59 | 78.56 | 86.79 | 86.12 | 86.46 | ||
[97], 2018, PAR | Attribute location | RAP [108] | 78.68 | 68.00 | 80.36 | 79.82 | 80.09 |
PA-100K [88] | 76.96 | 75.55 | 86.99 | 83.17 | 85.04 | ||
[63], 2018, PAR | Pose Estimation | PETA [152] | 82.97 | 78.08 | 86.86 | 84.68 | 85.76 |
RAP [108] | 74.31 | 64.57 | 78.86 | 75.90 | 77.35 | ||
PA-100K [88] | 74.95 | 73.08 | 84.36 | 82.24 | 83.29 | ||
[117], 2018, PAR | RNN | RAP [108] | - | 77.81 | 78.11 | 78.98 | 78.58 |
PETA [152] | - | 85.67 | 86.03 | 85.34 | 85.42 | ||
[126], 2019, PAR | Soft solution | RAP [108] | 77.44 | 65.75 | 79.01 | 77.45 | 78.03 |
PETA [152] | 84.13 | 78.62 | 85.73 | 86.07 | 85.88 | ||
[125], 2019, PAR | Multiplication | PETA [152] | 86.97 | 79.95 | 87.58 | 87.73 | 87.65 |
RAP [108] | 81.42 | 68.37 | 81.04 | 80.27 | 80.65 | ||
PA-100K [88] | 80.65 | 78.30 | 89.49 | 84.36 | 86.85 | ||
[118], 2019, PAR | RNN | RAP [108] | - | 77.81 | 78.11 | 78.98 | 78.58 |
PETA [152] | - | 86.67 | 86.03 | 85.34 | 85.42 | ||
[133], 2019, PAR | Occlusion | Duke [152] | - | 89.31 | - | - | 73.24 |
MARS [194] | - | 87.01 | - | - | 72.04 | ||
[100], 2019, PAR | Attribute Location | RAP [108] | 81.87 | 68.17 | 74.71 | 86.48 | 80.16 |
PETA [152] | 86.30 | 79.52 | 85.65 | 88.09 | 86.85 | ||
PA-100K [88] | 80.68 | 77.08 | 84.21 | 88.84 | 86.46 | ||
[92], 2019, PAR | Attention | PA-100K [88] | 81.61 | 78.89 | 86.83 | 87.73 | 87.27 |
RAP [108] | 81.25 | 67.91 | 78.56 | 81.45 | 79.98 | ||
PETA [152] | 84.88 | 79.46 | 87.42 | 86.33 | 86.87 | ||
Market [24] | 87.88 | - | - | - | - | ||
Duke [152] | 87.88 | - | - | - | - | ||
[110], 2019, PAR | GCN | RAP [108] | 77.91 | 70.04 | 82.05 | 80.64 | 81.34 |
PETA [152] | 85.21 | 81.82 | 88.43 | 88.42 | 88.42 | ||
PA-100K [88] | 79.52 | 80.58 | 89.40 | 87.15 | 88.26 | ||
[107], 2019, PAR | GCN | RAP [108] | 78.30 | 69.79 | 82.13 | 80.35 | 81.23 |
PETA [152] | 84.90 | 80.95 | 88.37 | 87.47 | 87.91 | ||
PA-100K [88] | 77.87 | 78.49 | 88.42 | 86.08 | 87.24 | ||
[94], 2019, PAR and FAR | Attention | RAP [108] | 84.28 | 59.84 | 66.50 | 84.13 | 74.28 |
WIDER [75] | 88.00 | - | - | - | - | ||
[96], 2020, PAR | Attention | RAP [108] | 92.23 | - | - | - | - |
PETA [152] | 91.70 | - | - | - | - | ||
[54], 2020, PAR | Multi-task | PA-100K [88] | 77.20 | 78.09 | 88.46 | 84.86 | 86.62 |
PETA [152] | 83.17 | 78.78 | 87.49 | 85.35 | 86.41 | ||
[195], 2020, PAR | RNN | RAP [108] | 77.62 | 67.17 | 79.72 | 78.44 | 79.07 |
PETA [152] | 84.62 | 78.80 | 85.67 | 86.42 | 86.04 | ||
[58], 2020, PAR | RNN and attention | RAP [108] | 83.72 | - | 81.85 | 79.96 | 80.89 |
PETA [152] | 88.56 | - | 88.32 | 89.62 | 88.97 | ||
[120], 2020, PAR | GCN | RAP [108] | 83.69 | 69.15 | 79.31 | 82.40 | 80.82 |
PETA [152] | 86.96 | 80.38 | 87.81 | 87.09 | 87.45 | ||
PA-100K [88] | 82.31 | 79.47 | 87.45 | 87.77 | 87.61 | ||
[196], 2020, PAR | Baseline | RAP [108] | 78.48 | 67.17 | 82.84 | 76.25 | 78.94 |
PETA [152] | 85.11 | 79.14 | 86.99 | 86.33 | 86.09 | ||
PA-100K [88] | 79.38 | 78.56 | 89.41 | 84.78 | 86.25 | ||
[59], 2020, PAR | RNN and attention | PA-100K [88] | 80.60 | - | 88.70 | 84.90 | 86.80 |
RAP [108] | 81.90 | - | 82.40 | 81.90 | 82.10 | ||
PETA [152] | 87.40 | - | 89.20 | 87.50 | 88.30 | ||
Market [24] | 88.50 | - | - | - | - | ||
Duke [152] | 88.80 | - | - | - | - | ||
[197], 2020, PAR | Hard solution | PA-100K [88] | 77.89 | 79.71 | 90.26 | 85.37 | 87.75 |
RAP [108] | 75.09 | 66.90 | 84.27 | 79.16 | 76.46 | ||
PETA [152] | 88.24 | 79.14 | 88.79 | 84.70 | 86.70 | ||
[198], 2020, CAA | — | Fashionista [199] | - | 88.91 | 47.72 | 44.92 | 39.42 |
[200], 2020, PAR | Math-oriented | Market [24] | 92.90 | 78.01 | 87.41 | 85.65 | 86.52 |
Duke [152] | 91.77 | 76.68 | 86.37 | 84.40 | 85.37 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yaghoubi, E.; Khezeli, F.; Borza, D.; Kumar, S.A.; Neves, J.; Proença, H. Human Attribute Recognition— A Comprehensive Survey. Appl. Sci. 2020, 10, 5608. https://doi.org/10.3390/app10165608
Yaghoubi E, Khezeli F, Borza D, Kumar SA, Neves J, Proença H. Human Attribute Recognition— A Comprehensive Survey. Applied Sciences. 2020; 10(16):5608. https://doi.org/10.3390/app10165608
Chicago/Turabian StyleYaghoubi, Ehsan, Farhad Khezeli, Diana Borza, SV Aruna Kumar, João Neves, and Hugo Proença. 2020. "Human Attribute Recognition— A Comprehensive Survey" Applied Sciences 10, no. 16: 5608. https://doi.org/10.3390/app10165608
APA StyleYaghoubi, E., Khezeli, F., Borza, D., Kumar, S. A., Neves, J., & Proença, H. (2020). Human Attribute Recognition— A Comprehensive Survey. Applied Sciences, 10(16), 5608. https://doi.org/10.3390/app10165608