Human Attribute Recognition by Deep Hierarchical Contexts

Yining Li, Chen Huang, Chen Change Loy, and Xiaoou Tang

Department of Informaiton Engineering, The Chinese University of Hong Kong

European Conference on Computer Vision (ECCV) 2016


We present an approach for recognizing human attributes in unconstrained settings. We train a Convolutional Neural Network (CNN) to select the most attribute-descriptive human parts from all poselet detections, and combine them with the whole body as a pose-normalized deep representation. We further improve by using deep hierarchical contexts ranging from human-centric level to scene level. Human-centric context captures human relations, which we compute from the nearest neighbor parts of other people on a pyramid of CNN feature maps. The matched parts are then average pooled and they act as a similarity regularization. To utilize the scene context, we re-score human-centric predictions by the global scene classification score jointly learned in our CNN, yielding final scene-aware predictions. To facilitate our study, a large-scale WIDER Attribute dataset is introduced with human attribute and image event annotations, and our method surpasses competitive baselines on this dataset and other popular ones.

[Full Paper]


Experiment Results

The mean AP (%) on Berkeley Attributes of People dataset, HAT dataset and WIDER Attribute dataset:

Method Berkeley HATWIDER Attribute
PANDA 79.0 - -
ACNH 80.0 66.2 -
R-CNN 87.8 76.3 80.0
R*CNN 89.2 76.4 80.5
Deep Part 89.5 - -
Our Baseline 90.8 76.7 80.5
Our Full Model 92.2 78.0 81.3


WIDER Attribute Dataset

WIDER Attribute is a large-scale human attribute dataset. It contains 13789 images belonging to 30 scene categories, and 57524 human bounding boxes each annotated with 14 binary attributes.

Download: [Images (Google Drive)],    [Annotations]




  author = {Li, Yining and Huang, Chen and Loy, Chen Change and Tang, Xiaoou},
  title = {Human Attribute Recognition by Deep Hierarchical Contexts},
  booktitle = {European Conference on Computer Vision},
  year = {2016}