CUHK-Search-Reranking (CUHKSR) Dataset

1.    Overview
CUHK-Search-Reranking (CUHKSR) dataset is for research on image re-ranking.

Data set

Images for re-ranking

Images of reference classes

# Keywords

Collecting date

Search engine

Collecting date

Search engine

I

120

Jul-10

Bing

Jul-10

Bing

II

Jul-10

Google

III

10

Aug-09

Bing

Jul-10

Bing


Note:

1) The images for re-ranking are the same in data set I and II

2) The images of reference classes in data set III are the same with those in data set I

2.    Downloads
The dataset can be downloaded from following FTP:
Url: 137.189.35.203
Port: 21
Username: CUHKSRData
Password: fc4lmge

3.    Reference
Please cite as:
X. Wang, K. Liu and X. Tang, ¡°Query-Specific Visual Semantic Spaces for Web Image Re-ranking¡±, in Proceedings of IEEE Computer Society Conference on Computer Vision and Patter Recognition (CVPR) 2011. [
PDF] [Project Website]

4.    Data Description
Note: Each zip file contains 120 folders, corresponding to 120 query keywords.

Data

File Name

File Size

Description

Images for re-ranking in data set I and II

BingReRanking(set I and II).zip

~900Mb

Within each query¡¯s folder, there¡¯re two folders: Data and Images.

The Images folder contains the ~1000 images (resized to be 160*160 at most). These images files are named as XXXXimage.jpg, where XXXX is a 4-digit ID for the image (e.g., 0000image.jpg, 0001image.jpg, etc.)

The Data folder contains two files: Metadata.txt and Labels.txt. Their formats  can be found below

Webpages of images for re-ranking in data set I and II

BingReRanking_Htmls(set I and II).zip

~3.6Gb

The webpages are placed under Htmls folder, named as XXXXtext.html (e.g. 0001text.html)

Images for re-ranking in data set III

BingReRanking(set III).zip

~180Mb

The organization of data is the same as ¡°Images for re-ranking in data set I and II¡±

Webpages of images for re-ranking in data set III

BingReRanking_Htmls(set III).zip

~370Mb

The organization of data is the same as ¡°Webpages of images for re-ranking in data set I and II¡±

Metadata of Images of reference classes in data set I

BingRef_Metadata(set I).zip

~240Mb

Within each query¡¯s folder, there¡¯re ~30 txt files.

Each txt file is named by a query keyword expansion, and its format is the same as metadata file for images for re-ranking.
The images of reference classes are not available for downloading as they take up too much space (>6Gb). They can be downloaded from Internet via the urls given in the metadata.

Metadata of Images of reference classes in data set II

GoogRef_Metadata(set II).zip

~200Mb

The organization of data is the same as ¡°Images of reference classes in data set I¡±

 

5.    Format

a)       Metadata.txt
The metadata of each image takes up three lines, followed by a blank line. The three lines are: ID, image url and the url of page containing the image. Following is an example:

0000
http://www.usageorge.com/Wallpapers/Computer/wallpaper/Apple-Macintosh.jpg
http://www.usageorge.com/Wallpapers/Computer/Apple-Macintosh.html

b)      Labels.txt
Labels.txt contains the labeled ground truth results of the images. It looks like

0000
apple wallpaper
apple logo

0001
red apple


which means that image 0000 is categorized into ¡°apple wallpaper¡± and ¡°apple logo¡± (an image may be categorized into multiple classes), while image 0001 is categorized into ¡°red apple¡±. Note that the ids in Labels.txt may not be in alphabetical order.