CUHK-Search-Reranking (CUHKSR) Dataset

1. Overview
CUHK-Search-Reranking (CUHKSR) dataset is for research on image re-ranking.

Data set	Images for re-ranking			Images of reference classes
Data set	# Keywords	Collecting date	Search engine	Collecting date	Search engine
I	120	Jul-10	Bing	Jul-10	Bing
II	120	Jul-10	Bing	Jul-10	Google
III	10	Aug-09	Bing	Jul-10	Bing

Note:

1) The images for re-ranking are the same in data set I and II

2) The images of reference classes in data set III are the same with those in data set I

2. Downloads
The dataset can be downloaded from following FTP:
Url: 137.189.35.203
Port: 21
Username: CUHKSRData
Password: fc4lmge

3. Reference
Please cite as:
X. Wang, K. Liu and X. Tang, “Query-Specific Visual Semantic Spaces for Web Image Re-ranking”, in Proceedings of IEEE Computer Society Conference on Computer Vision and Patter Recognition (CVPR) 2011. [PDF] [Project Website]

4. Data Description
Note: Each zip file contains 120 folders, corresponding to 120 query keywords.

Data	File Name	File Size	Description
Images for re-ranking in data set I and II	BingReRanking(set I and II).zip	~900Mb	Within each query’s folder, there’re two folders: Data and Images. The Images folder contains the ~1000 images (resized to be 160*160 at most). These images files are named as XXXXimage.jpg, where XXXX is a 4-digit ID for the image (e.g., 0000image.jpg, 0001image.jpg, etc.) The Data folder contains two files: Metadata.txt and Labels.txt. Their formats can be found below
Webpages of images for re-ranking in data set I and II	BingReRanking_Htmls(set I and II).zip	~3.6Gb	The webpages are placed under Htmls folder, named as XXXXtext.html (e.g. 0001text.html)
Images for re-ranking in data set III	BingReRanking(set III).zip	~180Mb	The organization of data is the same as “Images for re-ranking in data set I and II”
Webpages of images for re-ranking in data set III	BingReRanking_Htmls(set III).zip	~370Mb	The organization of data is the same as “Webpages of images for re-ranking in data set I and II”
Metadata of Images of reference classes in data set I	BingRef_Metadata(set I).zip	~240Mb	Within each query’s folder, there’re ~30 txt files. Each txt file is named by a query keyword expansion, and its format is the same as metadata file for images for re-ranking. The images of reference classes are not available for downloading as they take up too much space (>6Gb). They can be downloaded from Internet via the urls given in the metadata.
Metadata of Images of reference classes in data set II	GoogRef_Metadata(set II).zip	~200Mb	The organization of data is the same as “Images of reference classes in data set I”

5. Format

a) Metadata.txt
The metadata of each image takes up three lines, followed by a blank line. The three lines are: ID, image url and the url of page containing the image. Following is an example:

0000
http://www.usageorge.com/Wallpapers/Computer/wallpaper/Apple-Macintosh.jpg
http://www.usageorge.com/Wallpapers/Computer/Apple-Macintosh.html

b) Labels.txt
Labels.txt contains the labeled ground truth results of the images. It looks like

0000
apple wallpaper
apple logo

0001
red apple

which means that image 0000 is categorized into “apple wallpaper” and “apple logo” (an image may be categorized into multiple classes), while image 0001 is categorized into “red apple”. Note that the ids in Labels.txt may not be in alphabetical order.