Home > Data Lab > Data Set
  • IJCAI-15 Competition

    Providers : Tmall

    Posted : 2015.03.17

    #Participants : 1698

Data Set Description

Document (You can download after you login)

Format

data_format1.zip

.zip (360MB)

data_format2.zip

.zip (353MB)

sample_submission.csv

.csv (3MB)


The data set contains anonymized users' shopping logs in the past 6 months before and on the "Double 11" day,and the label information indicating whether they are repeated buyers. Due to privacy issue, data is sampled in a biased way, so the statistical result on this data set would deviate from the actual of Tmall.com. But it will not affect the applicability of the solution. At the first stage, the data set is available for downloading, while it is not at the second stage. The files for the training and testing data sets can be found in "data_format2.zip".Details of the data format can be found in the table below.

Data Fields

Definition

user_id

A unique id for the shopper.

age_range

User' s age range: 1 for <18; 2 for [18,24]; 3 for [25,29]; 4 for [30,34]; 5 for [35,39]; 6 for [40,49]; 7 and 8 for >= 50;
0 and NULL for unknown
.

gender

User' s gender: 0 for female, 1 for male, 2 and NULL for unknown.

merchant_id

A unique id for the merchant.

label

Value from {0, 1, -1, NULL}. ' 1' denotes ' user_id' is a repeat buyer for ' merchant_id' , while ' 0' is the opposite. ' -1' represents that ' user_id' is not a new customer of the given merchant, thus out of our prediction. However, such records may provide additional information. ' NULL' occurs only in the testing data, indicating it is a pair to predict.

activity_log

Set of interaction records between {user_id, merchant_id}, where each record is an action represented as ' item_id:category_id:brand_id:time_stamp:action_type' . ' #' is used to separate two neighbouring elements. Records are not sorted in any particular order.



Your Submission should be named as "prediction.csv" with following format.
 

Data Fields

Definition

user_id

A unique id for the shopper.

merchant_id

A unique id for the merchant.

prob

Predicted probability of the given user becoming a repeat buyer of the given merchant. Value should be between 0 and 1.



Data in another format
We also provide the same data set in another format, which contains 4 files and may be more user-friendly for feature engineering (files can be found in "data_format1.zip"). Remark: there is no such files in the second stage. The details of the data formats can be found below:

User Behaviour Logs

Data Fields

Definition

user_id

A unique id for the shopper.

item_id

A unique id for the item.

cat_id

A unique id for the category that the item belongs to.

merchant_id

A unique id for the merchant.

brand_id

A unique id for the brand of the item.

time_tamp

Date the action took place (format: mmdd)

action_type

It is an enumerated type {0, 1, 2, 3}, where 0 is for click, 1 is for add-to-cart, 2 is for purchase and 3 is for add-to-favourite.


User Profile

Data Fields

Definition

user_id

A unique id for the shopper.

age_range

User' s age range: 1 for <18; 2 for [18,24]; 3 for [25,29]; 4 for [30,34]; 5 for [35,39]; 6 for [40,49]; 7 and 8 for >= 50;0 and NULL for unknown.

gender

User' s gender: 0 for female, 1 for male, 2 and NULL for unknown.

Training and Testing Data

Data Fields

Definition

user_id

A unique id for the shopper.

merchant_id

A unique id for the merchant.

label

It is an enumerated type {0, 1}, where 1 means repeat buyer, 0 is for non-repeat buyer. This field is empty for test data.