Home > Data Lab > Data Set
  • Ali_Mobile_Rec

    Providers : Alibaba

    Posted : 2015.07.07

    #Participants : 814

Data Set Description

Document (You can download after you login)

Format

(sample)user_result.csv

.csv (159B)

tianchi_mobile_recommend_train_item.csv

.csv (8MB)

tianchi_mobile_recommend_train_user.zip

.zip (108MB)

Publications:
Purchase Behavior Prediction in M-Commerce with an Optimized Sampling Method|
A NEW MODEL TO MEASURE THE KNOWLEDGE DIFFUSION VIA INFORMATION ENTROPY IN VIRTUAL COMMUNITIES|
A Behavior Mining Based Hybrid Recommender System|
A Method of Purchase Prediction Based on User Behavior Log
Overview
In Data Lab, we provide data sets and evaluation system of previous competitions for people to test the warters of machine learning and data mining.
The followig data sets are from Season 1 of Ali Mobile Recommendation Algorithm Competition. We provide a baseline (the result of the top team - 小萝卜头) on the leaderboard and conduct the evaluation every other day. For walkthroughs and FAQs, please go to the competition Forum.


Introduction
The year 2014 witnessed the rapid development of Alibaba Group's M-Commerce business. For example, the Gross Merchandise Volume (GMV) on mobile terminals in the Nov. 11 great sale of 2014 accounts for 42.6% of total GMV. Compared with the PC era, access to the network on mobile terminals can take place anytime anyplace. Besides, they possess richer background data, such as users' location information, regularity in their access time and etc. This competition is based on the real users-commodities behavior data on Alibaba's M-Commerce platforms. Meanwhile, it provides location information typical in the mobile era. Participants need to build commodity recommendation models that are geared to M-Commerce. They are also expected to go into the deep meaning behind the data and recommend appropriate commodity for mobile users at the right time and the right place.

Data Description
In many cases we need to develop an individualized recommendation system for a subset of all items. When fulfilling such task, besides utilizing the user behavior data in such subset of items, we also need to utilize more comprehensive user behavior data. Notations:

U– The set of users
I– The whole set of items
P– The subset of items, P I
D– The user behavior data set in all the set of all items.
Our objective is to develop a recommendation model for users in U on the business domain P using the data D.

The data contains two parts. The first part is the dataset D, the mobile behavior data of users in the set of all items, which is corresponding to table tianchi_mobile_recommend_train_user,with the following columns:

Column

Description

Comment

user_id

Identity of users

Sampled&desensitized

item_id

Identity of items

Desensitized

behavior_type

The user behavior type

Including click, collect,add-to-cart and payment, the corresponding values are 1, 2, 3 and 4,respectively.

user_geohash

Latitude(user location when the behavior occurs, whichmay be null)

Subject to fuzzing

item_category

The category id of the item

Desensitized

time

The time of the behavior

To the nearest hours


The table 
tianchi_mobile_recommend_train_item corresponds to data (P),  with the following columns: 

Column

Description

Comment

item_id

Identity of items

Sampled & desensitized

item_ geohash

user location where the behavior occurs

(may be null)

generated by longitude and altitude through a

certain privacy-preserving algorithm

item_category

The category id of the item

Desensitized

The training data contains the mobile behavior data of certain quantity of sampled users (D). The evaluation data is the purchase data of these same users of the items in P one month later. The participants should develop a model to predict the purchase behavior of the users of the items in the next day.

Evaluation Data Format
Participants should submit the prediction results into a table named tianchi_mobile_recommendation_predict with specified format (other than a partition table) and containing a user_id column and an item_id column (both with string type). Duplicates should be removed. For example: 

Evaluation Metric
We use precision, recall and F1 scores as the evaluation metric, which are defined as follows:

Where, PredictionSet contains the submitted purchasedata and ReferenceSet contains thereal purchase data. We take F1 score as the only standard of the final evaluation.