Home > Data Lab > Data Set
  • Ali_Display_Ad_Click

    Providers : Alimama

    Posted : 2017.06.26

    #Participants : 26

Data Set Description

Document (You can download after you login)

Format

ad_feature.csv.tar.gz

.gz (9MB)

behavior_log.csv.tar.gz

.gz (4GB)

raw_sample.csv.tar.gz

.gz (231MB)

user_profile.csv.tar.gz

.gz (5MB)

Introduction
Ali_Display_Ad_Click is a dataset of click rate prediction about display Ad, which is displayed on the website of Taobao. The dataset is offered by the company of Alibaba. 

Data Sets

Table Description Feature
raw_sample raw training samples User   ID, Ad ID, nonclk, clk, timestamp
ad_feature Ad’s basic information Ad   ID, campaign ID, Cate ID, Brand
user_profile user profile User   ID, age, gender, etc
raw_behavior_log User behavior log User   ID, btag, cate, brand, timestamp

raw_sample

We randomly sampled 1140000 users from the website of Taobao for 8 days of ad display / click logs (26 million records) to form the original sample skeleton. Field description is as follows:
(1) user: User ID(int);
(2) time_stamp: time stamp(Bigint, 1494032110 stands for 2017-05-06 08:55:10);
(3) adgroup_id: adgroup ID(int);
(4) pid: scenario;
(5) noclk: 1 for not click, 0 for click;
(6) clk: 1 for click, 0 for not click;

We used 7 days’s samples as training samples (20170506-20170512), and the last day’s samples as test samples (20170513).

ad_feature
This data set covers the basic information of all ads in raw_sample. Field description is as follows:
(1) adgroup_idAd ID(int) ;
(2) cate_idcategory ID;
(3) campaign_idcampaign ID;
(4) brandbrand ID;
(5) customer_id: Advertiser ID;
(6) price: the price of item

One of the ad ID corresponds to an item, an item belongs to a category, an item belongs to a brand.

user_profile
This data set covers the basic information of 1060000 users in raw_sample.. Field description is as follows:
(1) userid: user ID;
(2) cms_segid: Micro group ID;
(3) cms_group_id: cms_group_id;
(4) final_gender_code: gender 1 for male , 2 for female
(5) age_level: age_level
(6) pvalue_level: Consumption grade, 1: low,  2: mid,  3: high
(7) shopping_level: Shopping depth, 1: shallow user, 2: moderate user, 3: depth user
(8) occupation: Is the college student 1: yes, 0: no?
(9) new_user_class_level: City level
(10) behavior_log

This data set covers the shopping behavior in 22 days of all users in raw_sample(totally seven hundred million records). Field description is as follows:
(1) nick: User ID(int);
(2) time_stamp: time stamp(Bigint, 1494032110 stands for 2017-05-06 08:55:10)
(3) btag: Types of behavior, include the following four:

type explanation
ipv browse
cart add to the shopping cart
fav favor
buy buy

(4) cate: category ID(int);
(5) brand: brand ID(int);

Here if we use userID and timestamp as primary key, we will find a lot of duplicate records. This is because the behavior of different types of the data are collected from different departments and when packaged together, there are small deviations (i.e. the same two timestamps may be two different time with a relatively small difference).

Typical research topics
Predict the probability of clicking on an ad when impressed based on user’s history shopping behavior.

Baseline
AUC:0.622

Reference and Related Publications
1. Gai K, Zhu X, Li H, et al. Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction[J]. arXiv preprint arXiv:1704.05194, 2017.
2. Guorui Zhou, Chengru Song, Xiaoqiang Zhu, et al. Deep Interest Network for Click-Through Rate Prediction.https://arxiv.org/abs/1706.06978.

--------------------------------------------------------以下是中文描述--------------------------------------------------------

Ali_Display_Ad_Click是阿里巴巴提供的一个淘宝展示广告点击率预估数据集。

数据集介绍

数据名称 说明 属性
raw_sample 原始的样本骨架 用户ID,广告ID,时间,资源位,是否点击
ad_feature 广告的基本信息 广告ID,广告计划ID,类目ID,品牌ID
user_profile 用户的基本信息 用户ID,年龄层,性别等
raw_behavior_log 用户的行为日志 用户ID,行为类型,时间,商品类目ID,品牌ID

原始样本骨架raw_sample
我们从淘宝网站中随机抽样了114万用户8天内的广告展示/点击日志(2600万条记录),构成原始的样本骨架。
字段说明如下:
(1) user_id:脱敏过的用户ID
(2) adgroup_id:脱敏过的广告单元ID
(3) time_stamp:时间戳;
(4) pid:资源位;
(5) noclk:为1代表没有点击;为0代表点击;
(6) clk:为0代表没有点击;为1代表点击;
我们用前面7天的做训练样本(20170506-20170512),用第8天的做测试样本(20170513)。

广告基本信息表ad_feature
本数据集涵盖了raw_sample中全部广告的基本信息。字段说明如下:
(1) adgroup_id:脱敏过的广告ID
(2) cate_id:脱敏过的商品类目ID
(3) campaign_id:脱敏过的广告计划ID
(4) customer_id:脱敏过的广告主ID
(5) brand:脱敏过的品牌ID
(6) price: 宝贝的价格
其中一个广告ID对应一个商品(宝贝),一个宝贝属于一个类目,一个宝贝属于一个品牌。

用户基本信息表user_profile
本数据集涵盖了raw_sample中全部用户的基本信息。字段说明如下:
(1) userid:脱敏过的用户ID
(2) cms_segid:微群ID
(3) cms_group_idcms_group_id
(4) final_gender_code:性别 1:,2:女;
(5) age_level:年龄层次;
(6) pvalue_level:消费档次,1:低档,2:中档,3:高档;
(7) shopping_level:购物深度,1:浅层用户,2:中度用户,3:深度用户
(8) occupation:是否大学生 1:,0:
(9) new_user_class_level:城市层级

用户的行为日志behavior_log
本数据集涵盖了raw_sample中全部用户22天内的购物行为(共七亿条记录)。字段说明如下:
(1) user:脱敏过的用户ID
(2) time_stamp:时间戳;
(3) btag:行为类型包括以下四种:

类型 说明
ipv 浏览
cart 加入购物车
fav 喜欢
buy 购买

(4) cate:脱敏过的商品类目;
(5) brand: 脱敏过的品牌词;
这里以user + time_stampkey,会有很多重复的记录;这是因为我们的不同的类型的行为数据是不同部门记录的,在打包到一起的时候,实际上会有小的偏差(即两个一样的time_stamp实际上是差异比较小的两个时间)。

典型科研场景
根据用户历史购物行为预测用户在接受某个广告的曝光时的点击概率。

基线
AUC0.622

研究成果
1.
Gai K, Zhu X, Li H, et al. Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction[J]. arXiv preprint arXiv:1704.05194, 2017.
2.
Guorui Zhou, Chengru Song, Xiaoqiang Zhu, et al. Deep Interest Network for Click-Through Rate Prediction.https://arxiv.org/abs/1706.06978.