天池数据集

AliExpress Searching System Dataset

描述

AliExpress Searching System Dataset

数据列表

  • 数据名称上传日期大小删除下载
  • md5.txt2020-08-171023.00Bytes
  • aliexpress_images_datasets.zip2020-10-12366.12KB
  • aliexpress_NL_datasets.zip2020-10-121.49GB
  • aliexpress_FR_datasets.zip2020-10-122.25GB
  • aliexpress_ES_datasets.zip2020-10-122.60GB
  • aliexpress_US_datasets.zip2020-10-122.25GB
  • aliexpress_RU_datasets.zip2020-10-1211.34GB
  • aliexpress_datasets.txt2020-10-122.11KB

文档

1.Introduction

This is a dataset gathered from real-world traffic logs of the search system in AliExpress. As one of the largest global e-commerce platform in the world, AliExpress provides item searching service for more than 200 countries. Figure 1 shows a search session in our e-commerce platform. An user firstly clicks a product from the search result page, and then decides to purchase the product or not.
enter image description here

The dataset is provided to facilitate any research on the problem of Learning to Rank(LTR). Previous LTR datasets are collected from one scenario. While this dataset is collected from 5 countries: Russia, Spain, French, Netherlands, and America, which can be seen as 5 scenarios. To our best knowledge, this is the first large scale real-world dataset for the problem of Multi-Scenario Learning to Rank.

2.Data Organization

The dataset contains 20 compressed files, each of which has a csv format file. Figure 2 shows the data organization.
enter image description here
As shown in Figure 2, the dataset can be divided into 5 groups. Each group represents a country, like Russia, Spain, French, Netherlands and America. Each group consists of 4 files: 2 for training and 2 for test. Training and test set are split along the time sequence.
An instance used by LTR model consists of 3 parts:

  • user and query features

  • item(product) features

  • label of user's feedback, like browsing, click and purchase
    When an user searches a query, the search system provides a search result page with a ranking product list. The user and query features are the same for all the items in a search result page. To reduce the huge cost of data storage, we split an instance into two files. For example, user and query features of Russia's training data are in ru_user_train.zip; item features and labels of Russia's training data are in ru_item_train.zip.

Therefore, in Russia group, the contents of 4 files are:

  • ru_user_train.zip: user and query features of Russia's training data, each row represents a search result page;

  • ru_item_train.zip: item features and labels of Russia's training data, each row represents an user's feedback;

  • ru_user_test.zip : user and query features of Russia's test data with the same format as ru_user_train.zip ;

  • ru_item_test.zip : item features and labels of Russia's test data with the same format as ru_item_train.zip .
    There is an unique id of each search result page to be identifed by the search system, which we call pv-id. The two training files can be joined through the pv-id to obtain complete instances. Here is the pseudocode of data completion.

    select a.pv-id
        ,b.user_query_features
        ,a.item_features
        ,a.label
    from  ru_item_train as a 
    join  ru_user_train as b
    on    a.pv-id=b.pv-id
    ;
    

Here is the number of records of each file:

File name Number of records
ru_user_train.zip 6367671
ru_item_train.zip 95355689
ru_user_test.zip 2303369
ru_item_test.zip 34564064
es_user_train.zip 1437867
es_item_train.zip 22326719
es_user_test.zip 600040
es_item_test.zip 9342708
fr_user_train.zip 1183392
fr_item_train.zip 18212800
fr_user_test.zip 570288
fr_item_test.zip 8822801
nl_user_train.zip 811871
nl_item_train.zip 12157894
nl_user_test.zip 368444
nl_item_test.zip 5559301
us_user_train.zip 1331232
us_item_train.zip 19932049
us_user_test.zip 497153
us_item_test.zip 7460564
The file md5.txt records the md5 checksum of each compressed file.

3.Data description

For the reason of data security, we omit the meaning of features and only provide the feature values.

Take the case of the training data of Russia, in the file ru_user_train.zip, the 1st column is anonymous pv-id and other columns are user and query features. Here is the feature description of ru_user_train.zip:

Column number Type Range Description
2 categorical 11 categories, 0-10
3 categorical 3 categories, 0-2
4 categorical 6 categories, 0-5
5 categorical 2 categories, 0-1
6 categorical 33 categories, 0-32
7 categorical 7 categories, 0-6
8 categorical 50 categories, 0-49
9 numerical non-negative number 1st dimension of multi-value feature mu1
10 numerical non-negative number 2nd dimension of multi-value feature mu1
11 numerical non-negative number 3rd dimension of multi-value feature mu1
12 numerical non-negative number 1st dimension of multi-value feature mu2
13 numerical non-negative number 2nd dimension of multi-value feature mu2
14 numerical non-negative number 3rd dimension of multi-value feature mu2
15 numerical non-negative number 4th dimension of multi-value feature mu2
16 numerical non-negative number 5th dimension of multi-value feature mu2
17 numerical non-negative number 1st dimension of multi-value feature mu3
18 numerical non-negative number 2nd dimension of multi-value feature mu3
19 numerical non-negative number 3rd dimension of multi-value feature mu3
20 numerical non-negative number 4th dimension of multi-value feature mu3
21 numerical non-negative number 5th dimension of multi-value feature mu3
22 numerical non-negative number 6th dimension of multi-value feature mu3
23 numerical non-negative number 7th dimension of multi-value feature mu3
24 numerical non-negative number 8th dimension of multi-value feature mu3
25 numerical non-negative number 9th dimension of multi-value feature mu3
26 numerical non-negative number 10th dimension of multi-value feature mu3
27 numerical non-negative number 1st dimension of multi-value feature mu4
28 numerical non-negative number 2nd dimension of multi-value feature mu4
29 numerical non-negative number 3rd dimension of multi-value feature mu4
30 numerical non-negative number
31 numerical non-negative number
32 categorical 8 categories, 0-7
33 categorical 8 categories, 0-7
The 1st column of the file ru_item_train.zip is also anonymous pv-id. Here is the feature and label description of ru_item_train.zip:
Column number Type Range Description
-- -- -- --
2 numerical [0, 1]
3 numerical [0, 1]
4 numerical [0, 1]
5 numerical [0, 1]
6 numerical [0, 1]
7 numerical [0, 1]
8 numerical [0, 1]
9 numerical [0, 1]
10 numerical [0, 1]
11 numerical [0, 1]
12 numerical [0, 1]
13 numerical [0, 1]
14 numerical [0, 1]
15 numerical [0, 1]
16 numerical [0, 1]
17 numerical [0, 1]
18 numerical [0, 1]
19 integer [0, 1]
20 numerical [0, 1]
21 numerical [0, 1]
22 numerical [0, 1]
23 numerical [0, 1]
24 numerical [0, 1]
25 numerical [0, 1]
26 numerical [0, 1]
27 numerical [0, 1]
28 numerical [0, 1]
29 numerical [0, 1]
30 numerical [0, 1]
31 numerical [0, 1]
32 numerical [0, 1]
33 numerical [0, 1]
34 integer [0, 1]
35 integer [0, 1]
36 integer [0, 1]
37 numerical [0, 1]
38 numerical [0, 1]
39 numerical [0, 1]
40 integer [0, 1]
41 integer [0, 1]
42 integer [0, 1]
43 numerical [0, 1]
44 numerical [0, 1]
45 numerical [0, 1]
46 numerical [0, 1]
47 numerical [0, 1]
48 numerical [0, 1]
49 integer {0, 1, 2} label of user's feedback, 0: impression, 1: click, 2: purchase

4.Citation

To acknowledge use of the dataset in publications, please cite the following paper:

@inproceedings{peng2020improving,
    author={pengcheng Li and 
            Runze Li and
            Qing Da and 
            An-Xiang Zeng and 
            Lijun Zhang},
    title={Improving Multi-Scenario Learning to Rank in E-commerce by Exploiting Task Relationships in the Label Space},
    booktitle={proceedings of the 28th {ACM} International Conference on Information and Knowledge Management, {CIKM} 2020, Virtual Event, Ireland, October 19-                                                23,2019},
    publisher={{ACM}},
    address={New York,NY,USA},
    year={2020}
}

If you have published papers using our dataset, please send to tianchi_open_dataset@alibabacloud.com with the publication URL. We will make statistic about the citation and contact you to send Tianchi gift.

5.License

The dataset is distributed under the CC BY-NC-SA 4.0 license.

目录

1.Introduction

2.Data Organization

3.Data description

4.Citation

5.License