AliExpress Searching System Dataset
This is a dataset gathered from real-world traffic logs of the search system in AliExpress. As one of the largest global e-commerce platform in the world, AliExpress provides item searching service for more than 200 countries. Figure 1 shows a search session in our e-commerce platform. An user firstly clicks a product from the search result page, and then decides to purchase the product or not.
The dataset is provided to facilitate any research on the problem of Learning to Rank(LTR). Previous LTR datasets are collected from one scenario. While this dataset is collected from 5 countries: Russia, Spain, French, Netherlands, and America, which can be seen as 5 scenarios. To our best knowledge, this is the first large scale real-world dataset for the problem of Multi-Scenario Learning to Rank.
The dataset contains 20 compressed files, each of which has a csv format file. Figure 2 shows the data organization.
As shown in Figure 2, the dataset can be divided into 5 groups. Each group represents a country, like Russia, Spain, French, Netherlands and America. Each group consists of 4 files: 2 for training and 2 for test. Training and test set are split along the time sequence.
An instance used by LTR model consists of 3 parts:
user and query features
item(product) features
label of user's feedback, like browsing, click and purchase
When an user searches a query, the search system provides a search result page with a ranking product list. The user and query features are the same for all the items in a search result page. To reduce the huge cost of data storage, we split an instance into two files. For example, user and query features of Russia's training data are in ru_user_train.zip; item features and labels of Russia's training data are in ru_item_train.zip.
Therefore, in Russia group, the contents of 4 files are:
ru_user_train.zip: user and query features of Russia's training data, each row represents a search result page;
ru_item_train.zip: item features and labels of Russia's training data, each row represents an user's feedback;
ru_user_test.zip : user and query features of Russia's test data with the same format as ru_user_train.zip ;
ru_item_test.zip : item features and labels of Russia's test data with the same format as ru_item_train.zip .
There is an unique id of each search result page to be identifed by the search system, which we call pv-id. The two training files can be joined through the pv-id to obtain complete instances. Here is the pseudocode of data completion.
select a.pv-id
,b.user_query_features
,a.item_features
,a.label
from ru_item_train as a
join ru_user_train as b
on a.pv-id=b.pv-id
;
Here is the number of records of each file:
File name | Number of records |
---|---|
ru_user_train.zip | 6367671 |
ru_item_train.zip | 95355689 |
ru_user_test.zip | 2303369 |
ru_item_test.zip | 34564064 |
es_user_train.zip | 1437867 |
es_item_train.zip | 22326719 |
es_user_test.zip | 600040 |
es_item_test.zip | 9342708 |
fr_user_train.zip | 1183392 |
fr_item_train.zip | 18212800 |
fr_user_test.zip | 570288 |
fr_item_test.zip | 8822801 |
nl_user_train.zip | 811871 |
nl_item_train.zip | 12157894 |
nl_user_test.zip | 368444 |
nl_item_test.zip | 5559301 |
us_user_train.zip | 1331232 |
us_item_train.zip | 19932049 |
us_user_test.zip | 497153 |
us_item_test.zip | 7460564 |
The file md5.txt records the md5 checksum of each compressed file. |
For the reason of data security, we omit the meaning of features and only provide the feature values.
Take the case of the training data of Russia, in the file ru_user_train.zip, the 1st column is anonymous pv-id and other columns are user and query features. Here is the feature description of ru_user_train.zip:
Column number | Type | Range | Description |
---|---|---|---|
2 | categorical | 11 categories, 0-10 | |
3 | categorical | 3 categories, 0-2 | |
4 | categorical | 6 categories, 0-5 | |
5 | categorical | 2 categories, 0-1 | |
6 | categorical | 33 categories, 0-32 | |
7 | categorical | 7 categories, 0-6 | |
8 | categorical | 50 categories, 0-49 | |
9 | numerical | non-negative number | 1st dimension of multi-value feature mu1 |
10 | numerical | non-negative number | 2nd dimension of multi-value feature mu1 |
11 | numerical | non-negative number | 3rd dimension of multi-value feature mu1 |
12 | numerical | non-negative number | 1st dimension of multi-value feature mu2 |
13 | numerical | non-negative number | 2nd dimension of multi-value feature mu2 |
14 | numerical | non-negative number | 3rd dimension of multi-value feature mu2 |
15 | numerical | non-negative number | 4th dimension of multi-value feature mu2 |
16 | numerical | non-negative number | 5th dimension of multi-value feature mu2 |
17 | numerical | non-negative number | 1st dimension of multi-value feature mu3 |
18 | numerical | non-negative number | 2nd dimension of multi-value feature mu3 |
19 | numerical | non-negative number | 3rd dimension of multi-value feature mu3 |
20 | numerical | non-negative number | 4th dimension of multi-value feature mu3 |
21 | numerical | non-negative number | 5th dimension of multi-value feature mu3 |
22 | numerical | non-negative number | 6th dimension of multi-value feature mu3 |
23 | numerical | non-negative number | 7th dimension of multi-value feature mu3 |
24 | numerical | non-negative number | 8th dimension of multi-value feature mu3 |
25 | numerical | non-negative number | 9th dimension of multi-value feature mu3 |
26 | numerical | non-negative number | 10th dimension of multi-value feature mu3 |
27 | numerical | non-negative number | 1st dimension of multi-value feature mu4 |
28 | numerical | non-negative number | 2nd dimension of multi-value feature mu4 |
29 | numerical | non-negative number | 3rd dimension of multi-value feature mu4 |
30 | numerical | non-negative number | |
31 | numerical | non-negative number | |
32 | categorical | 8 categories, 0-7 | |
33 | categorical | 8 categories, 0-7 | |
The 1st column of the file ru_item_train.zip is also anonymous pv-id. Here is the feature and label description of ru_item_train.zip: | |||
Column number | Type | Range | Description |
-- | -- | -- | -- |
2 | numerical | [0, 1] | |
3 | numerical | [0, 1] | |
4 | numerical | [0, 1] | |
5 | numerical | [0, 1] | |
6 | numerical | [0, 1] | |
7 | numerical | [0, 1] | |
8 | numerical | [0, 1] | |
9 | numerical | [0, 1] | |
10 | numerical | [0, 1] | |
11 | numerical | [0, 1] | |
12 | numerical | [0, 1] | |
13 | numerical | [0, 1] | |
14 | numerical | [0, 1] | |
15 | numerical | [0, 1] | |
16 | numerical | [0, 1] | |
17 | numerical | [0, 1] | |
18 | numerical | [0, 1] | |
19 | integer | [0, 1] | |
20 | numerical | [0, 1] | |
21 | numerical | [0, 1] | |
22 | numerical | [0, 1] | |
23 | numerical | [0, 1] | |
24 | numerical | [0, 1] | |
25 | numerical | [0, 1] | |
26 | numerical | [0, 1] | |
27 | numerical | [0, 1] | |
28 | numerical | [0, 1] | |
29 | numerical | [0, 1] | |
30 | numerical | [0, 1] | |
31 | numerical | [0, 1] | |
32 | numerical | [0, 1] | |
33 | numerical | [0, 1] | |
34 | integer | [0, 1] | |
35 | integer | [0, 1] | |
36 | integer | [0, 1] | |
37 | numerical | [0, 1] | |
38 | numerical | [0, 1] | |
39 | numerical | [0, 1] | |
40 | integer | [0, 1] | |
41 | integer | [0, 1] | |
42 | integer | [0, 1] | |
43 | numerical | [0, 1] | |
44 | numerical | [0, 1] | |
45 | numerical | [0, 1] | |
46 | numerical | [0, 1] | |
47 | numerical | [0, 1] | |
48 | numerical | [0, 1] | |
49 | integer | {0, 1, 2} | label of user's feedback, 0: impression, 1: click, 2: purchase |
To acknowledge use of the dataset in publications, please cite the following paper:
@inproceedings{peng2020improving,
author={pengcheng Li and
Runze Li and
Qing Da and
An-Xiang Zeng and
Lijun Zhang},
title={Improving Multi-Scenario Learning to Rank in E-commerce by Exploiting Task Relationships in the Label Space},
booktitle={proceedings of the 28th {ACM} International Conference on Information and Knowledge Management, {CIKM} 2020, Virtual Event, Ireland, October 19- 23,2019},
publisher={{ACM}},
address={New York,NY,USA},
year={2020}
}
If you have published papers using our dataset, please send to tianchi_open_dataset@alibabacloud.com with the publication URL. We will make statistic about the citation and contact you to send Tianchi gift.
The dataset is distributed under the CC BY-NC-SA 4.0 license.