Home > Data Lab > Data Set
  • OneID

    Providers : Deep Algorithm Alibaba

    Posted : 2017.06.29

    #Participants : 0

Data Set Description

Document (You can download after you login)

Format

TrainingSet1

download

TrainingSet2

download

TestingSet1

download

TestingSet2

download

Truth1

download

Truth2

download

Data Set Generation

We sample user by their home address in city A, and devide the sample into test and training set by district. One week wireless and pc browser logs of those user are provided.  For all device_ids and cookied belong to the same user, we generate matching pairs {(device_id_0, cookieid), (device_id_1, cookieid) , (device_id_2, cookieid)} and ground-truth.

Evaluation Metrics
Find all correct device id and cookeid  pair, include three types of relations {(device_id_0, cookieid), (device_id_1, cookieid),(device_id_2, cookieid)} that belong to the same person; Submissions will be evaluated using F1 measure;

Data Set
Training Set:
Table 1:  ijcai_device_encode_training.csv

FieldName Value Type Description
user_id ID String user id
device_id_0 ID String Device ID
device_id_1 ID String Device ID
device_id_2 ID String Device ID
Ip categorical IP, Encrypt by ip segments, for   example: “1.2.3.4” will be encrypted as   “encrypt(1).encrypt(2).encrypt(3).encrypt(4)”
search_keyword categorical User search keyword, for example:   “nike shoe” will be encrypted as “encrypt(nike) encrypt(shoe)”
auction_id categorical Auction ID
shop_id categorical Shop ID
Geohash6 categorical User page view location, encrypt   GeoHash string with length 6,GeoHash refer to: https://en.wikipedia.org/wiki/Geohash
Geohash7 categorical User page view location, encrypt   GeoHash string with length 7,GeoHash refer to: https://en.wikipedia.org/wiki/Geohash
Geohash8 categorical User page view location, encrypt   GeoHash string with length 8,GeoHash refer to: https://en.wikipedia.org/wiki/Geohash
reach_time Time page view time
Os categorical Operation System such as   "Android"

Table 2: ijcai_cookie_encode_training.csv

FieldName Value Type Description
user_id ID String user id
data_time Time page view time
Cookeid ID String Cookie user id
Url categorical Encrypted url,   For example: “www.taobao.com?search_keyword=shoe&auction_id=999“ will be encrypted as   “encrypt(www).encrypt(taobao).encrypt(com)?encrypt(search_keyword)=encrypt(shoe)&encrypt(auction_id)=encrypt(999)”
url_domain1 categorical Url domain, such as “taobao.com”,   will be encrypted as “encrypt(taobao.com)”
url_domain2 categorical Url subdomain, such as   “m.taobao.com”, will be encrypted as “encrypt(m.taobao.com)”
search_keyword categorical User search keyword, for example:   “nike shoe” will be encrypted as “encrypt(nike) encrypt(shoe)”
auction_id categorical Auction ID
shop_id categorical Shop ID
ip categorical IP, Encrypt by ip segments, for   example: “1.2.3.4” will be encrypted as   “encrypt(1).encrypt(2).encrypt(3).encrypt(4)”
title categorical Page Titlefor example:“NBA Basketball” will be encrypted as   “encrypt(NBA) encrypt(Basketball)”

Testing Sets
Table 3:  ijcai_device_encode_test_sample.csv (The structure is similar to Table 1)
Table 4:  ijcai_cookie_encode_test_sample.csv (The structure is similar to Table 2)

GroundTruth
Table 5 ijcai_device_encode_test.csv
Table 6:    ijcai_cookie_encode_test.csv

Baseline Results
Step1:
device_id =
Generate device_id pair from record, (device_id_0, device_id_1) , (device_id_1, device_id_2) , (device_id_0, device_id_2)

Step2: Graph G(V, E), V = E = . Find all connected component of graph G(V,E), each component represents a device.

Step3:

Find all (device_id, cookieid) pair that share the same ip address as potential matching pair. 

Features for device & cookie:
A:Sparse Features:
1: ip feature: ip1:day_cnt, ip2:day_cnt
2: search_keyword_feature: search_keyword1:pv, search_keyword2:pv …
3: auction_id_feature: auction_id1:pv, auction_id2:pv …
4: shop_id_feature: shop_id1:pv, shop_id2:pv …

B:Sparse Features Hash:
1: ip feature hash (dimension 10000): feature hash of ip feature
2: search_keyword feature hash (dimension 10000): feature hash of search_keyword feature
3: auction id feature hash (dimension 10000): feature hash of auction id feature
4: shop id feature hash (dimension 10000): feature hash of shop id feature

C: Similarity Features:
1: cosine similarity of sparse feature vectors for device and pc;
2: Shared ip, search_keyword, auction_id, shop_id count between device and cookie;

Model & Results:

Model Features Precision Recall F1
LR Similarity Features 0.759 0.2 0.32
GBDT Similarity Features 0.7416 0.224 0.344
LR+L1 Sparse Features Hash + Similarity Features 0.775 0.294 0.426