3. Agenda
¨ Avito-demand-prediction Overview
¨ Competition Pipeline
¨ Best Single Model (Lightgbm)
¨ Diverse Models
¨ Ynktk’s Best NN
¨ Kohei’s Ensemble
¨ China Competitions/Kagglers
¨ Q & A
6. Description
「Prediction」
predict demand for an
online advertisement
based on its full
description (title,
description, images, etc.),
its context
(geographically where it
was posted, similar ads
already posted) and
historical demand for
similar ads in similar
contexts
Russia’s largest classified
advertisements website
8. Data Description
¨ ID
item_id,user_id
¨ Numeric
price
¨ Category
region,city,parent_category_name,user_type
category_name,param_1, param_2, param_3,
image_top_1
¨ Text
title,description
¨ Image
image
¨ Sequence
item_seq_number
¨ Date
activation_date,date_from,date_to
Train/Test
User_id
……
Item_id
Target
Active
User_id
……
Item_id
Periods
Item_id
Date_from
Date_to
Supplemental data of
train minus
deal_probability,
image, and
image_top_1
10. Pipeline
[Baseline]
1.Table Data Model
2.Text Data Model
(reduce wait time)
[Validation]
Kfold: 5
Feature Validation:
once by one
Validate Score:
5fold
One Week One WeekOne Month
Description
Kernel
Discussion
Beseline
Design
[Feature Engineering]
LightGBM(Table + Text + Image)
Feature Save: 1 feature 1 pickle file
[Validation]
Kfold: 5
Feature Validation: once by one or
by group
Validate Score: 1fold
[Parameter Tuning]
Manually
Teammates’
feature
reuse
Diverse
Model’s oof
15. Supplemental Data Feature
¨ Caculate each item’s up days
all_periods['days_up'] = all_periods['date_to'].dt.dayofyear -
all_periods['date_from'].dt.dayofyear
¨ Count and Sum of item’s up days
{'groupby': ['item_id'], 'target':'days_up', 'agg':'count'},
{'groupby': ['item_id'], 'target':'days_up', 'agg':'sum'},
¨ Merge to main table
df_all = df_all.merge(all_periods, on='item_id', how='left')
※補足データの業務に関わる部分深掘りが大事
16. Impute Null Values
¨ Fillna with 0
df_all[‘price’].fillna(0)
¨ Fillna with median
enc = df_all.groupby('category_name')
['item_id_count_days_up'].agg('median’).reset_index()
enc.columns = ['category_name' ,'count_days_up_impute']
df_all = pd.merge(df_all, enc, how='left', on='category_name')
df_all['item_id_count_days_up_impute'].fillna(df_all['count_days_up_impute'
], inplace=True)
¨ Fillna with model prediction value
Rnn(text) -> image_top_1(rename:image_top_2)
※見つからなかったMagic feature: df[‘price’] – df[Rnn(text) -> price]
18. Text Feature
¨ SVD for Title
tfidf_vec = TfidfVectorizer(ngram_range=(1,1))
svd_title_obj = TruncatedSVD(n_components=40, algorithm='arpack')
svd_title_obj.fit(full_title_tfidf)
train_title_svd = pd.DataFrame(svd_title_obj.transform(train_title_tfidf))
test_title_svd = pd.DataFrame(svd_title_obj.transform(test_title_tfidf))
19. Text Feature
¨ Count Unique Feature
for cols in ['text','title','param_123']:
df_all[cols + '_num_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-ЯA-Z]', x))
df_all[cols + '_num_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-яa-z]', x))
df_all[cols + '_num_rus_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-Я]', x))
df_all[cols + '_num_eng_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[A-Z]', x))
df_all[cols + '_num_rus_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-я]', x))
df_all[cols + '_num_eng_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[a-z]', x))
df_all[cols + '_num_dig'] = df_all[cols].apply(lambda x: count_regexp_occ('[0-9]', x))
df_all[cols + '_num_pun'] = df_all[cols].apply(lambda x: sum(c in punct for c in x))
df_all[cols + '_num_space'] = df_all[cols].apply(lambda x: sum(c.isspace() for c in x))
df_all[cols + '_num_emo'] = df_all[cols].apply(lambda x: sum(c in emoji for c in x))
df_all[cols + '_num_row'] = df_all[cols].apply(lambda x: x.count('/n'))
df_all[cols + '_num_chars'] = df_all[cols].apply(len) # Count number of Characters
df_all[cols + '_num_words'] = df_all[cols].apply(lambda comment: len(comment.split()))
df_all[cols + '_num_unique_words'] = df_all[cols].apply(lambda comment: len(set(w for w in
comment.split())))
df_all[cols + '_ratio_unique_words'] = df_all[cols+'_num_unique_words'] / (df_all[cols+'_num_words']+1)
df_all[cols +'_num_stopwords'] = df_all[cols].apply(lambda x: len([w for w in x.split() if w in
stopwords]))
df_all[cols +'_num_words_upper'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if
w.isupper()]))
df_all[cols +'_num_words_lower'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if
w.islower()]))
df_all[cols +'_num_words_title'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if
w.istitle()]))
20. Text Feature
¨ Ynktk’s WordEmbedding
u Self-trained FastText
model = FastText(PathLineSentences(train+test+train_active+test_active),
size=300, window=5, min_count=5, word_ngrams=1, seed=seed, workers=32)
u Self-trained Word2Vec
model = Word2Vec(PathLineSentences(train+test+train_active+test_active),
size=300, window=5, min_count=5, seed=seed, workers=32)
※Wikiなどで学習したembeddingsよりも、与えられたテキストで学習したembeddingsの方が
有効.おそらく、商品名などの固有名詞が目的変数に効いていたため
21. Image Feature
¨ Meta Feature
u Image_size ,Height,Width,Average_pixel_width,Average_blue,Average_red,Aver
age_green,Blurrness,Whiteness,Dullness
u Dullness – Whiteness (Interaction feature)
¨ Pre-trained Prediction Feature
u Vgg16 Prediction Value
u Resnet50 Prediction Value
¨ Ynktk’s Feature
u 上位入賞者はImageをVGGなどで特徴抽出していたが、hand-craftな特徴も有効だった
u NIMA [1]
u Brightness, Saturation, Contrast, Colorfullness, Dullness, Bluriness, Interest Points, Saliency Map,
Human Facesなど[2]
[1] Talebi, H., & Milanfar, P. (2018). NIMA: Neural Image Assessment
[2] Cheng, H. et al. (2012). Multimedia Features for Click Prediction of New Ads in Display
Advertising
22. Parameter Tuning
q Manually choosing using multi servers
params = {
'boosting_type': 'gbdt',
’objective’: ‘xentropy’, #target value like a binary classification probability value
'metric': 'rmse',
'learning_rate': 0.02,
'num_leaves': 600,
'max_depth': -1,
'max_bin': 256,
’bagging_fraction’: 1,
’feature_fractio’: 0.1, #sparse text vector
'verbose': 1
}
24. Best Lightgbm Summary
¨ Table Feature Number
~250
¨ Text Feature Number
1,500,000+
¨ Image Feature Number
50+
¨ Total Feature Number
1,503,424
¨ Public LB
better than 0.2174
¨ Private LB
better than 0.2210
25. Diversity
Type Loss Data Set Feature Set Parameter NN Structure
Lightgbm xentropy
regression
huber
fair
auc
With/Without
Active data
Table
Table + Text
Table + Text + Image
Table + Text + Image +
Ridge_meta
Learning_rate
Num_leaves
Xgboost reg:linear
binary:logist
ic
With/Without
Active data
Table + Text
Table + Text + Image
Catboost binary_cross
entropy
With Active data Table + Image
Random
Forest
regression With Active data Table + Text + Image
Ridge
Regression
regression Without Active
data
Text
Table + Text + Image
Tfidf
max_features
Neural
network
regression
binary_cross
entropy
With Active data Table + Text + Image +
wordembedding
Layer size
Dropout
BatchNorm
Pooling
rnn-dnn
rnn-cnn-dnn
rnn-attention-dnn
26. Ynktk’s Best NN
Numerical Categorical Image Text
Embedding
Dense SpatialDropout
LSTM
GRU
Conv1D
LeakyReLU
GAP GMP
Concat
BatchNorm
LeakyReLU
Dropout
Dense
LeakyReLU
BatchNorm
Dense
Concat
Embedding
*callbacks
• EarlyStopping
• ReduceLROnPlateau
*optimizer
• Adam with clipvalue
0.5
• Learning rate 2e-03
*loss
• Binary cross entropy
Priv LB: 0.2225
Pub LB: 0.2181
27.
28.
29. China Competitions & Platform
Kaggle China Comp
Platform Kaggle Tianchi,Tencent,Jdata,Kesci,DataCastle,
Biendata,DataFountain… …[1]
Round Round1:2~3 months
Public/Private LB
Round1:1.5months Public,3 days Private
Round2:2 weeks Public,3 days Private
Round3:Presentation
Sub/day 5 Public:3,Private:1
Prize Top 3 Top 5/10/50
[1] https://github.com/iphysresearch/DataSciComp