Kaggle Avito Demand Prediction Challenge 9th Place Solution

KAGGLE AVITO DEMAND
PREDICTION CHALLENGE 9TH
SOLUTION
Kaggle Meetup Tokyo 5th – 2018.12.01senkin13

About Me
¨  詹金（せんきん）
¨  Kaggle ID: senkin13
¨  Infrastructure&DB Engineer
[Prefect World] [Square Enix]
¨  Bigdata Engineer
[Square Enix] [OPT] [Line] [FastRetailing]
¨  Machine learning Engineer
[FastRetailing]
Background
KaggleName

Agenda
¨  Avito-demand-prediction Overview
¨  Competition Pipeline
¨  Best Single Model (Lightgbm)
¨  Diverse Models
¨  Ynktk’s Best NN
¨  Kohei’s Ensemble
¨  China Competitions/Kagglers
¨  Q & A

Description
「Prediction」
predict demand for an
online advertisement
based on its full
description (title,
description, images, etc.),
its context
(geographically where it
was posted, similar ads
already posted) and
historical demand for
similar ads in similar
contexts
Russia’s largest classified
advertisements website

Evaluation
Item_id Deal_probability
b912c3c6a6ad 0.12789
2dac0150717d 0.00000
ba83aefab5dc 0.43177
02996f1dd2eas 0.80323
[Target]
This is the likelihood
that an ad actually sold
something.
Scope: 0 ~ 1

Data Description
¨  ID
item_id,user_id
¨  Numeric
price
¨  Category
region,city,parent_category_name,user_type
category_name,param_1, param_2, param_3,
image_top_1
¨  Text
title,description
¨  Image
image
¨  Sequence
item_seq_number
¨  Date
activation_date,date_from,date_to
Train/Test
User_id
……
Item_id
Target
Active
User_id
……
Item_id
Periods
Item_id
Date_from
Date_to
Supplemental data of
train minus
deal_probability,
image, and
image_top_1

Train Test Period
Train: 2017-03-15 ~ 2017-04-05
Test: 2017-04-12 ~ 2017-04-20

Pipeline
[Baseline]
1.Table Data Model
2.Text Data Model
(reduce wait time)
[Validation]
Kfold: 5
Feature Validation:
once by one
Validate Score:
5fold
One Week One WeekOne Month
Description
Kernel
Discussion
Beseline
Design
[Feature Engineering]
LightGBM(Table + Text + Image)
Feature Save: 1 feature 1 pickle file
[Validation]
Kfold: 5
Feature Validation: once by one or
by group
Validate Score: 1fold
[Parameter Tuning]
Manually
Teammates’
feature
reuse
Diverse
Model’s oof

Preprocossing
¨  Tabular data
df_all['price'] = np.log1p(df_all['price'])
df_all['city'] = df_all['city'] + ‘_’ + df_all['region’]
¨  Text data
def clean_text(s):
s = re.sub('м²|d+/d|d+-к|d+к', ‘ ‘, s.lower())
s = re.sub('s+', ‘ ‘, s)
s = s.strip()
return s
¨  Image data
Delete 4 empty images

Feature Engineering
¨  Date Feature
df_all['wday'] = df_all['activation_date'].dt.weekday
※TrainとTest両方があるdate型を利用する
¨  Extended Text Feature
df_all['param_123'] = (df_all['param_1'].fillna('') + ' ' +
df_all['param_2'].fillna('') + ' ' +
df_all['param_3'].fillna('')).astype(str)

df_all['text'] = df_all['description'].fillna('').astype(str) + ' ' +
df_all['title'].fillna('').astype(str) + ' ' +
df_all['param_123'].fillna('').astype(str)
※Traing単語が増える

Aggrearation Feature
¨  Unique
{'groupby': ['category_name'], 'target':’image_top_1', 'agg':'nunique'},
¨  Count
{'groupby': ['user_id'], 'target':'item_id', 'agg':'count'},
¨  Sum
{'groupby': ['parent_category_name'], 'target':'price', 'agg':'sum'},
¨  Mean
{'groupby': ['user_id'], 'target':'price', 'agg':'mean'},
¨  Median
{'groupby': ['image_top_1'], 'target':'price', 'agg':'median'},
¨  Max
{'groupby': ['image_top_1','user_id'], 'target':'price', 'agg':'max'},
¨  Min
{'groupby': ['user_id'], 'target':'price', 'agg':'min'},

※業務視点から作るのが効率が良い

Interaction Feature
¨  Difference between two features
df_all['image_top_1_diff_price'] = df_all['price'] -
df_all['image_top_1_mean_price']
df_all['category_name_diff_price'] = df_all['price'] -
df_all['category_name_mean_price']
df_all['param_1_diff_price'] = df_all['price'] -
df_all['param_1_mean_price']
df_all['param_2_diff_price'] = df_all['price'] -
df_all['param_2_mean_price']
df_all['user_id_diff_price'] = df_all['price'] -
df_all['user_id_mean_price']
df_all['region_diff_price'] = df_all['price'] - df_all['region_mean_price']
df_all['city_diff_price'] = df_all['price'] - df_all['city_mean_price']

※Business senseがある加減乗除特徴量が強い

Supplemental Data Feature
¨  Caculate each item’s up days
all_periods['days_up'] = all_periods['date_to'].dt.dayofyear -
all_periods['date_from'].dt.dayofyear
¨  Count and Sum of item’s up days
{'groupby': ['item_id'], 'target':'days_up', 'agg':'count'},
{'groupby': ['item_id'], 'target':'days_up', 'agg':'sum'},
¨  Merge to main table
df_all = df_all.merge(all_periods, on='item_id', how='left')

※補足データの業務に関わる部分深掘りが大事

Impute Null Values
¨  Fillna with 0
df_all[‘price’].fillna(0)
¨  Fillna with median
enc = df_all.groupby('category_name')
['item_id_count_days_up'].agg('median’).reset_index()
enc.columns = ['category_name' ,'count_days_up_impute']

df_all = pd.merge(df_all, enc, how='left', on='category_name')
df_all['item_id_count_days_up_impute'].fillna(df_all['count_days_up_impute'
], inplace=True)
¨  Fillna with model prediction value
Rnn(text) ->　image_top_1(rename:image_top_2)

※見つからなかったMagic feature: df[‘price’] – df[Rnn(text) -> price]

Text Feature
¨  TF-IDF for text ,title,param_123

vectorizer = FeatureUnion([
('text',TfidfVectorizer(
ngram_range=(1, 2),
max_features=200000,
**tfidf_para),
('title',TfidfVectorizer(
ngram_range=(1, 2),
stop_words =
russian_stop),
('param_123',TfidfVectorizer(
ngram_range=(1, 2),
stop_words =
russian_stop))
])
tfidf_para = {
"stop_words": russian_stop,
"analyzer": 'word',
"token_pattern": r'w{1,}',
"lowercase": True,
"sublinear_tf": True,
"dtype": np.float32,
"norm": 'l2',
"smooth_idf":False
}

Text Feature
¨  SVD for Title
tfidf_vec = TfidfVectorizer(ngram_range=(1,1))
svd_title_obj = TruncatedSVD(n_components=40, algorithm='arpack')

svd_title_obj.fit(full_title_tfidf)

train_title_svd = pd.DataFrame(svd_title_obj.transform(train_title_tfidf))
test_title_svd = pd.DataFrame(svd_title_obj.transform(test_title_tfidf))

Text Feature
¨  Count Unique Feature
for cols in ['text','title','param_123']:
df_all[cols + '_num_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-ЯA-Z]', x))
df_all[cols + '_num_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-яa-z]', x))
df_all[cols + '_num_rus_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-Я]', x))
df_all[cols + '_num_eng_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[A-Z]', x))
df_all[cols + '_num_rus_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-я]', x))
df_all[cols + '_num_eng_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[a-z]', x))
df_all[cols + '_num_dig'] = df_all[cols].apply(lambda x: count_regexp_occ('[0-9]', x))
df_all[cols + '_num_pun'] = df_all[cols].apply(lambda x: sum(c in punct for c in x))
df_all[cols + '_num_space'] = df_all[cols].apply(lambda x: sum(c.isspace() for c in x))
df_all[cols + '_num_emo'] = df_all[cols].apply(lambda x: sum(c in emoji for c in x))
df_all[cols + '_num_row'] = df_all[cols].apply(lambda x: x.count('/n'))
df_all[cols + '_num_chars'] = df_all[cols].apply(len) # Count number of Characters
df_all[cols + '_num_words'] = df_all[cols].apply(lambda comment: len(comment.split()))
df_all[cols + '_num_unique_words'] = df_all[cols].apply(lambda comment: len(set(w for w in
comment.split())))
df_all[cols + '_ratio_unique_words'] = df_all[cols+'_num_unique_words'] / (df_all[cols+'_num_words']+1)
df_all[cols +'_num_stopwords'] = df_all[cols].apply(lambda x: len([w for w in x.split() if w in
stopwords]))
df_all[cols +'_num_words_upper'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if
w.isupper()]))
df_all[cols +'_num_words_lower'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if
w.islower()]))
df_all[cols +'_num_words_title'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if
w.istitle()]))

Text Feature
¨  Ynktk’s WordEmbedding
u  Self-trained FastText
model = FastText(PathLineSentences(train+test+train_active+test_active),
size=300, window=5, min_count=5, word_ngrams=1, seed=seed, workers=32)
u  Self-trained Word2Vec
model = Word2Vec(PathLineSentences(train+test+train_active+test_active),
size=300, window=5, min_count=5, seed=seed, workers=32)
※Wikiなどで学習したembeddingsよりも、与えられたテキストで学習したembeddingsの方が
有効.おそらく、商品名などの固有名詞が目的変数に効いていたため

Image Feature
¨  Meta Feature
u  Image_size ,Height,Width,Average_pixel_width,Average_blue,Average_red,Aver
age_green,Blurrness,Whiteness,Dullness
u  Dullness – Whiteness (Interaction feature)
¨  Pre-trained Prediction Feature
u  Vgg16 Prediction Value
u  Resnet50 Prediction Value
¨  Ynktk’s Feature
u  上位入賞者はImageをVGGなどで特徴抽出していたが、hand-craftな特徴も有効だった
u  NIMA [1]
u  Brightness, Saturation, Contrast, Colorfullness, Dullness, Bluriness, Interest Points, Saliency Map,
Human Facesなど[2]
[1] Talebi, H., & Milanfar, P. (2018). NIMA: Neural Image Assessment
[2] Cheng, H. et al. (2012). Multimedia Features for Click Prediction of New Ads in Display
Advertising

Parameter Tuning
q  Manually choosing using multi servers
params = {
'boosting_type': 'gbdt',
’objective’: ‘xentropy’, #target value like a binary classification probability value
'metric': 'rmse',
'learning_rate': 0.02,
'num_leaves': 600,
'max_depth': -1,
'max_bin': 256,
’bagging_fraction’: 1,
’feature_fractio’: 0.1, #sparse text vector
'verbose': 1
}

Submission Analysis
Single Lightgbm Sub File Stacking Sub File
1.  Bug Check
2.  Diverse Model Comparation
3.  Prediction Value Trend

Best Lightgbm Summary
¨  Table Feature Number
~250
¨  Text Feature Number
1,500,000+
¨  Image Feature Number
50+
¨  Total Feature Number
1,503,424
¨  Public LB
better than 0.2174
¨  Private LB
better than 0.2210

Diversity
Type Loss Data Set Feature Set Parameter NN Structure
Lightgbm xentropy
regression
huber
fair
auc
With/Without
Active data
Table
Table + Text
Table + Text + Image
Table + Text + Image +
Ridge_meta
Learning_rate
Num_leaves
Xgboost reg:linear
binary:logist
ic
With/Without
Active data
Table + Text
Catboost binary_cross
entropy
With Active data Table + Image
Random
Forest
regression With Active data Table + Text + Image
Ridge
Regression
regression Without Active
data
Text
Tfidf
max_features
Neural
network
regression
binary_cross
entropy
With Active data Table + Text + Image +
wordembedding
Layer size
Dropout
BatchNorm
Pooling
rnn-dnn
rnn-cnn-dnn
rnn-attention-dnn

Ynktk’s Best NN
Numerical Categorical Image Text
Embedding
Dense SpatialDropout
LSTM
GRU
Conv1D
LeakyReLU
GAP GMP
Concat
BatchNorm
LeakyReLU
Dropout
Dense
LeakyReLU
BatchNorm
Dense
Concat
Embedding
*callbacks
•  EarlyStopping
•  ReduceLROnPlateau
*optimizer
•  Adam with clipvalue
0.5
•  Learning rate 2e-03
*loss
•  Binary cross entropy
Priv LB: 0.2225
Pub LB: 0.2181

China Competitions & Platform
Kaggle China Comp
Platform Kaggle Tianchi,Tencent,Jdata,Kesci,DataCastle,
Biendata,DataFountain… …[1]
Round Round1:2~3 months
Public/Private LB
Round1:1.5months Public,3 days Private
Round2:2 weeks Public,3 days Private
Round3:Presentation
Sub/day 5 Public:3,Private:1
Prize Top 3 Top 5/10/50
[1] https://github.com/iphysresearch/DataSciComp

Knowledge Sharing
https://github.com/Smilexuhc/Data-
Competition-TopSolution/blob/master/
README.md
Learn From GrandMasters
1.  EDA by Excel
2.  Join every competitions
3.  Reuse pipeline & features
4.  Strictly time management
5.  Use differnet area’s knowledge
6.  Family Support

Kaggle Avito Demand Prediction Challenge 9th Place Solution

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Kaggle Avito Demand Prediction Challenge 9th Place Solution

Similar a Kaggle Avito Demand Prediction Challenge 9th Place Solution (20)

Último

Último (20)

Kaggle Avito Demand Prediction Challenge 9th Place Solution