SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
KAGGLE AVITO DEMAND
PREDICTION CHALLENGE 9TH
SOLUTION
Kaggle Meetup Tokyo 5th – 2018.12.01senkin13
About Me
¨  詹金 (せんきん)
¨  Kaggle ID: senkin13
¨  Infrastructure&DB Engineer
[Prefect World] [Square Enix]
¨  Bigdata Engineer
[Square Enix] [OPT] [Line] [FastRetailing]
¨  Machine learning Engineer
[FastRetailing]
Background
KaggleName
Agenda
¨  Avito-demand-prediction Overview
¨  Competition Pipeline
¨  Best Single Model (Lightgbm)
¨  Diverse Models
¨  Ynktk’s Best NN
¨  Kohei’s Ensemble
¨  China Competitions/Kagglers
¨  Q & A
Our Team
Public LB:8th Private LB:9th
Description
「Prediction」
predict demand for an
online advertisement
based on its full
description (title,
description, images, etc.),
its context
(geographically where it
was posted, similar ads
already posted) and
historical demand for
similar ads in similar
contexts
Russia’s largest classified
advertisements website
Evaluation
Item_id Deal_probability
b912c3c6a6ad 0.12789
2dac0150717d 0.00000
ba83aefab5dc 0.43177
02996f1dd2eas 0.80323
[Target]
This is the likelihood
that an ad actually sold
something.
Scope: 0 ~ 1
Data Description
¨  ID
item_id,user_id
¨  Numeric
price
¨  Category
region,city,parent_category_name,user_type
category_name,param_1, param_2, param_3,
image_top_1
¨  Text
title,description
¨  Image
image
¨  Sequence
item_seq_number
¨  Date
activation_date,date_from,date_to
Train/Test
User_id
……
Item_id
Target
Active
User_id
……
Item_id
Periods
Item_id
Date_from
Date_to
Supplemental data of
train minus
deal_probability,
image, and
image_top_1
Train Test Period
Train: 2017-03-15 ~ 2017-04-05
Test: 2017-04-12 ~ 2017-04-20
Pipeline
[Baseline]
1.Table Data Model
2.Text Data Model
(reduce wait time)
[Validation]
Kfold: 5
Feature Validation:
once by one
Validate Score:
5fold
One Week One WeekOne Month
Description
Kernel
Discussion
Beseline
Design
[Feature Engineering]
LightGBM(Table + Text + Image)
Feature Save: 1 feature 1 pickle file
[Validation]
Kfold: 5
Feature Validation: once by one or
by group
Validate Score: 1fold
[Parameter Tuning]
Manually
Teammates’
feature
reuse
Diverse
Model’s oof
Preprocossing
¨  Tabular data
df_all['price']	=	np.log1p(df_all['price'])	
df_all['city']	=	df_all['city']	+	‘_’	+	df_all['region’]	
¨  Text data
def	clean_text(s):	
				s	=	re.sub('м²|d+/d|d+-к|d+к',	‘	‘,	s.lower())	
				s	=	re.sub('s+',	‘	‘,	s)	
				s	=	s.strip()	
				return	s	
¨  Image data
Delete	4	empty	images
Feature Engineering
¨  Date Feature
df_all['wday']	=	df_all['activation_date'].dt.weekday	
※TrainとTest両方があるdate型を利用する	
¨  Extended Text Feature
df_all['param_123']	=	(df_all['param_1'].fillna('')	+	'	'	+					
df_all['param_2'].fillna('')	+	'	'	+	
df_all['param_3'].fillna('')).astype(str)
	
df_all['text']	=	df_all['description'].fillna('').astype(str)	+	'	'	+	
df_all['title'].fillna('').astype(str)	+	'	'	+	
df_all['param_123'].fillna('').astype(str)	
※Traing単語が増える
Aggrearation Feature
¨  Unique
{'groupby':	['category_name'],	'target':’image_top_1',	'agg':'nunique'},	
¨  Count
{'groupby':	['user_id'],	'target':'item_id',	'agg':'count'},	
¨  Sum
{'groupby':	['parent_category_name'],	'target':'price',	'agg':'sum'},	
¨  Mean
{'groupby':	['user_id'],	'target':'price',	'agg':'mean'},	
¨  Median
{'groupby':	['image_top_1'],	'target':'price',	'agg':'median'},	
¨  Max
{'groupby':	['image_top_1','user_id'],	'target':'price',	'agg':'max'},	
¨  Min
{'groupby':	['user_id'],	'target':'price',	'agg':'min'},	
	
※業務視点から作るのが効率が良い
Interaction Feature
¨  Difference between two features
df_all['image_top_1_diff_price']	=	df_all['price']	-	
df_all['image_top_1_mean_price']	
df_all['category_name_diff_price']	=	df_all['price']	-	
df_all['category_name_mean_price']	
df_all['param_1_diff_price']	=	df_all['price']	-	
df_all['param_1_mean_price']	
df_all['param_2_diff_price']	=	df_all['price']	-	
df_all['param_2_mean_price']	
df_all['user_id_diff_price']	=	df_all['price']	-	
df_all['user_id_mean_price']	
df_all['region_diff_price']	=	df_all['price']	-	df_all['region_mean_price']	
df_all['city_diff_price']	=	df_all['price']	-	df_all['city_mean_price']	
	
※Business	senseがある加減乗除特徴量が強い
Supplemental Data Feature
¨  Caculate each item’s up days
all_periods['days_up']	=	all_periods['date_to'].dt.dayofyear	-		
all_periods['date_from'].dt.dayofyear
¨  Count and Sum of item’s up days
{'groupby':	['item_id'],	'target':'days_up',	'agg':'count'},	
{'groupby':	['item_id'],	'target':'days_up',	'agg':'sum'},		
¨  Merge to main table	
df_all	=	df_all.merge(all_periods,	on='item_id',	how='left')	
	
※補足データの業務に関わる部分深掘りが大事
Impute Null Values
¨  Fillna with 0
df_all[‘price’].fillna(0)	
¨  Fillna with median
enc	=	df_all.groupby('category_name')
['item_id_count_days_up'].agg('median’).reset_index()	
enc.columns	=	['category_name'	,'count_days_up_impute']	
	
df_all	=	pd.merge(df_all,	enc,	how='left',	on='category_name')	
df_all['item_id_count_days_up_impute'].fillna(df_all['count_days_up_impute'
],	inplace=True)
¨  Fillna with model prediction value
Rnn(text)	-> image_top_1(rename:image_top_2)	
	
※見つからなかったMagic	feature:	df[‘price’]	–	df[Rnn(text)	->	price]
Text Feature
¨  TF-IDF for text ,title,param_123
	
vectorizer	=	FeatureUnion([	
('text',TfidfVectorizer(	
								ngram_range=(1,	2),	
								max_features=200000,	
								**tfidf_para),	
('title',TfidfVectorizer(	
								ngram_range=(1,	2),	
								stop_words	=	
russian_stop),	
('param_123',TfidfVectorizer(	
									ngram_range=(1,	2),	
									stop_words	=	
russian_stop))					
])	
tfidf_para	=	{	
				"stop_words":	russian_stop,	
				"analyzer":	'word',	
				"token_pattern":	r'w{1,}',	
				"lowercase":	True,	
				"sublinear_tf":	True,	
				"dtype":	np.float32,	
				"norm":	'l2',	
"smooth_idf":False	
}
Text Feature
¨  SVD for Title
tfidf_vec = TfidfVectorizer(ngram_range=(1,1))	
svd_title_obj = TruncatedSVD(n_components=40, algorithm='arpack')	
	
svd_title_obj.fit(full_title_tfidf)	
	
train_title_svd = pd.DataFrame(svd_title_obj.transform(train_title_tfidf))	
test_title_svd = pd.DataFrame(svd_title_obj.transform(test_title_tfidf))
Text Feature
¨  Count Unique Feature
for	cols	in	['text','title','param_123']:	
				df_all[cols	+	'_num_cap']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[А-ЯA-Z]',	x))	
				df_all[cols	+	'_num_low']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[а-яa-z]',	x))	
				df_all[cols	+	'_num_rus_cap']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[А-Я]',	x))	
				df_all[cols	+	'_num_eng_cap']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[A-Z]',	x))					
				df_all[cols	+	'_num_rus_low']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[а-я]',	x))	
				df_all[cols	+	'_num_eng_low']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[a-z]',	x))	
				df_all[cols	+	'_num_dig']	=	df_all[cols].apply(lambda	x:	count_regexp_occ('[0-9]',	x))	
				df_all[cols	+	'_num_pun']	=	df_all[cols].apply(lambda	x:	sum(c	in	punct	for	c	in	x))	
				df_all[cols	+	'_num_space']	=	df_all[cols].apply(lambda	x:	sum(c.isspace()	for	c	in	x))	
				df_all[cols	+	'_num_emo']	=	df_all[cols].apply(lambda	x:	sum(c	in	emoji	for	c	in	x))	
				df_all[cols	+	'_num_row']	=	df_all[cols].apply(lambda	x:	x.count('/n'))	
				df_all[cols	+	'_num_chars']	=	df_all[cols].apply(len)	#	Count	number	of	Characters	
				df_all[cols	+	'_num_words']	=	df_all[cols].apply(lambda	comment:	len(comment.split()))		
				df_all[cols	+	'_num_unique_words']	=	df_all[cols].apply(lambda	comment:	len(set(w	for	w	in	
comment.split())))	
				df_all[cols	+	'_ratio_unique_words']	=	df_all[cols+'_num_unique_words']	/	(df_all[cols+'_num_words']+1)		
				df_all[cols	+'_num_stopwords']	=	df_all[cols].apply(lambda	x:	len([w	for	w	in	x.split()	if	w	in	
stopwords]))	
				df_all[cols	+'_num_words_upper']	=	df_all[cols].apply(lambda	x:	len([w	for	w	in	str(x).split()	if	
w.isupper()]))	
				df_all[cols	+'_num_words_lower']	=	df_all[cols].apply(lambda	x:	len([w	for	w	in	str(x).split()	if	
w.islower()]))	
				df_all[cols	+'_num_words_title']	=	df_all[cols].apply(lambda	x:	len([w	for	w	in	str(x).split()	if	
w.istitle()]))
Text Feature
¨  Ynktk’s WordEmbedding
u  Self-trained FastText
model	=	FastText(PathLineSentences(train+test+train_active+test_active),	
size=300,	window=5,	min_count=5,	word_ngrams=1,	seed=seed,	workers=32)	
u  Self-trained Word2Vec
model	=	Word2Vec(PathLineSentences(train+test+train_active+test_active),	
size=300,	window=5,	min_count=5,	seed=seed,	workers=32)	
※Wikiなどで学習したembeddingsよりも、与えられたテキストで学習したembeddingsの方が
有効.おそらく、商品名などの固有名詞が目的変数に効いていたため
Image Feature
¨  Meta Feature
u  Image_size	,Height,Width,Average_pixel_width,Average_blue,Average_red,Aver
age_green,Blurrness,Whiteness,Dullness	
u  Dullness	–	Whiteness	(Interaction	feature)	
¨  Pre-trained Prediction Feature
u  Vgg16	Prediction	Value	
u  Resnet50	Prediction	Value	
¨  Ynktk’s Feature
u  上位入賞者はImageをVGGなどで特徴抽出していたが、hand-craftな特徴も有効だった
u  NIMA [1]
u  Brightness, Saturation, Contrast, Colorfullness, Dullness, Bluriness, Interest Points, Saliency Map,
Human Facesなど[2]
[1] Talebi, H., & Milanfar, P. (2018). NIMA: Neural Image Assessment
[2] Cheng, H. et al. (2012). Multimedia Features for Click Prediction of New Ads in Display
Advertising
Parameter Tuning
q  Manually choosing using multi servers
params	=	{	
				'boosting_type':	'gbdt',	
				’objective’:	‘xentropy’,	#target	value	like	a	binary	classification	probability	value		
				'metric':	'rmse',	
				'learning_rate':	0.02,	
				'num_leaves':	600,			
				'max_depth':	-1,			
				'max_bin':	256,			
				’bagging_fraction’:	1,			
				’feature_fractio’:	0.1,		#sparse	text	vector	
				'verbose':	1	
				}
Submission Analysis
Single Lightgbm Sub File Stacking Sub File
1.  Bug Check
2.  Diverse Model Comparation
3.  Prediction Value Trend
Best Lightgbm Summary
¨  Table Feature Number
~250
¨  Text Feature Number
1,500,000+
¨  Image Feature Number
50+
¨  Total Feature Number
1,503,424
¨  Public LB
better than 0.2174
¨  Private LB
better than 0.2210
Diversity
Type Loss Data Set Feature Set Parameter NN Structure
Lightgbm xentropy
regression
huber
fair
auc
With/Without
Active data
Table
Table + Text
Table + Text + Image
Table + Text + Image +
Ridge_meta
Learning_rate
Num_leaves
Xgboost reg:linear
binary:logist
ic
With/Without
Active data
Table + Text
Table + Text + Image
Catboost binary_cross
entropy
With Active data Table + Image
Random
Forest
regression With Active data Table + Text + Image
Ridge
Regression
regression Without Active
data
Text
Table + Text + Image
Tfidf
max_features
Neural
network
regression
binary_cross
entropy
With Active data Table + Text + Image +
wordembedding
Layer size
Dropout
BatchNorm
Pooling
rnn-dnn
rnn-cnn-dnn
rnn-attention-dnn
Ynktk’s Best NN
Numerical	 Categorical	Image	 Text	
Embedding	
Dense	 SpatialDropout	
LSTM	
GRU	
Conv1D	
LeakyReLU	
GAP	 GMP	
Concat	
BatchNorm	
LeakyReLU	
Dropout	
Dense	
LeakyReLU	
BatchNorm	
Dense	
Concat	
Embedding	
*callbacks
•  EarlyStopping
•  ReduceLROnPlateau
*optimizer
•  Adam with clipvalue
0.5
•  Learning rate 2e-03
*loss
•  Binary cross entropy	
Priv LB: 0.2225
Pub LB: 0.2181
China Competitions & Platform
Kaggle China Comp
Platform Kaggle Tianchi,Tencent,Jdata,Kesci,DataCastle,
Biendata,DataFountain… …[1]
Round Round1:2~3 months
Public/Private LB
Round1:1.5months Public,3 days Private
Round2:2 weeks Public,3 days Private
Round3:Presentation
Sub/day 5 Public:3,Private:1
Prize Top 3 Top 5/10/50
[1] https://github.com/iphysresearch/DataSciComp
Knowledge Sharing
https://github.com/Smilexuhc/Data-
Competition-TopSolution/blob/master/
README.md
Learn From GrandMasters
1.  EDA by Excel
2.  Join every competitions
3.  Reuse pipeline & features
4.  Strictly time management
5.  Use differnet area’s knowledge
6.  Family Support
Thank You !
Q & A

Más contenido relacionado

La actualidad más candente

機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門hoxo_m
 
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2Preferred Networks
 
【DL輪読会】Scale Efficiently: Insights from Pre-training and Fine-tuning Transfor...
【DL輪読会】Scale Efficiently: Insights from Pre-training and Fine-tuning Transfor...【DL輪読会】Scale Efficiently: Insights from Pre-training and Fine-tuning Transfor...
【DL輪読会】Scale Efficiently: Insights from Pre-training and Fine-tuning Transfor...Deep Learning JP
 
線形?非線形?
線形?非線形?線形?非線形?
線形?非線形?nishio
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat ModelsDeep Learning JP
 
【DL輪読会】SimCSE: Simple Contrastive Learning of Sentence Embeddings (EMNLP 2021)
【DL輪読会】SimCSE: Simple Contrastive Learning of Sentence Embeddings  (EMNLP 2021)【DL輪読会】SimCSE: Simple Contrastive Learning of Sentence Embeddings  (EMNLP 2021)
【DL輪読会】SimCSE: Simple Contrastive Learning of Sentence Embeddings (EMNLP 2021)Deep Learning JP
 
backbone としての timm 入門
backbone としての timm 入門backbone としての timm 入門
backbone としての timm 入門Takuji Tahara
 
オープンワールド認識 (第34回全脳アーキテクチャ若手の会 勉強会)
オープンワールド認識 (第34回全脳アーキテクチャ若手の会 勉強会)オープンワールド認識 (第34回全脳アーキテクチャ若手の会 勉強会)
オープンワールド認識 (第34回全脳アーキテクチャ若手の会 勉強会)Takuma Yagi
 
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健Preferred Networks
 
最適輸送入門
最適輸送入門最適輸送入門
最適輸送入門joisino
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシンShinya Shimizu
 
PyTorchLightning ベース Hydra+MLFlow+Optuna による機械学習開発環境の構築
PyTorchLightning ベース Hydra+MLFlow+Optuna による機械学習開発環境の構築PyTorchLightning ベース Hydra+MLFlow+Optuna による機械学習開発環境の構築
PyTorchLightning ベース Hydra+MLFlow+Optuna による機械学習開発環境の構築Kosuke Shinoda
 
クラスタリングとレコメンデーション資料
クラスタリングとレコメンデーション資料クラスタリングとレコメンデーション資料
クラスタリングとレコメンデーション資料洋資 堅田
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選Yusuke Uchida
 
Deep Recurrent Q-Learning(DRQN) for Partially Observable MDPs
Deep Recurrent Q-Learning(DRQN) for Partially Observable MDPsDeep Recurrent Q-Learning(DRQN) for Partially Observable MDPs
Deep Recurrent Q-Learning(DRQN) for Partially Observable MDPsHakky St
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデルMasahiro Suzuki
 
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.Deep Learning JP
 
LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~
LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~
LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~RyuichiKanoh
 
組合せ最適化入門:線形計画から整数計画まで
組合せ最適化入門:線形計画から整数計画まで組合せ最適化入門:線形計画から整数計画まで
組合せ最適化入門:線形計画から整数計画までShunji Umetani
 
[DL輪読会]representation learning via invariant causal mechanisms
[DL輪読会]representation learning via invariant causal mechanisms[DL輪読会]representation learning via invariant causal mechanisms
[DL輪読会]representation learning via invariant causal mechanismsDeep Learning JP
 

La actualidad más candente (20)

機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門機械学習のためのベイズ最適化入門
機械学習のためのベイズ最適化入門
 
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
Kaggle Happywhaleコンペ優勝解法でのOptuna使用事例 - 2022/12/10 Optuna Meetup #2
 
【DL輪読会】Scale Efficiently: Insights from Pre-training and Fine-tuning Transfor...
【DL輪読会】Scale Efficiently: Insights from Pre-training and Fine-tuning Transfor...【DL輪読会】Scale Efficiently: Insights from Pre-training and Fine-tuning Transfor...
【DL輪読会】Scale Efficiently: Insights from Pre-training and Fine-tuning Transfor...
 
線形?非線形?
線形?非線形?線形?非線形?
線形?非線形?
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
 
【DL輪読会】SimCSE: Simple Contrastive Learning of Sentence Embeddings (EMNLP 2021)
【DL輪読会】SimCSE: Simple Contrastive Learning of Sentence Embeddings  (EMNLP 2021)【DL輪読会】SimCSE: Simple Contrastive Learning of Sentence Embeddings  (EMNLP 2021)
【DL輪読会】SimCSE: Simple Contrastive Learning of Sentence Embeddings (EMNLP 2021)
 
backbone としての timm 入門
backbone としての timm 入門backbone としての timm 入門
backbone としての timm 入門
 
オープンワールド認識 (第34回全脳アーキテクチャ若手の会 勉強会)
オープンワールド認識 (第34回全脳アーキテクチャ若手の会 勉強会)オープンワールド認識 (第34回全脳アーキテクチャ若手の会 勉強会)
オープンワールド認識 (第34回全脳アーキテクチャ若手の会 勉強会)
 
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健
 
最適輸送入門
最適輸送入門最適輸送入門
最適輸送入門
 
今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン今さら聞けないカーネル法とサポートベクターマシン
今さら聞けないカーネル法とサポートベクターマシン
 
PyTorchLightning ベース Hydra+MLFlow+Optuna による機械学習開発環境の構築
PyTorchLightning ベース Hydra+MLFlow+Optuna による機械学習開発環境の構築PyTorchLightning ベース Hydra+MLFlow+Optuna による機械学習開発環境の構築
PyTorchLightning ベース Hydra+MLFlow+Optuna による機械学習開発環境の構築
 
クラスタリングとレコメンデーション資料
クラスタリングとレコメンデーション資料クラスタリングとレコメンデーション資料
クラスタリングとレコメンデーション資料
 
モデル高速化百選
モデル高速化百選モデル高速化百選
モデル高速化百選
 
Deep Recurrent Q-Learning(DRQN) for Partially Observable MDPs
Deep Recurrent Q-Learning(DRQN) for Partially Observable MDPsDeep Recurrent Q-Learning(DRQN) for Partially Observable MDPs
Deep Recurrent Q-Learning(DRQN) for Partially Observable MDPs
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.
 
LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~
LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~
LightGBMを少し改造してみた ~カテゴリ変数の動的エンコード~
 
組合せ最適化入門:線形計画から整数計画まで
組合せ最適化入門:線形計画から整数計画まで組合せ最適化入門:線形計画から整数計画まで
組合せ最適化入門:線形計画から整数計画まで
 
[DL輪読会]representation learning via invariant causal mechanisms
[DL輪読会]representation learning via invariant causal mechanisms[DL輪読会]representation learning via invariant causal mechanisms
[DL輪読会]representation learning via invariant causal mechanisms
 

Similar a Kaggle Avito Demand Prediction Challenge 9th Place Solution

Webinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineWebinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineMongoDB
 
Blunt Umbrellas Website Showcase
Blunt Umbrellas Website ShowcaseBlunt Umbrellas Website Showcase
Blunt Umbrellas Website ShowcaseGareth Hall
 
MongoDB Distilled
MongoDB DistilledMongoDB Distilled
MongoDB Distilledb0ris_1
 
How To Build a Multi-Field Search Page For Your XPages Application
How To Build a Multi-Field Search Page For Your XPages ApplicationHow To Build a Multi-Field Search Page For Your XPages Application
How To Build a Multi-Field Search Page For Your XPages ApplicationMichael McGarel
 
An Intro to Angular 2
An Intro to Angular 2An Intro to Angular 2
An Intro to Angular 2Ron Heft
 
Database Development Replication Security Maintenance Report
Database Development Replication Security Maintenance ReportDatabase Development Replication Security Maintenance Report
Database Development Replication Security Maintenance Reportnyin27
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkDatabricks
 
Angular data binding
Angular data binding Angular data binding
Angular data binding Sultan Ahmed
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data scienceTuri, Inc.
 
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...IT Event
 
Angular server side rendering - Strategies & Technics
Angular server side rendering - Strategies & Technics Angular server side rendering - Strategies & Technics
Angular server side rendering - Strategies & Technics Eliran Eliassy
 
There's more than web
There's more than webThere's more than web
There's more than webMatt Evans
 
How We Built the Private AppExchange App (Apex, Visualforce, RWD)
How We Built the Private AppExchange App (Apex, Visualforce, RWD)How We Built the Private AppExchange App (Apex, Visualforce, RWD)
How We Built the Private AppExchange App (Apex, Visualforce, RWD)Salesforce Developers
 
EnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer VisionEnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer Visiongiamuhammad
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Jonathan Felch
 
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and PracticeGDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and PracticeJamesAnderson599331
 
はじめてのAngular2
はじめてのAngular2はじめてのAngular2
はじめてのAngular2Kenichi Kanai
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdwKohei KaiGai
 

Similar a Kaggle Avito Demand Prediction Challenge 9th Place Solution (20)

Webinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage EngineWebinar: Schema Patterns and Your Storage Engine
Webinar: Schema Patterns and Your Storage Engine
 
Nicolas Embleton, Advanced Angular JS
Nicolas Embleton, Advanced Angular JSNicolas Embleton, Advanced Angular JS
Nicolas Embleton, Advanced Angular JS
 
Blunt Umbrellas Website Showcase
Blunt Umbrellas Website ShowcaseBlunt Umbrellas Website Showcase
Blunt Umbrellas Website Showcase
 
MongoDB Distilled
MongoDB DistilledMongoDB Distilled
MongoDB Distilled
 
How To Build a Multi-Field Search Page For Your XPages Application
How To Build a Multi-Field Search Page For Your XPages ApplicationHow To Build a Multi-Field Search Page For Your XPages Application
How To Build a Multi-Field Search Page For Your XPages Application
 
An Intro to Angular 2
An Intro to Angular 2An Intro to Angular 2
An Intro to Angular 2
 
Database Development Replication Security Maintenance Report
Database Development Replication Security Maintenance ReportDatabase Development Replication Security Maintenance Report
Database Development Replication Security Maintenance Report
 
Ibis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache SparkIbis: Seamless Transition Between Pandas and Apache Spark
Ibis: Seamless Transition Between Pandas and Apache Spark
 
Angular data binding
Angular data binding Angular data binding
Angular data binding
 
Scalable data structures for data science
Scalable data structures for data scienceScalable data structures for data science
Scalable data structures for data science
 
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...
Oleh Zasadnyy "Progressive Web Apps: line between web and native apps become ...
 
Kendoui
KendouiKendoui
Kendoui
 
Angular server side rendering - Strategies & Technics
Angular server side rendering - Strategies & Technics Angular server side rendering - Strategies & Technics
Angular server side rendering - Strategies & Technics
 
There's more than web
There's more than webThere's more than web
There's more than web
 
How We Built the Private AppExchange App (Apex, Visualforce, RWD)
How We Built the Private AppExchange App (Apex, Visualforce, RWD)How We Built the Private AppExchange App (Apex, Visualforce, RWD)
How We Built the Private AppExchange App (Apex, Visualforce, RWD)
 
EnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer VisionEnrichmentWeek Binus Computer Vision
EnrichmentWeek Binus Computer Vision
 
Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)Groovy On Trading Desk (2010)
Groovy On Trading Desk (2010)
 
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and PracticeGDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice
GDG Cloud Southlake #11 Steve McGhee Reliability Theory and Practice
 
はじめてのAngular2
はじめてのAngular2はじめてのAngular2
はじめてのAngular2
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 

Último

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 

Último (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 

Kaggle Avito Demand Prediction Challenge 9th Place Solution

  • 1. KAGGLE AVITO DEMAND PREDICTION CHALLENGE 9TH SOLUTION Kaggle Meetup Tokyo 5th – 2018.12.01senkin13
  • 2. About Me ¨  詹金 (せんきん) ¨  Kaggle ID: senkin13 ¨  Infrastructure&DB Engineer [Prefect World] [Square Enix] ¨  Bigdata Engineer [Square Enix] [OPT] [Line] [FastRetailing] ¨  Machine learning Engineer [FastRetailing] Background KaggleName
  • 3. Agenda ¨  Avito-demand-prediction Overview ¨  Competition Pipeline ¨  Best Single Model (Lightgbm) ¨  Diverse Models ¨  Ynktk’s Best NN ¨  Kohei’s Ensemble ¨  China Competitions/Kagglers ¨  Q & A
  • 6. Description 「Prediction」 predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts Russia’s largest classified advertisements website
  • 7. Evaluation Item_id Deal_probability b912c3c6a6ad 0.12789 2dac0150717d 0.00000 ba83aefab5dc 0.43177 02996f1dd2eas 0.80323 [Target] This is the likelihood that an ad actually sold something. Scope: 0 ~ 1
  • 8. Data Description ¨  ID item_id,user_id ¨  Numeric price ¨  Category region,city,parent_category_name,user_type category_name,param_1, param_2, param_3, image_top_1 ¨  Text title,description ¨  Image image ¨  Sequence item_seq_number ¨  Date activation_date,date_from,date_to Train/Test User_id …… Item_id Target Active User_id …… Item_id Periods Item_id Date_from Date_to Supplemental data of train minus deal_probability, image, and image_top_1
  • 9. Train Test Period Train: 2017-03-15 ~ 2017-04-05 Test: 2017-04-12 ~ 2017-04-20
  • 10. Pipeline [Baseline] 1.Table Data Model 2.Text Data Model (reduce wait time) [Validation] Kfold: 5 Feature Validation: once by one Validate Score: 5fold One Week One WeekOne Month Description Kernel Discussion Beseline Design [Feature Engineering] LightGBM(Table + Text + Image) Feature Save: 1 feature 1 pickle file [Validation] Kfold: 5 Feature Validation: once by one or by group Validate Score: 1fold [Parameter Tuning] Manually Teammates’ feature reuse Diverse Model’s oof
  • 11. Preprocossing ¨  Tabular data df_all['price'] = np.log1p(df_all['price']) df_all['city'] = df_all['city'] + ‘_’ + df_all['region’] ¨  Text data def clean_text(s): s = re.sub('м²|d+/d|d+-к|d+к', ‘ ‘, s.lower()) s = re.sub('s+', ‘ ‘, s) s = s.strip() return s ¨  Image data Delete 4 empty images
  • 12. Feature Engineering ¨  Date Feature df_all['wday'] = df_all['activation_date'].dt.weekday ※TrainとTest両方があるdate型を利用する ¨  Extended Text Feature df_all['param_123'] = (df_all['param_1'].fillna('') + ' ' + df_all['param_2'].fillna('') + ' ' + df_all['param_3'].fillna('')).astype(str) df_all['text'] = df_all['description'].fillna('').astype(str) + ' ' + df_all['title'].fillna('').astype(str) + ' ' + df_all['param_123'].fillna('').astype(str) ※Traing単語が増える
  • 13. Aggrearation Feature ¨  Unique {'groupby': ['category_name'], 'target':’image_top_1', 'agg':'nunique'}, ¨  Count {'groupby': ['user_id'], 'target':'item_id', 'agg':'count'}, ¨  Sum {'groupby': ['parent_category_name'], 'target':'price', 'agg':'sum'}, ¨  Mean {'groupby': ['user_id'], 'target':'price', 'agg':'mean'}, ¨  Median {'groupby': ['image_top_1'], 'target':'price', 'agg':'median'}, ¨  Max {'groupby': ['image_top_1','user_id'], 'target':'price', 'agg':'max'}, ¨  Min {'groupby': ['user_id'], 'target':'price', 'agg':'min'}, ※業務視点から作るのが効率が良い
  • 14. Interaction Feature ¨  Difference between two features df_all['image_top_1_diff_price'] = df_all['price'] - df_all['image_top_1_mean_price'] df_all['category_name_diff_price'] = df_all['price'] - df_all['category_name_mean_price'] df_all['param_1_diff_price'] = df_all['price'] - df_all['param_1_mean_price'] df_all['param_2_diff_price'] = df_all['price'] - df_all['param_2_mean_price'] df_all['user_id_diff_price'] = df_all['price'] - df_all['user_id_mean_price'] df_all['region_diff_price'] = df_all['price'] - df_all['region_mean_price'] df_all['city_diff_price'] = df_all['price'] - df_all['city_mean_price'] ※Business senseがある加減乗除特徴量が強い
  • 15. Supplemental Data Feature ¨  Caculate each item’s up days all_periods['days_up'] = all_periods['date_to'].dt.dayofyear - all_periods['date_from'].dt.dayofyear ¨  Count and Sum of item’s up days {'groupby': ['item_id'], 'target':'days_up', 'agg':'count'}, {'groupby': ['item_id'], 'target':'days_up', 'agg':'sum'}, ¨  Merge to main table df_all = df_all.merge(all_periods, on='item_id', how='left') ※補足データの業務に関わる部分深掘りが大事
  • 16. Impute Null Values ¨  Fillna with 0 df_all[‘price’].fillna(0) ¨  Fillna with median enc = df_all.groupby('category_name') ['item_id_count_days_up'].agg('median’).reset_index() enc.columns = ['category_name' ,'count_days_up_impute'] df_all = pd.merge(df_all, enc, how='left', on='category_name') df_all['item_id_count_days_up_impute'].fillna(df_all['count_days_up_impute' ], inplace=True) ¨  Fillna with model prediction value Rnn(text) -> image_top_1(rename:image_top_2) ※見つからなかったMagic feature: df[‘price’] – df[Rnn(text) -> price]
  • 17. Text Feature ¨  TF-IDF for text ,title,param_123 vectorizer = FeatureUnion([ ('text',TfidfVectorizer( ngram_range=(1, 2), max_features=200000, **tfidf_para), ('title',TfidfVectorizer( ngram_range=(1, 2), stop_words = russian_stop), ('param_123',TfidfVectorizer( ngram_range=(1, 2), stop_words = russian_stop)) ]) tfidf_para = { "stop_words": russian_stop, "analyzer": 'word', "token_pattern": r'w{1,}', "lowercase": True, "sublinear_tf": True, "dtype": np.float32, "norm": 'l2', "smooth_idf":False }
  • 18. Text Feature ¨  SVD for Title tfidf_vec = TfidfVectorizer(ngram_range=(1,1)) svd_title_obj = TruncatedSVD(n_components=40, algorithm='arpack') svd_title_obj.fit(full_title_tfidf) train_title_svd = pd.DataFrame(svd_title_obj.transform(train_title_tfidf)) test_title_svd = pd.DataFrame(svd_title_obj.transform(test_title_tfidf))
  • 19. Text Feature ¨  Count Unique Feature for cols in ['text','title','param_123']: df_all[cols + '_num_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-ЯA-Z]', x)) df_all[cols + '_num_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-яa-z]', x)) df_all[cols + '_num_rus_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[А-Я]', x)) df_all[cols + '_num_eng_cap'] = df_all[cols].apply(lambda x: count_regexp_occ('[A-Z]', x)) df_all[cols + '_num_rus_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[а-я]', x)) df_all[cols + '_num_eng_low'] = df_all[cols].apply(lambda x: count_regexp_occ('[a-z]', x)) df_all[cols + '_num_dig'] = df_all[cols].apply(lambda x: count_regexp_occ('[0-9]', x)) df_all[cols + '_num_pun'] = df_all[cols].apply(lambda x: sum(c in punct for c in x)) df_all[cols + '_num_space'] = df_all[cols].apply(lambda x: sum(c.isspace() for c in x)) df_all[cols + '_num_emo'] = df_all[cols].apply(lambda x: sum(c in emoji for c in x)) df_all[cols + '_num_row'] = df_all[cols].apply(lambda x: x.count('/n')) df_all[cols + '_num_chars'] = df_all[cols].apply(len) # Count number of Characters df_all[cols + '_num_words'] = df_all[cols].apply(lambda comment: len(comment.split())) df_all[cols + '_num_unique_words'] = df_all[cols].apply(lambda comment: len(set(w for w in comment.split()))) df_all[cols + '_ratio_unique_words'] = df_all[cols+'_num_unique_words'] / (df_all[cols+'_num_words']+1) df_all[cols +'_num_stopwords'] = df_all[cols].apply(lambda x: len([w for w in x.split() if w in stopwords])) df_all[cols +'_num_words_upper'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if w.isupper()])) df_all[cols +'_num_words_lower'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if w.islower()])) df_all[cols +'_num_words_title'] = df_all[cols].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
  • 20. Text Feature ¨  Ynktk’s WordEmbedding u  Self-trained FastText model = FastText(PathLineSentences(train+test+train_active+test_active), size=300, window=5, min_count=5, word_ngrams=1, seed=seed, workers=32) u  Self-trained Word2Vec model = Word2Vec(PathLineSentences(train+test+train_active+test_active), size=300, window=5, min_count=5, seed=seed, workers=32) ※Wikiなどで学習したembeddingsよりも、与えられたテキストで学習したembeddingsの方が 有効.おそらく、商品名などの固有名詞が目的変数に効いていたため
  • 21. Image Feature ¨  Meta Feature u  Image_size ,Height,Width,Average_pixel_width,Average_blue,Average_red,Aver age_green,Blurrness,Whiteness,Dullness u  Dullness – Whiteness (Interaction feature) ¨  Pre-trained Prediction Feature u  Vgg16 Prediction Value u  Resnet50 Prediction Value ¨  Ynktk’s Feature u  上位入賞者はImageをVGGなどで特徴抽出していたが、hand-craftな特徴も有効だった u  NIMA [1] u  Brightness, Saturation, Contrast, Colorfullness, Dullness, Bluriness, Interest Points, Saliency Map, Human Facesなど[2] [1] Talebi, H., & Milanfar, P. (2018). NIMA: Neural Image Assessment [2] Cheng, H. et al. (2012). Multimedia Features for Click Prediction of New Ads in Display Advertising
  • 22. Parameter Tuning q  Manually choosing using multi servers params = { 'boosting_type': 'gbdt', ’objective’: ‘xentropy’, #target value like a binary classification probability value 'metric': 'rmse', 'learning_rate': 0.02, 'num_leaves': 600, 'max_depth': -1, 'max_bin': 256, ’bagging_fraction’: 1, ’feature_fractio’: 0.1, #sparse text vector 'verbose': 1 }
  • 23. Submission Analysis Single Lightgbm Sub File Stacking Sub File 1.  Bug Check 2.  Diverse Model Comparation 3.  Prediction Value Trend
  • 24. Best Lightgbm Summary ¨  Table Feature Number ~250 ¨  Text Feature Number 1,500,000+ ¨  Image Feature Number 50+ ¨  Total Feature Number 1,503,424 ¨  Public LB better than 0.2174 ¨  Private LB better than 0.2210
  • 25. Diversity Type Loss Data Set Feature Set Parameter NN Structure Lightgbm xentropy regression huber fair auc With/Without Active data Table Table + Text Table + Text + Image Table + Text + Image + Ridge_meta Learning_rate Num_leaves Xgboost reg:linear binary:logist ic With/Without Active data Table + Text Table + Text + Image Catboost binary_cross entropy With Active data Table + Image Random Forest regression With Active data Table + Text + Image Ridge Regression regression Without Active data Text Table + Text + Image Tfidf max_features Neural network regression binary_cross entropy With Active data Table + Text + Image + wordembedding Layer size Dropout BatchNorm Pooling rnn-dnn rnn-cnn-dnn rnn-attention-dnn
  • 26. Ynktk’s Best NN Numerical Categorical Image Text Embedding Dense SpatialDropout LSTM GRU Conv1D LeakyReLU GAP GMP Concat BatchNorm LeakyReLU Dropout Dense LeakyReLU BatchNorm Dense Concat Embedding *callbacks •  EarlyStopping •  ReduceLROnPlateau *optimizer •  Adam with clipvalue 0.5 •  Learning rate 2e-03 *loss •  Binary cross entropy Priv LB: 0.2225 Pub LB: 0.2181
  • 27.
  • 28.
  • 29. China Competitions & Platform Kaggle China Comp Platform Kaggle Tianchi,Tencent,Jdata,Kesci,DataCastle, Biendata,DataFountain… …[1] Round Round1:2~3 months Public/Private LB Round1:1.5months Public,3 days Private Round2:2 weeks Public,3 days Private Round3:Presentation Sub/day 5 Public:3,Private:1 Prize Top 3 Top 5/10/50 [1] https://github.com/iphysresearch/DataSciComp
  • 30. Knowledge Sharing https://github.com/Smilexuhc/Data- Competition-TopSolution/blob/master/ README.md Learn From GrandMasters 1.  EDA by Excel 2.  Join every competitions 3.  Reuse pipeline & features 4.  Strictly time management 5.  Use differnet area’s knowledge 6.  Family Support