SlideShare una empresa de Scribd logo
1 de 129
Descargar para leer sin conexión
Joaquin Vanschoren
Eindhoven University of Technology
j.vanschoren@tue.nl @joavanschoren http://bit.ly/openml
Advanced Course on Data Science & Machine Learning
Siena, 2019
Automatic Machine Learning & Meta-Learning
slides!
part 1
1
Slides / Book
Open access book
PDF (free): www.automl.org/book
www.amazon.de/dp/3030053172
More slides:
www.automl.org/events ->
AutoML Tutorial NeurIPS 2018
Video:
www.youtube.com/watch?v=0eBR8a4MQ30
http://bit.ly/openml
2
Learning takes experimentation
!3
It can take a lot of trial and error
Task
Models
performance
Learning
ModelsModelsModelsModels
Learning

episodes
Learning to learn takes experimentation
!4
Building machine learning systems requires a lot of expertise and trials
Neural networks
- design architecture (operators/layers)

- many sensitive hyperparameters

- optimization algorithm, learning rate,…

- regularization, dropout,…

- batch size, epochs, early stopping, …

- data augmentation, …
Task
Models
performance
Learning
ModelsModelsModelsModels
!5
Hyperparameters
Every design decision made by the user (architecture, operators, tuning,…)
!6
Hyperparameters
Can be very sensitive
Learning to learn takes experimentation
!7
Building machine learning systems requires a lot of expertise and trials
Task
Models
performance
Learning
ModelsModelsModelsModels
Machine learning pipelines
- design pipeline architecture (operators)

- clean, preprocess data (!)

- select and/or engineer features

- select and tune models

- adjust to evolving input data (concept drift),…
!8
Task Models
performance
Human expert ModelsModels
chooses architecture 

and hyperparameters
Task Models
performance
Meta-level learning and
optimization ModelsModels
chooses architecture 

and hyperparameters
Replace manual model building by automation
Automated machine learning
Current practice
ModelsModelsModels
!9
AUTOML!
THE DATA SCIENTIST’S #1 EXCUSE FOR
LEGITIMATELY SLACKING OFF:
“THE AUTOML TOOL IS OPTIMIZING MY MODELS!”
!10
Task Models
inner loop
Meta-level learning and
optimization ModelsModels
chooses architecture 

and hyperparameters
Human-in-the-loop AutoML (semi-AutoML)
Domain knowledge and human expertise are very valuable
e.g. unknown unknowns, preference learning,…
ModelsModelsModels
Human expert
outer loop
control

adapt Models +
explanations
Sutton 2019
ask
!11
Human-in-the-loop AutoML (semi-AutoML)
Explain model in human language: automatic statistician
Steinruecken 2019
search over grammar
of models (kernels)
data
input
model component to
English translation
report
generation
!12
Task ModelsOptimization ModelsModels
optimize architecture

and hyperparameters
Overview
AutoML introduction, promising optimization techniques
Meta-learning
part 1
part 2
part 3 Neural architecture search, meta-learning on neural nets
Tasks ModelsMeta-learning ModelsModels
optimize architecture

and hyperparameters
TasksTasks
Tasks ModelsNAS, meta-learning ModelsModels
optimize neural architecture,

model/hyperparameters
TasksTasks
Task ModelsOptimization ModelsModels
optimize architecture

and hyperparameters
Overview
AutoML introduction, optimization techniquespart 1
1. Problem definition
2. (Pipeline) architecture search
3. Optimization techniques
4. Performance improvements
5. Benchmarks
!13
!14
AutoML Definition
Let:
{A(1)
, A(2)
, . . . , A(n)
} be a set of algorithms (operators)
Λ(i)
= λ1 × λ2 × . . . × λm be the hyperparameter space for A(i)
𝒜 be the space of possible architectures of one or more algorithms
Find the configuration that minimizes the expected loss on a dataset :
λ ∈Λ𝒜
Λ𝒜 = Λ(1)
× Λ(2)
× . . . × Λ(m) its combined configuration space
a specific configuration (of architecture and hyperparameters)
ℒ(λ, Dtrain, Dvalid) the loss of the model created by , trained on
data , and validated on data
λ
Dtrain Dvalid
λ* = argmin
λ∈Λ𝒜
𝔼(Dtrain,Dvalid)∼𝒟ℒ(λ, Dtrain, Dtest)
𝒟
!15
Types of hyperparameters
• Continuous (e.g. learning rate, SVM_C,…)
• Integer (e.g. number of hidden units, number of boosting iterations,…)
• Categorical
• e.g. choice of algorithm (SVM, RandomForest, Neural Net,…)
• e.g. choose of operator (Convolution, MaxPooling, DropOut,…)
• e.g. activation function (ReLU, Leaky ReLU, tanh,…)
• Conditional
• e.g. SVM kernel if SVM is selected, kernel width if RBF kernel is selected
• e.g. Convolution kernel size if Convolution layer is selected
• We can identify two subproblems:
• Architecture search: search space of all possible architectures
• Pipelines: Fixed predefined pipeline, grammars, genetic
programming, planning, Monte-Carlo Tree Search
• Neural architectures: See part 3
• Hyperparameter optimization: optimize remaining hyperparameters
• Optimization: grid/random search, Bayesian optimization, evolution,
multi-armed bandits, gradient descent (only NNs)
• Meta-learning: See part 2
• Can be solved consecutively, simultaneously or interleaved
• Compositionality: breaking down the learning process into smaller reusable
tasks makes it easier, more transferable, more robust
Architecture vs hyperparameters
!16
• Parameterize best-practice (linear) pipelines
• Introduce conditional hyperparameters
• Combined Algorithm Selection and Hyperparameter optimization (CASH):
Pipeline search: fixed pipelines
λr ∈{A(1)
, . . . , A(k)
, ∅}
Λ𝒜 = Λ(1)
× Λ(2)
× . . . × Λ(m)
× λr
Could such a pipeline be learned?
Feurer et al. 2015, Thornton et al. 2013
!17
Pipeline search: genetic programming
• Tree-based pipeline optimization
• Start with random pipelines,
best of every generation will
cross-over or mutate
• GP primitives include data copy
and feature joins: trees
• Multi-objective optimization:
accurate but short
• Easy to parallelize
• Asynchronous evolution (GAMA)
Olson, Moore 2016, 2019
copy
change

insert

shrink
Gijsbers, Vanschoren 2019
(TPOT)
!18
!
Pipeline search: TPOT demo
Pipeline search: genetic programming
• Grammar-based genetic programming (RECIPE) [de Sa et al. 2017]
• Encodes background knowledge, avoids invalid pipelines
• Cross-over and mutation respect the grammar
production rule non-terminaloptional
terminal
!20
Pipeline search: Monte Carlo Tree Search
• Use MCTS to search for optimal pipelines
• Optimize the structure and hyperparameters simultaneously by
building a surrogate model to predict configuration performance
• Bayesian surrogate model: MOSAIC [Rakotoarison et al. 2019]
• Neural network: AlphaD3M [Drori et al. 2018]
[Chaslot et al., 2008]!21
Pipeline search: planning
• Hierarchical planners: ML-Plan [Mohr et al. 2018], P4ML [Gil et al. 2018]
• Graph search problem, can be solved with best first search
• Use random path completion (as in MCTS) to evaluate each node
• Hyperparameter optimization interleaved as possible actions
score
!22 tune
Hyperparameter optimization (HPO)
!23
Black box optimization: expensive for machine learning.
Sample efficiency is important!
λ* = argmin
λ∈Λ𝒜
𝔼(Dtrain,Dvalid)∼𝒟ℒ(λ, Dtrain, Dtest)
HPO: grid and random search
!24
Bergstra & Bengio, JMLR 2012 

• Random	search	handles	unimportant	dimensions	be4er	
• Easily	parallelizable,	but	uninformed	(no	learning)
• Start with a few (random) hyperparameter configurations
• Build a surrogate model to predict how well other configurations will
work: mean and standard deviation (blue band)
• Any probabilistic regression model: e.g. Gaussian processes
• To avoid a greedy search, use an acquisition function to trade off
exploration and exploitation, e.g. Expected Improvement (EI)
• Sample for the best configuration under that function
μ σ
HPO: Bayesian Optimization
Mockus [1974]
μ
μ + σ
μ −σ
• Repeat
• Stopping criterion:
• Fixed budget (time,
evaluations)
• Min. distance
between configs
• Threshold for
acquisition function
• Still overfits easily
• Convergence results
• Good for non-convex,
noisy objectives
• Used in AlphaGo
!
HPO: Bayesian Optimization
Srinivas et al. 2010, Freitas et al.
2012, Kawaguchi et al. 2016
Grid Random
Bayesian
!27
Gaussian processes
+ uncertainty, extrapolation
- Scalability (cubic)
Sparse GPs[Lawrence et al. 2003]
Random embeddings [Wang et al. 2013]
- Robustness? Meta-BayesOpt to optimize kernel [Malkomes et al. 2016]
Neural networks + Bayesian LR [Snoek et al. 2015]
+ Scalable, Learns basis expansion
Bayesian neural networks [Springenberg et al. 2016]
+ Bayesian - Scalability not studied yet
Random Forests [Hutter et al. 2011, Feurer et al. 2015]
Frequentist uncertainty estimate (variance)
+ scalable, complex HPs - uncertainty, extrapolation
Φz
P
BLR
surrogate
φz(λ)i
(λi,Pi,j) φz(λ)
!
Bayesian Optimization: surrogate models
!28
Random Forests (SMAC) example:
(minimizing)!29
interleaved with
random search
!
Bayesian Optimization: surrogate models
!30
Random Forest Surrogate
animation by Jeroen van Hoof
Jeroen van Hoof
(under review)
Gradient Boosting Surrogate
Estimate uncertainty with
quantile regression
!31
!32
HPO: Tree of Parzen Estimators
1.Test some hyperparameters

2. Separate into good and
bad hyperparameters (with
some quantile)

3. Fit non-parametric KDE

for and


4. For a few samples,
evaluate 

Shown to be equivalent to EI!

Efficient, parallelizable,
robust, but less sample
efficient than GPs

Used in HyperOpt-sklearn

[Komer et al. 2019]
p(λ = good)
p(λ = bad)
p(λ = good)
p(λ = bad)
HPO: population-based methods
!33
• Less	sample	efficient,	but	easy	to	parallelize,	and	adapts	to	changes	
• Gene@c	programming	
• Muta@ons:	add,	mutate/tune,	remove	HP	
• Par@cle	swarm	op@miza@on		
• Covariance	matrix	adapta@on	evolu@on	(CMA-ES)	
• Purely	con@nuous,	expensive	
• Very	compe@@ve	to	op@mize	deep	neural	nets	
Olson, Moore 2016, 2019
Mantovani et al 2015
mutate
Hansen 2015
											[Loshilov,	Hu-er	2016]
!34
AUTOML IS
STILL RUNNING
AutoML performance improvements
Making it practically useful (inspired by humans)
OH, COME ON!
!35
Multi-fidelity approaches
General techniques for cheap approximations of black-box
λ* = argmin
λ∈Λ𝒜
𝔼(Dtrain,Dvalid)∼𝒟ℒ(λ, Dtrain, Dtest)
data subsets
fewer epochs, shorter MCMC chains, 

fewer RL trials,…
downsampled images
λmf
!36
Multi-fidelity approaches
Cheap low-fidelity surrogates: train on small subsets, infer which regions
may be interesting to evaluate in more depth
• Evaluate random samples on smallest set
• Update a Bayesian surrogate model
• Select fewer points based on surrogate, evaluate on more data, repeat
Swersky	et	al.	2013,	2014;	Klein	et	al	2017;	Kandasamy	et	al	2017
!37
Multi-fidelity approaches
Successive halving:
• Randomly sample candidates and evaluate on a small data sample
• retrain the best half candidates on twice the data
Jamieson & Talwalkar 2016
1/16 1/8 1/4 1/2
!38
Multi-fidelity approaches
Successive halving:
• Randomly sample candidates and evaluate on a small data sample
• retrain the best half candidates on twice the data
Jamieson & Talwalkar 2016
!39
Multi-fidelity approaches
Hyperband: Repeated, decreasingly aggressive successive halving
• Minimizes (doesn’t eliminate) chance that candidate was pruned to early
• Strong anytime performance, easy to implement, scalable, parallelizable
Li et al. 2017
!40
Multi-fidelity approaches
Combined Bayesian Optimization and Hyperband (BO-HB)
• BayesOpt: choose which configurations to evaluate
• Hyperband: allocate budgets more efficiently
• Strong anytime and final performance
Falkner et al. 2018
!41
Early stopping
Learning curve prediction
• Model learning curves as parametric functions, fit
• Train a Bayesian Neural Net to predict learning curves
[Domhan et al 2015]
[Klein et al 2017]
Data allocation using upper bounds (DAUB)
• Fit linear function through latest estimates
• Compute upper performance bound
Sabharwal et al. 2015
CNN learning curves
Hyperparameter gradient descent
• Bilevel program
• Outer obj.: optimize
• Inner obj.: optimize weights given
• Derive through the entire SGD
• Get hypergradients wrt. validation loss
• Expensive! But useful if you have many
hyper parameters and high parallelism
• Interleave optimization steps
• Alternate SGD steps for and
λ
λ
ω λ
[Franceschi et al 2018]
Optimize neural network hyperparameters and weights simultaneously
[MacLaurin et al 2015]
[Luketina et al 2016]
!42
!43
Meta-learning
Meta-learning can drastically speed up the architecture and hyperparameter
search, in combination with the optimization techniques we just discussed
• Learn which hyperparameters are really important
• Learn which hyperparameter values should be tried first
• Learn which architectures will most likely work
• Learn how to clean and annotate data
• Learn which feature representations to use
• …
Thank you! To be continued…
Part 2: meta-learning
Image: Pesah et al. 2018!44
Joaquin Vanschoren
Eindhoven University of Technology
j.vanschoren@tue.nl @joavanschoren http://bit.ly/openml
Advanced Course on Data Science & Machine Learning
Siena, 2019
Automatic Machine Learning & Meta-Learning
slides!
part 2
!2
Task ModelsOptimization ModelsModels
optimize architecture

and hyperparameters
Overview
AutoML introduction, promising optimization techniques
Meta-learning
part 1
part 2
part 3 Neural architecture search, meta-learning on neural nets
Tasks ModelsMeta-learning ModelsModels
optimize architecture

and hyperparameters
TasksTasks
Tasks ModelsNAS, meta-learning ModelsModels
optimize neural architecture,

model/hyperparameters
TasksTasks
Learning is a never-ending process
!3
Humans don’t learn from scratch
Learning is a never-ending process
!4
Learning humans also seek/create related tasks
E.g. find many similar puzzles, solve them in different ways,…
Learning is a never-ending process
Task 1 Task 2 Task 3
Models ModelsModelsModels
Model
performance performance performance
Learning
ModelsModelsModelsModels
Learning Learning
Learning

episodes
meta-

learning
meta-

learning
meta-

learning
!5
Humans learn across tasks
Why? Requires less trial-and-error, less data
Task 1 Task 2 Task 3
ModelsModelsModels
performance performance performance
LearningLearning Learning
ModelsModelsModels
ModelsModelsModels
inductive bias
!6
Inductive bias: assumptions added to the training data to learn effectively
If prior tasks are similar, we can transfer prior knowledge to new tasks
Learning to learn
New Task
performance
ModelsModelsModels
Learning
prior beliefs
constraints
model parameters
representations
training data
Task 1 Task 2 Task 3
ModelsModelsModels
performance performance performance
LearningLearningLearners
LearningLearningLearners
LearningLearningLearners
ModelsModelsModels
ModelsModelsModels
}meta-data
!7
Meta-learning
New Task
performance
ModelsModelsModels
meta-learner
base-learner
additional 

experiments
Meta-learner learns a (base-)learning algorithm, based on meta-data
Meta-data
New Task
meta-learner
ModelsModelsModels
Task 1 Task j
ModelsModelsModels
performance performance
LearningLearningLearningLearning
ModelsModelsModels
…
performance
!8
Learners Learners
Pi,j
λi
mj
!i,j,k
Pi,j
λi
mj
!i,j,k
Description of task j (meta-features)
Configuration i (architecture + hyperparameters)
Model parameters (e.g. weights)
Performance estimate of model built with λi on task j
How?
New Task
meta-learner
ModelsModelsModels
Task 1 Task j
ModelsModelsModels
performance performance
LearningLearningLearningLearning
ModelsModelsModels
…
performance
1. Learn from observing learning algorithms (for any task)
2. Learn what may likely work (for somewhat similar tasks)
3. Learn from previous models (for very similar tasks)
!9
Learners Learners
Pi,j
λi
mj
!i,j,k
Pi,jλi
mj
!i,j,k
Pi,jλi
Pi,jλi(mj)
1. Learn from observing learning algorithms
Define, store and use meta-data:
- configurations: settings that uniquely define the model
- performance (e.g. accuracy) on specific tasks
New Task
meta-learner
ModelsModelsModels
configurations
performances
Task 1 Task j
ModelsModelsModels
performance performance
LearningLearningLearningLearning
ModelsModelsModels
…
performance
λi
Pi,j
!10
Learners Learners
(hyperparameters, pipeline,
neural architecture,…)
1. Learn from observing learning algorithms
Define, store and use meta-data:
- configurations: settings that uniquely define the model
- performance (e.g. accuracy) on specific tasks
New Task
meta-learner
ModelsModelsModels
configurations
performances
Task 1 Task j
ModelsModelsModels
performance performance
LearningLearningLearningLearning
ModelsModelsModels
…
performance
λi
Pi,j
!10
Learners Learners
(hyperparameters, pipeline,
neural architecture,…)
task model config λ score P
0 alg=SVM, C=0.1 0,876
0 alg=NN, lr=0.9 0,921
0 alg=RF, mtry=0.1 0,936
1 alg=SVM, C=0.2 0,674
1 alg=NN, lr=0.5 0,777
1 alg=RF, mtry=0.4 0,791
Rankings
• Build a global (multi-objective) ranking, recommend the top-K
• Can be used as a warm start for optimization techniques
• E.g. Bayesian optimization, evolutionary techniques,…
Tasks
ModelsModelsModels
performance
LearningLearningLearning
1. λa
2. λb
3. λc
4. λd
5. λe
6. …
New Task
meta-learner
ModelsModelsModels
performance
Global ranking

(task agnostic)
λa..k
warm
start
Leite et al. 2012
Abdulrahman et al. 2018
}
(discrete)
(multi-objective)
Pi,j
!11
λi
• What if prior configurations are not optimal?
• Per task, fit a differentiable plugin estimator on all evaluated configurations
• Do gradient descent to find optimized configurations, recommend those
Tasks
ModelsModelsModels
performance
LearningLearningLearning
New Task
meta-learner
ModelsModelsModels
performance
Warm-starting with plugin estimators
λ*i
per task:
task 1: λ*1
task 2: λ*2
task 3: λ*3
…
Wistuba et al. 2015
λi
}Pi,j
!12
• Functional ANOVA 1
Select hyperparameters that cause variance in the evaluations.
Tasks
ModelsModelsModels
performance
LearningLearningLearning
}
New Task
meta-learner
ModelsModelsModels
performance
Hyperparameter behavior
hyperparameter
importance
1 van Rijn & Hutter 2018
λ1
λ2
λ3
λ4 constraints
priors
Pi,j
!13
λi
1 van Rijn & Hutter 2018
SVM (RBF)
Hyperparameter behavior
• Functional ANOVA 1
Select hyperparameters that cause variance in the evaluations.
AdaBoost
1 van Rijn & Hutter 2018
ResNets for image classification (under review)
• Functional ANOVA 1
Select hyperparameters that cause variance in the evaluations.
Hyperparameter behavior
1 van Rijn & Hutter 2018
Priors (KDE) for the most important hyperparameters
• Functional ANOVA 1
Select hyperparameters that cause variance in the evaluations.
Learn priors for hyperparameter tuning (e.g. Bayesian optimization)
Hyperparameter behavior
• Functional ANOVA 1
Select hyperparameters that cause variance in the evaluations.
• Tunability 2,3,4
Learn good defaults, measure improvement from tuning over defaults
1 van Rijn & Hutter 2018
2 Probst et al. 2018
!17
3 Weerts et al. 2018Hyperparameter behavior
4 van Rijn et al. 2018
• Search space pruning
Exclude regions yielding bad performance on (similar) tasks
Tasks
ModelsModelsModels
performance
LearningLearningLearning
}
New Task
meta-learner
ModelsModelsModels
performance
constraints
Pi,j
Wistuba et al. 2015
P
λ1
!18
λi
λ2
Hyperparameter behavior
• Experiments on the new task can tell us how it is similar to previous tasks
• Task are similar if observed performance of configurations is similar
• Use this to recommend new configurations, run, repeat
Tasks
ModelsModelsModels
performance
LearningLearningLearning
}
New Task
meta-learner
ModelsModelsModels
performance
Learning task similarity
λc
Pi,j
!19
λi
Pc,new ≅ Pc,j
• Learn task similarity while tuning configurations
• Tournament-style selection, warm-start with overall best config λbest
• Next candidate λc : the one that beats current λbest on similar tasks
Tasks
ModelsModelsModels
performance
LearningLearningLearning
}
New Task
meta-learner
ModelsModelsModels
performance
Active testing
Leite et al. 2012
λc
Select λc > λbest
on similar tasks
(discrete)
Pi,j
!20
λi
Sim(tj,tnew) = Corr([RLa,b,j],[RLa,b,new])
Pc,j
• If task j is similar to the new task, its surrogate model Sj will likely transfer well
• Sum up all Sj predictions, weighted by task similarity (as in active testing)1
• Build combined Gaussian process, weighted by current performance on new task2
Tasks
ModelsModelsModels
performance
LearningLearningLearning
New Task
meta-learner
ModelsModelsModels
performance
per task tj:
Pi,j
}
Surrogate model transfer
1 Wistuba et al. 2018
λi
P
Sj
2 Feurer et al. 2018
S= ∑ wj Sj
+
+
S1
S2
S3
!21
λi
• Multi-task Gaussian processes: train surrogate model on t tasks simultaneously1
• If tasks are similar: transfers useful info
• Not very scalable
• Bayesian Neural Networks as surrogate model2
• Multi-task, more scalable
• Stacking Gaussian Process regressors (Google Vizier)3
• Continual learning (sequential similar tasks)
• Transfers a prior based on residuals of previous GP
Multi-task Bayesian optimization
1 Swersky et al. 2013
Independent GP predictions Multi-task GP predictions
2 Springenberg et al. 2016
3 Golovin et al. 2017
!22
• Bayesian linear regression (BLR) surrogate model on every task
• Use neural net to learn a suitable basis expansion ϕz(λ) for all tasks
• Scales linearly in # observations, transfers info on configuration space
Tasks
ModelsModelsModels
performance
LearningLearningLearning New Task
meta-learner
ModelsModelsModels
performance
}
More scalable variants
Perrone et al. 2018
P
BLR
surrogate
(λi,Pi,j)
φz(λ)i
warm-start (pre-train)
λi
Bayesian optimization
φz(λ)
BLR hyperparameters
Pi,j
!23
λi
2. Learn what may likely work (for partially similar tasks)
Meta-features: measurable properties of the tasks
(number of instances and features, class imbalance, feature skewness,…)
configurations
performances
similar mj ?Task 1 Task N
ModelsModelsModels
performance performance
LearningLearningLearning
LearningLearningLearning
ModelsModelsModels
… meta-features New Task
meta-learner
ModelsModelsModels
performance
mj
Pi,j
!24
λi
2. Learn what may likely work (for partially similar tasks)
Meta-features: measurable properties of the tasks
(number of instances and features, class imbalance, feature skewness,…)
configurations
performances
similar mj ?Task 1 Task N
ModelsModelsModels
performance performance
LearningLearningLearning
LearningLearningLearning
ModelsModelsModels
… meta-features New Task
meta-learner
ModelsModelsModels
performance
mj
Pi,j
!24
λi
meta-
features
model config
λ
score
P
60000,5,0.2 alg=SVM, C=0.1 0,876
60000,5,0.2 alg=NN, lr=0.9 0,921
60000,5,0.2 alg=RF, mtry=0.1 0,936
11233,4,0.1 alg=SVM, C=0.2 0,674
11233,4,0.1 alg=NN, lr=0.5 0,777
11233,4,0.1 alg=RF, mtry=0.4 0,791
• Hand-crafted (interpretable) meta-features1
• Number of instances, features, classes, missing values, outliers,…
• Statistical: skewness, kurtosis, correlation, covariance, sparsity, variance,…
• Information-theoretic: class entropy, mutual information, noise-signal
ratio,…
• Model-based: properties of simple models trained on the task
• Landmarkers: performance of fast algorithms trained on the task
• Domain specific task properties
• Optimized (recommended) representations exist 2
Meta-features
1 Vanschoren 2018
!25
2 Rivolli et al 2019
• Learning a joint task representation
• Deep metric learning: learn a representation hmf using ground truth distance
• Siamese Network: similar task, similar representation 1
• Feed through pretrained CNN, extract vector from weights 2
• Recommend neural architectures, or warm-start neural architecture search
Meta-features
1 Kim et al. 2017
!26
2 Achille et al. 2019
• Find k most similar tasks, warm-start search with best λi
• Auto-sklearn: Bayesian optimization (SMAC)
• Meta-learning yield better models, faster
• Winner of AutoML Challenges
Tasks
ModelsModelsModels
performance
LearningLearningLearning
New Task
meta-learner
ModelsModelsModels
performance
Pi,j
}
Warm-starting from similar tasks
λ1..k
mj
best λi on
similar tasks
Feurer et al. 2015
!27
λi
Bayesian optimization
λ
P
λ1
λ3
λ2
λ4
Warm-starting from similar tasks
Fusi et al. 2017
Pi,j
λi
TL
λL
tj
tnew warm-started
with λ1..k
. . .. . . .. . . . .
P
λiλLi
P
p(P|λLi)
latent representation
• Learn latent representation
for tasks and configurations
• Use meta-features to
warm-start on new task
• Returns probabilistic
predictions for BayesOpt
• Collaborative filtering: configurations λi are `rated’by tasks tj
New Task
meta-learner
ModelsModelsModels
performance
λ1..k
Pi,j
}
mj
(discrete)
λi
• Train a task-independent surrogate model with meta-features in inputs
• SCOT: Predict ranking of λi with surrogate ranking model + mj. 1
• Predict Pi,j with multilayer Perceptron surrogates + mj. 2
• Build joint GP surrogate model on most similar ( ) tasks. 3
• Scalability is often an issue
Tasks
ModelsModelsModels
performance
LearningLearningLearning
New Task
meta-learner
ModelsModelsModels
performance
Pi,j
}
Global surrogate models
mj
1 Bardenet et al. 2013
2 Schilling et al. 2015
mj ✖ λi
P
3 Yogatama et al. 2014
2
!29
λi
task1 task2
joint
• Learn direct mapping between meta-features and Pi,j
• Zero-shot meta-models: predict best λi given meta-features 1
• Ranking models: return ranking λ1..k 2
• Predict which algorithms / configurations to consider / tune3
• Predict performance / runtime for given !i and task4
• Can be integrated in larger AutoML systems: warm start, guide search,…
meta-learner
Meta-models
λbest
1 Brazdil et al. 2009, Lemke et al. 2015
2 Sun and Pfahringer 2013, Pinto et al. 2017
meta-learner λ1..k
mj
mj
meta-learner
Pijmj, λi
3 Sanders and C. Giraud-Carrier 2017
meta-learner
Λmj
4 Yang et al. 2018
!30
Learning to learn through self-play
!31
• Build pipelines by inserting, deleting, replacing components (actions)
• Neural network (LSTM) receives task meta-features, pipelines and evaluations
• Predicts pipeline performance and action probabilities
• Monte Carlo Tree Search builds pipelines, runs simulations against LSTM
Drori et al 2017
New Task
meta-learner
ModelsModelsModels
performance
self-play
mj
λi
3. Learn from previous models (for very similar tasks)
configurations
performances
Task 1 Task N
ModelsModelsModels
performance performance
LearningLearningLearning
LearningLearningLearning
ModelsModelsModels
… New Task
meta-learner
ModelsModelsModels
performance
model parameters
Models trained on intrinsically similar tasks
(model parameters, features,…)
intrinsically (very) similar

(e.g. shared representation)
!k
Pi,j
!32
λi
3. Learn from previous models (for very similar tasks)
configurations
performances
Task 1 Task N
ModelsModelsModels
performance performance
LearningLearningLearning
LearningLearningLearning
ModelsModelsModels
… New Task
meta-learner
ModelsModelsModels
performance
model parameters
Models trained on intrinsically similar tasks
(model parameters, features,…)
intrinsically (very) similar

(e.g. shared representation)
!k
Pi,j
!32
λi
model
param !
model config
λ
score
P
0.34,0.43,… alg=NN, lr=0.9 0,876
0.04,0.03,… alg=NN, lr=0.9 0,921
0.94,0.03,… alg=NN, lr=0.9 0,936
0.34,0.23,… alg=NN, lr=0.5 0,674
0.04,0.23,… alg=NN, lr=0.5 0,777
0.94,0.23,… alg=NN, lr=0.5 0,791
Transfer Learning
!33
Source
tasks
ModelsModelsModels
performance
LearningLearningLearning
Target
task
meta-learner
performancePi,j
!33
ModelsModelsModels
• Select source tasks, transfer trained models to similar target task 1
• Use as starting point for tuning, or freeze certain aspects (e.g. structure)
• Bayesian networks: start structure search from prior model 2
• Reinforcement learning: start policy search from prior policy 3
1 Thrun and Pratt 1998
Bayesian Network transfer
Reinforcement learning: 2D to 3D mountain car
3 Taylor and Stone 2009
2 Niculescu-Mizil and Caruana 2005
!k
λi
Learning to learn Neural networks
!35!35
• End-to-end differentiable models afford many meta-learning
opportunities
• Transfer learning
• Learning gradient descent procedures (weight update rules)
• Few shot learning
• Model-agnostic meta-learning (learning weight initializations)
• Learning to reinforcement learn
• …
See part 3
Joaquin Vanschoren
Eindhoven University of Technology
j.vanschoren@tue.nl @joavanschoren http://bit.ly/openml
Advanced Course on Data Science & Machine Learning
Siena, 2019
Automatic Machine Learning & Meta-Learning
slides!
part 3
!2
Task ModelsOptimization ModelsModels
optimize architecture

and hyperparameters
Overview
AutoML introduction, promising optimization techniques
Meta-learning
part 1
part 2
part 3 Neural architecture search, meta-learning on neural nets
Tasks ModelsMeta-learning ModelsModels
optimize architecture

and hyperparameters
TasksTasks
Tasks ModelsNAS, meta-learning ModelsModels
optimize neural architecture,

model/hyperparameters
TasksTasks
!3
Overview
part 3 Neural architecture search, meta-learning on neural nets
Tasks ModelsNAS, meta-learning ModelsModels
optimize neural architecture,

model/hyperparameters
TasksTasks
1. Neural Architecture Search space
2. Neural Architecture optimization
3. (Neural) Meta-Learning
!4
Neural Architecture Search space
Sequential (chain) Arbitrary graph
Choose:

• number of layers

• type of layers

• dense

• convolutional

• max-pooling

• …

• hyperparameters of
layers

+ easier to search 

- sometimes too
simple
Choose:

• branching

• joins

• skip connections

• types of layers

• hyperparameters
of layers

+ more flexible 

- much harder
to search
Elsken et al. 2019
!5
Neural Architecture Search space
Observation: successful deep networks have repeated motifs (cells)

e.g. Inception v4:
!6
Cell search space
• parameterize building
blocks (cells)
• e.g. regular cell +
resolution reduction cell
• stack cells together in
macro-architecture

• usually a chain

• can be manual or
learned

+ smaller search space 

+ cells can be learned on a
small dataset & transferred
to a larger dataset

- you can’t learn entirely
new architectures
Zoph et al 2018
!7
Within-cell search space
• Different strategies to construct cell, leading to different parameterizations

• Can be parameterized as set of categorical hyperparameters:

• Which existing layer (hidden state, e.g. cell input) to build on

• Which operation (e.g. 3x3conv) to perform on 

• How to combine into new hidden state

• Iterate over B phases (blocks)

Hi
Hi
Zoph et al 2018
!8
Neural architecture optimization
Standard hyperparameter optimization techniques
1 Bergstra et al. 2013
• Kernels to optimize neural nets with GP-based Bayesian optimization

• ArcNet2: DNNs, 23 hyperparameters, 6 kernels

• NASBOT3: DNNs and CNNs
• Image classification pipelines1

• 238 hyperparameters, tuned with TPE
2 Swersky et al 2013
3 Kandasamy et al. 2018
!9
Neural architecture optimization
• AutoNet 1 : DNNs, 63 hyperparameters, tuned with SMAC

• Joint NAS + HPO 2: ResNets, tuned with BO-HB

• PNAS (Progressive NAS) 3

• Cell search space, optimized with SMAC, HPO afterwards

• SotA on ImageNet, CIFAR
1 Mendoza et al. 2016
3 Liu et al. 2018
Standard hyperparameter optimization techniques
2 Zela et al. 2018
!10
1. Observe state

2. Take action u based on policy ( are weights in an RNN)

3. New state is formed

4. Take further actions based on new state

5. After trajectory of motions, adjust policy based on total reward R( )
πθ θ
τ τ
Deep RL recap
policy gradient estimate (H steps, m samples)
HuiLevine
R
!11
NAS with Reinforcement learning
πθ
Zoph & Le 2017
RNN policy network:

• generate architecture step by step

• evaluate to get reward

• update with policy gradient
2-layer LSTM (REINFORCE), chains of convolutional layers

• State of the art on CIFAR-10, Penn Treebank

• 800 GPUs for 3-4 weeks, 12800 architectures
!12
NAS with Reinforcement learning
πθ
RNN policy network:

• generate architecture step by step

• evaluate to get reward

• update with policy gradient
1-layer LSTM (PPO), cell space search

• State of the art on ImageNet

• 450 GPUs, 20000 architectures
Zoph et al 2018
!13
Neuroevolution
Angeline et al 1994, Stanley and Miikkulainen 2002
Earlier solutions tried to:

• Optimize both the configuration and the weights simultaneously
with genetic algorithms

• Optimize individual nodes and connections

• Doesn’t scale to deep networks
!14
Neuroevolution
Real et al 2017; Miikkulainen et al 2017
Better: 

• learn the architecture and configuration with evolutionary techniques

• train the network with SGD

• mutation steps: add, change, remove layer
alive, killed
!15
Neuroevolution
AmoebaNet: State of the art on ImageNet, CIFAR-10

• Cell search space, aging evolution (kill oldest networks)

Real et al 2019
!16
Neuroevolution
AmoebaNet: State of the art on ImageNet, CIFAR-10

• Cell search space, aging evolution (kill oldest networks)

• More efficient than reinforcement learning
Real et al 2019
!17
Network morphisms
Chen et al. 2016, Wei et al. 2016, Cai et al. 2017
Cai et al. 2018, Elsken et al. 2017, Cortes et al. 2017, Cai et al. 2018
• Change architecture, but not modelled function

• Non-black box, allows much more efficient architecture search
init. identity
!18
Weight sharing
1 Saxena & Verbeek, 2016
• One-shot models (convolutional neural fabric) 1

• Search space: ConvNet = path though fabric with shared weights

• Train ensemble of all of them For 7-layer CNNs:
• Path dropout 2
• ensure that individual paths do well
2 Bender et al. 2018
3 Pham et al. 2018
4 Brock et al. 2017
• ENAS 3

• use RL to sample paths from one-shot model, train weights

• other paths inherit these weights
• SMASH 4

• Shared hypernetwork that predicts weights given architecture
DARTS: Differentiable NAS
Liu et al. 2018
convolution
max pooling
zero
• Fixed (one-shot) structure, learn which operators to use

• Give all operators a weight 

• Optimize and model weights using bilevel programming
αi
αi ωj
One-shot model operator weights αi
interleaved optimization
of and with SGDαi ωj!19
argmax αi
DARTS: Differentiable NAS
• Lots of further refinements

• SNAS

• Use Gumbel softmax (differentiable) on 

• Proxyless NAS

• Memory-efficient DARTS

• trains sparse architectures only

• …
αi
!20
[Xie et al 2019]
[Cai et al 2019]
Learning to learn from previous models
configurations
performances
Task 1 Task N
ModelsModelsModels
performance performance
LearningLearningLearning
LearningLearningLearning
ModelsModelsModels
… New Task
meta-learner
ModelsModelsModels
performance
model parameters
For models trained on intrinsically similar tasks
(model parameters, features,…)
intrinsically (very) similar

(e.g. shared representation)
!k
Pi,j
!21
λi
!22
Learning to learn
Learn a better gradient update rule
for a group of tasks
Learn a better initialization (and representation)
for a group of tasks
Learning to learn by gradient descent
• Our brains probably don’t do backprop, replace it with:
• Simple parametric (bio-inspired) rule to update weights 1
• Single-layer neural network to learn weight updates 2
• Learn parameters across tasks, by gradient descent (meta-gradient)
1 Bengio et al. 1995
2 Runarsson and Jonsson 2000
learning rate
presynaptic activity
reinforcing signal
Tasks
meta-learner
performance
ModelsModelsModels
Δ !i = " ( )
meta-gradient
weights λi!23
learning rate
learn λi
gradient
descent
λi
λinit
learner
Bengio et al.
Runarsson and Jonsson
Δ !i
Learning to learn gradient descent
2 Andrychowicz et al. 2016
1 Hochreiter 2001
• Replace backprop with a recurrent neural net (LSTM)1, not so scalable
• Use a coordinatewise LSTM [m] for scalability/flexibility (cfr. ADAM, RMSprop) 2
• Optimizee: receives weight update gt from optimizer
• Optimizer: receives gradient estimate ∇t from optimizee
• Learns how to do gradient descent across tasks
hidden state
optimisee
weights
New task
Model
meta-
model
by gradient descent
!24LSTM parameters shared for all !
Single 

network!
Learning to learn gradient descent
2 Andrychowicz et al. 2016
1 Hochreiter 2001
• Left: optimizer and optimizee trained to do style transfer
• Right: optimizer solves similar tasks (different style, content and
resolution) without any more training data
by gradient descent
!25
Application: Few-shot learning
• Learn how to learn from few examples (given similar tasks)
• Meta-learner must learn how to train a base-learner based on prior experience
• Parameterize base-learner model and learn the parameters !i
Ttrain
Image: Hugo Larochelle
meta-model
Model M
!i+1
Tj
Ttest
Ttest
!i
Pi,j
λk
!26
Pi+1,test
X_test
y_test
y_test
X_train y_train
Co st(θi) =
1
|Ttest | ∑
t∈Ttest
lo ss(θi, t) 1-shot, 5-class:
new classes!
meta
meta
Few-shot learning: approaches
!27
• Existing algorithm as meta-learner:
• LSTM + gradient descent
• Learn !init +gradient descent
• kNN-like: Memory + similarity
• Learn embedding + classifier
• …
• Black-box meta-learner
• Neural Turing machine (with memory)
• Neural attentive learner
• …
Co st(θi) =
1
|Ttest | ∑
t∈Ttest
lo ss(θi, t)
Santoro et al. 2016
Mishra et al. 2018
meta-model
Model M
!i+1
Tj Ttest
!i
Pi,j
λk
Pi+1,test
Ravi and Larochelle 2017
Finn et al. 2017
Vinyals et al. 2016
Snell et al. 2017
LSTM meta-learner + gradient descent
Ravi and Larochelle 2017
!28
Train Test
Co st(θT) =
1
|Ttest | ∑
t∈Ttest
lo ss(θT, t)
LSTM LSTM LSTM LSTM
M M M M M
• Gradient descent update !t is similar to LSTM cell state update ct
• Hence, training a meta-learner LSTM yields an update rule for training M
• Start from initial !0, train model on first batch, get gradient and loss update
• Predict !t+1 , continue to t=T, get cost, backpropagate to learn LSTM weights, optimal !0
forget
input
!29
Meta-dataset
Triantafillou et al. 2019
!30
Learning to learn
Learn a better gradient update rule
for a group of tasks
Learn a better initialization (and representation)
for a group of tasks
Transfer learning
• Pre-train weights from:
• Large image datasets (e.g. ImageNet) 1
• Large text corpora (e.g. Wikipedia) 2
• Use these weights as initialization for new task
• Fails if tasks are not similar enough 3
frozen new
pre-trained new
frozen
Source
tasks
ModelsModels
performance
LearningLearningLearning Feature extraction: 

remove last layers, use output as features

if task is quite different, remove more layers
End-to-end tuning: 

train from initialized weights
Fine-tuning: 

unfreeze last layers, tune on new task
sm
all target task
large

similar
large

different
filters
1 Razavian et al. 2014
3 Yosinski et al. 2014
2 Mikolov et al. 2013
new
pre-trained
convnet
!31
• Solve new tasks faster by learning a model initialization from
similar tasks
• Current initialization !
• On K examples/task, evaluate
• Update weights for
• Update ! to minimize
sum of per-task losses
• Repeat
Model-agnostic meta-learning
1 Finn et al. 2017
!32
∇θ LTi
( fθ)
θ1, θ2, θ3
• Solve new tasks faster by learning a model initialization from
similar tasks
• Current initialization !
• On K examples/task, evaluate
• Update weights for
• Update ! to minimize
sum of per-task losses
• Repeat
Model-agnostic meta-learning
1 Finn et al. 2017
!32
∇θ LTi
( fθ)
θ1, θ2, θ3
• Solve new tasks faster by learning a model initialization from
similar tasks
• Current initialization !
• On K examples/task, evaluate
• Update weights for
• Update ! to minimize
sum of per-task losses
• Repeat
Model-agnostic meta-learning
1 Finn et al. 2017
!32
∇θ LTi
( fθ)
θ1, θ2, θ3
Model-agnostic meta-learning
• More resilient to overfitting
• Generalizes better than LSTM approaches for few shot learning
• Universality 1: no theoretical downsides in terms of expressivity when
compared to alternative meta-learning models.
• Many variants and applications:
• REPTILE: do SGD for k steps in one task, only then update init. weights 2
• PLATIPUS: probabilistic MAML
• Bayesian MAML (Bayesian ensemble)
• Online MAML
• …
!33
1 Finn et al. 2018
2 Nichol et al. 2018
Finn & Levine 2019
!34
Model-agnostic meta-learning
• For reinforcement learning:
!35
Neural Architecture Meta-Learning
Wong et al. 2018• Warm-start a deep RL controller based on prior tasks

• Much faster than single-task equivalent
• Another idea: warm-start DARTS with task-specific one-shot model
Learning to reinforcement learn
!36
1 Duan et al. 2017
2 Wang et al. 2017
3 Duan et al. 2017
Environments
meta-RL
algorithm
performance
policy #θ
fast RL
agent
policy #θ
Similar env.
performance
• Humans often learn to play new games much faster than RL techniques do
• Reinforcement learning is very suited for learning-to-learn:
• Build a learner, then use performance as that learner as a reward
impl
• Learning to reinforcement learn 1,2
• Use RNN-based deep RL to train a
recurrent network on many tasks
• Learns to implement a‘fast’RL agent,
encoded in its weights
Learning to reinforcement learn
!37
• Also works for few-shot learning 3
• Condition on observation + upcoming demonstration
• You don’t know what someone is trying to teach you, but you
prepare for the lesson
1 Duan et al. 2017
2 Wang et al. 2017
3 Duan et al. 2017
Learning to learn more tasks
!38
• Active learning
• Deep network (learns representation) + policy network
• Receives state and reward, says which points to query next
• Density estimation
• Learn distribution over small set of images, can generate new ones
• Uses a MAML-based few-shot learner
• Matrix factorization
• Deep learning architecture that makes recommendations
• Meta-learner learns how to adjust biases for each user (task)
• Replace hand-crafted algorithms by learned ones.
• Look at problems through a meta-learning lens!
Pang et al. 2018
Reed et al. 2017
Vartak et al. 2017
Meta-data sharing
!39
import	openml	as	oml	
from	sklearn	import	tree	
task	=	oml.tasks.get_task(14951)	
clf	=	tree.ExtraTreeClassifier()	
flow	=	oml.flows.sklearn_to_flow(clf)	
run	=	oml.runs.run_flow_on_task(task,	flow)	
myrun	=	run.publish()
run locally, share globally
Vanschoren et al. 2014
• What if we have a shared memory of all machine learning experiments?
• OpenML.org
• Thousands of uniform datasets, 100+ meta-features
• Millions of evaluated runs
• Same splits, 30+ metrics
• Traces, models (opt)
• APIs in Python, R, Java,… Sklearn, Keras, PyTorch, MXNet,…
• Publish your own runs
• Never ending learning
• Benchmarks
building a shared memory
Meta-data sharing
!40
Vanschoren et al. 2014
models built
by humans
models built
by AutoML bots
building a shared memory
• Train and run AutoML bots on any OpenML dataset
• Never-ending learning (from robots and humans)
!41
AutoML tools and benchmarks
Open source benchmark on https://openml.github.io/automlbenchmark/
• On OpenML datasets, will regularly be updated, anyone can add
Gijsbers et al. 2019
!42
AutoML open source tools
Architecture
search
Hyperparameter
search
Improvements Meta-learning
Auto-WEKA
Parameterized
pipeline
Bayesian
Optimization (RF)
auto-sklearn
Parameterized
pipeline
Bayesian
Optimization (RF)
Ensembling warm-start
BO-HB
Parameterized
pipeline
Tree of Parzen
Estimators
Ensembling,
Hyperband
hyperopt-sklearn
Parameterized
pipeline
Tree of Parzen
Estimators
TPOT
Evolving pipelines
(trees)
Genetic
programming
GAMA
Evolving pipelines
(trees)
Asynchronous
evolution
Ensembling,
Hyperband
H2O AutoML Single algorithms random search Stacking
ML-Plan
Planning-based

pipeline
Hierarchical
planner
OBOE Single algorithms
Low rank
approximation
Ensembling Runtime predictor
!43
AutoML tools and benchmarks
Observations so far (for pipeline search):
• No AutoML system outperforms all others (Auto-WEKA is worse)
• Auto-sklearn, H2O, TPOT can outperform each other on given datasets
• Tuned RandomForests are a strong baseline
• High numbers of classes, unbalanced classes, are a problem for most tools
• Little support for anytime prediction …
We need good benchmarks for NAS as well…
!44
AutoML Gym
Environment to train general-purpose AutoML algorithms
• Open benchmark, meta-learning, continual learning, efficient NAS
Join
Us!
Open PostDoc position!

http://bit.ly/automl-gym
Towards human-like learning to learn
!45
• Learning-to-learn gives humans a significant advantage
• Learning how to learn any task empowers us far beyond knowing
how to learn specific tasks.
• It is a universal aspect of life, and how it evolves
• Very exciting field with many unexplored possibilities
• Many aspects not understood (e.g. task similarity), need more
experiments.
• Challenge:
• Build learners that never stop learning, that learn from each other
• Fast learning by slow meta-learning
• Build a global memory for learning systems to learn from
• Let them explore / experiment by themselves
Thank you!
special thanks to
Pavel Brazdil, Matthias Feurer, Frank Hutter, Erin Grant,
Hugo Larochelle, Raghu Rajan, Jan van Rijn, Jane Wang
more to learn
http://www.automl.org/book/
Chapter 2: Meta-Learning
!46
Never stop learning

Más contenido relacionado

La actualidad más candente

And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023HyunJoon Jung
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introductionAnas Jamil
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLHimadri Mishra
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine LearningUpekha Vandebona
 
An Introduction to XAI! Towards Trusting Your ML Models!
An Introduction to XAI! Towards Trusting Your ML Models!An Introduction to XAI! Towards Trusting Your ML Models!
An Introduction to XAI! Towards Trusting Your ML Models!Mansour Saffar
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101QuantUniversity
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Hayim Makabee
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Sri Ambati
 
Machine learning life cycle
Machine learning life cycleMachine learning life cycle
Machine learning life cycleRamjee Ganti
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language modelJiWenKim
 
META-LEARNING.pptx
META-LEARNING.pptxMETA-LEARNING.pptx
META-LEARNING.pptxAyanaRukasar
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 

La actualidad más candente (20)

And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
Feature Selection in Machine Learning
Feature Selection in Machine LearningFeature Selection in Machine Learning
Feature Selection in Machine Learning
 
An Introduction to XAI! Towards Trusting Your ML Models!
An Introduction to XAI! Towards Trusting Your ML Models!An Introduction to XAI! Towards Trusting Your ML Models!
An Introduction to XAI! Towards Trusting Your ML Models!
 
Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101Automatic machine learning (AutoML) 101
Automatic machine learning (AutoML) 101
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
 
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
Bert
BertBert
Bert
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Generative AI
Generative AIGenerative AI
Generative AI
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
 
Meta-Learning Presentation
Meta-Learning PresentationMeta-Learning Presentation
Meta-Learning Presentation
 
Journey of Generative AI
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Machine learning life cycle
Machine learning life cycleMachine learning life cycle
Machine learning life cycle
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
META-LEARNING.pptx
META-LEARNING.pptxMETA-LEARNING.pptx
META-LEARNING.pptx
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 

Similar a AutoML lectures (ACDL 2019)

The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTuri, Inc.
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoMLArpitha Gurumurthy
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
 
MEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational ExperimentsMEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational ExperimentsGIScRG
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
 
How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?Axel de Romblay
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine LearningJoaquin Vanschoren
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 
Accelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer ModelsAccelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer ModelsPhilippe Laborie
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...MLconf
 
Matrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdfMatrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdfAndrea496281
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 r-kor
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingLionel Briand
 

Similar a AutoML lectures (ACDL 2019) (20)

The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
Towards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning BenchmarkTowards a Comprehensive Machine Learning Benchmark
Towards a Comprehensive Machine Learning Benchmark
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
MEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational ExperimentsMEME – An Integrated Tool For Advanced Computational Experiments
MEME – An Integrated Tool For Advanced Computational Experiments
 
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From ScratchPR-232:  AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?How to automate Machine Learning pipeline ?
How to automate Machine Learning pipeline ?
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
Accelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer ModelsAccelerating the Development of Efficient CP Optimizer Models
Accelerating the Development of Efficient CP Optimizer Models
 
C3 w1
C3 w1C3 w1
C3 w1
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
 
Matrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdfMatrix_Profile_Tutorial_Part1.pdf
Matrix_Profile_Tutorial_Part1.pdf
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 

Más de Joaquin Vanschoren (18)

Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Exposé Ontology
Exposé OntologyExposé Ontology
Exposé Ontology
 
Designed Serendipity
Designed SerendipityDesigned Serendipity
Designed Serendipity
 
Learning how to learn
Learning how to learnLearning how to learn
Learning how to learn
 
OpenML NeurIPS2018
OpenML NeurIPS2018OpenML NeurIPS2018
OpenML NeurIPS2018
 
OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017
 
OpenML DALI
OpenML DALIOpenML DALI
OpenML DALI
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 
OpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine LearningOpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine Learning
 
Data science
Data scienceData science
Data science
 
OpenML 2014
OpenML 2014OpenML 2014
OpenML 2014
 
Open Machine Learning
Open Machine LearningOpen Machine Learning
Open Machine Learning
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop sensordata part2
Hadoop sensordata part2Hadoop sensordata part2
Hadoop sensordata part2
 
Hadoop sensordata part1
Hadoop sensordata part1Hadoop sensordata part1
Hadoop sensordata part1
 
Hadoop sensordata part3
Hadoop sensordata part3Hadoop sensordata part3
Hadoop sensordata part3
 

Último

Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPirithiRaju
 
PSP3 employability assessment form .docx
PSP3 employability assessment form .docxPSP3 employability assessment form .docx
PSP3 employability assessment form .docxmarwaahmad357
 
Substances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestSubstances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestAkashDTejwani
 
MARSILEA notes in detail for II year Botany.ppt
MARSILEA  notes in detail for II year Botany.pptMARSILEA  notes in detail for II year Botany.ppt
MARSILEA notes in detail for II year Botany.pptaigil2
 
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WayShiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WaySérgio Sacani
 
Pests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPRPests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPRPirithiRaju
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsHassan Jolany
 
World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabkiyorndlab
 
KeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceKeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceLayne Sadler
 
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Sérgio Sacani
 
Basic Concepts in Pharmacology in molecular .pptx
Basic Concepts in Pharmacology in molecular  .pptxBasic Concepts in Pharmacology in molecular  .pptx
Basic Concepts in Pharmacology in molecular .pptxVijayaKumarR28
 
Gene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdfGene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdfNetHelix
 
Human brain.. It's parts and function.
Human brain.. It's parts and function. Human brain.. It's parts and function.
Human brain.. It's parts and function. MUKTA MANJARI SAHOO
 
soft skills question paper set for bba ca
soft skills question paper set for bba casoft skills question paper set for bba ca
soft skills question paper set for bba caohsadfeeling
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptxVaishnaviAware
 
Application of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxApplication of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxRahulVishwakarma71547
 
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...PirithiRaju
 
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfSUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfsantiagojoderickdoma
 
Bureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxBureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxkastureyashashree
 
Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.ShwetaHattimare
 

Último (20)

Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdfPests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
Pests of wheat_Identification, Bionomics, Damage symptoms, IPM_Dr.UPR.pdf
 
PSP3 employability assessment form .docx
PSP3 employability assessment form .docxPSP3 employability assessment form .docx
PSP3 employability assessment form .docx
 
Substances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening TestSubstances in Common Use for Shahu College Screening Test
Substances in Common Use for Shahu College Screening Test
 
MARSILEA notes in detail for II year Botany.ppt
MARSILEA  notes in detail for II year Botany.pptMARSILEA  notes in detail for II year Botany.ppt
MARSILEA notes in detail for II year Botany.ppt
 
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky WayShiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
Shiva and Shakti: Presumed Proto-Galactic Fragments in the Inner Milky Way
 
Pests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPRPests of Redgram_Identification, Binomics_Dr.UPR
Pests of Redgram_Identification, Binomics_Dr.UPR
 
geometric quantization on coadjoint orbits
geometric quantization on coadjoint orbitsgeometric quantization on coadjoint orbits
geometric quantization on coadjoint orbits
 
World Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlabWorld Water Day 22 March 2024 - kiyorndlab
World Water Day 22 March 2024 - kiyorndlab
 
KeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data scienceKeyBio pipeline for bioinformatics and data science
KeyBio pipeline for bioinformatics and data science
 
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...Identification of Superclusters and Their Properties in the Sloan Digital Sky...
Identification of Superclusters and Their Properties in the Sloan Digital Sky...
 
Basic Concepts in Pharmacology in molecular .pptx
Basic Concepts in Pharmacology in molecular  .pptxBasic Concepts in Pharmacology in molecular  .pptx
Basic Concepts in Pharmacology in molecular .pptx
 
Gene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdfGene transfer in plants agrobacterium.pdf
Gene transfer in plants agrobacterium.pdf
 
Human brain.. It's parts and function.
Human brain.. It's parts and function. Human brain.. It's parts and function.
Human brain.. It's parts and function.
 
soft skills question paper set for bba ca
soft skills question paper set for bba casoft skills question paper set for bba ca
soft skills question paper set for bba ca
 
Role of herbs in hair care Amla and heena.pptx
Role of herbs in hair care  Amla and  heena.pptxRole of herbs in hair care  Amla and  heena.pptx
Role of herbs in hair care Amla and heena.pptx
 
Application of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptxApplication of Foraminiferal Ecology- Rahul.pptx
Application of Foraminiferal Ecology- Rahul.pptx
 
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
3.2 Pests of Sorghum_Identification, Symptoms and nature of damage, Binomics,...
 
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdfSUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
SUKDANAN DIAGNOSTIC TEST IN PHYSICAL SCIENCE ANSWER KEYY.pdf
 
Bureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptxBureau of Indian Standards Specification of Shampoo.pptx
Bureau of Indian Standards Specification of Shampoo.pptx
 
Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.Role of Herbs in Cosmetics in Cosmetic Science.
Role of Herbs in Cosmetics in Cosmetic Science.
 

AutoML lectures (ACDL 2019)

  • 1. Joaquin Vanschoren Eindhoven University of Technology j.vanschoren@tue.nl @joavanschoren http://bit.ly/openml Advanced Course on Data Science & Machine Learning Siena, 2019 Automatic Machine Learning & Meta-Learning slides! part 1 1
  • 2. Slides / Book Open access book PDF (free): www.automl.org/book www.amazon.de/dp/3030053172 More slides: www.automl.org/events -> AutoML Tutorial NeurIPS 2018 Video: www.youtube.com/watch?v=0eBR8a4MQ30 http://bit.ly/openml 2
  • 3. Learning takes experimentation !3 It can take a lot of trial and error Task Models performance Learning ModelsModelsModelsModels Learning episodes
  • 4. Learning to learn takes experimentation !4 Building machine learning systems requires a lot of expertise and trials Neural networks - design architecture (operators/layers) - many sensitive hyperparameters - optimization algorithm, learning rate,… - regularization, dropout,… - batch size, epochs, early stopping, … - data augmentation, … Task Models performance Learning ModelsModelsModelsModels
  • 5. !5 Hyperparameters Every design decision made by the user (architecture, operators, tuning,…)
  • 7. Learning to learn takes experimentation !7 Building machine learning systems requires a lot of expertise and trials Task Models performance Learning ModelsModelsModelsModels Machine learning pipelines - design pipeline architecture (operators) - clean, preprocess data (!) - select and/or engineer features - select and tune models - adjust to evolving input data (concept drift),…
  • 8. !8 Task Models performance Human expert ModelsModels chooses architecture and hyperparameters Task Models performance Meta-level learning and optimization ModelsModels chooses architecture and hyperparameters Replace manual model building by automation Automated machine learning Current practice ModelsModelsModels
  • 9. !9 AUTOML! THE DATA SCIENTIST’S #1 EXCUSE FOR LEGITIMATELY SLACKING OFF: “THE AUTOML TOOL IS OPTIMIZING MY MODELS!”
  • 10. !10 Task Models inner loop Meta-level learning and optimization ModelsModels chooses architecture and hyperparameters Human-in-the-loop AutoML (semi-AutoML) Domain knowledge and human expertise are very valuable e.g. unknown unknowns, preference learning,… ModelsModelsModels Human expert outer loop control adapt Models + explanations Sutton 2019 ask
  • 11. !11 Human-in-the-loop AutoML (semi-AutoML) Explain model in human language: automatic statistician Steinruecken 2019 search over grammar of models (kernels) data input model component to English translation report generation
  • 12. !12 Task ModelsOptimization ModelsModels optimize architecture and hyperparameters Overview AutoML introduction, promising optimization techniques Meta-learning part 1 part 2 part 3 Neural architecture search, meta-learning on neural nets Tasks ModelsMeta-learning ModelsModels optimize architecture and hyperparameters TasksTasks Tasks ModelsNAS, meta-learning ModelsModels optimize neural architecture, model/hyperparameters TasksTasks
  • 13. Task ModelsOptimization ModelsModels optimize architecture and hyperparameters Overview AutoML introduction, optimization techniquespart 1 1. Problem definition 2. (Pipeline) architecture search 3. Optimization techniques 4. Performance improvements 5. Benchmarks !13
  • 14. !14 AutoML Definition Let: {A(1) , A(2) , . . . , A(n) } be a set of algorithms (operators) Λ(i) = λ1 × λ2 × . . . × λm be the hyperparameter space for A(i) 𝒜 be the space of possible architectures of one or more algorithms Find the configuration that minimizes the expected loss on a dataset : λ ∈Λ𝒜 Λ𝒜 = Λ(1) × Λ(2) × . . . × Λ(m) its combined configuration space a specific configuration (of architecture and hyperparameters) ℒ(λ, Dtrain, Dvalid) the loss of the model created by , trained on data , and validated on data λ Dtrain Dvalid λ* = argmin λ∈Λ𝒜 𝔼(Dtrain,Dvalid)∼𝒟ℒ(λ, Dtrain, Dtest) 𝒟
  • 15. !15 Types of hyperparameters • Continuous (e.g. learning rate, SVM_C,…) • Integer (e.g. number of hidden units, number of boosting iterations,…) • Categorical • e.g. choice of algorithm (SVM, RandomForest, Neural Net,…) • e.g. choose of operator (Convolution, MaxPooling, DropOut,…) • e.g. activation function (ReLU, Leaky ReLU, tanh,…) • Conditional • e.g. SVM kernel if SVM is selected, kernel width if RBF kernel is selected • e.g. Convolution kernel size if Convolution layer is selected
  • 16. • We can identify two subproblems: • Architecture search: search space of all possible architectures • Pipelines: Fixed predefined pipeline, grammars, genetic programming, planning, Monte-Carlo Tree Search • Neural architectures: See part 3 • Hyperparameter optimization: optimize remaining hyperparameters • Optimization: grid/random search, Bayesian optimization, evolution, multi-armed bandits, gradient descent (only NNs) • Meta-learning: See part 2 • Can be solved consecutively, simultaneously or interleaved • Compositionality: breaking down the learning process into smaller reusable tasks makes it easier, more transferable, more robust Architecture vs hyperparameters !16
  • 17. • Parameterize best-practice (linear) pipelines • Introduce conditional hyperparameters • Combined Algorithm Selection and Hyperparameter optimization (CASH): Pipeline search: fixed pipelines λr ∈{A(1) , . . . , A(k) , ∅} Λ𝒜 = Λ(1) × Λ(2) × . . . × Λ(m) × λr Could such a pipeline be learned? Feurer et al. 2015, Thornton et al. 2013 !17
  • 18. Pipeline search: genetic programming • Tree-based pipeline optimization • Start with random pipelines, best of every generation will cross-over or mutate • GP primitives include data copy and feature joins: trees • Multi-objective optimization: accurate but short • Easy to parallelize • Asynchronous evolution (GAMA) Olson, Moore 2016, 2019 copy change insert shrink Gijsbers, Vanschoren 2019 (TPOT) !18
  • 20. Pipeline search: genetic programming • Grammar-based genetic programming (RECIPE) [de Sa et al. 2017] • Encodes background knowledge, avoids invalid pipelines • Cross-over and mutation respect the grammar production rule non-terminaloptional terminal !20
  • 21. Pipeline search: Monte Carlo Tree Search • Use MCTS to search for optimal pipelines • Optimize the structure and hyperparameters simultaneously by building a surrogate model to predict configuration performance • Bayesian surrogate model: MOSAIC [Rakotoarison et al. 2019] • Neural network: AlphaD3M [Drori et al. 2018] [Chaslot et al., 2008]!21
  • 22. Pipeline search: planning • Hierarchical planners: ML-Plan [Mohr et al. 2018], P4ML [Gil et al. 2018] • Graph search problem, can be solved with best first search • Use random path completion (as in MCTS) to evaluate each node • Hyperparameter optimization interleaved as possible actions score !22 tune
  • 23. Hyperparameter optimization (HPO) !23 Black box optimization: expensive for machine learning. Sample efficiency is important! λ* = argmin λ∈Λ𝒜 𝔼(Dtrain,Dvalid)∼𝒟ℒ(λ, Dtrain, Dtest)
  • 24. HPO: grid and random search !24 Bergstra & Bengio, JMLR 2012 • Random search handles unimportant dimensions be4er • Easily parallelizable, but uninformed (no learning)
  • 25. • Start with a few (random) hyperparameter configurations • Build a surrogate model to predict how well other configurations will work: mean and standard deviation (blue band) • Any probabilistic regression model: e.g. Gaussian processes • To avoid a greedy search, use an acquisition function to trade off exploration and exploitation, e.g. Expected Improvement (EI) • Sample for the best configuration under that function μ σ HPO: Bayesian Optimization Mockus [1974] μ μ + σ μ −σ
  • 26. • Repeat • Stopping criterion: • Fixed budget (time, evaluations) • Min. distance between configs • Threshold for acquisition function • Still overfits easily • Convergence results • Good for non-convex, noisy objectives • Used in AlphaGo ! HPO: Bayesian Optimization Srinivas et al. 2010, Freitas et al. 2012, Kawaguchi et al. 2016
  • 28. Gaussian processes + uncertainty, extrapolation - Scalability (cubic) Sparse GPs[Lawrence et al. 2003] Random embeddings [Wang et al. 2013] - Robustness? Meta-BayesOpt to optimize kernel [Malkomes et al. 2016] Neural networks + Bayesian LR [Snoek et al. 2015] + Scalable, Learns basis expansion Bayesian neural networks [Springenberg et al. 2016] + Bayesian - Scalability not studied yet Random Forests [Hutter et al. 2011, Feurer et al. 2015] Frequentist uncertainty estimate (variance) + scalable, complex HPs - uncertainty, extrapolation Φz P BLR surrogate φz(λ)i (λi,Pi,j) φz(λ) ! Bayesian Optimization: surrogate models !28
  • 29. Random Forests (SMAC) example: (minimizing)!29 interleaved with random search ! Bayesian Optimization: surrogate models
  • 31. Jeroen van Hoof (under review) Gradient Boosting Surrogate Estimate uncertainty with quantile regression !31
  • 32. !32 HPO: Tree of Parzen Estimators 1.Test some hyperparameters 2. Separate into good and bad hyperparameters (with some quantile) 3. Fit non-parametric KDE for and 4. For a few samples, evaluate Shown to be equivalent to EI! Efficient, parallelizable, robust, but less sample efficient than GPs Used in HyperOpt-sklearn [Komer et al. 2019] p(λ = good) p(λ = bad) p(λ = good) p(λ = bad)
  • 33. HPO: population-based methods !33 • Less sample efficient, but easy to parallelize, and adapts to changes • Gene@c programming • Muta@ons: add, mutate/tune, remove HP • Par@cle swarm op@miza@on • Covariance matrix adapta@on evolu@on (CMA-ES) • Purely con@nuous, expensive • Very compe@@ve to op@mize deep neural nets Olson, Moore 2016, 2019 Mantovani et al 2015 mutate Hansen 2015 [Loshilov, Hu-er 2016]
  • 34. !34 AUTOML IS STILL RUNNING AutoML performance improvements Making it practically useful (inspired by humans) OH, COME ON!
  • 35. !35 Multi-fidelity approaches General techniques for cheap approximations of black-box λ* = argmin λ∈Λ𝒜 𝔼(Dtrain,Dvalid)∼𝒟ℒ(λ, Dtrain, Dtest) data subsets fewer epochs, shorter MCMC chains, fewer RL trials,… downsampled images λmf
  • 36. !36 Multi-fidelity approaches Cheap low-fidelity surrogates: train on small subsets, infer which regions may be interesting to evaluate in more depth • Evaluate random samples on smallest set • Update a Bayesian surrogate model • Select fewer points based on surrogate, evaluate on more data, repeat Swersky et al. 2013, 2014; Klein et al 2017; Kandasamy et al 2017
  • 37. !37 Multi-fidelity approaches Successive halving: • Randomly sample candidates and evaluate on a small data sample • retrain the best half candidates on twice the data Jamieson & Talwalkar 2016 1/16 1/8 1/4 1/2
  • 38. !38 Multi-fidelity approaches Successive halving: • Randomly sample candidates and evaluate on a small data sample • retrain the best half candidates on twice the data Jamieson & Talwalkar 2016
  • 39. !39 Multi-fidelity approaches Hyperband: Repeated, decreasingly aggressive successive halving • Minimizes (doesn’t eliminate) chance that candidate was pruned to early • Strong anytime performance, easy to implement, scalable, parallelizable Li et al. 2017
  • 40. !40 Multi-fidelity approaches Combined Bayesian Optimization and Hyperband (BO-HB) • BayesOpt: choose which configurations to evaluate • Hyperband: allocate budgets more efficiently • Strong anytime and final performance Falkner et al. 2018
  • 41. !41 Early stopping Learning curve prediction • Model learning curves as parametric functions, fit • Train a Bayesian Neural Net to predict learning curves [Domhan et al 2015] [Klein et al 2017] Data allocation using upper bounds (DAUB) • Fit linear function through latest estimates • Compute upper performance bound Sabharwal et al. 2015 CNN learning curves
  • 42. Hyperparameter gradient descent • Bilevel program • Outer obj.: optimize • Inner obj.: optimize weights given • Derive through the entire SGD • Get hypergradients wrt. validation loss • Expensive! But useful if you have many hyper parameters and high parallelism • Interleave optimization steps • Alternate SGD steps for and λ λ ω λ [Franceschi et al 2018] Optimize neural network hyperparameters and weights simultaneously [MacLaurin et al 2015] [Luketina et al 2016] !42
  • 43. !43 Meta-learning Meta-learning can drastically speed up the architecture and hyperparameter search, in combination with the optimization techniques we just discussed • Learn which hyperparameters are really important • Learn which hyperparameter values should be tried first • Learn which architectures will most likely work • Learn how to clean and annotate data • Learn which feature representations to use • …
  • 44. Thank you! To be continued… Part 2: meta-learning Image: Pesah et al. 2018!44
  • 45. Joaquin Vanschoren Eindhoven University of Technology j.vanschoren@tue.nl @joavanschoren http://bit.ly/openml Advanced Course on Data Science & Machine Learning Siena, 2019 Automatic Machine Learning & Meta-Learning slides! part 2
  • 46. !2 Task ModelsOptimization ModelsModels optimize architecture and hyperparameters Overview AutoML introduction, promising optimization techniques Meta-learning part 1 part 2 part 3 Neural architecture search, meta-learning on neural nets Tasks ModelsMeta-learning ModelsModels optimize architecture and hyperparameters TasksTasks Tasks ModelsNAS, meta-learning ModelsModels optimize neural architecture, model/hyperparameters TasksTasks
  • 47. Learning is a never-ending process !3 Humans don’t learn from scratch
  • 48. Learning is a never-ending process !4 Learning humans also seek/create related tasks E.g. find many similar puzzles, solve them in different ways,…
  • 49. Learning is a never-ending process Task 1 Task 2 Task 3 Models ModelsModelsModels Model performance performance performance Learning ModelsModelsModelsModels Learning Learning Learning episodes meta- learning meta- learning meta- learning !5 Humans learn across tasks Why? Requires less trial-and-error, less data
  • 50. Task 1 Task 2 Task 3 ModelsModelsModels performance performance performance LearningLearning Learning ModelsModelsModels ModelsModelsModels inductive bias !6 Inductive bias: assumptions added to the training data to learn effectively If prior tasks are similar, we can transfer prior knowledge to new tasks Learning to learn New Task performance ModelsModelsModels Learning prior beliefs constraints model parameters representations training data
  • 51. Task 1 Task 2 Task 3 ModelsModelsModels performance performance performance LearningLearningLearners LearningLearningLearners LearningLearningLearners ModelsModelsModels ModelsModelsModels }meta-data !7 Meta-learning New Task performance ModelsModelsModels meta-learner base-learner additional experiments Meta-learner learns a (base-)learning algorithm, based on meta-data
  • 52. Meta-data New Task meta-learner ModelsModelsModels Task 1 Task j ModelsModelsModels performance performance LearningLearningLearningLearning ModelsModelsModels … performance !8 Learners Learners Pi,j λi mj !i,j,k Pi,j λi mj !i,j,k Description of task j (meta-features) Configuration i (architecture + hyperparameters) Model parameters (e.g. weights) Performance estimate of model built with λi on task j
  • 53. How? New Task meta-learner ModelsModelsModels Task 1 Task j ModelsModelsModels performance performance LearningLearningLearningLearning ModelsModelsModels … performance 1. Learn from observing learning algorithms (for any task) 2. Learn what may likely work (for somewhat similar tasks) 3. Learn from previous models (for very similar tasks) !9 Learners Learners Pi,j λi mj !i,j,k Pi,jλi mj !i,j,k Pi,jλi Pi,jλi(mj)
  • 54. 1. Learn from observing learning algorithms Define, store and use meta-data: - configurations: settings that uniquely define the model - performance (e.g. accuracy) on specific tasks New Task meta-learner ModelsModelsModels configurations performances Task 1 Task j ModelsModelsModels performance performance LearningLearningLearningLearning ModelsModelsModels … performance λi Pi,j !10 Learners Learners (hyperparameters, pipeline, neural architecture,…)
  • 55. 1. Learn from observing learning algorithms Define, store and use meta-data: - configurations: settings that uniquely define the model - performance (e.g. accuracy) on specific tasks New Task meta-learner ModelsModelsModels configurations performances Task 1 Task j ModelsModelsModels performance performance LearningLearningLearningLearning ModelsModelsModels … performance λi Pi,j !10 Learners Learners (hyperparameters, pipeline, neural architecture,…) task model config λ score P 0 alg=SVM, C=0.1 0,876 0 alg=NN, lr=0.9 0,921 0 alg=RF, mtry=0.1 0,936 1 alg=SVM, C=0.2 0,674 1 alg=NN, lr=0.5 0,777 1 alg=RF, mtry=0.4 0,791
  • 56. Rankings • Build a global (multi-objective) ranking, recommend the top-K • Can be used as a warm start for optimization techniques • E.g. Bayesian optimization, evolutionary techniques,… Tasks ModelsModelsModels performance LearningLearningLearning 1. λa 2. λb 3. λc 4. λd 5. λe 6. … New Task meta-learner ModelsModelsModels performance Global ranking (task agnostic) λa..k warm start Leite et al. 2012 Abdulrahman et al. 2018 } (discrete) (multi-objective) Pi,j !11 λi
  • 57. • What if prior configurations are not optimal? • Per task, fit a differentiable plugin estimator on all evaluated configurations • Do gradient descent to find optimized configurations, recommend those Tasks ModelsModelsModels performance LearningLearningLearning New Task meta-learner ModelsModelsModels performance Warm-starting with plugin estimators λ*i per task: task 1: λ*1 task 2: λ*2 task 3: λ*3 … Wistuba et al. 2015 λi }Pi,j !12
  • 58. • Functional ANOVA 1 Select hyperparameters that cause variance in the evaluations. Tasks ModelsModelsModels performance LearningLearningLearning } New Task meta-learner ModelsModelsModels performance Hyperparameter behavior hyperparameter importance 1 van Rijn & Hutter 2018 λ1 λ2 λ3 λ4 constraints priors Pi,j !13 λi
  • 59. 1 van Rijn & Hutter 2018 SVM (RBF) Hyperparameter behavior • Functional ANOVA 1 Select hyperparameters that cause variance in the evaluations. AdaBoost
  • 60. 1 van Rijn & Hutter 2018 ResNets for image classification (under review) • Functional ANOVA 1 Select hyperparameters that cause variance in the evaluations. Hyperparameter behavior
  • 61. 1 van Rijn & Hutter 2018 Priors (KDE) for the most important hyperparameters • Functional ANOVA 1 Select hyperparameters that cause variance in the evaluations. Learn priors for hyperparameter tuning (e.g. Bayesian optimization) Hyperparameter behavior
  • 62. • Functional ANOVA 1 Select hyperparameters that cause variance in the evaluations. • Tunability 2,3,4 Learn good defaults, measure improvement from tuning over defaults 1 van Rijn & Hutter 2018 2 Probst et al. 2018 !17 3 Weerts et al. 2018Hyperparameter behavior 4 van Rijn et al. 2018
  • 63. • Search space pruning Exclude regions yielding bad performance on (similar) tasks Tasks ModelsModelsModels performance LearningLearningLearning } New Task meta-learner ModelsModelsModels performance constraints Pi,j Wistuba et al. 2015 P λ1 !18 λi λ2 Hyperparameter behavior
  • 64. • Experiments on the new task can tell us how it is similar to previous tasks • Task are similar if observed performance of configurations is similar • Use this to recommend new configurations, run, repeat Tasks ModelsModelsModels performance LearningLearningLearning } New Task meta-learner ModelsModelsModels performance Learning task similarity λc Pi,j !19 λi Pc,new ≅ Pc,j
  • 65. • Learn task similarity while tuning configurations • Tournament-style selection, warm-start with overall best config λbest • Next candidate λc : the one that beats current λbest on similar tasks Tasks ModelsModelsModels performance LearningLearningLearning } New Task meta-learner ModelsModelsModels performance Active testing Leite et al. 2012 λc Select λc > λbest on similar tasks (discrete) Pi,j !20 λi Sim(tj,tnew) = Corr([RLa,b,j],[RLa,b,new]) Pc,j
  • 66. • If task j is similar to the new task, its surrogate model Sj will likely transfer well • Sum up all Sj predictions, weighted by task similarity (as in active testing)1 • Build combined Gaussian process, weighted by current performance on new task2 Tasks ModelsModelsModels performance LearningLearningLearning New Task meta-learner ModelsModelsModels performance per task tj: Pi,j } Surrogate model transfer 1 Wistuba et al. 2018 λi P Sj 2 Feurer et al. 2018 S= ∑ wj Sj + + S1 S2 S3 !21 λi
  • 67. • Multi-task Gaussian processes: train surrogate model on t tasks simultaneously1 • If tasks are similar: transfers useful info • Not very scalable • Bayesian Neural Networks as surrogate model2 • Multi-task, more scalable • Stacking Gaussian Process regressors (Google Vizier)3 • Continual learning (sequential similar tasks) • Transfers a prior based on residuals of previous GP Multi-task Bayesian optimization 1 Swersky et al. 2013 Independent GP predictions Multi-task GP predictions 2 Springenberg et al. 2016 3 Golovin et al. 2017 !22
  • 68. • Bayesian linear regression (BLR) surrogate model on every task • Use neural net to learn a suitable basis expansion ϕz(λ) for all tasks • Scales linearly in # observations, transfers info on configuration space Tasks ModelsModelsModels performance LearningLearningLearning New Task meta-learner ModelsModelsModels performance } More scalable variants Perrone et al. 2018 P BLR surrogate (λi,Pi,j) φz(λ)i warm-start (pre-train) λi Bayesian optimization φz(λ) BLR hyperparameters Pi,j !23 λi
  • 69. 2. Learn what may likely work (for partially similar tasks) Meta-features: measurable properties of the tasks (number of instances and features, class imbalance, feature skewness,…) configurations performances similar mj ?Task 1 Task N ModelsModelsModels performance performance LearningLearningLearning LearningLearningLearning ModelsModelsModels … meta-features New Task meta-learner ModelsModelsModels performance mj Pi,j !24 λi
  • 70. 2. Learn what may likely work (for partially similar tasks) Meta-features: measurable properties of the tasks (number of instances and features, class imbalance, feature skewness,…) configurations performances similar mj ?Task 1 Task N ModelsModelsModels performance performance LearningLearningLearning LearningLearningLearning ModelsModelsModels … meta-features New Task meta-learner ModelsModelsModels performance mj Pi,j !24 λi meta- features model config λ score P 60000,5,0.2 alg=SVM, C=0.1 0,876 60000,5,0.2 alg=NN, lr=0.9 0,921 60000,5,0.2 alg=RF, mtry=0.1 0,936 11233,4,0.1 alg=SVM, C=0.2 0,674 11233,4,0.1 alg=NN, lr=0.5 0,777 11233,4,0.1 alg=RF, mtry=0.4 0,791
  • 71. • Hand-crafted (interpretable) meta-features1 • Number of instances, features, classes, missing values, outliers,… • Statistical: skewness, kurtosis, correlation, covariance, sparsity, variance,… • Information-theoretic: class entropy, mutual information, noise-signal ratio,… • Model-based: properties of simple models trained on the task • Landmarkers: performance of fast algorithms trained on the task • Domain specific task properties • Optimized (recommended) representations exist 2 Meta-features 1 Vanschoren 2018 !25 2 Rivolli et al 2019
  • 72. • Learning a joint task representation • Deep metric learning: learn a representation hmf using ground truth distance • Siamese Network: similar task, similar representation 1 • Feed through pretrained CNN, extract vector from weights 2 • Recommend neural architectures, or warm-start neural architecture search Meta-features 1 Kim et al. 2017 !26 2 Achille et al. 2019
  • 73. • Find k most similar tasks, warm-start search with best λi • Auto-sklearn: Bayesian optimization (SMAC) • Meta-learning yield better models, faster • Winner of AutoML Challenges Tasks ModelsModelsModels performance LearningLearningLearning New Task meta-learner ModelsModelsModels performance Pi,j } Warm-starting from similar tasks λ1..k mj best λi on similar tasks Feurer et al. 2015 !27 λi Bayesian optimization λ P λ1 λ3 λ2 λ4
  • 74. Warm-starting from similar tasks Fusi et al. 2017 Pi,j λi TL λL tj tnew warm-started with λ1..k . . .. . . .. . . . . P λiλLi P p(P|λLi) latent representation • Learn latent representation for tasks and configurations • Use meta-features to warm-start on new task • Returns probabilistic predictions for BayesOpt • Collaborative filtering: configurations λi are `rated’by tasks tj New Task meta-learner ModelsModelsModels performance λ1..k Pi,j } mj (discrete) λi
  • 75. • Train a task-independent surrogate model with meta-features in inputs • SCOT: Predict ranking of λi with surrogate ranking model + mj. 1 • Predict Pi,j with multilayer Perceptron surrogates + mj. 2 • Build joint GP surrogate model on most similar ( ) tasks. 3 • Scalability is often an issue Tasks ModelsModelsModels performance LearningLearningLearning New Task meta-learner ModelsModelsModels performance Pi,j } Global surrogate models mj 1 Bardenet et al. 2013 2 Schilling et al. 2015 mj ✖ λi P 3 Yogatama et al. 2014 2 !29 λi task1 task2 joint
  • 76. • Learn direct mapping between meta-features and Pi,j • Zero-shot meta-models: predict best λi given meta-features 1 • Ranking models: return ranking λ1..k 2 • Predict which algorithms / configurations to consider / tune3 • Predict performance / runtime for given !i and task4 • Can be integrated in larger AutoML systems: warm start, guide search,… meta-learner Meta-models λbest 1 Brazdil et al. 2009, Lemke et al. 2015 2 Sun and Pfahringer 2013, Pinto et al. 2017 meta-learner λ1..k mj mj meta-learner Pijmj, λi 3 Sanders and C. Giraud-Carrier 2017 meta-learner Λmj 4 Yang et al. 2018 !30
  • 77. Learning to learn through self-play !31 • Build pipelines by inserting, deleting, replacing components (actions) • Neural network (LSTM) receives task meta-features, pipelines and evaluations • Predicts pipeline performance and action probabilities • Monte Carlo Tree Search builds pipelines, runs simulations against LSTM Drori et al 2017 New Task meta-learner ModelsModelsModels performance self-play mj λi
  • 78. 3. Learn from previous models (for very similar tasks) configurations performances Task 1 Task N ModelsModelsModels performance performance LearningLearningLearning LearningLearningLearning ModelsModelsModels … New Task meta-learner ModelsModelsModels performance model parameters Models trained on intrinsically similar tasks (model parameters, features,…) intrinsically (very) similar (e.g. shared representation) !k Pi,j !32 λi
  • 79. 3. Learn from previous models (for very similar tasks) configurations performances Task 1 Task N ModelsModelsModels performance performance LearningLearningLearning LearningLearningLearning ModelsModelsModels … New Task meta-learner ModelsModelsModels performance model parameters Models trained on intrinsically similar tasks (model parameters, features,…) intrinsically (very) similar (e.g. shared representation) !k Pi,j !32 λi model param ! model config λ score P 0.34,0.43,… alg=NN, lr=0.9 0,876 0.04,0.03,… alg=NN, lr=0.9 0,921 0.94,0.03,… alg=NN, lr=0.9 0,936 0.34,0.23,… alg=NN, lr=0.5 0,674 0.04,0.23,… alg=NN, lr=0.5 0,777 0.94,0.23,… alg=NN, lr=0.5 0,791
  • 80. Transfer Learning !33 Source tasks ModelsModelsModels performance LearningLearningLearning Target task meta-learner performancePi,j !33 ModelsModelsModels • Select source tasks, transfer trained models to similar target task 1 • Use as starting point for tuning, or freeze certain aspects (e.g. structure) • Bayesian networks: start structure search from prior model 2 • Reinforcement learning: start policy search from prior policy 3 1 Thrun and Pratt 1998 Bayesian Network transfer Reinforcement learning: 2D to 3D mountain car 3 Taylor and Stone 2009 2 Niculescu-Mizil and Caruana 2005 !k λi
  • 81. Learning to learn Neural networks !35!35 • End-to-end differentiable models afford many meta-learning opportunities • Transfer learning • Learning gradient descent procedures (weight update rules) • Few shot learning • Model-agnostic meta-learning (learning weight initializations) • Learning to reinforcement learn • … See part 3
  • 82. Joaquin Vanschoren Eindhoven University of Technology j.vanschoren@tue.nl @joavanschoren http://bit.ly/openml Advanced Course on Data Science & Machine Learning Siena, 2019 Automatic Machine Learning & Meta-Learning slides! part 3
  • 83. !2 Task ModelsOptimization ModelsModels optimize architecture and hyperparameters Overview AutoML introduction, promising optimization techniques Meta-learning part 1 part 2 part 3 Neural architecture search, meta-learning on neural nets Tasks ModelsMeta-learning ModelsModels optimize architecture and hyperparameters TasksTasks Tasks ModelsNAS, meta-learning ModelsModels optimize neural architecture, model/hyperparameters TasksTasks
  • 84. !3 Overview part 3 Neural architecture search, meta-learning on neural nets Tasks ModelsNAS, meta-learning ModelsModels optimize neural architecture, model/hyperparameters TasksTasks 1. Neural Architecture Search space 2. Neural Architecture optimization 3. (Neural) Meta-Learning
  • 85. !4 Neural Architecture Search space Sequential (chain) Arbitrary graph Choose: • number of layers • type of layers • dense • convolutional • max-pooling • … • hyperparameters of layers + easier to search - sometimes too simple Choose: • branching • joins • skip connections • types of layers • hyperparameters of layers + more flexible - much harder to search Elsken et al. 2019
  • 86. !5 Neural Architecture Search space Observation: successful deep networks have repeated motifs (cells) e.g. Inception v4:
  • 87. !6 Cell search space • parameterize building blocks (cells) • e.g. regular cell + resolution reduction cell • stack cells together in macro-architecture • usually a chain • can be manual or learned + smaller search space + cells can be learned on a small dataset & transferred to a larger dataset - you can’t learn entirely new architectures Zoph et al 2018
  • 88. !7 Within-cell search space • Different strategies to construct cell, leading to different parameterizations • Can be parameterized as set of categorical hyperparameters: • Which existing layer (hidden state, e.g. cell input) to build on • Which operation (e.g. 3x3conv) to perform on • How to combine into new hidden state • Iterate over B phases (blocks) Hi Hi Zoph et al 2018
  • 89. !8 Neural architecture optimization Standard hyperparameter optimization techniques 1 Bergstra et al. 2013 • Kernels to optimize neural nets with GP-based Bayesian optimization • ArcNet2: DNNs, 23 hyperparameters, 6 kernels • NASBOT3: DNNs and CNNs • Image classification pipelines1 • 238 hyperparameters, tuned with TPE 2 Swersky et al 2013 3 Kandasamy et al. 2018
  • 90. !9 Neural architecture optimization • AutoNet 1 : DNNs, 63 hyperparameters, tuned with SMAC • Joint NAS + HPO 2: ResNets, tuned with BO-HB • PNAS (Progressive NAS) 3 • Cell search space, optimized with SMAC, HPO afterwards • SotA on ImageNet, CIFAR 1 Mendoza et al. 2016 3 Liu et al. 2018 Standard hyperparameter optimization techniques 2 Zela et al. 2018
  • 91. !10 1. Observe state 2. Take action u based on policy ( are weights in an RNN) 3. New state is formed 4. Take further actions based on new state 5. After trajectory of motions, adjust policy based on total reward R( ) πθ θ τ τ Deep RL recap policy gradient estimate (H steps, m samples) HuiLevine R
  • 92. !11 NAS with Reinforcement learning πθ Zoph & Le 2017 RNN policy network: • generate architecture step by step • evaluate to get reward • update with policy gradient 2-layer LSTM (REINFORCE), chains of convolutional layers • State of the art on CIFAR-10, Penn Treebank • 800 GPUs for 3-4 weeks, 12800 architectures
  • 93. !12 NAS with Reinforcement learning πθ RNN policy network: • generate architecture step by step • evaluate to get reward • update with policy gradient 1-layer LSTM (PPO), cell space search • State of the art on ImageNet • 450 GPUs, 20000 architectures Zoph et al 2018
  • 94. !13 Neuroevolution Angeline et al 1994, Stanley and Miikkulainen 2002 Earlier solutions tried to: • Optimize both the configuration and the weights simultaneously with genetic algorithms • Optimize individual nodes and connections • Doesn’t scale to deep networks
  • 95. !14 Neuroevolution Real et al 2017; Miikkulainen et al 2017 Better: • learn the architecture and configuration with evolutionary techniques • train the network with SGD • mutation steps: add, change, remove layer alive, killed
  • 96. !15 Neuroevolution AmoebaNet: State of the art on ImageNet, CIFAR-10 • Cell search space, aging evolution (kill oldest networks) Real et al 2019
  • 97. !16 Neuroevolution AmoebaNet: State of the art on ImageNet, CIFAR-10 • Cell search space, aging evolution (kill oldest networks) • More efficient than reinforcement learning Real et al 2019
  • 98. !17 Network morphisms Chen et al. 2016, Wei et al. 2016, Cai et al. 2017 Cai et al. 2018, Elsken et al. 2017, Cortes et al. 2017, Cai et al. 2018 • Change architecture, but not modelled function • Non-black box, allows much more efficient architecture search init. identity
  • 99. !18 Weight sharing 1 Saxena & Verbeek, 2016 • One-shot models (convolutional neural fabric) 1 • Search space: ConvNet = path though fabric with shared weights • Train ensemble of all of them For 7-layer CNNs: • Path dropout 2 • ensure that individual paths do well 2 Bender et al. 2018 3 Pham et al. 2018 4 Brock et al. 2017 • ENAS 3 • use RL to sample paths from one-shot model, train weights • other paths inherit these weights • SMASH 4 • Shared hypernetwork that predicts weights given architecture
  • 100. DARTS: Differentiable NAS Liu et al. 2018 convolution max pooling zero • Fixed (one-shot) structure, learn which operators to use • Give all operators a weight • Optimize and model weights using bilevel programming αi αi ωj One-shot model operator weights αi interleaved optimization of and with SGDαi ωj!19 argmax αi
  • 101. DARTS: Differentiable NAS • Lots of further refinements • SNAS • Use Gumbel softmax (differentiable) on • Proxyless NAS • Memory-efficient DARTS • trains sparse architectures only • … αi !20 [Xie et al 2019] [Cai et al 2019]
  • 102. Learning to learn from previous models configurations performances Task 1 Task N ModelsModelsModels performance performance LearningLearningLearning LearningLearningLearning ModelsModelsModels … New Task meta-learner ModelsModelsModels performance model parameters For models trained on intrinsically similar tasks (model parameters, features,…) intrinsically (very) similar (e.g. shared representation) !k Pi,j !21 λi
  • 103. !22 Learning to learn Learn a better gradient update rule for a group of tasks Learn a better initialization (and representation) for a group of tasks
  • 104. Learning to learn by gradient descent • Our brains probably don’t do backprop, replace it with: • Simple parametric (bio-inspired) rule to update weights 1 • Single-layer neural network to learn weight updates 2 • Learn parameters across tasks, by gradient descent (meta-gradient) 1 Bengio et al. 1995 2 Runarsson and Jonsson 2000 learning rate presynaptic activity reinforcing signal Tasks meta-learner performance ModelsModelsModels Δ !i = " ( ) meta-gradient weights λi!23 learning rate learn λi gradient descent λi λinit learner Bengio et al. Runarsson and Jonsson Δ !i
  • 105. Learning to learn gradient descent 2 Andrychowicz et al. 2016 1 Hochreiter 2001 • Replace backprop with a recurrent neural net (LSTM)1, not so scalable • Use a coordinatewise LSTM [m] for scalability/flexibility (cfr. ADAM, RMSprop) 2 • Optimizee: receives weight update gt from optimizer • Optimizer: receives gradient estimate ∇t from optimizee • Learns how to do gradient descent across tasks hidden state optimisee weights New task Model meta- model by gradient descent !24LSTM parameters shared for all ! Single network!
  • 106. Learning to learn gradient descent 2 Andrychowicz et al. 2016 1 Hochreiter 2001 • Left: optimizer and optimizee trained to do style transfer • Right: optimizer solves similar tasks (different style, content and resolution) without any more training data by gradient descent !25
  • 107. Application: Few-shot learning • Learn how to learn from few examples (given similar tasks) • Meta-learner must learn how to train a base-learner based on prior experience • Parameterize base-learner model and learn the parameters !i Ttrain Image: Hugo Larochelle meta-model Model M !i+1 Tj Ttest Ttest !i Pi,j λk !26 Pi+1,test X_test y_test y_test X_train y_train Co st(θi) = 1 |Ttest | ∑ t∈Ttest lo ss(θi, t) 1-shot, 5-class: new classes! meta meta
  • 108. Few-shot learning: approaches !27 • Existing algorithm as meta-learner: • LSTM + gradient descent • Learn !init +gradient descent • kNN-like: Memory + similarity • Learn embedding + classifier • … • Black-box meta-learner • Neural Turing machine (with memory) • Neural attentive learner • … Co st(θi) = 1 |Ttest | ∑ t∈Ttest lo ss(θi, t) Santoro et al. 2016 Mishra et al. 2018 meta-model Model M !i+1 Tj Ttest !i Pi,j λk Pi+1,test Ravi and Larochelle 2017 Finn et al. 2017 Vinyals et al. 2016 Snell et al. 2017
  • 109. LSTM meta-learner + gradient descent Ravi and Larochelle 2017 !28 Train Test Co st(θT) = 1 |Ttest | ∑ t∈Ttest lo ss(θT, t) LSTM LSTM LSTM LSTM M M M M M • Gradient descent update !t is similar to LSTM cell state update ct • Hence, training a meta-learner LSTM yields an update rule for training M • Start from initial !0, train model on first batch, get gradient and loss update • Predict !t+1 , continue to t=T, get cost, backpropagate to learn LSTM weights, optimal !0 forget input
  • 111. !30 Learning to learn Learn a better gradient update rule for a group of tasks Learn a better initialization (and representation) for a group of tasks
  • 112. Transfer learning • Pre-train weights from: • Large image datasets (e.g. ImageNet) 1 • Large text corpora (e.g. Wikipedia) 2 • Use these weights as initialization for new task • Fails if tasks are not similar enough 3 frozen new pre-trained new frozen Source tasks ModelsModels performance LearningLearningLearning Feature extraction: remove last layers, use output as features if task is quite different, remove more layers End-to-end tuning: train from initialized weights Fine-tuning: unfreeze last layers, tune on new task sm all target task large similar large different filters 1 Razavian et al. 2014 3 Yosinski et al. 2014 2 Mikolov et al. 2013 new pre-trained convnet !31
  • 113. • Solve new tasks faster by learning a model initialization from similar tasks • Current initialization ! • On K examples/task, evaluate • Update weights for • Update ! to minimize sum of per-task losses • Repeat Model-agnostic meta-learning 1 Finn et al. 2017 !32 ∇θ LTi ( fθ) θ1, θ2, θ3
  • 114. • Solve new tasks faster by learning a model initialization from similar tasks • Current initialization ! • On K examples/task, evaluate • Update weights for • Update ! to minimize sum of per-task losses • Repeat Model-agnostic meta-learning 1 Finn et al. 2017 !32 ∇θ LTi ( fθ) θ1, θ2, θ3
  • 115. • Solve new tasks faster by learning a model initialization from similar tasks • Current initialization ! • On K examples/task, evaluate • Update weights for • Update ! to minimize sum of per-task losses • Repeat Model-agnostic meta-learning 1 Finn et al. 2017 !32 ∇θ LTi ( fθ) θ1, θ2, θ3
  • 116. Model-agnostic meta-learning • More resilient to overfitting • Generalizes better than LSTM approaches for few shot learning • Universality 1: no theoretical downsides in terms of expressivity when compared to alternative meta-learning models. • Many variants and applications: • REPTILE: do SGD for k steps in one task, only then update init. weights 2 • PLATIPUS: probabilistic MAML • Bayesian MAML (Bayesian ensemble) • Online MAML • … !33 1 Finn et al. 2018 2 Nichol et al. 2018 Finn & Levine 2019
  • 117. !34 Model-agnostic meta-learning • For reinforcement learning:
  • 118. !35 Neural Architecture Meta-Learning Wong et al. 2018• Warm-start a deep RL controller based on prior tasks • Much faster than single-task equivalent • Another idea: warm-start DARTS with task-specific one-shot model
  • 119. Learning to reinforcement learn !36 1 Duan et al. 2017 2 Wang et al. 2017 3 Duan et al. 2017 Environments meta-RL algorithm performance policy #θ fast RL agent policy #θ Similar env. performance • Humans often learn to play new games much faster than RL techniques do • Reinforcement learning is very suited for learning-to-learn: • Build a learner, then use performance as that learner as a reward impl • Learning to reinforcement learn 1,2 • Use RNN-based deep RL to train a recurrent network on many tasks • Learns to implement a‘fast’RL agent, encoded in its weights
  • 120. Learning to reinforcement learn !37 • Also works for few-shot learning 3 • Condition on observation + upcoming demonstration • You don’t know what someone is trying to teach you, but you prepare for the lesson 1 Duan et al. 2017 2 Wang et al. 2017 3 Duan et al. 2017
  • 121. Learning to learn more tasks !38 • Active learning • Deep network (learns representation) + policy network • Receives state and reward, says which points to query next • Density estimation • Learn distribution over small set of images, can generate new ones • Uses a MAML-based few-shot learner • Matrix factorization • Deep learning architecture that makes recommendations • Meta-learner learns how to adjust biases for each user (task) • Replace hand-crafted algorithms by learned ones. • Look at problems through a meta-learning lens! Pang et al. 2018 Reed et al. 2017 Vartak et al. 2017
  • 122. Meta-data sharing !39 import openml as oml from sklearn import tree task = oml.tasks.get_task(14951) clf = tree.ExtraTreeClassifier() flow = oml.flows.sklearn_to_flow(clf) run = oml.runs.run_flow_on_task(task, flow) myrun = run.publish() run locally, share globally Vanschoren et al. 2014 • What if we have a shared memory of all machine learning experiments? • OpenML.org • Thousands of uniform datasets, 100+ meta-features • Millions of evaluated runs • Same splits, 30+ metrics • Traces, models (opt) • APIs in Python, R, Java,… Sklearn, Keras, PyTorch, MXNet,… • Publish your own runs • Never ending learning • Benchmarks building a shared memory
  • 123. Meta-data sharing !40 Vanschoren et al. 2014 models built by humans models built by AutoML bots building a shared memory • Train and run AutoML bots on any OpenML dataset • Never-ending learning (from robots and humans)
  • 124. !41 AutoML tools and benchmarks Open source benchmark on https://openml.github.io/automlbenchmark/ • On OpenML datasets, will regularly be updated, anyone can add Gijsbers et al. 2019
  • 125. !42 AutoML open source tools Architecture search Hyperparameter search Improvements Meta-learning Auto-WEKA Parameterized pipeline Bayesian Optimization (RF) auto-sklearn Parameterized pipeline Bayesian Optimization (RF) Ensembling warm-start BO-HB Parameterized pipeline Tree of Parzen Estimators Ensembling, Hyperband hyperopt-sklearn Parameterized pipeline Tree of Parzen Estimators TPOT Evolving pipelines (trees) Genetic programming GAMA Evolving pipelines (trees) Asynchronous evolution Ensembling, Hyperband H2O AutoML Single algorithms random search Stacking ML-Plan Planning-based pipeline Hierarchical planner OBOE Single algorithms Low rank approximation Ensembling Runtime predictor
  • 126. !43 AutoML tools and benchmarks Observations so far (for pipeline search): • No AutoML system outperforms all others (Auto-WEKA is worse) • Auto-sklearn, H2O, TPOT can outperform each other on given datasets • Tuned RandomForests are a strong baseline • High numbers of classes, unbalanced classes, are a problem for most tools • Little support for anytime prediction … We need good benchmarks for NAS as well…
  • 127. !44 AutoML Gym Environment to train general-purpose AutoML algorithms • Open benchmark, meta-learning, continual learning, efficient NAS Join Us! Open PostDoc position! http://bit.ly/automl-gym
  • 128. Towards human-like learning to learn !45 • Learning-to-learn gives humans a significant advantage • Learning how to learn any task empowers us far beyond knowing how to learn specific tasks. • It is a universal aspect of life, and how it evolves • Very exciting field with many unexplored possibilities • Many aspects not understood (e.g. task similarity), need more experiments. • Challenge: • Build learners that never stop learning, that learn from each other • Fast learning by slow meta-learning • Build a global memory for learning systems to learn from • Let them explore / experiment by themselves
  • 129. Thank you! special thanks to Pavel Brazdil, Matthias Feurer, Frank Hutter, Erin Grant, Hugo Larochelle, Raghu Rajan, Jan van Rijn, Jane Wang more to learn http://www.automl.org/book/ Chapter 2: Meta-Learning !46 Never stop learning