Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencoder from Various Perspectives)

Copyright ⓒ 2017 Haezoom INC. All Right Reserved.
Understanding Variational Autoencoder
from Various Perspectives
Sangwoong Yoon, Haezoom INC.
sw.yoon@haezoom.com
http://www.haezoom.com (official)
http://story.haezoom.com (blog)
BCTJV @SNU
2017.06.09
한국어 자막 및 보충설명 포함

Copyright ⓒ 2017 Haezoom INC. All Right Reserved. - 2 -
Motivation
This talk aims to share my intuition on Variational Autoencoder,
connecting VAE to other machine learning algorithms.
For more general, introductive (yet in-depth) tutorial, you may consult
• (Doersch et al., 2016) “Tutorial on Variational Autoencoders”
• Or other great materials on the web.
이 발표자료는 Variational Autoencoder에 대한
개인적인 해석과 직관을 공유하기 위해 만들어졌습니다.
보다 일반적이고 기초적인 튜토리얼은
인터넷 상의 다른 자료를 참고해주세요.
검증되지 않은 주장과 직관이 섞여 있기 때문에
틀리거나 논란의 여지가 있는 부분이 있을 수 있습니다.
작성자에게 이메일 주시면, 슬라이드를 수정하거나 함께 논의하실 수 있습니다.

Overview
1. Learning Probability Distributions
2. Variational Autoencoders
3. Autuencoders and Latent Variable Models
(Originally included was the demonstration of TensorFlow implementation
of Variational Autoencoder)
1. 데이터로부터 확률분포를 학습하는 문제의 관점에서 VAE를 해설합니다.
2. VAE의 목적함수인 Evidence Lower-bound를 가능한한 직관적으로 설명합니다.
3. Autoencoder와 Latent Variable Model의 관계에 대해서 살펴보고, VAE가 기존의 autoencoder와
어떻게 다른지 생각해봅니다.

1. Learning Probability Distributions

The ultimate goal of statistical learning
Learning an underlying distribution from finite data
통계학습의 궁극적인 목표는, 유한한 수의 데이터로부터
그 데이터를 만들어낸 확률분포를 찾아내는 것입니다.

Why?
Because if you recover the true probability distribution P(𝑋),
you know everything about 𝑋
(Besides computational issues…)
왜냐하면, 어떤 변수의 확률 분포를 안 다는 것은 그 변수
에 대한 모든 것을 안다는 것을 의미하기 때문이지요.
(계산적인 이슈를 제외한다면요)

What can we do with 𝑝 𝑋 ?
1. Inference
• Arbitrary (or general) inference
• Condition and Marginalization
• Given 𝑃 𝑋, 𝑌, 𝑍 , 𝑃 𝑋|𝑌 =?
• Regression and classification are also inference
2. Sampling
• Generate 𝑋~𝑃 𝑋
확률변수(벡터)의 확률분포를 알고 있을 때 무엇을 할 수 있는가?
크게 두 가지를 할 수 있습니다. 추론(Inference)와 샘플링(Sampling)입니다.
추론이란, 특정 조건 하에서 내가 궁금한 확률변수의 확률분포를 알아내는 것입니다. 예를 들어서 X, Y, Z 확률
변수가 있을 때, Z를 marginalize하고 Y를 conditioning했을 때 X의 확률 분포가 어떻게 되는지 알아내는 것입니
다. 감독학습(Supervised learning)도 이 사례에 해당된다고 할 수 있습니다.

What can we do with 𝑝 𝑋 ?
1. Inference
• Arbitrary (or general) inference
• Condition and Marginalization
• Given 𝑃 𝑋, 𝑌, 𝑍 , 𝑃 𝑋|𝑌 =?
• Regression and classification are also inference
2. Sampling
• Generate 𝑋~𝑃 𝑋
샘플링이란 그 확률 분포를 따르는 빈도로 새로운 샘플(데이터)를 만들어내는 것입니다.
단순해 보이지만, 이 두 가지가 통계학습에서 하고자 하는 거의 모든 일들을 포함합니다.

Then,
How can we learn a distribution from data?
그럼, 데이터가 주어져 있을 때 어떻게 확률분포를 학습할 수 있을까요?

Method 1: Parametric Distribution Model
• Set a parametric distribution model 𝑃 𝜃 𝑋 , then find 𝜃
• Possibly with maximum likelihood (ML) or maximum a posteriori (MAP)
Example : Gaussian distribution model with ML
Given {𝑋1, … , 𝑋 𝑁}, 𝑋𝑖 ∈ ℝ 𝐷
𝑝 𝑋; 𝜇, Σ =
1
2𝜋
𝐷
2 Σ
1
2
exp −
1
2
𝑋 − 𝜇 𝑇Σ−1 𝑋 − 𝜇
Then find 𝜇, Σ such that
𝜇∗ = argmax 𝜇 Σ𝑖 log 𝑝(𝑋𝑖; 𝜇, Σ)
Σ∗ = argmaxΣ Σ𝑖 log 𝑝(𝑋𝑖; 𝜇, Σ)
가장 기초적인 방법은 계산이 편한 모수적(parametric) 모형을 하나 세우고, 그 모수(parameter)
들을 데이터에 맞추는 것입니다.

Method 1: Parametric Distribution Model
Strength
• Simple and straightforward
• Faster convergence, given a correct model
• Which is rarely the case in machine learning
Weakness
• Very restricted expressive power
계산이 간단하고 빠릅니다. 만약 선택한 모수적 모형의 집합 안에 진짜 분포가 포함되어 있다면,
가장 빠른 수렴속도를 보입니다. 물론 기계학습에서 그런 경우는 거의 없지요.
가우시안(Gaussian)이나 지수족(Exponential family) 분포들을 많이 사용하지만, 이들의 형태는
지나치게 제한되어 있어 다양하고 복잡한 현실세계의 분포를 표현하지 못합니다.

Method 2: Non-Parametric Distribution Model
• Estimate 𝑃 𝑋 from local neighbors
Example : Kernel density estimation (aka Parzen Windows)
Given {𝑋1, … , 𝑋 𝑁}, 𝑋𝑖 ∈ ℝ 𝐷
𝑘 𝜎 𝑋𝑖, 𝑋𝑗 =
1
2𝜋
𝐷
2 𝜎 𝐷
exp −
𝑋𝑖 − 𝑋𝑗
2
2𝜎2
Our estimate 𝑝 𝑋 =
1
𝑁 𝑖 𝑘 𝜎 𝑋, 𝑋𝑖
커널 밀도 추정(KDE)이나 k-NN 밀도 추정과 같이 인접접(neighbors)을 사용하여, 국소적인 (local)
정보를 바탕으로 확률분포를 추정하는 방법들도 있습니다.

Method 2: Non-Parametric Distribution Model
Strength
• Flexible
Weakness
• Slower convergence
• Dependent on distance measure
모수적인 분포를 가정하지 않기 때문에 굉장히 유연합니다. 대신 수렴속도는 느리지요.
어떤 점이 인접점인지 판단하기 위해서는 거리 지표(distance measure)가 중요하고, 이것의 선택
이나 학습이 중요하게 됩니다.

Method 3: Factored Distribution Model (i.e., PGM)
• Probabilistic Graphical Model (PGM) represent a distribution with
factored smaller distributions
Examples
Bayesian Networks Markov Random Fields
𝑃 𝑋 = Π𝑖 𝑃 𝑋𝑖|𝑃𝑎(𝑋𝑖)
𝑃𝑎 ⋅ means parents
𝑃 𝑋 ∝ Π 𝑘 𝜙 𝐶 𝑘
𝐶 𝑘 means k-th cliques
확률그래프모델(PGM) 또한 오랫동안 연구된, 확률분포를 표현하는 방법 중 하나입니다.
PGM에서는 전체 확률분포를 더 작은 확률분포/포텐셜함수의 곱셈으로 표현합니다.

Method 3: Factored Distribution Model (i.e., PGM)
Strength
• Able to perform arbitrary inference
• Able to embed domain knowledge into the graph structure
• Able to treat latent variable  more complex distribution
Weakness
• The relationship between nodes are still simple
• Learning with (many) latent variables is very difficult
• E.g. Deep Boltzmann Machine
• Do we really need the capability to perform arbitrary inference?
PGM은 추론을 수행하는 데에 주된 목적을 두고 있어서, 변수 간의 임의의 추론이 가능합니다. 그러나
여전히 노드 간의 관계 표현은 단순한 편입니다.

Mini wrap-up
1. Expressivity is an issue
2. Seems like we can’t have everything at the same time
• Accurate estimate of 𝑃 𝑋
• Traditional density estimation methods
• Ability to perform arbitrary inference
• PGMs
• Ability to generate samples
• Some PGMs
“the ability to generate samples”
seems to be the most important point in the deep learning era
확률분포를 학습하는 기존의 방법들은, 예를 들어 이미지의 확률분포를 학습하기에는 표현력이 부족합니다. 한편,
모형마다 장단점이 명확해서, 추론과 샘플링을 모두 다 잘 할 수 있는 하나의 모델은 없는 것 같습니다.
기존에 추론에 초점을 두었던 것과는 달리, 딥러닝의 시대에서는 샘플링이 중요한 task입니다.
이미지, 언어, 소리 등을 인공지능이 만들어내기를 원하니까요.

In the meantime,
Linear functions Linear functions
with basis functions
Linear functions
with adaptive basis functions
In the supervised learning world…
Linear regression,
logistic regression,
Linear SVM, …
Radial Basis Function Network,
SVM with kernels, …
Neural networks,
Gradient boosting,
Random forests, …
Increasing model complexity
한편, 감독학습의 영역에서는 인공신경망과 같이 아주 복잡도가 높은 모형들에 대한 연구가 많이 이루어졌습니다.

Neural Networks
• Very expressive
• Universal function approximator
• ConvNets  effective model for images
• Recurrent NNs  effective model for sequences
• Very scalable
• Trained by online manner (SGD)
Works nicely for supervised function approximation
인공신경망은 표현력이 매우 뛰어납니다. 임의의 함수를 충분한 정확도로 근사할 수 있다는 것이 알려져 있고,
이미지 데이터를 잘 인코딩할 수 있는 ConvNet과 시계열 데이터를 잘 인코딩할 수 있는 RNN이 있습니다. 학습
과정에서 모든 데이터를 한번에 읽지 않기 때문에 데이터 크기에 scalable합니다.
이러한 특징을 바탕으로 감독학습에서 좋은 성과를 올려왔습니다.

Then,
Can’t we apply a neural net
to learn probability distribution?
인공신경망을 이용하여 확률분포를 학습할 수는 없을까요?

Naïve approach – supervised learning for density estimation?
𝑋
𝑝 𝑋
We don’t know the value 𝑝 𝑋
Therefore not possible.
만약 우리가 학습데이터에 대해서 확률밀도값을 알고 있다면 바로 감독학습을 할 수 있을 것입니
다. 그러나 데이터가 아무리 많아도 그 확률밀도의 정확한 값을 알 수는 없으므로, 직접적인
regression을 통해 학습할 수가 없습니다.

Neural network as a distribution
1. Let’s focus on sampling (generation)
2. Neural nets are awesome at mapping
𝑍
𝑋
𝑍 ~ 𝑁(0, 𝐈)
𝑋 = 𝑓(𝑍)
두 가지에 초점을 맞춰봅시다.
1. 추론보다는 샘플링을 잘 하는 모델을 만들자. (선택과 집중?)
2. 인공신경망은 한 벡터를 다른 벡터로 변환하는 것을 아주 잘한다.
Key idea
• First, sample from well-known distribution (e.g. Gaussian)
• Then, map via a neural network!
핵심 아이디어는 이렇습니다.
먼저, 우리가 아주 잘 알고 있는 분포(예: 정규분포)로부터
샘플을 만듭니다. 그리고 그것을 뉴럴네트워크를 통해 데
이터의 공간으로 사상(map)합니다. 이것은 Generative
Adversarial Network에서도 사용되고 있는 방법입니다.
개인적으로는 이 부분이 감동포인트인데..

Neural network as a distribution
Strictly speaking, NN is modeling 𝑝 𝑋 𝑍
𝑍
𝑋
𝑍 ~ 𝑁(0, 𝐈)
𝑝(𝑋|𝑍) = 𝑓(𝑍)
𝑋 = 𝑓(𝑍)
Note: this is not the reparametrization trick
Discussion: is it outputting a probability distribution? Or value?
실은, 엄밀하게 말하면 인공신경망은 조건부확률을 모델링하게 됩니다.

2. Variational Autoencoders
인공신경망을 이용하여 확률분포를 모델링하기 위한 방법의 관점에서 VAE의 학습 목적함수인 Evidence Lower
Bound를 해설합니다.

Spoiler 1
Latent Variable Model
• We aim to model 𝑋
• Assumption: there exists the latent
structure 𝑍
데이터를 모델링할 때, 현상의 이면에 숨겨져 있는 은닉변수(Latent Variable)이 있다고 가정하는 모델을
Latent Variable Model이라고 합니다.
예를 들어 숫자 손글씨 데이터라면, 각도, 획의 굵기, 숫자와 같은 것들이 은닉변수가 될 수 있습니다.
VAE 또한 Latent Variable Model 중 하나이며, 앞서 말한 샘플링되는 변수 𝑍가 은닉변수가 됩니다.
(Figure from Kingma and Welling, 2014)

Spoiler 2
Encoder Decoder
𝑋 𝑋
𝑁 𝜇 𝑋 , 𝜎 𝑋 2 𝐈
ϵ
𝑍
Reparametrization
trick
𝐿: Evidence Lower Bound (ELBO)
Recognition model,
Variational
distribution
상세한 설명 전에 VAE 전체 구조 살펴봅니다. Encoder 신경망과 Decoder 신경망으로 이루어져 있습니다.
Encoder 신경망은 𝑋를 입력받아 은닉변수 𝑍를 샘플할 수 있는 분포를 출력합니다.
여기서 샘플된 𝑍는 Decoder 신경망을 거쳐 𝑋의 복원인 𝑋를 생성합니다.
목적함수는 Evidence Lower Bound라고 하는 함수이며, 전체를 경사하강법(Gradient Descent)로 학습하기 위해
샘플링 단계를 Reparametrization trick으로 대체합니다.

Returning to the main point
I want to justify (for myself) why VAE should look like the way it is.
The questions are…
Why do we need encoder?
Why don’t we just apply maximum likelihood?
How can intuitively drive the derivation of ELBO?
Many answers have been given by other articles such as (Doersch, 2016).
However, I reorganized them in my way.
제가 여기에서 하고 싶은 일은 왜 우리가 VAE와 같은 구조를 필요로 하는가를 (스스로에게, 그리고 Variational
Inference의 용어를 빌리지 않고) 정당화하는 것입니다. 제가 대답을 구하고자 하는 질문들은, 왜 우리가
encoder가 필요한지, 왜 우리가 그냥 최대우도추정을 사용할 수 없는지, Evidence Lower Bound (ELBO)를 유도
하는 과정을 정성적으로 어떻게 이끌어갈 수 있는지 입니다.
기존에 여러 곳에서 논의가 된 적 있는 내용이지만, 저의 방식으로 재구성하였습니다.
(Especially without getting into the classical variational inference)

Okay, let’s go maximum likelihood!
𝑍
𝑋
𝑍 ~ 𝑁(0, 𝐈)
𝑝 𝜃(𝑋|𝑍) = 𝑓𝜃(𝑍)
𝜃∗
= argmax 𝜃
𝑖
𝑝 𝜃 𝑋𝑖
= argmax 𝜃
𝑖
𝑝 𝜃 𝑋𝑖|𝑍 𝑝 𝑍 𝑑𝑍
최대우도추정을 이용하여 모델 파라미터를 학습하는 것을 생각해보겠습니다. 앞서 신경망은 조건부확률분포를
모델링한다고 하였고, 𝑍는 표준정규분포에서 샘플링된다고 하였으니, 우도를 나타내는 것은 어렵지 않습니다.
문제는 여기에서 등장하는 적분인데, 모델이 인공신경망이 아니더라도 풀기 어려운 경우가 많습니다.

This problem looks familiar
𝑍
𝑋
𝜃∗
= argmax 𝜃
𝑖
𝑝 𝑋𝑖
= argmax 𝜃
𝑖
𝑝 𝑋𝑖|𝑍 𝑝 𝑍 𝑑𝑍
• Maximum likelihood when a latent variable 𝑍 presents
• Treat 𝑍 as a missing variable
• 𝑝 𝑋 : “marginal likelihood”
Typically, Expectation-Maximization (EM) can be a solution.
But we are considering the case where EM is not applicable. (Kingma and Welling, 2014)
𝑁
이러한 상황에서 사용하라고 만든 것이 EM 알고리즘입니다. EM알고리즘에 대한 상세한 설명은 (Bishop, 2006)을 참고해
주세요. 보통의 경우라면 EM을 적용하면 되겠지만, 우리는 EM을 적용하지 못하는 경우를 생각합니다.
자세한 사항은 VAE original paper인 (Kingma and Welling, 2014)에 나오지만, 다음 두 가지 조건을 생각합니다.
1. Posterior 𝑝(𝑍|𝑋)를 analytic하게 계산하는 것이 불가능함 (인공신경망이므로)
2. 데이터가 아주 많음

Preliminary
E 𝑍 𝑓 𝑍 = ∫ 𝑓 𝑍 𝑝 𝑍 𝑑𝑍
=
1
𝑁 𝑖=1
𝑁
𝑓 𝑍𝑖 , 𝑍𝑖~𝑝 𝑍
Monte-Carlo Approximation
Jensen’s inequality
𝐸 𝑋 𝑓(𝑋) ≤ 𝑓 𝐸 𝑋 𝑋 , 𝑓 is concave
VAE의 형태가 아닌 Naïve한 최대우도 추정을 시나리오를 생각해보기 위해, 간단한 기초지식을 소개합니다.

Naïve attempt to apply gradient descent
𝑍
𝑋
𝐿 𝑋𝑖 = log 𝑝(𝑋𝑖)
= log ∫ 𝑝 𝑋𝑖|𝑍 𝑝 𝑍 𝑑𝑍
= log 𝐸 𝑝 𝑍 𝑝 𝑋𝑖|𝑍
≥ 𝐸 𝑝 𝑍 log 𝑝 𝑋𝑖|𝑍
=
1
𝑀 𝑗=1
𝑀
log 𝑝 𝑋𝑖|𝑍𝑗
𝑍𝑗 ~ 𝑝(𝑍)
𝑍를 𝑋로 mapping하는 인공신경망을 만든 뒤에, maximum (marginal) likelihood 방식으로 어떻게 학습을 할 수 있을지 전
개해본 내용입니다. 𝑍에 대한 적분 때문에 바로 계산하는 것은 힘들지만, Jensen 부등식과 Monte Carlo 근사를 사용하면
marginal likelihood의 하한 (lower bound)를 얻을 수 있습니다.
얻어진 식을 정성적으로 풀이하여 알고리즘을 만들면 아래와 같습니다.
1. 임의의 𝑍𝑗를 샘플링 합니다.
2. 인공신경망을 통과시킨 출력을 실제 데이터 𝑋𝑖와 가까워지도록 gradient descent합니다.
3. 이 과정을 i와 j에 대해 계속 반복합니다.

Unfortunately,
This won’t work.

Why? – a qualitative explanation
Suppose that we’re training on MNIST.
An ideal, hypothetical 𝑍 space might look like
The space of 𝑍
A region of 𝑍
mapped to digit
‘1’
A region of 𝑍
mapped to digit
‘7’
Regions in 𝑍 “differentiate” to represent characteristics of 𝑋.
이상적으로 학습이 완료되었다면, 𝑍의 각 지역들은 특정 유형의 𝑋를 표현하도록 분화되어야 합니다.
가상의 예를 들어서, 𝑍의 각 영역들이 MNIST의 수들에 대응되도록 학습이 될 수 있을 것입니다.

If we implement the naïve approach,
We keep sampling 𝑍 from a fixed distribution regardless of 𝑋𝑖
The space of 𝑍
Regions in 𝑍 is very difficult to differentiate because 𝑍
is sampled from all over the space
그러나 앞서 검토한 순진한 접근방법을 도식화하면 위와 같습니다.
𝑍를 계속 같은 distribution에서 sampling하여 𝑓(𝑍)가 𝑋가 되도록 학습하므로, 이런 상황에서는 𝑍의 영역이 특정한 𝑋의
특정한 성질을 나타내도록 분화하기가 어렵습니다.
𝑍𝑗
𝑋1
𝑋2
𝑋3
⋯
𝑓(𝑍𝑗) is trained to match
It may be an intuitive explanation of “high variance”
mentioned in (Kingma and Welling, 2014)

Sample 𝑍𝑗 from something meaningful
Let’s use 𝐸 𝑝 𝑍 𝑗|𝑋 𝑖
instead of 𝐸 𝑝 𝑍 𝑗
We can now get “differentiating” samples
Actually it’s EM, then.
Posterior
However, 𝑝(𝑍𝑗|𝑋𝑖) is intractable.
(Posterior of a neural net!)
𝑍를 계속 같은 분포에서 sampling하는 것이 문제입니다. 그렇다면 𝑍를 sampling하는 분포를 모델의 학습에 따라서 바꿔
가면 어떨까요? 예를 들어서 사후확률분포인 𝑝(𝑍|𝑋𝑖)같은 것에서 뽑는다면? 그것이 EM알고리즘입니다.
하지만 인공신경망의 경우 사후확률분포에서 샘플링하기가 매우 어렵습니다.
그래서 사후확률분포를 근사하는 variational inference를 사용합니다.
Therefore, we go variational.
(We approximate the posterior)

Approximation of the posterior – variational distribution
Since we will never know the posterior 𝑝 𝑍 𝑋𝑖 ,
we approximate it with variational distribution 𝑞 𝜙(𝑍|𝑋𝑖)
𝐿 ≅ 𝐸 𝑝 𝑍 log 𝑝 𝑋𝑖|𝑍
𝐿 ≅ 𝐸 𝑞 𝜙 𝑍|𝑋 𝑖
log 𝑝 𝑋𝑖|𝑍
With sufficiently good 𝑞 𝜙, we will (probably) get better gradients
사후분포를 근사하기 위한 분포 𝑞 𝜙(𝑍|𝑋) 를 variational distribution이라고 합니다.
Variational distribution에서 sampling된 𝑍를 사용하면 모델을 학습시키기에 더 좋은 gradient를 얻을 수 있을 것입니다.
(아마도. 정성적으로요.)

A Neural Network as a variational distribution
We let 𝑞 𝜙(𝑍|𝑋𝑖) be Gaussian distribution 𝑁 𝜇 𝑋𝑖 , 𝜎 𝑋𝑖
2
𝐈
𝜇 𝑋𝑖 = 𝑀𝐿𝑃, 𝜎2 𝑋𝑖 = 𝑀𝐿𝑃
Why Gaussian?
1. It’s variational distribution so we are the boss
2. Computational advantage
1. Easy to sample
2. Analytic expression KL divergence between Gaussians
𝜇 𝜎
우리는 variational distribution도 인공신경망으로 만들 겁니다.
인공신경망이 Gaussian 분포의 평균과 표준편차를 출력하게 만들면, 입력값𝑋𝑖에 조건부인 분포를 표현할 수 있습니다.
왜 Gaussian을 쓰냐구요? 계산하기 편하기 때문입니다. 어차피 근사이기 때문에 계산이 편한 것이 최우선입니다. 또한,
두 Gaussian 사이의 KL divergence가 손쉽게 계산되기 때문이기도 하지요.

However,
How much does 𝐸 𝑞 𝜙 𝑍|𝑋 𝑖
log 𝑝 𝑋𝑖|𝑍 deviate
from the marginal likelihood 𝑝(𝑋𝑖)?
Variational distribution을 이용하여 𝑍를 분화시키기에 적절한 scheme을 만들었습니다. 그런데 이런 방식으로 학습하면
우리의 애초 목표였던 𝑝(𝑋𝑖)와 얼마만큼 차이가 날까요?

Deriving the evidence lower bound (ELBO)
𝐸 𝑞 𝜙 𝑍|𝑋 𝑖
log 𝑝 𝜃 𝑋𝑖|𝑍log 𝑝 𝜃 𝑋𝑖
우리가 만들어낸 근사적 접근법이 궁극적인 목적인 marginal likelihood 와 얼마나 차이가 날까요?
둘 중에 한 쪽으로부터 다른 쪽으로 넘어가기 위해 식을 조작하다보면 variational inference에서 사용하는 evidence
lower bound를 만들어낼 수 있습니다.
Our original goal What we have now
Evidence lower bound를 처음 접했을 때의 당혹감은, 이 수식 전개의 motivation을 전혀 따라갈 수 없었다는 것입니다.
각 step이 수학적으로 참이라는 것은 알겠으나, 어떤 목적과 직관으로 각 단계를 전개해 나가는지 알 수 없었습니다. 이
내용은 제 개인적으로 고민하여 합리화한 결과입니다.

Resulting objective - 1
Our original goal
What we have now
궁극적인 목적과 우리가 만든 근사 식 사이의 차이는 점선부분과 같습니다. 사실 𝑞 𝜙를 없애고 쓸 수 있지만, 식을
interpretable하게 만들기 위해서 도입한다고 볼 수 있습니다. 두 개의 KL divergence로 볼 수 있는데, KL(1)은 계산이 불
가능하고 KL(2)는 계산이 가능합니다.
계산이 가능하다면 굳이 deviation으로 남겨둘 필요가 없으니 objective function에 포함시킵니다.
이렇게 했을 때의 또 다른 이점은, KL divergence는 항상 양수이기 때문에 KL(1)이 항상 양수이고,
Objective function이 marginal likelihood의 lower bound가 됩니다.
Variational inference, 그리고 VAE에서는 바로 이 lower bound를 최대화하는 방향으로 전체 모형을 학습합니다.
Deviation between the two
Evidence Lower Bound
(ELBO)
KL(1)
KL(2)
(The equations were captured from (Kingma and Welling, 2014))

Resulting objective - 2
regularization reconstruction
이렇게 하고 나면 나머지는 다른 튜토리얼이나 설명자료들에서도 많이 다룬 내용이 남았습니다.
ELBO에 있는 KL divergence는 일종의 regularizer 역할을 하고, variational distribution에서 sampling하여 계산한
likelihood는 reconstruction error 역할을 합니다.
ELBO에 있는 모든 파라미터 𝜃, 𝜙를 경사하강법만으로 학습할 수 있도록 하기 위해, KL divergence는 Gaussian의 성질을
이용하여 analytic하게 푸고, sampling step은 reparametrization trick을 이용하여 치환합니다.
(reparametrization trick도 많이 재미있지만, 많이 설명되어 있어서 넘어가겠습니다.)

KL divergence of two Gaussians
𝐾𝐿(𝑝| 𝑞 = 𝑝 log
𝑝
𝑞
𝑝 =
1
2𝜋
1
2 𝜎
exp −
1
2
𝑥 − 𝜇
𝜎2
2
Try it yourself =D

Reparametrization trick
Really great figure from (Doersch, 2016)

Latent space
(Kingma and Welling, 2014)

3. Autoencoders and Latent Variable Models
Autoencoder와 Latent Variable Model의 관계에 대해 생각해보고,
VAE가 다른 autoencoder들과 어떻게 다른지 생각해봅니다.

Linear autoencoder = Principal Component Analysis
• There has been connection between latent variable models and
autoencoders.
• Linear activation & weight sharing
• Hidden neurons as latent variables
Bourlard and Kamp (1988)
AE와 LVM의 연결점의 시작은, linear AE가 PCA와 동치라는 것을 발견하면서 부터입니다.
이때부터 hidden neuron = latent variable이라는 시각이 발전합니다.

Nonlinear, regularized autoencoders
ℎ = 𝜎 𝑊1 𝑥 + 𝑏1
𝑥 = 𝑊2ℎ + 𝑏2
𝐿 = 𝑥 − 𝑥 2
+ 𝜆𝐿 𝑟𝑒𝑔
Typical design choices
• Regularization (𝐿 𝑟𝑒𝑔)
• L2
• Sparsity
• Denoising (Vincent et al., 2010)
• Contractive (Rifai et al., 2011)
• Number of layers
• Weight sharing
• Convolutional / Recurrent
In good old days, they used for initialize weights of deep neural nets.
Autoencoder는 그 뒤로 계속 발전하여 여러 가지 변형들을 내놓게 됩니다. 그리고 deep neural net을 학습하는 방법을 찾
는 데에 공헌하게 되지요. (요즘은 안 쓰지만)

Restricted Boltzmann Machines (RBM)
Undirected graphical model version of autoencoder
• Maths are almost the same
• Hidden neuron == latent variable
Restricted Boltzmann Machine was also
used for pretraining of Deep autoencoder
(non-linear) Autoencoder를 undirected graphical model로 표현한 RBM에서 역시 hidden neuron = latent variable이라는
관계는 다시 한번 강력하게 드러납니다.
ℎ1 ℎ2 ℎ3
𝑣1 𝑣2 𝑣3 𝑣4

In VAE: Hidden neurons != Latent Variables
We have mentioned that in most PGMs, the relationship between nodes
is relatively simple.
• In VAE, hidden neurons are not latent variables
• They are to encode non-linear relationship between nodes
Almost linear relationship !
Variational autoencoder에서는 hidden neuron이 visible variable과 latent variable 사이의 비선형 관계를 나타내기 위해
사용됩니다. 이것이 기존의 autoencdoer와 VAE의 결정적인 시각 차이라고 생각합니다.
ℎ1 ℎ2 ℎ3
𝑣1 𝑣2 𝑣3 𝑣4
𝑍
𝑋
Very non-linear relationship
(A neural network)
RBM
VAE

Conventional AE’s vs Variational AE’s
Conventional AE’s doesn’t aim to learn probability
가장 널리 쓰이는 nonlinear, regularized autoencoder들은, 분석한 논문인 (Alain and Bengio, 2014)에 따르면, 확률분포
를 학습하는 것이 목표가 아닙니다. 자세한 내용은 생략하겠지만, denoising autoencoder의 경우 확률분포의 log의 미분
을 학습하는 성질을 가지고 있습니다. 확률분포 자체를 학습하여 sampling을 수행하고자 하는 VAE와는 목표가 다르다고
할 수 있습니다.

Thank You for Your Attention

Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencoder from Various Perspectives)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencoder from Various Perspectives)

Similar a Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencoder from Various Perspectives) (20)

Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencoder from Various Perspectives)