SlideShare a Scribd company logo
1 of 34
Download to read offline
Driving in TORCS with
Deep Deterministic Policy Gradient
Final Report
Naoto Yoshida
About Me
● Ph.D. student from Tohoku University
● My Hobby:
○ Reading Books
○ TBA
● NEWS:
○ My conference paper on the reward function was
accepted!
■ SCIS&ISIS2016 @ Hokkaido
Outline
● TORCS and Deep Reinforcement Learning
● DDPG: An Overview
● In Toy Domains
● In TORCS Domain
● Conclusion / Impressions
TORCS and
Deep Reinforcement Learning
TORCS: The Open source Racing Car Simulator
● Open source
● Realistic (?) dynamics simulation of the car environment
Deep Reinforcement Learning
● Reinforcement Learning + Deep Learning
○ From Pixel to Action
■ General game play in ATARI domain
■ Car Driver
■ (Go Expert)
● Deep Reinforcement Learning in Continuous Action Domain: DDPG
○ Lillicrap, Timothy P., et al.
"Continuous control with deep reinforcement learning." , ICLR 2016
Vision-based Car agent in TORCS
Steering + Accel/Blake
= 2 dim continuous actions
DDPG: An overview
GOAL: Maximization of in expectation
Reinforcement Learning
Agent
Environment
Action : a
State : s
Reward : r
GOAL: Maximization of in expectation
Reinforcement Learning
Agent
Environment
Action : a
State : s
Reward : r
Interface
Raw output: u
Raw input: x
Deterministic Policy Gradient
● Formal Objective Function: Maximization of True Action Value
● Policy Evaluation: Approximation of the objective function
● Policy Improvement: Improvement of the objective function
where
Bellman equation
wrt Deterministic Policy
Loss for
Critic
Update direction
of Actor
Silver, David, et al. "Deterministic policy gradient algorithms."
ICML. 2014.
Deep Deterministic Policy Gradient
Initialization
Update of Critic
+ minibatch
Update of Actor
+ minibatch
Update of Target
Sampling / Interaction
RL agent
(DDPG)
s, ra
TORCS
Lillicrap, Timothy P., et al. "Continuous control
with deep reinforcement learning.", ICLR 2016
Deep Architecture of DDPG
Three-step observation
Simultaneous training of two deep convolutional networks
Exploration: Ornstein–Uhlenbeck process
● Gaussian noise with moments
○ θ,σ:parameters
○ dt:time difference
○ μ:mean (= 0.)
● Stochastic Differential Equation:
● Exact Solution for the discrete time step:
Wiener Process
Gaussian
OU process: Example
GaussianOU Process
θ = 0.15, σ = 0.2,μ = 0
DDPG in Toy Domains
https://gym.openai.com/
Toy Problem: Pendulum Swingup
● Classical RL benchmark task
○ Nonlinear control:
○ Action: Torque
○ State:
○ Reward:
From “Reinforcement Learning In Continuous Time and
Space”, Kenji Doya, 2000
Results
# of Episode
Results: SVG(0)
# of Episode
Toy Problem 2: Cart-pole Balancing
● Another classical benchmark task
○ Action: Horizontal Force
○ State:
○ Reward: (other definition is possible)
■ +1 (angle is in the area)
■ 0 (Episode Terminal)
Angle Area
Results: non-convergent behavior :(
RLLAB implementation worked well
Successful
Score
Total Steps
https://rllab.readthedocs.io
DDPG in TORCS Domain
Note:
Red line : Parameters with Author Confirmation / DDPG paper
Blue line : Estimated/Hand-tuned Parameters
Track: Michigan Speedway
● Used in DDPG paper
● This track actually exists!
www.mispeedway.com
TORCS: Low-dimensional observation
● TORCS supports low-dimensional sensor outputs for AI agent
○ “Track” sensor
○ “Opponent” sensor
○ Speed sensor
○ Fuel, damage, wheel spin speed etc.
● Track + speed sensor as observation
● Network: Shallow network
● Action: Steering (1 dim)
● Reward:
○ If car clashed/course out , car gets penalty -1
○ otherwise
“Track” Laser Sensor
Track
Axis
Car Axis
Result: Reasonable behavior
TORCS: Vision inputs
● Two deep convolutional neural networks
○ Convolution:
■ 1st layer: 32 filters, 5x5 kernel, stride 2, paddling 2
■ 2nd, 3rd layer: 32 filters, 3x3 kernel, stride 2, paddling 1
■ Full-connection: 200 hidden nodes
VTORCS-RL-color
● Visual TORCS
○ TORCS for Vision-based AI agent
■ Original TORCS does not have vision API!
■ vtorcs:
● Koutník et al., "Evolving deep unsupervised convolutional networks for vision-based reinforcement learning, ACM,
2014.
○ Monochrome image from TORCS server
■ Modification for the color vision → vtorcs-RL-color
○ Restart bug
■ Solved with help of mentors’ substantial suggestions!
Result: Still not a good result...
What was the cause of the failure?
● DDPG implementation?
○ Worked correctly, at least in toy domains.
■ The approximation of value functions → ok
● However, policy improvement failed in the end.
■ Default exploration strategy is problematic in TORCS environment
● This setting may be for general tasks
● Higher order exploration in POMDP is required
● TROCS environment?
○ Still several unknown environment parameters
■ Reward → ok (DDPG author check)
■ Episode terminal condition → still various possibilities
(from DDPG paper)
gym-torcs
・TORCS environment with openAI-Gym like interface
Impressions
● On DDPG
○ Learning of the continuous control is a tough problem :(
■ Difficulty of policy update in DDPG
■ “Twice” recommendation of Async method by DDPG author (^ ^;)
● Throughout this PFN internship:
○ Weakness: Coding
■ Thank you! Fujita-san, Kusumoto-san
■ I knew many weakness of my coding style
○ Advantage: Reinforcement Learning Theory
■ and its branched algorithms, topics and relationships between RL and Inference
■ For DEEP RL, Fujita-san is an auth. in Japan :)
Update after the PFI seminar
Cart-pole Balancing
● DDPG could learn the successful policy
○ Still unstable after the several successful trial
Success in Half-Cheetah Experiment
● We could run successful experiment with identical hyper parameters in cart-
pole.
300-step total reward
Episode
Keys in DDPG / deep RL
● Normalization of the environment
○ Preprocess is known to be very important for deep learning.
■ This is also true in deep RL.
■ Scaling of inputs (possibly, and actions, rewards) will help the agent to learn.
● Possible normalization:
○ Simple normalization helps: x_norm = (x - mean_x)/std_x
○ Mean and Standard deviation are obtained during the initial exploration.
○ Other normalization like ZCA/PCA whitening may also help.
● Epsilon parameter in Adam/RMS prop can be large value
○ 0.1, 0.01, 0.001… We still need a hand-tuning / grid search...

More Related Content

What's hot

Deep Learning in Finance
Deep Learning in FinanceDeep Learning in Finance
Deep Learning in FinanceAltoros
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowIllia Polosukhin
 
Ultrasound Nerve Segmentation
Ultrasound Nerve Segmentation Ultrasound Nerve Segmentation
Ultrasound Nerve Segmentation Sneha Ravikumar
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016MLconf
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷Eiji Sekiya
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitShiladitya Sen
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learningmooopan
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Universitat Politècnica de Catalunya
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016MLconf
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15MLconf
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Antti Haapala
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlowBarbara Fusinska
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in PythonImry Kissos
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017MLconf
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbitjakehofman
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learningpauldix
 
Intro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular LabsIntro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular LabsKendall
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningSsu-Rui Lee
 

What's hot (20)

Deep Learning in Finance
Deep Learning in FinanceDeep Learning in Finance
Deep Learning in Finance
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
Ultrasound Nerve Segmentation
Ultrasound Nerve Segmentation Ultrasound Nerve Segmentation
Ultrasound Nerve Segmentation
 
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
 
強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷強化学習の分散アーキテクチャ変遷
強化学習の分散アーキテクチャ変遷
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
Software Frameworks for Deep Learning (D1L7 2017 UPC Deep Learning for Comput...
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
 
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
Melanie Warrick, Deep Learning Engineer, Skymind.io at MLconf SF - 11/13/15
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
nn network
nn networknn network
nn network
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Intro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular LabsIntro to TensorFlow and PyTorch Workshop at Tubular Labs
Intro to TensorFlow and PyTorch Workshop at Tubular Labs
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement Learning
 

Viewers also liked

Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Seiya Tokui
 
Chainerを使って細胞を数えてみた
Chainerを使って細胞を数えてみたChainerを使って細胞を数えてみた
Chainerを使って細胞を数えてみたsamacoba1983
 
NVIDIA 更新情報: Tesla P100 PCIe/cuDNN 5.1
NVIDIA 更新情報: Tesla P100 PCIe/cuDNN 5.1NVIDIA 更新情報: Tesla P100 PCIe/cuDNN 5.1
NVIDIA 更新情報: Tesla P100 PCIe/cuDNN 5.1NVIDIA Japan
 
深層学習ライブラリの環境問題Chainer Meetup2016 07-02
深層学習ライブラリの環境問題Chainer Meetup2016 07-02深層学習ライブラリの環境問題Chainer Meetup2016 07-02
深層学習ライブラリの環境問題Chainer Meetup2016 07-02Yuta Kashino
 
On the benchmark of Chainer
On the benchmark of ChainerOn the benchmark of Chainer
On the benchmark of ChainerKenta Oono
 
俺のtensorが全然flowしないのでみんなchainer使おう by DEEPstation
俺のtensorが全然flowしないのでみんなchainer使おう by DEEPstation俺のtensorが全然flowしないのでみんなchainer使おう by DEEPstation
俺のtensorが全然flowしないのでみんなchainer使おう by DEEPstationYusuke HIDESHIMA
 
ヤフー音声認識サービスでのディープラーニングとGPU利用事例
ヤフー音声認識サービスでのディープラーニングとGPU利用事例ヤフー音声認識サービスでのディープラーニングとGPU利用事例
ヤフー音声認識サービスでのディープラーニングとGPU利用事例Yahoo!デベロッパーネットワーク
 
Chainer, Cupy入門
Chainer, Cupy入門Chainer, Cupy入門
Chainer, Cupy入門Yuya Unno
 
マシンパーセプション研究におけるChainer活用事例
マシンパーセプション研究におけるChainer活用事例マシンパーセプション研究におけるChainer活用事例
マシンパーセプション研究におけるChainer活用事例nlab_utokyo
 
Chainer Development Plan 2015/12
Chainer Development Plan 2015/12Chainer Development Plan 2015/12
Chainer Development Plan 2015/12Seiya Tokui
 
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例Jun-ya Norimatsu
 
深層学習ライブラリのプログラミングモデル
深層学習ライブラリのプログラミングモデル深層学習ライブラリのプログラミングモデル
深層学習ライブラリのプログラミングモデルYuta Kashino
 
Chainer Contribution Guide
Chainer Contribution GuideChainer Contribution Guide
Chainer Contribution GuideKenta Oono
 
ディープラーニングにおける学習の高速化の重要性とその手法
ディープラーニングにおける学習の高速化の重要性とその手法ディープラーニングにおける学習の高速化の重要性とその手法
ディープラーニングにおける学習の高速化の重要性とその手法Yuko Fujiyama
 
Lighting talk chainer hands on
Lighting talk chainer hands onLighting talk chainer hands on
Lighting talk chainer hands onOgushi Masaya
 
ボケるRNNを学習したい (Chainer meetup 01)
ボケるRNNを学習したい (Chainer meetup 01)ボケるRNNを学習したい (Chainer meetup 01)
ボケるRNNを学習したい (Chainer meetup 01)Motoki Sato
 
Chainer入門と最近の機能
Chainer入門と最近の機能Chainer入門と最近の機能
Chainer入門と最近の機能Yuya Unno
 
Chainer meetup lt
Chainer meetup ltChainer meetup lt
Chainer meetup ltAce12358
 
Introduction to DEEPstation the GUI Deep learning environment for chainer
Introduction to DEEPstation the GUI Deep learning environment for chainerIntroduction to DEEPstation the GUI Deep learning environment for chainer
Introduction to DEEPstation the GUI Deep learning environment for chainerRyo Shimizu
 

Viewers also liked (20)

Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+Chainer Update v1.8.0 -> v1.10.0+
Chainer Update v1.8.0 -> v1.10.0+
 
Chainerを使って細胞を数えてみた
Chainerを使って細胞を数えてみたChainerを使って細胞を数えてみた
Chainerを使って細胞を数えてみた
 
NVIDIA 更新情報: Tesla P100 PCIe/cuDNN 5.1
NVIDIA 更新情報: Tesla P100 PCIe/cuDNN 5.1NVIDIA 更新情報: Tesla P100 PCIe/cuDNN 5.1
NVIDIA 更新情報: Tesla P100 PCIe/cuDNN 5.1
 
深層学習ライブラリの環境問題Chainer Meetup2016 07-02
深層学習ライブラリの環境問題Chainer Meetup2016 07-02深層学習ライブラリの環境問題Chainer Meetup2016 07-02
深層学習ライブラリの環境問題Chainer Meetup2016 07-02
 
On the benchmark of Chainer
On the benchmark of ChainerOn the benchmark of Chainer
On the benchmark of Chainer
 
俺のtensorが全然flowしないのでみんなchainer使おう by DEEPstation
俺のtensorが全然flowしないのでみんなchainer使おう by DEEPstation俺のtensorが全然flowしないのでみんなchainer使おう by DEEPstation
俺のtensorが全然flowしないのでみんなchainer使おう by DEEPstation
 
ヤフー音声認識サービスでのディープラーニングとGPU利用事例
ヤフー音声認識サービスでのディープラーニングとGPU利用事例ヤフー音声認識サービスでのディープラーニングとGPU利用事例
ヤフー音声認識サービスでのディープラーニングとGPU利用事例
 
Chainer, Cupy入門
Chainer, Cupy入門Chainer, Cupy入門
Chainer, Cupy入門
 
マシンパーセプション研究におけるChainer活用事例
マシンパーセプション研究におけるChainer活用事例マシンパーセプション研究におけるChainer活用事例
マシンパーセプション研究におけるChainer活用事例
 
Chainer Development Plan 2015/12
Chainer Development Plan 2015/12Chainer Development Plan 2015/12
Chainer Development Plan 2015/12
 
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例
Capitalicoでのchainer 1.1 → 1.5 バージョンアップ事例
 
深層学習ライブラリのプログラミングモデル
深層学習ライブラリのプログラミングモデル深層学習ライブラリのプログラミングモデル
深層学習ライブラリのプログラミングモデル
 
Chainer Contribution Guide
Chainer Contribution GuideChainer Contribution Guide
Chainer Contribution Guide
 
ディープラーニングにおける学習の高速化の重要性とその手法
ディープラーニングにおける学習の高速化の重要性とその手法ディープラーニングにおける学習の高速化の重要性とその手法
ディープラーニングにおける学習の高速化の重要性とその手法
 
Lighting talk chainer hands on
Lighting talk chainer hands onLighting talk chainer hands on
Lighting talk chainer hands on
 
ボケるRNNを学習したい (Chainer meetup 01)
ボケるRNNを学習したい (Chainer meetup 01)ボケるRNNを学習したい (Chainer meetup 01)
ボケるRNNを学習したい (Chainer meetup 01)
 
Chainer入門と最近の機能
Chainer入門と最近の機能Chainer入門と最近の機能
Chainer入門と最近の機能
 
Chainer meetup lt
Chainer meetup ltChainer meetup lt
Chainer meetup lt
 
Introduction to DEEPstation the GUI Deep learning environment for chainer
Introduction to DEEPstation the GUI Deep learning environment for chainerIntroduction to DEEPstation the GUI Deep learning environment for chainer
Introduction to DEEPstation the GUI Deep learning environment for chainer
 
CuPy解説
CuPy解説CuPy解説
CuPy解説
 

Similar to PFN Spring Internship Final Report: Autonomous Drive by Deep RL

Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learningguruprasad110
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place SolutionKaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place SolutionPreferred Networks
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...GeeksLab Odessa
 
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteitMeetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteitDigipolis Antwerpen
 
Reinforcement learning in a nutshell
Reinforcement learning in a nutshellReinforcement learning in a nutshell
Reinforcement learning in a nutshellNing Zhou
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsSeung Jae Lee
 
Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24Aritra Sarkar
 
SOLID refactoring - racing car katas
SOLID refactoring - racing car katasSOLID refactoring - racing car katas
SOLID refactoring - racing car katasGeorg Berky
 
Dynamic Search and Beyond
Dynamic Search and BeyondDynamic Search and Beyond
Dynamic Search and BeyondGrace Hui Yang
 
The Volcano/Cascades Optimizer
The Volcano/Cascades OptimizerThe Volcano/Cascades Optimizer
The Volcano/Cascades Optimizer宇 傅
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...Dongmin Lee
 
Evan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATLEvan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATLMLconf
 
How we won the 2nd place In KaggleDays Championship.pptx
How we won the 2nd place In KaggleDays Championship.pptxHow we won the 2nd place In KaggleDays Championship.pptx
How we won the 2nd place In KaggleDays Championship.pptxjficwwc
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5TigerGraph
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationTigerGraph
 
CIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingCIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingAlexey Grigorev
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSPreferred Networks
 
Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human labelKai-Wen Zhao
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdfdsfajkh
 

Similar to PFN Spring Internship Final Report: Autonomous Drive by Deep RL (20)

Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place SolutionKaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th Place Solution
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
 
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteitMeetup 18/10/2018 - Artificiële intelligentie en mobiliteit
Meetup 18/10/2018 - Artificiële intelligentie en mobiliteit
 
Reinforcement learning in a nutshell
Reinforcement learning in a nutshellReinforcement learning in a nutshell
Reinforcement learning in a nutshell
 
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular MethodsReinforcement Learning 8: Planning and Learning with Tabular Methods
Reinforcement Learning 8: Planning and Learning with Tabular Methods
 
Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24
 
SOLID refactoring - racing car katas
SOLID refactoring - racing car katasSOLID refactoring - racing car katas
SOLID refactoring - racing car katas
 
Dynamic Search and Beyond
Dynamic Search and BeyondDynamic Search and Beyond
Dynamic Search and Beyond
 
The Volcano/Cascades Optimizer
The Volcano/Cascades OptimizerThe Volcano/Cascades Optimizer
The Volcano/Cascades Optimizer
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
 
Evan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATLEvan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATL
 
How we won the 2nd place In KaggleDays Championship.pptx
How we won the 2nd place In KaggleDays Championship.pptxHow we won the 2nd place In KaggleDays Championship.pptx
How we won the 2nd place In KaggleDays Championship.pptx
 
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
Graph Gurus Episode 32: Using Graph Algorithms for Advanced Analytics Part 5
 
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationUsing Graph Algorithms for Advanced Analytics - Part 5 Classification
Using Graph Algorithms for Advanced Analytics - Part 5 Classification
 
CIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device LinkingCIKM Cup 2016: Cross-Device Linking
CIKM Cup 2016: Cross-Device Linking
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
 
Learning visual representation without human label
Learning visual representation without human labelLearning visual representation without human label
Learning visual representation without human label
 
DALL-E.pdf
DALL-E.pdfDALL-E.pdf
DALL-E.pdf
 

Recently uploaded

Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmDeepika Walanjkar
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptJohnWilliam111370
 
Industrial Applications of Centrifugal Compressors
Industrial Applications of Centrifugal CompressorsIndustrial Applications of Centrifugal Compressors
Industrial Applications of Centrifugal CompressorsAlirezaBagherian3
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming languageSmritiSharma901052
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfNainaShrivastava14
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.elesangwon
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHSneha Padhiar
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Erbil Polytechnic University
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 

Recently uploaded (20)

Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithmComputer Graphics Introduction, Open GL, Line and Circle drawing algorithm
Computer Graphics Introduction, Open GL, Line and Circle drawing algorithm
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONTHE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION
 
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.pptROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
ROBOETHICS-CCS345 ETHICS AND ARTIFICIAL INTELLIGENCE.ppt
 
Industrial Applications of Centrifugal Compressors
Industrial Applications of Centrifugal CompressorsIndustrial Applications of Centrifugal Compressors
Industrial Applications of Centrifugal Compressors
 
OOP concepts -in-Python programming language
OOP concepts -in-Python programming languageOOP concepts -in-Python programming language
OOP concepts -in-Python programming language
 
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdfPaper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
Paper Tube : Shigeru Ban projects and Case Study of Cardboard Cathedral .pdf
 
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
2022 AWS DNA Hackathon 장애 대응 솔루션 jarvis.
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACHTEST CASE GENERATION GENERATION BLOCK BOX APPROACH
TEST CASE GENERATION GENERATION BLOCK BOX APPROACH
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
Comparative study of High-rise Building Using ETABS,SAP200 and SAFE., SAFE an...
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 

PFN Spring Internship Final Report: Autonomous Drive by Deep RL

  • 1. Driving in TORCS with Deep Deterministic Policy Gradient Final Report Naoto Yoshida
  • 2. About Me ● Ph.D. student from Tohoku University ● My Hobby: ○ Reading Books ○ TBA ● NEWS: ○ My conference paper on the reward function was accepted! ■ SCIS&ISIS2016 @ Hokkaido
  • 3. Outline ● TORCS and Deep Reinforcement Learning ● DDPG: An Overview ● In Toy Domains ● In TORCS Domain ● Conclusion / Impressions
  • 5. TORCS: The Open source Racing Car Simulator ● Open source ● Realistic (?) dynamics simulation of the car environment
  • 6. Deep Reinforcement Learning ● Reinforcement Learning + Deep Learning ○ From Pixel to Action ■ General game play in ATARI domain ■ Car Driver ■ (Go Expert) ● Deep Reinforcement Learning in Continuous Action Domain: DDPG ○ Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." , ICLR 2016 Vision-based Car agent in TORCS Steering + Accel/Blake = 2 dim continuous actions
  • 8. GOAL: Maximization of in expectation Reinforcement Learning Agent Environment Action : a State : s Reward : r
  • 9. GOAL: Maximization of in expectation Reinforcement Learning Agent Environment Action : a State : s Reward : r Interface Raw output: u Raw input: x
  • 10. Deterministic Policy Gradient ● Formal Objective Function: Maximization of True Action Value ● Policy Evaluation: Approximation of the objective function ● Policy Improvement: Improvement of the objective function where Bellman equation wrt Deterministic Policy Loss for Critic Update direction of Actor Silver, David, et al. "Deterministic policy gradient algorithms." ICML. 2014.
  • 11. Deep Deterministic Policy Gradient Initialization Update of Critic + minibatch Update of Actor + minibatch Update of Target Sampling / Interaction RL agent (DDPG) s, ra TORCS Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning.", ICLR 2016
  • 12. Deep Architecture of DDPG Three-step observation Simultaneous training of two deep convolutional networks
  • 13. Exploration: Ornstein–Uhlenbeck process ● Gaussian noise with moments ○ θ,σ:parameters ○ dt:time difference ○ μ:mean (= 0.) ● Stochastic Differential Equation: ● Exact Solution for the discrete time step: Wiener Process Gaussian
  • 14. OU process: Example GaussianOU Process θ = 0.15, σ = 0.2,μ = 0
  • 15. DDPG in Toy Domains https://gym.openai.com/
  • 16. Toy Problem: Pendulum Swingup ● Classical RL benchmark task ○ Nonlinear control: ○ Action: Torque ○ State: ○ Reward: From “Reinforcement Learning In Continuous Time and Space”, Kenji Doya, 2000
  • 19. Toy Problem 2: Cart-pole Balancing ● Another classical benchmark task ○ Action: Horizontal Force ○ State: ○ Reward: (other definition is possible) ■ +1 (angle is in the area) ■ 0 (Episode Terminal) Angle Area
  • 20. Results: non-convergent behavior :( RLLAB implementation worked well Successful Score Total Steps https://rllab.readthedocs.io
  • 21. DDPG in TORCS Domain Note: Red line : Parameters with Author Confirmation / DDPG paper Blue line : Estimated/Hand-tuned Parameters
  • 22. Track: Michigan Speedway ● Used in DDPG paper ● This track actually exists! www.mispeedway.com
  • 23. TORCS: Low-dimensional observation ● TORCS supports low-dimensional sensor outputs for AI agent ○ “Track” sensor ○ “Opponent” sensor ○ Speed sensor ○ Fuel, damage, wheel spin speed etc. ● Track + speed sensor as observation ● Network: Shallow network ● Action: Steering (1 dim) ● Reward: ○ If car clashed/course out , car gets penalty -1 ○ otherwise “Track” Laser Sensor Track Axis Car Axis
  • 25. TORCS: Vision inputs ● Two deep convolutional neural networks ○ Convolution: ■ 1st layer: 32 filters, 5x5 kernel, stride 2, paddling 2 ■ 2nd, 3rd layer: 32 filters, 3x3 kernel, stride 2, paddling 1 ■ Full-connection: 200 hidden nodes
  • 26. VTORCS-RL-color ● Visual TORCS ○ TORCS for Vision-based AI agent ■ Original TORCS does not have vision API! ■ vtorcs: ● Koutník et al., "Evolving deep unsupervised convolutional networks for vision-based reinforcement learning, ACM, 2014. ○ Monochrome image from TORCS server ■ Modification for the color vision → vtorcs-RL-color ○ Restart bug ■ Solved with help of mentors’ substantial suggestions!
  • 27. Result: Still not a good result...
  • 28. What was the cause of the failure? ● DDPG implementation? ○ Worked correctly, at least in toy domains. ■ The approximation of value functions → ok ● However, policy improvement failed in the end. ■ Default exploration strategy is problematic in TORCS environment ● This setting may be for general tasks ● Higher order exploration in POMDP is required ● TROCS environment? ○ Still several unknown environment parameters ■ Reward → ok (DDPG author check) ■ Episode terminal condition → still various possibilities (from DDPG paper)
  • 29. gym-torcs ・TORCS environment with openAI-Gym like interface
  • 30. Impressions ● On DDPG ○ Learning of the continuous control is a tough problem :( ■ Difficulty of policy update in DDPG ■ “Twice” recommendation of Async method by DDPG author (^ ^;) ● Throughout this PFN internship: ○ Weakness: Coding ■ Thank you! Fujita-san, Kusumoto-san ■ I knew many weakness of my coding style ○ Advantage: Reinforcement Learning Theory ■ and its branched algorithms, topics and relationships between RL and Inference ■ For DEEP RL, Fujita-san is an auth. in Japan :)
  • 31. Update after the PFI seminar
  • 32. Cart-pole Balancing ● DDPG could learn the successful policy ○ Still unstable after the several successful trial
  • 33. Success in Half-Cheetah Experiment ● We could run successful experiment with identical hyper parameters in cart- pole. 300-step total reward Episode
  • 34. Keys in DDPG / deep RL ● Normalization of the environment ○ Preprocess is known to be very important for deep learning. ■ This is also true in deep RL. ■ Scaling of inputs (possibly, and actions, rewards) will help the agent to learn. ● Possible normalization: ○ Simple normalization helps: x_norm = (x - mean_x)/std_x ○ Mean and Standard deviation are obtained during the initial exploration. ○ Other normalization like ZCA/PCA whitening may also help. ● Epsilon parameter in Adam/RMS prop can be large value ○ 0.1, 0.01, 0.001… We still need a hand-tuning / grid search...