CHAPTERS
1.Welcome back to ICML 2019 presentations. This session on Deep Reinforcement Lear...00:00
2.Paper: An Investigation of Model-Free Planning07:45
3.RL agents in complex domains08:00
4.Planning network architecture10:07
5.Sokoban10:40
6.What is planning?12:35
7.Planning: Behavioral Characteristics13:11
8.Can these methods plan? 14:10
9.Our investigation14:21
10.RL Methods14:57
11.Performance on Sokoban15:24
12.Stacked ConvLSTMs17:52
13.Repeat within time-step18:38
14.Deep Recurrent ConvLSTM (DRC) architecture19:06
15.One common view of RNN in RL19:31
16.What we want20:02
17.Sokoban (Early training)20:20
18.Planning: Behavioural Charaterstics20:52
19.Generalization / Data efficiency20:58
20.Time Scalability22:15
21.Other planning domains23:09
22.Take Aways23:37
23.Q&A24:31
24.Paper: CURIOUS: Intrinsically Motivated Modular Multi-Goal RL28:12
25.Problem: Intrinsically Motivated Modular Multi-Goal RL28:36
26.The Curious Algorithm29:25
27.Modular goal encoding vs Multi-Goal Module Experts29:52
28.Automatic Curriculum with Absolute Learning Progress30:07
29.Resilience to Distracting Goals30:58
30.Resilience to Forgetting and Sensory Failures31:29
31.Curious: Intrinsically Motivated Modular Multi-Goal RL31:57
32.Paper: Task-Agnostic Dynamics Priors for Deep Reinforcement Learning32:40
33.Key Questions32:51
34.Dynamics Model in RL33:26
35.Overall Approach33:58
36.Spatial34:10
37.Spatial Memory34:29
38.Experimental Setup35:03
39.Model Predictions35:36
40.Predicting Physical Parameters35:58
41.Policy Learning: PhysShooter36:26
42.Policy Learning: Atari36:43
43.Transfer Learning36:52
44.Conclusion37:03
45.Paper: Diagnosing Bottlenecks in Deep Q-learning Algorithms37:19
46.Motivation37:33
47.How does function approximation affect convergence?38:25
48.Does overfitting occur?39:11
49.Can early stopping help?39:53
50.How to choose the sampling distribution?40:04
51.Adversarial Feature Matching (AFM)40:47
52.Paper: Collaborative Evolutionary Reinforcement Learning41:45
53.A simple actor-critic policy gradient setup41:55
54.Learner42:09
55.What do we optimize exactly?42:16
56.Portfolio of Learners (varying discount rates)42:44
57.Why varying discount rates?42:59
58.Adding a Resource Manager44:33
59.Adding Neuroevolution45:02
60.Experiment: Humanoid45:34
61.Paper: EMI: Exploration with Mutual Information47:11
62.Outline47:26
63.Sparse-reward environments47:31
64.Intuition47:52
65.Mutual information maximizing embedding representations48:31
66.Lower bound of MI49:44
67.Lower bound of MI using Jensen-Shannon divergence50:14
68.Maximizing lower bound of MI for embedding learning52:18
69.Architecture for MI estimation52:56
70.Linear dynamics model with error model53:50
71.Final learning objective55:28
72.Intrinsic reward augmentation56:09
73.Pseudocode56:35
74.Baselines57:42
75.Environments for evaluation 59:02
76.Experimental results59:43
77.Learned EMI embeddings1:02:26
78.Conclusion1:03:44
79.Q&A1:04:40
80.Paper: Imitation Learning from Imperfect Demonstration1:06:07
81.Introduction1:06:18
82.Motivation1:06:53
83.Generative Adversarial Imitation Learning [1]1:07:25
84.Problem Setting1:07:45
85.Proposed Method 1: Two-Step Importance Weighting Imitation Learning1:08:16
86.Proposed Method 2: GAIL with Imperfect Demonstration and Confidence1:08:43
87.Setup1:09:11
88.Results: Higher Average Return of the Proposed Methods1:09:18
89.Results: Unlabeled Data Helps1:09:31
90.Conclusion1:09:53
91.Paper: Curiosity-Bottleneck: Exploration by Distilling Task-Specific Novelty1:10:17
92.Motivation: Exploration under Distraction1:10:48
93.Approach: Curiosity-Bottleneck1:11:36
94.Experiments: Static Environment1:13:25
95.Experiments: Treasure-Hunt1:13:57
96.Experiments: Atari Hard-Exploration Games1:14:37
97.Paper: Dynamic Weights in Multi-Objective Deep Reinforcement Learning1:15:03
98.Problem1:15:17
99.Conditioned Network (CN)1:16:14
100.Updating the Conditioned Network1:16:34
101.Diverse Experience Replay (DER)1:17:12
102.Our CN algorithm converges to near-optimality1:18:08
103.Diversity is crucial for large but sparse weight changes1:18:49
104.Paper: Fingerprint Policy Optimisation for Robust Reinforcement Learning1:19:36
105.Motivation1:19:48
106.Naive application of policy gradients1:20:39
107.Fingerprint Policy Optimisation (FPO)1:21:20
108.Policy fingerprints1:22:40
109.Results1:23:17
CHAPTERS
Powered byVideoKen
MORE
ShareCOPIED
Share Topic...
×
Search in Video
Feedback
Powered byVideoKen
SEARCH IN VIDEO
Transcript is not available for this video
Powered byVideoKen
FEEDBACK
Love it, will recommend
Good, but won't recommend
Bad, needs a lot of improvement
Submit Feedback
Powered byVideoKen