ICML 2019 Deep Reinforcement Learning Paper Presentations

1.Welcome back to ICML 2019 presentations. This session on Deep Reinforcement Lear...00:00

2.Paper: An Investigation of Model-Free Planning07:45

3.RL agents in complex domains08:00

4.Planning network architecture10:07

5.Sokoban10:40

6.What is planning?12:35

7.Planning: Behavioral Characteristics13:11

8.Can these methods plan? 14:10

9.Our investigation14:21

10.RL Methods14:57

11.Performance on Sokoban15:24

12.Stacked ConvLSTMs17:52

13.Repeat within time-step18:38

14.Deep Recurrent ConvLSTM (DRC) architecture19:06

15.One common view of RNN in RL19:31

16.What we want20:02

17.Sokoban (Early training)20:20

18.Planning: Behavioural Charaterstics20:52

19.Generalization / Data efficiency20:58

20.Time Scalability22:15

21.Other planning domains23:09

22.Take Aways23:37

23.Q&A24:31

24.Paper: CURIOUS: Intrinsically Motivated Modular Multi-Goal RL28:12

25.Problem: Intrinsically Motivated Modular Multi-Goal RL28:36

26.The Curious Algorithm29:25

27.Modular goal encoding vs Multi-Goal Module Experts29:52

28.Automatic Curriculum with Absolute Learning Progress30:07

29.Resilience to Distracting Goals30:58

30.Resilience to Forgetting and Sensory Failures31:29

31.Curious: Intrinsically Motivated Modular Multi-Goal RL31:57

32.Paper: Task-Agnostic Dynamics Priors for Deep Reinforcement Learning32:40

33.Key Questions32:51

34.Dynamics Model in RL33:26

35.Overall Approach33:58

36.Spatial34:10

37.Spatial Memory34:29

38.Experimental Setup35:03

39.Model Predictions35:36

40.Predicting Physical Parameters35:58

41.Policy Learning: PhysShooter36:26

42.Policy Learning: Atari36:43

43.Transfer Learning36:52

44.Conclusion37:03

45.Paper: Diagnosing Bottlenecks in Deep Q-learning Algorithms37:19

46.Motivation37:33

47.How does function approximation affect convergence?38:25

48.Does overfitting occur?39:11

49.Can early stopping help?39:53

50.How to choose the sampling distribution?40:04

51.Adversarial Feature Matching (AFM)40:47

52.Paper: Collaborative Evolutionary Reinforcement Learning41:45

53.A simple actor-critic policy gradient setup41:55

54.Learner42:09

55.What do we optimize exactly?42:16

56.Portfolio of Learners (varying discount rates)42:44

57.Why varying discount rates?42:59

58.Adding a Resource Manager44:33

59.Adding Neuroevolution45:02

60.Experiment: Humanoid45:34

61.Paper: EMI: Exploration with Mutual Information47:11

62.Outline47:26

63.Sparse-reward environments47:31

64.Intuition47:52

65.Mutual information maximizing embedding representations48:31

66.Lower bound of MI49:44

67.Lower bound of MI using Jensen-Shannon divergence50:14

68.Maximizing lower bound of MI for embedding learning52:18

69.Architecture for MI estimation52:56

70.Linear dynamics model with error model53:50

71.Final learning objective55:28

72.Intrinsic reward augmentation56:09

73.Pseudocode56:35

74.Baselines57:42

75.Environments for evaluation 59:02

76.Experimental results59:43

77.Learned EMI embeddings1:02:26

78.Conclusion1:03:44

79.Q&A1:04:40

80.Paper: Imitation Learning from Imperfect Demonstration1:06:07

81.Introduction1:06:18

82.Motivation1:06:53

83.Generative Adversarial Imitation Learning [1]1:07:25

84.Problem Setting1:07:45

85.Proposed Method 1: Two-Step Importance Weighting Imitation Learning1:08:16

86.Proposed Method 2: GAIL with Imperfect Demonstration and Confidence1:08:43

87.Setup1:09:11

88.Results: Higher Average Return of the Proposed Methods1:09:18

89.Results: Unlabeled Data Helps1:09:31

90.Conclusion1:09:53

91.Paper: Curiosity-Bottleneck: Exploration by Distilling Task-Specific Novelty1:10:17

92.Motivation: Exploration under Distraction1:10:48

93.Approach: Curiosity-Bottleneck1:11:36

94.Experiments: Static Environment1:13:25

95.Experiments: Treasure-Hunt1:13:57

96.Experiments: Atari Hard-Exploration Games1:14:37

97.Paper: Dynamic Weights in Multi-Objective Deep Reinforcement Learning1:15:03

98.Problem1:15:17

99.Conditioned Network (CN)1:16:14

100.Updating the Conditioned Network1:16:34

101.Diverse Experience Replay (DER)1:17:12

102.Our CN algorithm converges to near-optimality1:18:08

103.Diversity is crucial for large but sparse weight changes1:18:49

104.Paper: Fingerprint Policy Optimisation for Robust Reinforcement Learning1:19:36

105.Motivation1:19:48

106.Naive application of policy gradients1:20:39

107.Fingerprint Policy Optimisation (FPO)1:21:20

108.Policy fingerprints1:22:40

109.Results1:23:17