Life Long Learning

Life Long Learning Hung-yi Lee 李宏毅 https://www.pearsonlearned.com/lifelong-learning-will-help-workers-navigate-future-work/

你用同一個腦學習 機器學習的每一堂課但是每一個作業你都訓練不同的類神經網路能不能每次作業都訓練同一個類神經網路呢？ https://www.forbes.com/sites/kpmg/2018/04/23/the-changing-nature-of-work-why-lifelong-learning-matters-more-than-ever/#4e04e90e1e95

Life Long Learning (LLL) Continuous Learning, Never Ending Learning, Incremental Learning I can solve task 1. I can solve tasks 1&2. I can solve tasks 1&2&3. … Learning Task 2 Learning Task 1 Learning Task 3

Life-long Learning

Example–Image 3 layers, 50 neurons each = Task 1 Task 2 This is “0”. This is “0”. 97% 96% 90% Forget!!! 80%

Learning Task 2 Learning Task 1 98% Learning Task 1 89% 97% 96% Learning Task 2 90% Forget!!! 80% 明明可以把 Task 1 和 2 都學好，為什麼會變成這樣子呢!?

https://arxiv.org/pdf/1502.05698.pdf Example–Question Answering • Given a document, answer the question based on the document. • There are 20 QA tasks in bAbi corpus. • Train a QA model through the 20 tasks

Example–Question Answering Sequentially train the 20 tasks Jointly training the 20 tasks 是不為也非不能也感謝何振豪同學提供實驗結果

Catastrophic Forgetting

Wait a minute …… • Multi-task training can solve the problem! • Multi-task training can be considered as the upper bound of LLL. Using all the data for training Computation issue Training Data for Task 1 Training Data for Task 999 Training Data for Task 1000 …… Learning Task 1 Learning Task 2 Storage issue Always keep the data

Elastic Weight Consolidation (EWC) Basic Idea: Some parameters in the model are important to the previous tasks. Only change the unimportant parameters. is the model learned from the previous tasks. Each parameter has a “guard” How important this parameter is Loss for current task Parameters learned from previous task Loss to be optimized Parameters to be learning

Elastic Weight Consolidation (EWC) Basic Idea: Some parameters in the model are important to the previous tasks. Only change the unimportant parameters. is the model learned from the previous tasks. Each parameter has a “guard” One kind of regularization. should be close to in certain directions. If , there is no constraint on If , would always be equal to

Elastic Weight Consolidation (EWC) Task 1 Task 2 Forget! The error surfaces of tasks 1 & 2. (darker = smaller loss)

Elastic Weight Consolidation (EWC) Task 1 Small 2nd derivative is small 動到沒關係 Large 2nd derivative Each parameter has a “guard” is large 動到會出事

Elastic Weight Consolidation (EWC) Task 1 Task 2 Do not forget! is small, while is large. (可以動，但儘量不要動到 )

Elastic Weight Consolidation (EWC) • Intransigence MNIST permutation, from the original EWC paper

Elastic Weight Consolidation (EWC) • Elastic Weight Consolidation (EWC) • http://www.citeulike.org/group/15400/article/14311063 • Synaptic Intelligence (SI) • https://arxiv.org/abs/1703.04200 • Memory Aware Synapses (MAS) • Special part: Do not need labelled data • https://arxiv.org/abs/1711.09601 Synaptic: 突觸的 Synapsis: 突觸

Generating Data https://arxiv.org/abs/1705.08690 https://arxiv.org/abs/1711.10563 • Conducting multi-task learning by generating pseudo-data using generative model Generate task 1 data Generate task 1&2 data Training Data for Task 1 Training Data for Task 2 Multi-task Learning Solve task 1 Solve task 2

Adding New Classes • Learning without forgetting (LwF) • https://arxiv.org/abs/1606.09282 • iCaRL: Incremental Classifier and Representation Learning • https://arxiv.org/abs/1611.07725

Wait a minute …… • Train a model for each task Learning Task 1 Learning Task 2 Learning Task 3 • Knowledge cannot transfer across different tasks • Eventually we cannot store all the models …

Life-Long v.s. Transfer I can do task 2 because I have learned task 1. Transfer Learning: (We don’t care whether machine can still do task 1.) fine-tune Learning Task 1 Learning Task 2 Even though I have learned task 2, I do not forget task 1. Life-long Learning:

Evaluation : after training task i, performance on task j If , After training task i, does task j be forgot If , Accuracy Can we transfer the skill of task i to task j Backward Transfer (It is usually negative.)

Evaluation : after training task i, performance on task j If , After training task i, does task j be forgot If , Accuracy Can we transfer the skill of task i to task j Backward Transfer Forward Transfer

GEM: https://arxiv.org/abs/1706.08840 A-GEM: https://arxiv.org/abs/1812.00420 Gradient Episodic Memory (GEM) • Constraint the gradient to improve the previous tasks as close as possible : negative gradient of current task negative gradient of previous task Need the data from the previous tasks : update direction

https://arxiv.org/abs/1606.04671 Progressive Neural Networks Task 1 Task 2 Task 3 input input input

Expert Gate https://arxiv.org/abs/1611.06194

https://arxiv.org/abs/1511.05641 Net2Net Add some small noises Expand the network only when the training accuracy of the current task is not good enough. https://arxiv.org/abs/1811.07017

Concluding Remarks

Curriculum Learning : what is the proper learning order? Task 2 Task 1 96% 97% 90% Forget!!! 80% Task 2 Task 1 97% 97% 90% 62%

taskonomy = task + taxonomy (分類學) http://taskonomy.stanford.edu/#abstract

Life Long Learning