Lifelong Learning via
Progressive Distillation and Retrospection

Saihui Hou1*    Xinyu Pan2*    Chen Change Loy3    Zilei Wang1    Dahua Lin2   

1 University of Science and Technology of China     2 The Chinese Univeristy of Hong Kong    
3 Nanyang Technological University     [* indicates joint first authorship]

Fig 1. Illustration of Distillation and Retrospection.

Abstract

Lifelong learning aims at adapting a learned model to new tasks while retaining the knowledge gained earlier. A key challenge for lifelong learning is how to strike a balance between the preservation on old tasks and the adaptation to a new one within a given model. Approaches that combine both objectives in training have been explored in previous works. Yet the performance still suffers from considerable degradation in a long sequence of tasks. In this work, we propose a novel approach to lifelong learning, which tries to seek a better balance between preservation and adaptation via two techniques: Distillation and Retrospection. Specifically, the target model adapts to the new task by knowledge distillation from an intermediate expert, while the previous knowledge is more effectively preserved by caching a small subset of data for old tasks. The combination of Distillation and Retrospection leads to a more gentle learning curve for the target model, and extensive experiments demonstrate that our approach can bring consistent improvements on both old and new tasks.

Architcture

Fig 2. Illustration of network structures for Distillation+Retrospection.

Performance

Table 1. Classification accuracy (%) for two-task scenario. Feature Extraction provides the reference performance for the first task while Finetuning provides the reference for the second one. D for Distillation, and R for Retrospection.

Table 2. Classification accuracy (%) for five-task scenario. The results are reported at the end of the last training stage. LwF is treated as the baseline here.

(a) Imagenet→Scenes→Birds→Flowers→Aircrafts. (b) Imagenet→Birds→Flowers→Aircrafts→Scenes.

Fig 3. Accuracy degradation on ImageNet in five-task scenario. D for Distillation, and R for Retrospection.