Register now After registration you will be able to apply for this opportunity online.
This opportunity is not published. No applications will be accepted.
Continual Learning and Neural Networks’ Scaling Limit(s)
In this project, we aim to study the effect of the network’s architecture in continual learning, with a specific focus on the effect of scaling it to large width and depth, and their interplay with other architectural components such as residual connections.
Continual learning is a machine learning paradigm that focuses on learning a set of tasks in a sequential fashion. The ideal objective is to learn new tasks flexibly (plasticity) without forgetting what has been learned (stability). When it comes to neural networks, it is well-established that they suffer from catastrophic forgetting, as they fail to retain knowledge even with very few tasks. Despite several methods being developed to counteract it, finding a good trade-off between stability and plasticity is still a largely open problem in continual learning [Wang et al., 2023]
Why Scaling Limits? It turns out that there are several ways to scale-up the architecture, resulting in different limits with different properties, especially when it comes to feature learning. For instance, when the network’s with N is taken to infinity, the network enters the so-called kernel regime, where the predictions can be described in closed form with the machinery of kernels and Gaussian processes [Neal, 1996, Lee et al., 2017, Jacot et al., 2018]. Crucial for this project, in the limit there is no feature learning! Without feature learning, the network is supposed to forget less catastrophically (at the loss of reduced plasticity?), as shown in Mirzadeh et al. [2022].
Alternatively, if the architecture and learning rate are parametrized slightly differently, the network learns feature at every layer as N → ∞ [Yang and Hu, 2020, Bordelon and Pehlevan, 2022]. Recently, there has been extensions of these limits to infinite depth L → ∞ as well! [Bordelon et al., 2023]. In practice, the rate of feature learning can be easily controlled through a hyperparameter (as you will learn during the project)
Continual learning is a machine learning paradigm that focuses on learning a set of tasks in a sequential fashion. The ideal objective is to learn new tasks flexibly (plasticity) without forgetting what has been learned (stability). When it comes to neural networks, it is well-established that they suffer from catastrophic forgetting, as they fail to retain knowledge even with very few tasks. Despite several methods being developed to counteract it, finding a good trade-off between stability and plasticity is still a largely open problem in continual learning [Wang et al., 2023]
Why Scaling Limits? It turns out that there are several ways to scale-up the architecture, resulting in different limits with different properties, especially when it comes to feature learning. For instance, when the network’s with N is taken to infinity, the network enters the so-called kernel regime, where the predictions can be described in closed form with the machinery of kernels and Gaussian processes [Neal, 1996, Lee et al., 2017, Jacot et al., 2018]. Crucial for this project, in the limit there is no feature learning! Without feature learning, the network is supposed to forget less catastrophically (at the loss of reduced plasticity?), as shown in Mirzadeh et al. [2022]. Alternatively, if the architecture and learning rate are parametrized slightly differently, the network learns feature at every layer as N → ∞ [Yang and Hu, 2020, Bordelon and Pehlevan, 2022]. Recently, there has been extensions of these limits to infinite depth L → ∞ as well! [Bordelon et al., 2023]. In practice, the rate of feature learning can be easily controlled through a hyperparameter (as you will learn during the project)
The existence of different limits poses questions such as which limit is best for continual learning? What is the role of feature learning? Can we have consistent improvement as we scale up the architecture, while making the most efficient use of the parameters?
The existence of different limits poses questions such as which limit is best for continual learning? What is the role of feature learning? Can we have consistent improvement as we scale up the architecture, while making the most efficient use of the parameters?