Datacast Episode 61: Meta Reinforcement Learning with Louis Kirsch

The 61st episode of Datacast is my chat with Louis Kirsch— a Ph.D. student at the Swiss AI Lab IDSIA, advised by Prof. Jürgen Schmidhuber.

We had a wide-ranging conversation covering his interest in programming growing up, his foray into AI research, the intersection of meta-learning and reinforcement-learning, contemporary challenges in AI, working with professor Schmidhuber, and much more.

Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) TuneIn, (5) RadioPublic, (6) Stitcher, and (7) iHeart Radio

Key Takeaways

Here are highlights from my conversation with Louis:

On Being A Self-Taught Programmer

I started programming when I was about 10. I was super fascinated by the idea of telling a computer what to do instead of having to do it myself. This idea of magnifying my own effort as much as possible has been a recurring theme in my life.

When I was 13, I wrote on my small computer a game with the XNA game engine. It took me an entire year to get it finished because I was learning everything on the fly. On my 14th birthday, I burned up CDs for my friends, which contain my game. That was an awesome moment.

A central theme in my life is being a self-taught programmer. Whenever I want to achieve something, I just take it into my own hands. That mindset manifests itself in the form of having the dream to create something big in the future. When I turned 18, I started my own small company with a friend. We started with freelancing. That was a dream come true for me. I had the opportunity to write real-world software, deal with clients, learn about finances and taxes.

The engineering mindset from that time still stuck with me in some sense: striking the right balance between making things work in practice and achieving the bigger vision.

On Getting A Bachelor at Hasso Plattner Institute

HPI is a privately financed institution by Hasso Plattner, the co-founder of the big giant Germany software company SAP. It’s a small institute with just 500 people, including faculty and students. My greatest friendships formed during that time.

The Theoretical Computer Science courses are those I enjoyed the most. I already had a bit of a software engineering background when I started, while there are so many topics in theoretical CS that I was not familiar with yet. Additionally, the Introductory Machine Learning course ultimately inspired me to follow that direction.

My Bachelor thesis is the first AI research that I engaged with. The more I learn about Machine Learning, the more I was bothered about the fact that the learning part of ML is, in some sense, limited. For example, an image classifier can learn from data, but the architecture needs to be designed manually. This inspired me to perform an architecture search for a Convolutional classifier. I came up with a differentiable variant of architecture search by growing and shrinking the number of neurons and layers in this ConvNet.

While taking a deep learning class, I worked on a big project that implements speech recognition with a constrained GPU memory budget and limited training data. My ACL paper investigates how one can train a speech recognizer on a large English corpus and apply transfer learning. That essentially means taking weight initialization and continuing the training step on a smaller German corpus. Ultimately, that led to fairly large savings on GPU compute memory and required training data.

On Getting A Master at the University College London

Gatsby Computational Neuroscience Unit (Source: https://www.ucl.ac.uk/gatsby/)

Going to UCL was one of the best decisions that I made. It threw me in this ML environment, and I have never learned so much in a single year. I transitioned from a software engineer to an ML researcher.

The best UCL course I took is “Probabilistic and Unsupervised Learning” at the Gatsby Unit, which was insanely and densely packed. It allows me to learn a lot from the professors/peers and get up to speed quickly.

On Modular Networks

If you look at the history of deep learning, it’s not necessarily the fancy new learning algorithm that ultimately improves a model’s performance. Instead, a driving factor is how you scale the dataset size and the number of model parameters. If you think about the human brain, it has an estimated 150 trillion synapses, as a rough approximation that corresponds to at least 150 trillion floating-point parameters. Existing deep learning models have maybe up to a few hundred of billions of parameters. There are still a few orders of magnitude in between.

The main issue is that we need to evaluate all the parameters for every input that we feed into our model, which means our compute budget must scale linearly with our model size. However, not all neurons in the brain fire all the time, and the energy cost is proportional to the number of firing neurons. That’s the inspiration for my modular network paper.

The modular layer consists of a pool of modules and a controller that chooses the modules to execute based on the input (Source: http://louiskirsch.com/modular-networks)

  • We had a small pool of neural networks called modules. Next, we train these modules and change their parameters and train how we have to compose those to a large neural network. Only some of these modules have to be executed at a time, not all of them.

  • Additionally, the module selection has to be deterministic. It was very important that we don’t have to potentially evaluate multiple sets at times (as previously done with Mixture-of-Experts).

  • We also evaluated the modular collapse phenomenon. When you start training, you always use the same modules in the beginning. And then, those modules get better and better over time. When you decide which modules to pick next, you pick the ones that are already better. Some of the modules will never be trained.

Our follow-up report rethought our approach: maybe it would be better to simply turn on and off certain neurons in our network. The main insight is that perhaps we can use sparsity (either in the weights or in the activations of the network). The sparsity in the activation is conditional to the input, while the sparsity in the weights is unconditional. Whenever a weight or an activation is zero, we would skip the associated computation. In the case of the activation, we would skip the entire row of the matrix. But there is an unfortunate problem here: our GPUs are absolutely awful at skipping those zero-weight matrices. Ultimately, the long-term goal is to develop hardware that is capable of leveraging such sparsity. The near-term goal would be something equivalent to modular networks, as suggested above.

On Characteristics of Machine Learning Research with Impact

I think any researchers interested in making an impact should try to learn as much as possible about why other people are successful. Answering the question of what constitutes an impact is quite difficult. In academia, we like to use citations because they are one of the few things that we can directly measure and compare. In this report, I analyzed what kinds of papers are highly cited and when do they get cited.

While the LSTM (or its most successful version) has been published in 1997 it took several years and success in the form of applicability for real-world challenges to be widely recognized and used in other research (Source: http://louiskirsch.com/ml-research-with-impact)

  • The majority of citations of successful papers do not come right after their publication but can be decades behind.

  • Many highly-cited papers have large-scale applications. They do not just propose an idea that might perhaps be useful in the future. They require large-scale experiments to demonstrate to others that the proposed ideas are good.

  • Highly-cited researchers only have a few publications with the majority of the citations. There’s a lot of trial-and-error, even if you are a skilled researcher.

On Pursuing a Ph.D. at the Swiss AI Lab IDSIA

When looking for a Ph.D. opportunity, I was already interested in meta-learning and figuring out the best way to pursue AGI. Jürgen Schmidhuber is a very interesting advisor to work with. His research interests align with mine. He also started the field of meta-learning in 1987. It’s incredible how many of Jürgen’s ideas find practical applications today or are reinvented later.

In terms of the research environment, we have a lot of freedom in the group and pursue promising projects. That’s something I appreciate and has been working quite well for me so far. The lab is located in Lugano, the Southern part of Switzerland. I love hiking in the nearby mountains and going for walks at the lake while reflecting on life.

On Meta Reinforcement Learning

There is a lot of trial-and-error involved in doing research on learning algorithms. The research community keeps inventing new reinforcement learning algorithms to solve all the problems we have yet solved. Some people called this “graduate student descent” as an analogy to gradient descent, where we need a lot of human capital to improve our learning algorithms. With meta-learning, the burden of designing a good learning algorithm is no longer necessary on the human researcher but shifts to learning data automatically.

A Map of Reinforcement Learning (Source: http://louiskirsch.com/maps/reinforcement-learning)

To deepen my understanding of problems that meta-learning would have to solve, I created a big mind map with all the challenges that I was currently thinking of in reinforcement learning and categorized them on whether they are solvable by meta-learning.

On The Path to AGI

Even if we have a perfect meta-learner, we would still need an environment to train it on and some tasks for it to perform. Ultimately, building these environments and tasks by hand would be infeasible. We need some principles of generating a new problem apart from the manually designed tasks that we want the AI to achieve. Essentially, these two pillars of environments that we have to generate and the meta-learner to be trained later became quite related to Jürgen’s Power Play framework. A few months later, Jeff Clune published his idea on the AI-Generating Algorithms, which describes a similar angle of using meta-learning as the path to AGI.

On MetaGenRL

When starting this project, I was bothered by the state of meta-learning research at the moment. I always want meta-learning to be the change that reduces the burden for human researchers to invent new learning algorithms. However, it seemed that existing meta-learning approaches did not solve the same problems that a human researcher would. Most SOTA meta-learning approaches could only adapt to extremely similar tasks/environments, not so much over a wide range of environments (as a human-engineered algorithm can).

Schematic of MetaGenRL (Source: http://louiskirsch.com/metagenrl)

MetaGenRL (Meta-Learning General Reinforcement-Learner) is our first step towards the direction of creating meta-learning general-purpose algorithms.

  • To meta-learn a single learning algorithm applicable across a wide range of RL agents and environments, we need to have a big population of agents that act in different environments.

  • We start with a random learning algorithm. Then, we have a population of agents that all use this learning algorithm to learn at the moment. But at the same time, we use the experiences that we collect from these agents to improve the learning algorithm.

  • In our case, we represent the learning algorithm using a small neural network as an objective function. This function takes inputs (actions of the RL agents, the rewards for the actions) and produces a scalar loss. Then, we differentiate this produced scalar loss with respect to the agent’s parameters. That way, we can update the agent’s parameters using the objective function.

  • Meta-learning corresponds to calculating the second-order gradient on the objective function, where we differentiate through the objective function and the agent in order to update the parameters of the objective function.

We were able to show that this meta-learned objective can be trained in an environment and applied to another entirely different environment later on while still performing the learning there. In other words, it can generalize to significantly different environments. For me, this was a breakthrough, and I was delighted with the results.

On Variable Shared Meta-Learning

Variable Shared Meta-Learning (VSML) is the next step after MetaGenRL. With MetaGenRL, we were able to learn somewhat general-purpose learning algorithms. However, we still need many inductive biases (like back-propagation and objective functions) that we hard-coded in our system. That means we could make sub-optimal choices there. VSML disregards these human-made design choices.

My intuition is that learning algorithms (such as back-propagation) are simple principles that apply across all neural networks. We have only a few bits that describe the learning algorithm, but lots of bits are the results of what’s being learned. Usually, a learning algorithm extracts the information from the environment and updates the weights of the neural networks with it.

The simplest meta-learner is the recurrent neural network (RNN) that receives feedback from the environment. In the RL context, the RNN outputs some actions, receives rewards about how good those actions were, and decides what actions to take next based on that feedback. In some sense, we can encode a learning algorithm in the weights of the RNN. We can also store information about what could be a better strategy in the future in the RNN’s activations. However, we will have quadratically many weights compared to the activations — meaning we have a largely over-parametrized learning algorithm with way too many variables and very small memory.

Different perspectives on the VS-ML RNN (Source: https://arxiv.org/pdf/2012.14905.pdf)

For a general-purpose learning algorithm, we want the opposite of that. My simple solution with VSML is to introduce weight-sharing to the RNN’s weight matrix, such that only a few weights are replicated. The cool experiments we did in this paper were two-fold:

  1. We showed that an RNN using this variable-shared approach could implement back-propagation. We did something called learning algorithm cloning — where we trained this RNN to implement back-prop. When we ran this RNN forward, it became better at predicting labels.

  2. We also attempted to meta-learn from scratch: Here’s all the data and the labels, figure out how to better predict these labels just by running the RNN forward. Our trained RNN performed well in unseen datasets.

On Making a Dent in AI Research

You should start with identifying your goals. The top priority, in the beginning, should be accumulating knowledge and skills: implementing models, doing experimentation, reading research papers, etc. Writing blog posts help develop a deep understanding and your own ideas. When something gets hard (your papers get rejected, your experiments do not work for months, etc.), it’s good to have a goal in front of you, so you know why you’re doing these things.

Also, keep learning. I constantly try to figure out what other people are doing differently with better success. Deep learning research is a lot about running/designing/evaluating experiments.

Finally, networking and advertisement are crucial. You have to learn how to sell yourself, get to know important people in your field, and perhaps collaborate with them if possible.

Show Notes

  • (2:05) Louis went over his childhood as a self-taught programmer and his early days in school as a freelance developer.

  • (4:22) Louis described his overall undergraduate experience getting a Bachelor’s degree in IT Systems Engineering from Hasso Plattner Institute, a highly-ranked computer science university in Germany.

  • (6:10) Louis dissected his Bachelor thesis at HPI called “Differentiable Convolutional Neural Network Architectures for Time Series Classification,” — which addresses the problem of automatically designing architectures for time series classification efficiently, using a regularization technique for ConvNet that enables joint training of network weights and architecture through back-propagation.

  • (7:40) Louis provided a brief overview of his publication “Transfer Learning for Speech Recognition on a Budget,” — which explores Automatic Speech Recognition training by model adaptation under constrained GPU memory, throughput, and training data.

  • (10:31) Louis described his one-year Master of Research degree in Computational Statistics and Machine Learning at the University College London supervised by David Barber.

  • (12:13) Louis unpacked his paper “Modular Networks: Learning to Decompose Neural Computation,” published at NeurIPS 2018 — which proposes a training algorithm that flexibly chooses neural modules based on the processed data.

  • (15:13) Louis briefly reviewed his technical report, “Scaling Neural Networks Through Sparsity,” which discusses near-term and long-term solutions to handle sparsity between neural layers.

  • (18:30) Louis mentioned his report, “Characteristics of Machine Learning Research with Impact,” which explores questions such as how to measure research impact and what questions the machine learning community should focus on to maximize impact.

  • (21:16) Louis explained his report, “Contemporary Challenges in Artificial Intelligence,” which covers lifelong learning, scalability, generalization, self-referential algorithms, and benchmarks.

  • (23:16) Louis talked about his motivation to start a blog and discussed his two-part blog series on intelligence theories (part 1 on universal AI and part 2 on active inference).

  • (27:46) Louis described his decision to pursue a Ph.D. at the Swiss AI Lab IDSIA in Lugano, Switzerland, where he has been working on Meta Reinforcement Learning agents with Jürgen Schmidhuber.

  • (30:06) Louis created a very extensive map of reinforcement learning in 2019 that outlines the goal, methods, and challenges associated with the RL domain.

  • (33:50) Louis unpacked his blog post reflecting on his experience at NeurIPS 2018 and providing updates on the AGI roadmap regarding topics such as scalability, continual learning, meta-learning, and benchmarks.

  • (37:04) Louis dissected his ICLR 2020 paper “Improving Generalization in Meta Reinforcement Learning using Learned Objectives,” which introduces a novel algorithm called MetaGenRL, inspired by biological evolution.

  • (44:03) Louis elaborated on his publication “Meta-Learning Backpropagation And Improving It,” which introduces the Variable Shared Meta-Learning framework that unifies existing meta-learning approaches and demonstrates that simple weight-sharing and sparsity in a network are sufficient to express powerful learning algorithms.

  • (51:14) Louis expands on his idea to bootstrap AI that entails how the task, the general meta learner, and the unsupervised objective should interact (proposed at the end of his invited talk at NeurIPS 2020).

  • (54:14) Louis shared his advice for individuals who want to make a dent in AI research.

  • (56:05) Louis shared his three most useful productivity tips.

  • (58:36) Closing segment.

Louis’s Contact Info

Mentioned Content

Papers and Reports

Blog Posts

People

Book

Note: Louis is an organizer at the upcoming ICLR 2021 workshop on a roadmap to never-ending RL in May. Get involved here: https://sites.google.com/view/neverendingrl

About the show

Datacast features long-form conversations with practitioners and researchers in the data community to walk through their professional journey and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths - from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.