I started following Deep Learning Curriculum written by Jacob Hilton and here is what I learnt from the exercise in Topic 8 - Interpretability. My solution is written in Colab T8-Interpretability-solution.ipynb

It took me around 8 hours to finish the exercise and most of the time is spent on ARENA course material about this topic. It helps me a lot in understanding this topic and it offers many helper functions to do this exercise. Here is what I learnt from this exercise:

  1. Induction head: A head which implements the induction behavior. They attend to the token immediately after an earlier copy of the current token, and then predicts that the token attended to will come next.
  2. How to reverse engineer the model’s behavior.
  3. How to find induction head by checking the model’s attention pattern and doing logit attribution.