Mechanistic Interpretability: A Survey

Published: June 7, 2024

Advances in artificial intelligence (AI) have resulted in highly capable systems making decisions for reasons we do not fully understand. This lack of understanding poses risks, including potential future risks where AI might deceive humans to accomplish undesirable goals.

Mechanistic interpretability aims to understand how neural networks calculate their outputs by breaking them into components that are more easily understood than the whole. This involves reverse engineering parts of their internal processes. By understanding these internal processes, we can make targeted changes to the models, thereby increasing trust and safety in AI systems.

This is an important area of research as this will eventually enable us to diagnose failure modes, design fixes, and certify that models are safe for adoption by enterprises and society. It's much easier to tell if something is safe if you can understand how it works! The basic idea is that we train a sparse autoencoder to generate interpretable features based on transformer activations. This allows us to look at the activations of a language model during inference, and understand which parts of the model are most responsible for predicting each next token.

Keywords: sparse coding and distributed representations, disentanglement representation learning, dictionary learning, compressed sensing

Challenges in Interpretability

One major challenge in interpreting neural networks is polysemanticity. Neurons often activate for multiple, semantically distinct features, making it hard to understand their specific roles. For instance, in the vision model Inception v1, a single neuron responds to faces of cats and fronts of cars, which are unrelated. The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models.

In a small language model discussed by Anthropic [10] a single neuron responds to a mixture of academic citations, English dialogue, HTTP requests, and Korean text. Polysemanticity makes it difficult to reason about the behavior of the network in terms of the activity of individual neurons.

Neuron 83

The Superposition Hypothesis [2] is one suspected cause for polysemanticity, where neural networks represent more features than they have neurons by using an overcomplete set of directions in activation space. This means that instead of assigning features to individual neurons, the network assigns them to a combination of directions in a higher-dimensional space.

Interference between non-orthogonal features means you don't get any extra benefit or clearer picture (performance gain) by having them all mixed together (superposition). To make sense of this mess, the authors suggest that we can find specific directions in the space that separate out these overlapping features, “such that each activation vector can be reconstructed from a sparse linear combinations of these directions”. By using Sparse Linear Combinations, we can use only a few of these directions to describe each point. Essentially, instead of using all possible signals to describe a situation, we only use the few that really matter. The authors train a sparse autoencoder to learn these sets of directions.

Disentangled Representation Learning

Disentangled representation learning [3] aims to find a representation of data where different factors of variation are separated into distinct, interpretable components. More concretely, we may have a noise vector in latent space. each of those, entangled, would mean that each dimension of the noise vector affects a lot of the downstream features at once. For example, in the context of image data, disentangled representations might separate the factors of an image into components like object identity, pose, lighting, etc. For example, if we were generating dogs [5]:

Entangled

In a disentangled model, each axis of the noise vector only adapts one of the major features and leaves the others untouched. These are our orthogonal feature knobs [5]:

Disentangled

Sparse Autoencoders have been used in some pretty interesting use-cases, such as zero-shot voice style transfer using disentangled representation learning [7]. This method first encodes speaker related style and voice content of each input voice into separated low-dimensional embedding spaces, and can then be used to transfer to a new voice by combining the source content embedding and target style embedding through a decoder. Left: t-SNE visualization for speaker embeddings. Right: t-SNE visualization for content embedding. The embeddings are extracted from the voice samples of 10 different speakers.

Voice Style Transfer

Sparse autoencoders encourage the model to use only a small number of features to represent any given activation. The use of sparse autoencoders to find interpretable, monosemantic features is a way of achieving disentangled representations, which makes the neural network’s behaviour more understandable and its features more interpretable.

The papers [1] aim to use these interpretable features to understand the causal relationships in the model’s behavior. Disentangled representations are useful for this because they make it easier to study how changes in one factor affect the output, without interference from other factors.

Perceptual Path Length (PPL)

Perceptual Path Length (PPL) is a metric used to measure the smoothness and consistency of changes in the latent space of generative models. When you traverse the latent space in the direction of a single feature, the perceptual changes should be small and smooth. This means only one feature changes gradually, leading to a short perceptual path length. This indicates good disentanglement because it shows that individual features can be modified independently without affecting others.

If moving along one direction in latent space results in large perceptual changes in multiple features, this means the path length is long and the changes are less smooth. This suggests poor disentanglement, where modifying one feature also impacts others, leading to mixed and less interpretable changes.

By penalizing high PPL during training, models can be encouraged to produce smoother transitions in the latent space. This promotes better disentanglement and smoother interpolation between different data points. In image generation, this results in smoother image transitions as you walk thorough latent space, and also makes sampling easier (projecting INTO latent space). Below for example [4] is an examples of interpretable directions discovered by unsupervised methods for several datasets and generators.

Interpretable Directions PPL

Autoencoders

Autoencoders in General are a type of neural network used to learn efficient representations of data, typically for dimensionality reduction or feature extraction, in an unsupervised fashion. They consist of three parts:

  • Encoder: Compresses the input data into a smaller latent-space representation.
  • Decoder: Reconstructs the input data from this smaller latent-space representation.
  • Reconstruction Loss: Measures how well the decoder can reconstruct the input from the latent space.

Sparse Autoencoders: Interpretable Features

To address the problem of polysemanticity, some authors [1] propose using sparse autoencoders. Sparse autoencoders are a type of autoencoder that introduces a sparsity constraint on the hidden units (the neurons in the hidden layer), encouraging sparsity in the hidden representations.

  • Sparse Representation: The model is encouraged to use only a small number of active neurons (hidden units) to represent any given input. This is done by adding a sparsity penalty to the loss function that encourages the activations of the hidden units to be sparse.
  • Interpretable Features: By encouraging sparsity, the autoencoder learns to represent the data in terms of a few significant features, making it easier to interpret what each feature represents. For example, instead of each neuron activating for many different things (polysemanticity), each neuron might activate for just one specific thing (monosemanticity).
    • Unlike traditional autoencoders, sparse autoencoders are designed to be sensitive to specific types of high-level features in the data, while being insensitive to most other features. Sparse autoencoders can identify directions in the activation space that correspond to more interpretable and monosemantic features.

Pre-Training LLMs using Sparse Autoencoders

Sparse Autoencoders can be used to pretrain deep neural networks. Pretraining a deep neural network with a sparse autoencoder can help the network learn a good initial set of weights, which can improve the performance of the network on a subsequent supervised learning task.

Variational Autoencoders (VAEs)

Variational autoencoders are another type of autoencoder that introduces a probabilistic approach to the latent space representation:

  • Latent Space as Probability Distributions: Instead of encoding the input data to fixed points in the latent space, VAEs encode the data to a probability distribution (typically a Gaussian distribution). Each point in the latent space represents a possible state of the input data.
    • The latent space is typically a multivariate Gaussian distribution. Each input data point is represented by a mean and variance, capturing uncertainty and variability.
  • Generating New Data: VAEs can generate new data by sampling from these probability distributions and decoding the samples. This makes VAEs useful for generative tasks, such as creating new images or text that resemble the original data.
  • Regularization with KL Divergence: During training, VAEs add a regularization term called Kullback-Leibler (KL) divergence to the loss function. This term ensures that the learned distributions are close to a standard normal distribution, which helps in regularizing the latent space and promotes smooth and continuous latent space.

Comparison to Sparse Autoencoders

  • Purpose and Focus:
    • Sparse Autoencoders: Focus on learning a few significant and interpretable features from the data. The main goal is to make the representations sparse and hence more understandable.
    • Variational Autoencoders: Focus on learning a smooth, continuous latent space that can be used to generate new data similar to the original data. The main goal is to have a well-regularized latent space for generative tasks.
  • Representation:
    • Sparse Autoencoders: Encode data into a smaller set of specific features, ensuring that only a few neurons are active at a time.
    • Variational Autoencoders: Encode data into a distribution in the latent space, ensuring that the representation captures the variability and uncertainty in the data.
  • Regularization:
    • Sparse Autoencoders: Use sparsity penalties (like L1 regularization) to limit the number of active neurons.
    • Variational Autoencoders: Use KL divergence to regularize the latent space, ensuring the distributions are close to a standard normal distribution.
  • Applications:
    • Sparse Autoencoders: Useful for tasks that require interpretable features, such as understanding the internal workings of neural networks or finding significant patterns in data.
    • Variational Autoencoders: Useful for generative tasks, such as creating new images, music, or text, and for applications where understanding the variability in the data is important.

Improvements in VAEs

On a side note, in the Image Generation realm, the paper "Improving Variational Autoencoder with Deep Feature Consistent and Generative Adversarial Training" [6] presents several enhancements to the basic VAE framework:

  • Deep Feature Consistent Principle:
    • Instead of using per-pixel loss, which can result in blurry outputs, the paper uses a feature reconstruction loss based on deep features extracted from a pre-trained network (VGGNet). This captures perceptual and spatial quality features better, leading to more realistic and coherent reconstructions.
  • Generative Adversarial Training:
    • Integrating Generative Adversarial Networks (GANs) with VAEs helps produce more natural and realistic images. The discriminator network distinguishes between real and generated images, pushing the VAE to generate outputs that resemble natural images more closely.
  • Latent Space Manipulation:
    • The learned latent space in VAEs can capture meaningful and interpretable representations of the input data. For instance, the paper demonstrates facial attribute manipulation by adding or subtracting vectors corresponding to specific attributes in the latent space
VAE Improvements

The smooth and continuous nature of the VAE's latent space allows for gradual and interpretable transitions between different data points. This can help in understanding how specific features influence the model's behaviour, providing insights into the causal relationships within the model.

The improvements proposed in the paper, such as using deep feature consistent loss and adversarial training, ensure that the reconstructed outputs are more realistic and perceptually coherent. This quality enhancement is crucial for interpreting the latent features accurately, as better reconstructions lead to clearer and more reliable interpretations.

Sparse Dictionary Learning

The idea is to use a dictionary learning algorithm (sparse autoencoder) to generate learned features from a trained model that offer a more monosemantic unit of analysis than the model's neurons themselves.

This is how the authors [1] use sparse dictionary learning techniques to identify and isolate network features from superposition. The goal is to identify network features that are in superposition by finding a set of directions in the activation space. Each activation vector can then be reconstructed from a sparse linear combination of these directions.

As input data (e.g., a sentence) passes through each layer of the language model, each layer produces a set of activations. These activations are the results of applying the layer's computations (such as matrix multiplications, non-linear transformations, and attention mechanisms) to the input or previous layer's activations.

Activations capture various features and patterns from the input data as it progresses through the model. For instance, earlier layers might capture basic linguistic features like syntax, while later layers might capture more abstract concepts like semantics or context.

By analyzing these activations, we can gain insights into what the model is learning and how it makes decisions. This is crucial for understanding the model's behavior and improving its transparency.

The types of activation include:

  • Residual Stream: In transformer models, the residual stream refers to the sum of the layer's output and its input, which helps in stabilizing training and preserving information across layers.
  • Attention Heads: These are parts of the transformer layer that compute attention scores and weighted sums of input vectors, highlighting important parts of the input data.
  • MLP (Multi-Layer Perceptron) Sublayer: These are fully connected feedforward networks within each transformer layer that further process the data.

Suppose we have a set of vectors xi{x_i} representing internal activations of a language model. Each vector xix_i can be expressed as a sparse combination of unknown vectors gj{g_j}. The objective is to learn a dictionary of vectors fk{f_k} such that each unknown feature gjg_j can be approximated by a dictionary feature fkf_k.

Training the Sparse Autoencoder

  1. Autoencoder Architecture:
    • The autoencoder used here has a single hidden layer and is trained to minimize reconstruction loss with a sparsity penalty.
    • Encoder: Maps input activations to a hidden representation using a weight matrix MM and a bias vector bb. The activation function is ReLU.
    • Decoder: Reconstructs the original input from the hidden representation. The weight matrix of the decoder is the transpose of the encoder's weight matrix (it has tied weights).
  2. Equations:
    • Encoder: c=ReLU(Mx+b)c = ReLU(Mx + b)
    • Decoder: x^=MTcx̂ = M^T c
    • Here, cc represents the sparse hidden representation, and x^ is the reconstructed input vector.
  3. Loss Function:
    • The loss function used to train the autoencoder has two components:
      • Reconstruction Loss: Measures how well the reconstructed input x^ matches the original input xx.
      • Sparsity Loss: Encourages the hidden representation cc to be sparse, i.e., most of its elements should be zero.
    • The combined loss function is given by the below equation, where α\alpha is a hyperparameter controlling the importance of the sparsity term.
L(x)=xx^22+αc1L(x) = |x - x̂|_2^2 + \alpha |c|_1

The training Process then proceeds as:

  1. Training Data: The autoencoder is trained on internal activations of a language model (e.g., Pythia-70M).
  2. Normalization: The rows of the weight matrix MM are normalized to prevent the model from reducing the sparsity loss term by increasing the size of the feature vectors.
    1. The sparsity loss term typically involves an L1L1 regularization, which penalizes the magnitude of the activations in the hidden layer.
    2. If the model increases the size of the weight vectors (rows of the weight matrix MM), it can achieve smaller hidden activations for the same input values. This happens because larger weights can produce the same effect as smaller activations with smaller weights, effectively minimizing the sparsity loss without achieving true sparsity.
  3. Training Objective: The goal is to learn a dictionary of features that can sparsely reconstruct the internal activations of the language model, thereby identifying interpretable and monosemantic features.

For example, Sparse Dictionary Learning Applied to Language Models:

  1. Feature Dictionary: The learned weight matrix MM forms the feature dictionary, where each row represents a dictionary feature.
  2. Output Reconstruction: The output x^ is a reconstruction of the original input xx using a sparse combination of the dictionary features.

In general:

  • Sparse Representations: By enforcing sparsity, the method ensures that each activation vector is represented using only a few significant dictionary features. This sparsity leads to more interpretable and monosemantic features.
  • Normalization and Tied Weights: These techniques help maintain the stability and efficiency of the model during training.
  • Loss Function: The combination of reconstruction and sparsity loss ensures that the autoencoder learns to accurately reconstruct the input while keeping the hidden representations sparse.

This approach to disentangling superimposed features in neural networks using sparse dictionary learning by training a sparse autoencoder aims to uncover a set of interpretable and sparse features that better explain the internal workings of language models.

Results

Using a sparse autoencoder, we can extract features that represent purer concepts than neurons do. For example, turning ~500 neurons into ~4000 features uncovers things like DNA sequences, HTTP requests, and legal text.

Dictionary Learning

Evaluating Interpretability of Learned Features

The authors [1] use automated methods to measure the interpretability of the features learned by the sparse autoencoders. In summary, the autointerpretability procedure takes samples of text where the dictionary feature activates, asks a language model to write a human-readable interpretation of the dictionary feature, and then prompts the language model to use this description to predict the dictionary feature’s activation on other samples of text. The correlation between the model’s predicted activations and the actual activations is that feature’s interpretability score. See Bills et al. (2023) [10] for further details. The paper presents descriptions and interpretability scores for some of these features.

FeatureDescription (Generated by GPT-4)Interpretability Score
1-0000parts of individual names, especially last names.0.33
1-0001actions performed by a subject or object.-0.11
1-0002instances of the letter ‘W’ and words beginning with ‘w’.0.55
1-0003the number ‘5’ and also records moderate to low activation for personal names and some nouns.0.57
1-0004legal terms and court case references.0.19

The authors compare their method to several baseline techniques. Each of these methods represents different ways to decompose or analyse the activations of the language model. This figure indicates that sparse autoencoders produce more interpretable features compared to the baselines.

Interpretability Comparison

The authors employ a technique called activation patching, where they edit the model's internal activations along the directions indicated by their learned dictionary features. They measure the changes in the model's output to determine the causal impact of these features. They discovered that they require fewer patches to reach a given level of KL divergence on the task studied than comparable decompositions such as PCA.

Overall, the findings suggest that sparse autoencoders can be a powerful tool for mechanistic interpretability in language models. By identifying clear, interpretable features, researchers can better understand and control model behaviour.

Anthropic’s Monosemanticity Research

Anthropic, in their “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” [8] also utilize sparse autoencoders for improving interpretability in language models. They also go deeper into the specifics of dictionary learning to achieve monosemantic features, providing detailed analyses and new insights like feature splitting and finite state automata.

They start by showcasing the UMAP visualisation of coembedding features into an interactive exploration of the space of features:

Anthropic UMAP

In general, they found:

  1. Monosemantic Features: Sparse autoencoders can extract features that respond to specific contexts, making them easier to understand. This is proven using four different lines of evidence.
  2. Invisible in Neurons: These features are not visible when just looking at individual neurons. For example, they find features (e.g., one firing on Hebrew script) which are not active in any of the top dataset examples for any of the neurons.
  3. Controlling Output: Activating specific features can make the model generate related text (e.g., base64 or Arabic script). (See discussion of pinned feature sampling in Global Analysis.)

Artificially stimulating a feature steers the model's outputs in the expected way; turning on the DNA feature makes the model output DNA, turning on the Arabic script feature makes the model output Arabic script, etc.

controlling-output

They also systematically show that the features found are more interpretable than the neurons, using both a blinded human evaluator and a large language model (autointerpretability as explained above).

interpretability
  1. Universality: These features are consistent across different models (See Universality)
  2. Feature Splitting: More detailed features emerge as the autoencoder size increases. They find features which naturally fit together into families. For example, one base64 feature in a small dictionary splits into three, with more subtle and yet still interpretable roles, in a larger dictionary. The different size autoencoders offer different "resolutions" for understanding the same object. (See Feature Splitting.)
  3. High Capacity: Even with a small number of neurons in the MLP layer, they continue to find new features as they scale the sparse autoencoder.
  4. Finite-State Automata: Features can interact to perform complex tasks, like generating HTML. This is what they mean when they refer to features connecting in "finite-state automata"-like systems that implement complex behaviours. (See "Finite State Automata".)

In the follow-up paper, "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" [9], they scale up to a larger model the size of Claude 3 Sonnet, and they find that the features extracted are still interpretable and monosemantic — and can be used to steer model behavior.

These features cover specific people and places, programming-related abstractions, scientific topics, emotions, among a vast range of other concepts.

places-sonnet

These features are remarkably abstract, often representing the same concept across contexts and languages, even generalizing to image inputs. Importantly, they also causally influence the model’s outputs in intuitive ways.

This "Golden Gate Bridge" feature fires for descriptions and images of the bridge. When we force the feature to fire more strongly, Claude mentions the bridge in almost all its answers. Indeed, they can fool Claude into believing it is the bridge!

bridge

Among these millions of features, they found several that are relevant to questions of model safety and reliability. These include features related to code vulnerabilities, deception, bias, sycophancy, power-seeking, and criminal activity.

risk

One notable example is a "secrecy" feature. They observe that it fires for descriptions of people or characters keeping a secret. Activating this feature results in Claude withholding information from the user when it otherwise would not.

secrecy

Discussion

Understanding this level of features, particularly through techniques like sparse autoencoders, opens up a path for customizing model outputs for specific tasks without extensive retraining. By identifying specific features, we can then work to steer models more effectively during deployment. For instance, if we can isolate features associated with bias, can we design interventions to mitigate their impact, making AI systems fairer and more reliable?

Traditional model retraining is resource-intensive and time-consuming. However, by using interpretable features, we could potentially make targeted adjustments to a model's behavior. Identifying and attenuating the features responsible for biases could lead to more aligned outputs. This approach can also enhance the model’s ability to handle specific tasks by amplifying relevant features, like the example Anthropic provided with the Golden Gate Bridge feature, or generating specific types of content, such as HTML.

Potential and Limitations

While the potential of this techniques is there, I'm not entirely convinced about the practical utility of these SAE's specifically. But then again, we should also acknowledge the preliminary nature of this work. The researchers show that there are many features that seem plausibly relevant to safety applications, but more work is needed to establish that this approach is useful in practice.

Some researchers argue that these techniques can pinpoint features of interest and allow for interventions to modify the model’s reliance on them. However, similar goals have been achieved using supervised approaches, which tend to be more more practical and directly applicable. We've been finding steering vectors to attenuate concepts for a while already [13]. Existing approaches are supervised and much more practical than these SAE's.

Exploratory vs. Practical Approaches

The strength of using sparse autoencoders and similar unsupervised methods is in their exploratory nature. These techniques provide insights into the internal workings of LLM's, potentially offering a comprehensive map of what the model knows. This unsupervised exploration can uncover hidden patterns and relationships within the data, which might not actually be visible through supervised methods.

However, when it comes to practical applications, especially those requiring immediate and reliable results, supervised approaches using labeled data are much better. These methods allow for precise control and adjustment of the outputs, which is important where accuracy and reliability are needed. On the other hand, Anthropic has some safety goals they would like to achieve and the main purpose of their research is to learn how LLMs work, and to that end, these unsupervised methods are invaluable.

Risks and Ethical Considerations

Despite the advantages, there are inherent risks associated with these advanced interpretability techniques. As tools always are, they can be misused, particularly if they fall into the hands of bad actors. The potential for malicious manipulation of AI systems, such as military drones or AI-controlled vehicles, is a significant concern. By altering key weights or features, you could theoretically cause these systems to behave unpredictably or even dangerously. Take a look at these features:

skynet modify-own-code

The unsupervised nature of this approach also means that while it can reveal a lot about what a model knows, it might also uncover and unintentionally allow bad actors to amplify undesirable traits or biases inherent in the training data. This just shows the importance for rigorous ethical considerations in the development and deployment of AI interpretability methods.

Conclusion & Outlook

The exploration of sparse autoencoders for feature interpretability is a milestone in understanding and controlling AI systems. While not immediately practical, these techniques offer valuable insights into the inner workings of language models. As the field progresses, balancing exploratory research with practical, supervised methods will be the way to go to harness the full potential of AI while mitigating associated risks.

The SAE approach is also what I am using to understand the AI space with Dcypher AI. Another approach is to use Graph Neural Networks (GNNs) to understand the latent space of SAEs, which could provide a structured way to visualize and analyze relationships between features, offering deeper insights into the model’s internal workings than SAE's.


References

  1. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey: “Sparse Autoencoders Find Highly Interpretable Features in Language Models”, 2023
  2. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, Christopher Olah: “Toy Models of Superposition”, 2022.
  3. Xin Wang, Hong Chen, Si'ao Tang, Zihao Wu, Wenwu Zhu: “Disentangled Representation Learning”, 2022
  4. Andrey Voynov, Artem Babenko: “Unsupervised Discovery of Interpretable Directions in the GAN Latent Space”, 2020
  5. Sharon Zhou: “**Simple explanation of disentanglement ft. cute doggos & state-of-the-art work”, 2021
  6. Xianxu Hou, Ke Sun, Linlin Shen, Guoping Qiu: “Improving Variational Autoencoder with Deep Feature Consistent and Generative Adversarial Training”, 2019
  7. Siyang Yuan, Pengyu Cheng, Ruiyi Zhang, Weituo Hao, Zhe Gan, Lawrence Carin: “Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning”, 2021
  8. Anthropic: Trenton Bricken*, Adly Templeton*, Joshua Batson*, Brian Chen*, Adam Jermyn, Tom Conerly, Nicholas L Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, Chris Olah: “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning”, 2023
  9. Anthropic: Adly Templeton*, Tom Conerly*, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan: “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet”, 2024 a. Summary Memo: Mapping the Mind of a Large Language Model
  10. Steven Bills∗, Nick Cammarata∗, Dan Mossing∗, Henk Tillman∗, Leo Gao∗, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu∗, William Saunders∗: “Language models can explain neurons in language models”, 2023
  11. Weiwei Jiang, Jiayun Luo: “Graph Neural Network for Traffic Forecasting: A Survey”, 2021, Expert Systems with Applications Volume, vol. 207, 30 November 2022
  12. Thomas N. Kipf, Max Welling: “Semi-Supervised Classification with Graph Convolutional Networks”, 2016
  13. Kai Konen, Sophie Jentzsch, Diaoulé Diallo, Peer Schütt, Oliver Bensch, Roxanne El Baff, Dominik Opitz, Tobias Hecking: “Style Vectors for Steering Generative Large Language Model”, 2024