The creator has not uploaded a project yet...

Contrastive Steering with Clustered Sparse Autoencoder Features in GPT2-Small


Our group (Jason, Sinem,  Kanishk, and Daniel) implemented a clustering algorithm based on SAELens and Alice's clustering algorithm. First, we use the activation deltas between positive and negative prompts, find important token positions, and for each prompt find the set of features that seem to activate sparsely when the prompt is inputted to the model.

Then, using a simple linear regression, we find a linear combination of decoded activations (i.e. decode(encode(activation)) ) that minimizes the least squares for each prompt. We try a Contrastive-Activation-Addition-like steering using this set of features and their weights to steer the model. 


Results: 

Failure

We have not found strong evidence of steer-ability of persona using the linear regression part of the pipeline. 


Most likely sources of error include:

  • the features are not localized properly
  • using just one layer is insufficient for a high-level steering (but this is unlikely since CAA works reliably with one-layer intervention) 
  • activation delta is not an appropriate proxy for and we should find a different target of linear regression. We could also try logistic regression.
  • gpt 2 might not have interesting enough high-level directions, so we should use billion-param models.
  • we looked at agreeableness (10 prompt pairs) and the dataset is under-specific for the persona we are trying to steer for. 


To mitigate these potential issues, we plan to:

  • generate a larger better contrastive pair based on model-written-evals
    • there exist clearly better datasets with more clear feature separation
  • try other methods to fit decoded feature activation to activation delta
  • run on Gemma 2B

Success:

However, clustering the relevant features work, and when we compile the complete list of features collected and observe the Neuronpedia labels and top activations on the UI,  they subjectively check out (i.e. makes sense why they might have been selected). In addition, the empirical observation that steering works better just by multiplying the clustered_features by a large scalar (50~300) and we observed a noticeable change in the completion. If we multiply these features by a negative scalar, they display a marked "opposite" effect.  The "negative" steering can get quite dark at times. 


Overall, we think that the clustering part works and is able to find some interesting features given a set of prompts.  Much work needs to be done to refine the weight-finding process and find a set of weights that can (if it exists) steer model completion towards a persona direction. We are willing to recruit 1-3 extra people to join the team. We are looking for people who:

  • have experience with automated prompt generation 
  • have experience with evaluating language models on completion tasks, not just A/B testing

And good-to-haves (but definitely not required) are people who are:

  • well-versed in the mech interp literature and techniques
  • understand the "why" of interp (helpful for getting unstuck when something doesn't work)
  • have read / know about sparse autoencoders in LLMs 
  • excited about model steering for practical applications of interpretability

Please reach out to Sinem, Kanishk, Jason (me), or Daniel if you think you qualify and want to join! 



Feedback on the methodology would be much appreciated!


Colab notebook (read-only) that contains the work and explanations.

https://colab.research.google.com/drive/1qUdsegUIZILv5DDLVStH6GTIz-ZhFmKH?usp=sh...

Download

Download
agreeableness.json 2.4 kB

Leave a comment

Log in with itch.io to leave a comment.