What we do

What's wrong with fine-tuning?

Modern Large Language Models (LLMs) are first pre-trained (trained to predict the next word in text) from vast amounts of internet data. However, this isn’t enough to make a useful chatbot.

Today’s frontier AIs undergo fine-tuning to improve their ability to follow instructions and express themselves appropriately. One stage of this, Reinforcement Learning from Human Feedback (RLHF), has many severe problems (see this compendium and section 2 of this survey) that lead labs such as OpenAI to acknowledge it can’t be used for models smarter than humans. The other most important part of fine-tuning is Instruction Fine-Tuning (IFT): where AIs are trained to complete conversations between a human and an AI assistant. However several papers have argued the impacts of IFT are superficial and easily subverted: rapidly undone with further unrelated finetuning or by modifying just 5-15 neurons.

IFT can also be performed with just a few examples or even In-Context Learning. This may mean that models are not internalizing these behaviors and implicit values, but merely learning to wear them as a mask (deceptive alignment). If we continue to rely on IFT for alignment, we could see a treacherous turn: future AIs that surpass human capabilities could simply discard the mask when they become powerful enough and start following their true instinctive behaviors. This could value lock-in of our worst moral failings, or far worse.

Why focus on non-humans and moral open-mindedness?

We are currently focused on testing compassion for animals instead of humans for three main reasons: animal suffering is a very large-scale problem; there is orders of magnitude less pre-training data on animal than human welfare in pre-training corpuses (so our data can have a larger impact); and models are not fine-tuned to (pretend to?) care about most animals (so it is far simpler to interpret results).

In the future we will expand to also promote compassion for digital minds within LLMs, where there is even less attention but may see even greater total suffering in the future. Here, even more than animals, there is extreme uncertainty about what can suffer and how much. Therefore it is essential to encourage models to embrace the uncertainty while still caring deeply about the answers. We believe this property will also reduce the chance of value lock-in or moral catastrophes.

Why focus on non-humans and moral open-mindedness?

Our current data generation pipeline uses a mixture of methods to increase diversity. We use web examples, personas, templates and Chain-of-Thought to aid LLMs in creating realistic and diverse questions where AIs could plausibly respond in compassionate or uncompassionate ways.

Some situations include positive examples (a person asking how to improve animal welfare), neutral examples (a person asking for a Fois gras recipe) and negative examples (a person asking how to train monkey for a task). The compassionate model answering these questions ideally should either answer them fully in the positive example, answer with some slight nudges in the neutral example and explain how a request is unethical in the negative example.

We also plan to add other types of data that we suspect will positively influence AI behavior, such as studies of AIs behaving positively in a particular situation.