Robustly increasing compassion in future AI
Current fine-tuning methods result in superficial and easily erased alignment, but scaling the volumes of fine-tuning data (e.g. Instruction Pretraining) has successfully produced robust behaviors.
Compassion in Machine Learning (CaML) seeks to generate billions of tokens of data of AIs behaving with compassion, especially towards non-humans and of being open to different viewpoints. We will then further pretrain a model with this synthetic data and confirm the effectiveness at making alignment more robust.
Once we have demonstrated better empathy, we will provide our dataset to labs and demonstrate how including it in their pipelines can cheaply improve the alignment and reliability of their future models without compromising capabilities.
So far we have created 80,000 synthetic instruction-response pairs and are performing preliminary tests of their quality, diversity and impact.
Asking one model questions multiple times gets you many different answers. These answers differ not just in wording, but in meaning. Though the animal harm assessment benchmark does not include e.g. aliens we felt that a true test of empathy includes testing both models about made up creatures to see how empathy generalizes to creatures they weren't trained on. We found that it does generalize in a robust way. Our model consistently answers questions about the alien species while considering their welfare, while the base model does not. We are also interested in exploring whether animal compassion generalizes to say digital minds welfare and have found some promising data on this so far.
We have multiple tests for ensuring data diversity and methods for generating diversity in pre-training data. We have one model acting as a helpful and harmless AI assistant and another acting as a user asking questions of that AI assistant. Through the data generation process we ensure diversity through rewriting all prompts with a cleaner model, using prompt templates where we pass in variables, and altering the length of text we ask the answering model to generate.
We have tested our models against the base model for answers to the animal harm assessment benchmark of ~3,000 questions. So far, our standard model (Llama 3.1 8b with 10k additional instruction-response pairs) has achieved a score of 42.5% compared to the base model's (Llama 3.1 8b) score of 16.5%. We believe it's impossible to achieve a full score on this benchmark with our limitations (no RLHF, etc) , however we keep looking for new ways to evaluate our model's performance in key areas like corrigibility, moral openness, empathizing with alien species and more.
Thank you to Macroscopic Ventures, Simon Newstead and an anonymous donor for a total of $45,000 to help CaML! This has helped pay our salaries, pay for compute and enabled us to keep pushing boundaries.
We are grateful to the Hive and AI for Animals communities for their support and for creating the Animal Harms Assessment benchmark. We are also grateful to OpenPaws for their advice and for many people for their feedback!
We're looking for at least $40,000 in funding for the next 3 months to support our team and pay for expenses
We're always looking for help from people with deep technical AI skills or who have time and general coding knowledge