AI Alignment Breakthroughs this Week (10/15/23)
Before getting to this week’s Alignment Breakthroughs, I would like to address one “controversy”. If you aren’t interested in controversy, please skip down to the breakthroughs section.
Every week, I cross-post this blog on LessWrong.org, which is pretty much the most hostile audience one could find for the message “actually, we’re making routine meaningful progress on AI Alignment”. Thus far, the response has been mostly positive. The only criticism I’ve received has been that people dislike the use of the word “breakthrough”.
I generally assume that policing other people’s language is a form of wokism. When someone has no valid criticism of your argument, they instead try to control what words you can use. Engaging with these arguments is pointless. Instead, the correct response is to define how you are using a word and move on.
So to be clear, when I say “alignment breakthrough”, I mean something that:
Uses a new technique
Achieves a better result than was previously possible
Is plausibly applicable to at least one AI alignment strategy
First of all, I would like to point out people claiming “actually this is incremental progress” aren’t making the point they think they are. If even major breakthroughs like solving superposition are incremental progress, that suggests that AI Alignment is easier than I expect.
However, I do think some of the criticism is valid. Namely, I have been lumping major breakthroughs in with incremental advances and calling them all “breakthroughs”. Therefore I am introducing a new rating system for each alignment breakthrough. The rating system will go from 1 star⭐ to 5 stars ⭐⭐⭐⭐⭐.
A 1 star ⭐ “breakthrough” represents incremental progress. This means, that while technically achieving a new milestone, this breakthrough was the result of known techniques and could have been easily predicted in advance by an expert in the field. An example of something I’ve posted in the past that should be considered 1 star ⭐ is Wustchen V3. An example of a hypothetical 1 star ⭐ breakthrough would be if RLHF on GPT-5 was found to work better than RLHF on GPT-4.
A 5 star ⭐⭐⭐⭐⭐ “breakthrough” represents a discovery that solves a significant unsolved problem that was considered to be a major obstacle on at least one AI Alignment path. An example of a 5 star ⭐⭐⭐⭐⭐ breakthrough that I’ve posted in the past would be neuron superposition. An example of a hypothetical 5 star ⭐⭐⭐⭐⭐ breakthrough would be if someone were to develop a system that could translate a human-language description of a math problem into a formal mathematical proof.
My hope is that by introducing the star rating system, I will elevate the quality of my critics from language policing (pointless, dumb) to meaningful discussion of specific technical topics. An example of a a “good” criticism I hope to receive is
“No, the AI Lie Detector, which you rated 5 stars should be 1 star. Anyone could have seen it coming”
or
“No, Decoding Speech from Non-invasive brain recordings, which you rated 2 stars, should be 5 stars. It will never happen again.”
As a reminder, if you are arguing something should have fewer stars, you are claiming it is easier than I expected. If you are claiming something should have more stars, you are claiming it is harder than I expected.
Now, without any further ado…
AI Alignment Breakthroughs this Week
This week there were breakthroughs in the areas of:
AI Evaluation
AI Agents
Mechanistic Interpretability
Explainable AI
Simulation
Making AI Do what we want
AI Art
AI Evaluation
What is it: a new benchmark for multi-modal decision making
What is new: evaluate multimodal models (like GPT-4V) by their ability to make decisions in different domains
What is it good for: Benchmarking is key for many AI safety strategies such as the Pause and RSPs
Rating: ⭐⭐
AI Agents
Adapting LLM Agents Through Communication
What is it: Improved AI agents
What is new: By fine-tuning the LLM, the agents can perform better
What is it good for: Factored Congnition, Bueracracy of AIs
Rating: ⭐
Mechanistic Interpretability
Research on infinite-width neural networks
What is it: research showing the behavior of infinite (large number of) parameter LLMs
What’s new: a specific map showing when NNs will under/overtrain
What is it good for: determining the stability of neural networks as they scale up
Rating:⭐⭐⭐
Reverse-engineering LLM components
What is it: research to understand LLM components
What’s new: discovery of “copy supress” heads that prevent the LLM from repeating the input
What is it good for: understanding how LLMs work gives us better tools to trust/control them
Rating:⭐⭐⭐
RLHF impacts on output diversity
What is it: Research showing how RLHF training on a model reduces output diversity
What’s new: seems to confirm anecdotal findings that RLHF reduces diversity
What is it good for: Understanding how alignment affects model outputs
Rating: ⭐
What it is: a simple method to improve long generations with LLMs
What’s new: they discover that the first 4 tokens acts as “attention sinks” and keeping them improves LLM outputs
What is it good for: This gives me strong vibes of this research, which allowed us to get much better attention maps from VITs
Rating: ⭐⭐⭐
Explainable AI
What is it: A new multi-modal-LLM that “grounds” its explanations in the images it is shown.
What is new: By adding the ability to point to specific parts of the image, the human can ask more detailed questions and the LLM can better explain its answers.
What is it good for: Having AI explain why it did something is a way to avoid the imfamous (and possibly apocryphal) “sunny tanks” problem.
Rating:⭐⭐
What it is: Teach LLMs rules to reduce hallucinations
What’s new: two-stage approach where LLM first proposes rules and then applies them
What it is good for: In addition to being reliable, rules learned by LLMs should hopefully be more explainable as well.
Rating: ⭐⭐⭐ (new technique, much better results, promising research direction)
Simulation
What is it: Use Grand Theft Auto as an environment to test MLLM agents.
What is the breakthrough: extends on the idea of using GPT-4 in Minecraft. the use of a vision-LLM is new and GTA-V should have a richer set of actions.
What is it good for: Training AIs in sandboxes is a form of sandboxing. Although GTA-V wouldn’t be my first choice if you were trying to raise friendly AI.
Rating:⭐
What it is: Teach LLMs rules to reduce hallucinations
What’s new: two-stage approach where LLM first proposes rules and then applies them
What it is good for: In addition to being reliable, rules learned by LLMs should hopefully be more explainable as well.
Rating: ⭐⭐⭐ (new technique, much better results, promising research direction)
Making AI Do what we want
What it is: method for verifying truthfulness of LLM outputs
What’s new: The break each response into factors which are individually verified
What is it good for: the ability to verify factual outputs is useful for most alignment plans
rating: ⭐⭐⭐(new technique, much better results, promising research direction)
What it is: improve learning with limited data
What’s new: by looking at data across RL episodes, they can improve policy training
What is it good for: making the best use of limited data reduces the chance of AI doing something wrong.
Rating: ⭐⭐
What is it: a modification to RLHF to prevent overfitting
What’s new: They weight the Reward Model to make sure it is only used in the region where it is effect
What is it good for: Prevent Goodharting
Rating: ⭐⭐⭐⭐
AI Art
What is it: A method for converting a normal video into a “4d” movie you can view from any angle
What is the breakthrough: By using Gaussian Splattering, much better quality and speed than previous methods for doing this
What is it good for: Cool matrix-style shots. Video games probably.
Rating: ⭐⭐ (big leap in quality, but mostly an application of a known method to a new problem)
What is it: Pretty movie generator
What is new: Deform is one of the OG AI video methods, this just applies it to a new model
What is it good for:
Rating: ⭐
What is it: a way to avoid reproducing training images in diffusion models
what’s new: they mask the training images to prevent reproduction
what is it good for: reducing copyright concerns when using diffusion models.
Rating: ⭐⭐
What is it: seperate motion+subject in text-to-video models
What’s new: they train a dual-path lora on an individual video to extract motion
What is it good for: Transfer the motion from one video to another
Rating: ⭐⭐⭐⭐
This is Not AI Alignment
GPT-4v 🔃 Dall-E 3 (https://twitter.com/conradgodfrey/status/1712564282167300226)
What is it: A fun graphic showing what happens when we repeatedly iterate complex systems.
What does it mean: There was a fun back-and-forth where it was speculated “this is how we die”, which was quickly refuted. I think this perfectly demonstrates the need for empiricism in AI Alignment.
What is it: “secret” messages can be passed via image to deceive the user.
What does this mean: Everyone expected to find image jailbreaks. And we did. Good job everyone.