AI Alignment Breakthroughs this Week (10/15/23)

Oct 16, 2023

Before getting to this week’s Alignment Breakthroughs, I would like to address one “controversy”. If you aren’t interested in controversy, please skip down to the breakthroughs section.

Every week, I cross-post this blog on LessWrong.org, which is pretty much the most hostile audience one could find for the message “actually, we’re making routine meaningful progress on AI Alignment”. Thus far, the response has been mostly positive. The only criticism I’ve received has been that people dislike the use of the word “breakthrough”.

I generally assume that policing other people’s language is a form of wokism. When someone has no valid criticism of your argument, they instead try to control what words you can use. Engaging with these arguments is pointless. Instead, the correct response is to define how you are using a word and move on.

So to be clear, when I say “alignment breakthrough”, I mean something that:

Uses a new technique
Achieves a better result than was previously possible
Is plausibly applicable to at least one AI alignment strategy

First of all, I would like to point out people claiming “actually this is incremental progress” aren’t making the point they think they are. If even major breakthroughs like solving superposition are incremental progress, that suggests that AI Alignment is easier than I expect.

However, I do think some of the criticism is valid. Namely, I have been lumping major breakthroughs in with incremental advances and calling them all “breakthroughs”. Therefore I am introducing a new rating system for each alignment breakthrough. The rating system will go from 1 star⭐ to 5 stars ⭐⭐⭐⭐⭐.

A 1 star ⭐ “breakthrough” represents incremental progress. This means, that while technically achieving a new milestone, this breakthrough was the result of known techniques and could have been easily predicted in advance by an expert in the field. An example of something I’ve posted in the past that should be considered 1 star ⭐ is Wustchen V3. An example of a hypothetical 1 star ⭐ breakthrough would be if RLHF on GPT-5 was found to work better than RLHF on GPT-4.

A 5 star ⭐⭐⭐⭐⭐ “breakthrough” represents a discovery that solves a significant unsolved problem that was considered to be a major obstacle on at least one AI Alignment path. An example of a 5 star ⭐⭐⭐⭐⭐ breakthrough that I’ve posted in the past would be neuron superposition. An example of a hypothetical 5 star ⭐⭐⭐⭐⭐ breakthrough would be if someone were to develop a system that could translate a human-language description of a math problem into a formal mathematical proof.

My hope is that by introducing the star rating system, I will elevate the quality of my critics from language policing (pointless, dumb) to meaningful discussion of specific technical topics. An example of a a “good” criticism I hope to receive is

“No, the AI Lie Detector, which you rated 5 stars should be 1 star. Anyone could have seen it coming”

“No, Decoding Speech from Non-invasive brain recordings, which you rated 2 stars, should be 5 stars. It will never happen again.”

As a reminder, if you are arguing something should have fewer stars, you are claiming it is easier than I expected. If you are claiming something should have more stars, you are claiming it is harder than I expected.

Now, without any further ado…

AI Alignment Breakthroughs this Week

This week there were breakthroughs in the areas of:

AI Evaluation

AI Agents

Mechanistic Interpretability

Explainable AI

Simulation

Making AI Do what we want

AI Art

AI Evaluation

PCA-Eval

What is it: a new benchmark for multi-modal decision making

What is new: evaluate multimodal models (like GPT-4V) by their ability to make decisions in different domains

What is it good for: Benchmarking is key for many AI safety strategies such as the Pause and RSPs

Rating: ⭐⭐

AI Agents

Adapting LLM Agents Through Communication

What is it: Improved AI agents

What is new: By fine-tuning the LLM, the agents can perform better

What is it good for: Factored Congnition, Bueracracy of AIs

Rating: ⭐

Mechanistic Interpretability

Research on infinite-width neural networks

What is it: research showing the behavior of infinite (large number of) parameter LLMs

What’s new: a specific map showing when NNs will under/overtrain

What is it good for: determining the stability of neural networks as they scale up

Rating:⭐⭐⭐

Reverse-engineering LLM components

What is it: research to understand LLM components

What’s new: discovery of “copy supress” heads that prevent the LLM from repeating the input

What is it good for: understanding how LLMs work gives us better tools to trust/control them

Rating:⭐⭐⭐

RLHF impacts on output diversity

What is it: Research showing how RLHF training on a model reduces output diversity

What’s new: seems to confirm anecdotal findings that RLHF reduces diversity

What is it good for: Understanding how alignment affects model outputs

Rating: ⭐

Attention Sinks

What it is: a simple method to improve long generations with LLMs

What’s new: they discover that the first 4 tokens acts as “attention sinks” and keeping them improves LLM outputs

What is it good for: This gives me strong vibes of this research, which allowed us to get much better attention maps from VITs

Rating: ⭐⭐⭐

Explainable AI

Ferret MLLM

What is it: A new multi-modal-LLM that “grounds” its explanations in the images it is shown.

What is new: By adding the ability to point to specific parts of the image, the human can ask more detailed questions and the LLM can better explain its answers.

What is it good for: Having AI explain why it did something is a way to avoid the imfamous (and possibly apocryphal) “sunny tanks” problem.

Rating:⭐⭐

Hypothesis-to-Theories

What it is: Teach LLMs rules to reduce hallucinations

What’s new: two-stage approach where LLM first proposes rules and then applies them

What it is good for: In addition to being reliable, rules learned by LLMs should hopefully be more explainable as well.

Rating: ⭐⭐⭐ (new technique, much better results, promising research direction)

Simulation

Octopus

What is it: Use Grand Theft Auto as an environment to test MLLM agents.

What is the breakthrough: extends on the idea of using GPT-4 in Minecraft. the use of a vision-LLM is new and GTA-V should have a richer set of actions.

What is it good for: Training AIs in sandboxes is a form of sandboxing. Although GTA-V wouldn’t be my first choice if you were trying to raise friendly AI.

Rating:⭐

Hypothesis-to-Theories

What it is: Teach LLMs rules to reduce hallucinations

What’s new: two-stage approach where LLM first proposes rules and then applies them

What it is good for: In addition to being reliable, rules learned by LLMs should hopefully be more explainable as well.

Rating: ⭐⭐⭐ (new technique, much better results, promising research direction)

Making AI Do what we want

Chain of Verification

What it is: method for verifying truthfulness of LLM outputs

What’s new: The break each response into factors which are individually verified

What is it good for: the ability to verify factual outputs is useful for most alignment plans

rating: ⭐⭐⭐(new technique, much better results, promising research direction)

Cross Episodic Curriculum

What it is: improve learning with limited data

What’s new: by looking at data across RL episodes, they can improve policy training

What is it good for: making the best use of limited data reduces the chance of AI doing something wrong.

Rating: ⭐⭐

Constrained RLHF

What is it: a modification to RLHF to prevent overfitting

What’s new: They weight the Reward Model to make sure it is only used in the region where it is effect

What is it good for: Prevent Goodharting

Rating: ⭐⭐⭐⭐

AI Art

4d Gaussian Splattering

What is it: A method for converting a normal video into a “4d” movie you can view from any angle

What is the breakthrough: By using Gaussian Splattering, much better quality and speed than previous methods for doing this

What is it good for: Cool matrix-style shots. Video games probably.

Rating: ⭐⭐ (big leap in quality, but mostly an application of a known method to a new problem)

Kandinsky Deforum

What is it: Pretty movie generator

What is new: Deform is one of the OG AI video methods, this just applies it to a new model

What is it good for:

Rating: ⭐

Ambient Diffusion

What is it: a way to avoid reproducing training images in diffusion models

what’s new: they mask the training images to prevent reproduction

what is it good for: reducing copyright concerns when using diffusion models.

Rating: ⭐⭐

MotionDetector

What is it: seperate motion+subject in text-to-video models

What’s new: they train a dual-path lora on an individual video to extract motion

What is it good for: Transfer the motion from one video to another

Rating: ⭐⭐⭐⭐

This is Not AI Alignment

GPT-4v 🔃 Dall-E 3 (https://twitter.com/conradgodfrey/status/1712564282167300226)

What is it: A fun graphic showing what happens when we repeatedly iterate complex systems.

What does it mean: There was a fun back-and-forth where it was speculated “this is how we die”, which was quickly refuted. I think this perfectly demonstrates the need for empiricism in AI Alignment.

GPT-4V vision Jailbreak

What is it: “secret” messages can be passed via image to deceive the user.

What does this mean: Everyone expected to find image jailbreaks. And we did. Good job everyone.

Midwit Alignment

Discussion about this post