AI Alignment Breakthroughs this Week (10/22/23)
Sorry for getting this one out a day late.
Before getting started, one minor update. I liked David Orr’s suggestion that stars ⭐are too generic. So to emphasize that I am rating the innovativeness of a topic (and not necessarily its quality), I am going to be rating breakthroughs with a lightbulb 💡instead.
As a reminder, a rating of 1 💡means a breakthrough, that while achieving a new state-of-the-art was mostly done through predictable iterative improvements (such as combining know methods or applying an existing method at greater scale). A rating of 5 💡💡💡💡💡 Means a breakthrough that solves a significant, previously assumed to be hard problem on at least on alignment path.
Without Further ado, here are our
AI Alignment Breakthroughs this Week
This week there were breakthroughs in the areas:
Learning Human Preferences
Making AI Do what we want
What it is: Research suggesting human consciousness might be related to the entropy of our internal mental states
What’s new: Researchers used an MRI to measure brain activity during conscious/unconscous states and found conscious states exhibit higher entropy
What’s it Good for: Literally any progress on understanding the hard problem of consciousness would be a massive breakthough with implications for problems such as S-Risk
Rating: 💡💡💡(I want to rate it high because this is such an important problem, but I’m not overly impressed with this particular study. I think IIT would have predicted this)
What it is: Research suggesting beautiful infographics are more likely to be believed and cited
What’s new: They manipuated data to be more beautiful and found it was rated higher
What is it good for: Understanding how humans evaluate research is a key step to figuring out how to make us better.
What is it: a new bench mark measuring how well AI models generalize
What’s new: They identify different categories of “generalization” and measure how AIs do on each of them
What is it good for: Measuring AI capabilities (especially generalization) is critical for plans such as strategic AI pauses.
What it is: Researches find a way to measure the overall “intelligence” of Language Models
What’s new: They find that Language models exhibit a common-factor similar to Spearman’s G-factor in human beings
What is it good for: it would be nice to have a scientific definition of “human level” AI rather than just the current heuristic: it can do anything a human can do.
What it is: Enhance AI Agent planning using the concept of “regret”
What’s new: A new proveable reget framework
What is it good for: allows us to better control AI agents and make sure they don’t make mistakes.
What is it: Fine-tuning LLMs to make them better AI agents
What’s new: New SOTA performance
What is it good for: better AI agents are useful for Alignment paths such factored cognition.
What is it: Improve truthfulness in Retrieval-Augmented-Generation
What’s new: They train the AI to adaptively retrieve documents to improve factual recall
What’s it good for: Having AI that is factually correct and cites is sources is useful for almost all alignment paths
What is it: recociling incoheret AI predictions
What’s new: they devise a new “concensus game” that allows the AI to converge on a logically consistent, more accurate conclusion
What’s it good for: truthful, accurate AI is needed for almost all alignment pathways
Rating: 💡💡💡💡(I really think we are going to see a lot of important breakthoughs in the coming year involving the concept of AI self-play)
What is it: a method for improving AI summaries
What’s new: By breaking summaries down into individual claims, each claim can then be indpendently verified
What’s it good for: accurately summarizing documents. This will be important if we want an AI to e.g. “quickly summarize this plan for me”.
Learning human preferences
What is it: A better way to write “prompts” for AI models
What’s new: By interacting with the user, generative AI can generate better prompts than humans or AIs alone
What is it good for: Lots of AI right now depends heavily on “prompt engineering” where humans have to learn how to express their desires in terms the AI understands. The goal should be to flip this around and have the AI try to understand human preferences.
What is it: RLHF without the RL
What’s new: by minimizing a “regret” score, they can better learn human preferences
What is it good for: Learning human preferences is on the critical path for many AI alignment plans.
What is it: a method for converting user feedback into a score that can be used by a RL model
What’s new: A new method for converting a goal described in natural language into an executable program that can be used by a reinfocement-learner
What is it good for: Allows us to give instructions to robots using words instead of physically showing them what we want them to do.
What is it: We’ve know for a problem that tokenization produces lots of issues with language models (for example in learning to write a word backwords or count the number of letters)
What’s new: Unsurprisingly, it’s a problem for math too
What’s it good for: Ideally we can soon move to byte-level models soon (now that the problem of training large attention windows is gradually getting solved)
Rating: 💡 (everbody knew this, but good to have more confirmation)
What is it: LLM trained on a better math data set
What’s new: Better Data=better results
What is it good for: solving math is on the critical path for alignment plans that rely on proveably secure AI
What is it: a language model specifically trained to predict how programs will run and when they will encounter errors
What’s new: the model is specifically architectured like a code-interpreter
What is it good for: predicting the output of programs should lead to generating better code/proofs.
What is it: researchers find interpretable “directions” in a diffusion model
What’s new: they apply a method previously used on GANs (another type of image generation model) to interpret diffusers
What is it good for: allows us to better understand/control diffusion models
What is it: Find intepretable subnetworks in LLMs
What’s new: a new more efficent algorithm for finding these circuits
What it good for: allows us to better understand LLMs
What is it: Finding a way to seperate true and false statements in an LLM’s embedding space
What’s new: There is some work here to distinguish true/false from likely/unlikely. Also some improvements in the regression they use to find this concept
What is it good for: Finding concepts like this is something we need to get good at generally, and true/false is specifically an important example.
What it is: a method for determining whether an AI “intends” something, for example “intentionally lied to me”
What’s new: Seems to be a new philosophically framework mostly
What is it good for: I think the concept of “intention” is incoherent/anthropomorphising when it comes to LLMs, but some of the techniques here seem to have practical uses.
What it is: LLMs have a known problem where they always agree with the user
What’s new: they find that optimizing against human preferences makes the LLM more likey to “say what you want to hear”
What is it good for: Knowing you have a problem is the first step.
What is it: We can predict what people are seeing by reading their brain waves
What’s new: Seems better quality that examples of this in the past
What is it good for: brain-computer interfaces are one path to AI alignment
What is it: a “seeing eye dog” robot
What’s new: they train the robot to detect when the user “tugs” on its leash
What is it good for: helping blind people
Making AI Do what we want
What is it: a method for prompting vision models better
What’s new: adding labels to images before giving them to GPT-4V makes it much more useful
What is it good for: You take a picture of your bike and want to know what part needs fixing.
What is it: improved lidar for self-driving cars
What’s new: By having AI label the data, we can train better lidar with less human labeled data
What is it good for: Every year 30k Americans die in car crashes
What is it: do image generationwithout training a model
What’s new: a new closed-form that “smooths out” the latent-space between datapoints
What is it good for: training text-to-image models is hugely expensive. If this works, maybe we can skip that entirely. May also unlock “hybrid” methods where we add new images to a diffusion model without having to re-train it.
What is it: A much faster image generator
What’s new: latent consistency models are faster to train than previous methods for speeding up image diffusion models
What is it good for: more pretty picture faster.
What is it: outpaint an image using a reference image
What’s new: big improvment over previous methods
What is it good for: take a picture and expand it, but control what goes in the expanded area
This is not AI Alignment
What is it: There seems to be a new trend of AI Safety researchers opposing AI alignment research.
What does it mean: Seems bad. AI safety researchers who don’t want to solve alignment remind me of “climate activists” who oppose nuclear power plants.
What is it: Substack employees may be the first people who can legitimately say “AI took our jobs”
What does it mean: Talented software engineers will land on their feet, but when we start replacing unskilled labor the impact on society will be much bigger.
What is it: Research showing CEOs don’t want their employees to use AI because "they don't fully understand it."
What does it mean: This is 50% ludditism, 50% a reminder that alignment research is a rate-limiting step for unlocking AI capabilities.