AI Alignment Breakthroughs this Week (11/05/2023)
The biggest news of this week is probably the two new AI Alignment Initiatives.
Otherwise, this was a relatively quiet week for AI Alignment breakthroughs.
So here are our
AI Alignment Breakthroughs this Week
This week there were breakthroughs in the areas of:
Mechanistic Interpretability
Avoiding Adversarial Attacks
Human Augmentation
Making AI Do what we want
AI Art
Mechanistic Interpretability
Research on the limits of Generalization in LLMs
What is it: Shows that LLMs are unable to generalize outside of their training data
What’s new: A systematic study of the types of generalization that LLMs can do ((in domain learning) and the types they can’t (out of domain learning)
What does it mean: Understanding the limits of LLMs should help us better understand where they can be safely deployed
Rating: 💡💡💡
Does appealing to AI “emotions” make them preform better?
What is it: Researchers find you can make LLMs answer questions better by using emotion
What’s new: new prompting strategies such as “this is important for my career”
What is it good for: In addition to purely improved performance, the really question is why does this work. Is there some “try harder” vector in the latent space that we can identify and exploit?
Rating: 💡💡
LLMs Generalized Truthfulness Across Agents
What is it: a way to tell true statements from false ones using LLMs
What’s new: they find evidence that LLMs have different “personas” that they use to judge whether something is true or not
What is it good for: Perhaps we can extract these personas and use them for alignment.
Rating: 💡💡💡💡
Avoiding Adversarial Attacks
What it is: Pretrain a model so that it cannot be later fine-tuned for harmful purposes?
What’s new: They find meta-parameters such that fine-tuning a model on one task reduces its usefulness for other tasks
What is it good for: Hypothetically you could open-source a model and not have all of the RLHF safety features immediately removed by fine-tuning
Rating:💡💡(important topic, but I’m not convinced this technique works. More optimistic about techniques like Concept Erasure)
What is it: an example of an attack that uses prompt injection to exfiltrate data from Google Bard
What’s new: they use Bard’s Google Doc extension to exfiltrate data
What is it good for: red-teaming these kinds of attacks is the first step to fixing them
Rating: 💡
Will releasing the weights of large language models grant widespread access to pandemic agents?
What is it: exactly what it says on the tin
What’s new: they simulate someone trying to build a harmful bio-weapon with/without an LLM and find the LLM helps
What is it good for: There’s been pretty strong pushback against this from e/acc, with people noting that this is also true of Google Search, or a hypothetical drug that increased everyone in the world’s IQ by 1 point. The larger point remains that we should identify dangerous capabilities and act to mitigate them. “Regulate uses not research” is the rallying cry of Midwit Alignment.
Rating: 💡
Human Augmentation
What is it: a 3d sign language dataset
What’s new: first dataset of its kind
What is it good for: Train a model to automatically convert speech into sign language
Rating: 💡💡💡
Making AI Do what we want
What is it: answer moral questions using visual data
What’s new: They use a LM to generate a multimodal benchmark set of various moral quandries
What is it good for: If we want robots or self-driving cars to behave ethically, they will have to rely on visual input to do so
Rating:💡💡
Agent Instructs Large Language Models to be General Zero-Shot Reasoners
What is it: train an agent to follow instructions
What’s new: they first have the agent access web resources before generating step-by-step instructions
What is it good for: Accurately doing tasks such as classification
Rating: 💡💡💡
AI Art
What is it: a new program for converting a 360 degree photo into a 3d scene
What’s new: they train a 3d-aware diffusion model to allow them to calculate novel views
What is it good for: convert any 360 degree photo (and soon I expect video) into a full 3d scene you can walk around in
Rating: 💡💡💡
What is it: A faster version of the speech-to-text model whisper
What’s new: they distill the model down to one that’s 49% smaller
What is it good for: this unlocks text-to-speech in places like mobile or web where it would have previously been too slow.
Rating: 💡💡
What is it: an AI model trained to generate motion for 3d models
What’s new: Not open source, but seems better than past versions of this I’ve seen
What is it good for: animating all of those sweet models that you’re generating with text-to-3d.
Rating: 💡💡💡
AI Alignment Initiatives
The Biden Executive Order on AI
What is it: The most notable provision being that models trained with than 10**26 flops and data centers with a capacity of more than 10**20 flops/sec must register with the government. This limit appears to have been chosen not for any scientific reason, but because it is slightly larger than the largest current models. There’s also some ominous language about open-source models, but no actual regulations so far.
What does it mean: Doomers will say this doesn’t go far enough. E/ACC is already preparing to flee for friendlier waters (this particular project is a joke). The most important fact isn’t the order itself (which is not a law and thus has limited enforcement mechanisms) but that this sets the tone for future regulation. Expect more of the same in the future: technical limits set for political reasons not scientific ones, emphasis on whatever the current administration’s pet-projects are (Biden likes unions hates discrimination, and wants to cure cancer), and more words than action (since we still have to compete with China after all).
Overall Rating:🧓🏻🧓🏻🧓🏻 (3 Joe Bidens, could have been worse)
What is it: a joint statement by a number of countries agreeing to mitigate AI risk
What does it mean: The most important signature on this list is undoubtedly China. Much like global warming, China is probably the single biggest contributor to AI risk. They have the computing heft to build powerful systems, but are more likely to cut safety standards due to the perceive need to “race” to catch up with the US. And an AI broadly trained with the values of the CCP is less likely to be benevolent. I wouldn’t expect miracles, but hopefully this indicates a willingness by the Chinese to at least follow the common consensus on AI safety standards.
Rating: 🇨🇳🇨🇳🇨🇳🇨🇳🇨🇳
This is not AI Alignment
An amusing short-story (long tweet?) By EY
What it is: A reminder that EY is a very good fiction writer
What does it mean: It is funny and worth reading.
Overall Rating: 📃📃📃📃(4 letters of self-reflection)
Small LLMs
What it is: There are an increasing number of small LLMs in the “better than GPT 3.5” category including OpenChat, Grok and Yi-34B. Importantly anyone with a modestly good PC can run OpenChat or Yi-34B on their personal computer.
What does it mean: A little over a year an a half after the release of GPT3.5, the technology has now become so widespread that there are multiple independent replications. If technology really is increasing at a hyper-exponential rate, we should expect smaller and smaller time-gaps between the leading edge and widespread availability (and paradoxically larger and larger capability gaps). One key thing to track might be to see how long before we have widespread models better than GPT4.
Rating:🤖🤖🤖+🤖/2 (3.5 large language models)