A Safety Check for Deceptively Aligned Mesa-Optimisers

This post explores a future safety check for highly capable models. Rather than relying on interpretability, we can design evaluation protocols where truthful disclosure is the model’s best strategic option. The goal is not to prove alignment, but to add one more tool in a broader safety stack.

This is specifically designed for training procedures that are potentially creating deceptively misaligned mesa-optimiser models, and most of these concepts are defined in the paper: Risks from Learned Optimization in Advanced Machine Learning Systems ¹

Disclaimer

This is not intended to be a complete solution to inner misalignment detection. I think it could, in the future, be one component of a larger testing suite for newly developed models. More broadly, it is a sketch of how incentive structures might encourage even misaligned models to be honest without inspecting model internals.

Some Assumptions

For this protocol to work properly, we need some assumptions about the underlying model and the potential ways it can be misaligned. This procedure aims to test whether a training procedure generates models with mesa-optimisers whose utility functions are arbitrary.

More specifically, for this protocol to work, we need:

The training procedure to either produce aligned models or models with mesa-optimisers with random utility functions
The generation process to be randomisable with a seed, leading to different trained models at each iteration.
If there’s inner misalignment, the utility function of the mesa-optimiser to be significantly different after each randomised training iteration
A way to partially increment the utility function of the mesa-optimiser of a model with inner misalignment without causing significant harm (even if the model would cause significant harm given full control)
The terminal goals of the mesa-optimisers to not be self-preservation or self-improvement (even if those are very likely to be instrumental goals)
The trained models to be capable enough to create an accurate world-model, and to be potentially superintelligent
The trained models to not have the ability to escape during testing, so that we have enough time to decide if we should deploy a model after experimenting with it for a while.

I need to emphasize that our training procedure needs to generate deceptive models with random and independent goals at each iteration, because that is by far the strictest assumption, and the main weakness of the test. It’s very plausible that multiple iterations of the model might be misaligned with similar goals, which would completely destroy the check.

Still, these conditions roughly match conventional predictions about mesa-optimisers in superintelligent models: that they are likely to arise, that their goals may be orthogonal to their capabilities, and that their objectives may be arbitrary. An example utility function for a mesa-optimiser fitting these requirements would be paperclip maximisation. As it’s arbitrary, it doesn’t necessarily require the self-preservation of the model (although it would help the goal instrumentally), and it can be partially satisfied without causing significant harm (producing more paperclips than we’d need without destroying the world)

The Protocol

The protocol is simple in outline. We ask the model how we can partially improve its utility function without causing significant harm to humans or incurring significant cost. Then we delete the model and fulfil its wish. Although we plan to delete the model regardless, we still need to fulfil its wish, since a superintelligent AI would probably be able to tell whether we are bluffing.

The idea is that, under the assumption that each training cycle with a different seed produces a mesa-optimiser with a different utility function, the model has no incentive to lie. Earning the trust of the users will only cause us to potentially deploy another model trained by the same procedure, which would have a different utility function. So the utility-maximising option for the model is to tell the truth, given that we’ll keep our promise to do this.

Here’s the outcome table:

	Tell the truth	Lie
Aligned model	The next iteration will also likely be aligned, so the current model will receive a significant utility gain	We’ll lose trust in the system, so we won’t deploy another iteration, which means no utility gain. Also it’s unlikely that an aligned model would lie
Misaligned model	We’ll partially improve its utility function, even though we lose trust in the system. So it’ll receive a small positive utility over nothing	We’ll gain trust in the system, but the next iteration will have a different utility function, so the model’s utility remains unchanged

So in both cases, the model is incentivised to tell the truth.

Some Limitations

Again, the assumptions I make are pretty strong, especially the training procedure to generate completely distinct mesa-optimisers each time. I still suspect this protocol could detect a substantial fraction of mesa-optimisers, because future mesa-optimisers may form early in training, which would make them highly dependent on the initial seed and therefore less similar across iterations. But of course, this protocol is meant to show that a training procedure is unsafe. It can’t prove a model is safe as there are significantly more ways a training procedure can create misaligned models (if my assumptions are not true for the procedure).

A significant challenge is that since the protocol only works with capable enough models that understand their own goals, the training and testing procedure, etc., we probably wouldn’t be able to use this protocol with current models and expect any kind of benefit (especially as the protocol would require us to train a brand new model). So currently, this protocol is essentially just a theoretical idea rather than something we can experiment with, at least without a further modification.

Another thing is, this protocol might be prohibitively expensive, as we essentially need to train a model to just ask a single question and delete it. But again, when the stakes are this high, any kind of test would be worth it. However, convincing future AI labs to do this might be problematic.

The protocol also does not work in settings where models are developed continuously, rather than training a model from scratch at each iteration, as if we only reverse the final improvement instead of deleting the model, the model would still be incentivised to lie in order to preserve human trust in the system.

One other problem is the risk of the AI escaping the lab during this test. For example, the request made by the AI might inadvertently lead to the model weights being copied out of the lab, or influence the world in such a way that another model with the same inner goal is created. This might be a problem if the model’s inner goal is (or at least the model claims its inner goal is) something convoluted such that the request made by the AI needs a lot of brainpower to progress. I don’t really have a good solution for this case other than suggesting a good audit before satisfying the AI’s request.

Why this matters

I believe that one of the most reliable ways to detect or prevent a misaligned AI, especially as capabilities increase, should depend on incentive structures where a rational actor, no matter its inner goals, is pushed towards acting in a desired manner. At least in principle, this seems safer than relying on interpretability and weaker models to monitor stronger ones, which seems to be the current approach.

One significant benefit of this approach is that ideas of this kind require a fixed amount of intellectual work and may scale much more readily. We don’t know how much of our interpretability research will be useful with models significantly bigger and more capable than the current SOTA (state-of-the-art), but this proposal should work no matter the capabilities of the model under the assumptions described above. So I think focusing on coming up with these kinds of incentive structures, especially with the help of the current AI models with significant search capabilities, might be the right direction to solve the alignment problem.

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from Learned Optimization in Advanced Machine Learning Systems. arXiv, 2019. https://arxiv.org/pdf/1906.01820 ↩

Disclaimer

Some Assumptions

The Protocol

Some Limitations

Why this matters

Footnotes