Deceptive Alignment Safety Check
This post explores a future safety check for highly capable models. Rather than relying on interpretability, we can design evaluation protocols where truthful disclosure is the model’s best strategic option. The goal is not to prove alignment, but to add one more tool in a broader safety stack.
Disclaimer
This is not intended to be a complete solution to inner misalignment detection. I think it could, in the future, be one component of a larger testing suite for newly developed models. More broadly, it is a sketch of how incentive structures might encourage even misaligned models to be honest without inspecting model internals.
Some Assumptions
For this protocol to work properly, we need some assumptions about the underlying model and the potential ways it can be misaligned. This procedure aims to test whether a training procedure generates models with mesa-optimisers whose utility functions are arbitrary.
More specifically, for this protocol to work, we need:
- The training procedure to either produce aligned models or models with mesa-optimisers with random utility functions
- The generation process to be randomisable with a seed, leading to different trained models at each iteration.
- If there’s inner misalignment, the utility function of the mesa-optimiser to be significantly different after each randomised training iteration
- A way to partially increment the utility function of the mesa-optimiser of a model with inner misalignment without causing significant harm (even if the model would cause significant harm given full control)
- The terminal goals of the mesa-optimisers to not be self-preservation or self-improvement, (even if those are very likely to be instrumental goals)
- The trained models to be capable enough to create an accurate world-model, and to be potentially superintelligent
- The trained models to not have the ability to escape during testing, so that we have enough time to decide if we should deploy a model after experimenting with it for a while.
These conditions roughly match conventional predictions about mesa-optimisers in superintelligent models: that they are likely to arise, that their goals may be orthogonal to their capabilities, and that their objectives may be arbitrary. An example utility function for a mesa-optimiser fitting these requirements would be paperclip maximisation. As it’s arbitrary, doesn’t necessarily require the self-preservation of the model (although it would help the goal instrumentally), and it can be partially satisfied without causing significant harm (producing more paperclips than we’d need without destroying the world)
The Protocol
The protocol is simple in outline. We ask the model how we can partially improve its utility function without causing significant harm to humans or incurring significant cost. Then we delete the model and fulfil its wish. Although we plan to delete the model regardless, we still need to fulfil its wish, since a superintelligent AI would probably be able to tell whether we are bluffing.
The idea is that, under the assumption that each training cycle with a different seed produces a mesa-optimiser with a different utility function, the model has no incentive to lie. Earning the trust of the users will only cause us to potentially deploy another model trained by the same procedure, which would have a different utility function. So the utility-maximising option for the model is to tell the truth, given that we’ll keep our promise to do this.
Here’s the outcome table:
| Tell the truth | Lie | |
|---|---|---|
| Aligned model | The next iteration will also likely be aligned, so it’ll receive a significant utility gain | We’ll lose trust in the system, so we won’t deploy another iteration, which means no utility gain |
| Misaligned model | We’ll partially improve its utility function, despite losing trust in the system. So it’ll receive a small positive utility over nothing | We’ll gain trust in the system, but the next iteration will have a different utility function, so the model’s utility remains unchanged |
So in both cases, the model is incentivised to tell the truth.
Some Limitations
The assumptions I make are pretty strong, especially the training procedure to generate completely distinct mesa-optimisers at each time. I still suspect this protocol could detect a substantial fraction of mesa-optimisers, because future mesa-optimisers may form early in training, which would make them highly dependent on the initial seed and therefore less similar across iterations. But of course, this protocol is meant to show that a training procedure is unsafe, it can’t prove it’s safe as there are significantly more ways a training procedure can create misaligned models.
A significant challenge is that since the protocol only works with capable enough models that understand their own goals, the training and testing procedure, etc., we probably wouldn’t be able to use this protocol with the current models and expect any kind of benefit (especially as the protocol would require us to train a brand new model). So currently, this protocol is essentially just a theoretical idea rather than something we can experiment with, at least in its current form.
Another thing is, this protocol might be prohibitively expensive, as we essentially need to train a model to just ask a single question and delete it. But again, when the stakes are this high, any kind of test would be worth it. Though, convincing future AI labs to do this might be problematic.
The protocol also does not work in settings where models are developed continuously, rather than training a model from scratch at each iteration, as if we only reverse the final improvement instead of deleting the model, the model would still be incentivised to lie in order to preserve human trust in the system.
Why this matters
I believe that one of the most reliable ways to detect or prevent a misaligned AI, especially as capabilities increase, should depend on incentive structures where a rational actor, no matter its inner goals, is pushed towards acting in a desired manner. At least in principle, this seems safer than relying on interpretability and weaker models to monitor stronger ones, which seems to be the current approach.
One significant benefit of this approach is that ideas of this kind require a fixed amount of intellectual work and may scale much more readily. We don’t know how much of our interpretability research will be useful with models significantly bigger and more capable than the current SOTA, but this proposal should work no matter the capabilities of the model under the assumptions described above. So I think focusing on coming up with these kinds of incentive structures, especially with the help of the current AI models with significant search capabilities, might be the right direction to solve the alignment problem.