Amidst the Russian invasion of Ukraine, the risk of nuclear war is now larger than it has been since the end of the Cold War. The spectre of nuclear annihilation, once thought a thing of the past, has returned.
While technology can avert some forms of annihilation, for example by diverting major asteroid strikes, these naturally occurring risks are likely small, evidenced by our long history free from them. The same cannot be said for those caused or exacerbated by technology. Nuclear war, climate change, engineered bioweapons, and even pandemics: these risks are unfortunately all too familiar.
In his book What’s the Worst That Could Happen? Existential Risk and Extreme Politics (2021), Labor MP for Fenner Andrew Leigh, discusses these risks to our continued existence and how we might mitigate them. But he also worries about another risk not yet listed, a risk that is less familiar and perhaps even more dangerous.
Progress in artificial intelligence (AI) research is accelerating, with the number of new papers doubling every two years. In April, OpenAI released DALL-E 2, a model that generates detailed images from text prompts. While it struggled to generate intelligible text, Google’s Parti, announced only two months later, did not struggle at all. And in August, StabilityAI freely released Stable Diffusion, allowing anyone to download the model, disable the content filter, and generate images on their own computer. While it might be easy to get swept up in the debates raging around AI generated art, we must remember that what we have now is the barest hint of what is coming.
Language generation is where the true prize lies. Large language models (LLMs) are trained on a significant fraction of all human-produced text to predict the text that is most likely to follow the input text. OpenAI’s GPT-3, released in June 2020, was the first LLM to receive significant public attention, even writing an article for The Guardian. The numerous applications of GPT-3 include writing university essays and powering GitHub Copilot, a programming assistant which suggests code: AI that hastens AI development.
The reasoning capabilities of these LLMs generally improve when they are made larger and trained on more text. Some flaws persist as these models are superhuman at predicting the text most likely to follow the input, which is not always the same as reasoning well. However, their reasoning drastically improves if we append ‘Let’s think step by step’ to the input text because that makes correct reasoning more likely to follow. What other capabilities are we yet to discover?
In 2021, Jacob Steinhardt, an assistant professor at UC Berkeley, created a forecasting contest to predict AI progress, including a benchmark of high school competition-level mathematics problems called MATH. The aggressive progress forecasted shocked him: state-of-the-art AI in 2021 correctly answered 6.9 per cent of the questions, but the median estimated score in 2025 was 52 per cent. In April, Google announced PaLM, which outperforms the average human on a benchmark designed to be difficult for LLMs. Just two months later Minerva, a version of PaLM specialising in mathematical and scientific reasoning, scored 50.3 per cent on MATH, achieving four years of progress in just one.
But there are problems we fear that will emerge in AI, universal problems of intelligent agents that already manifest in humans, corporations, and states. Different people — agents — have different goals, and insofar as their goals are misaligned, some degree of conflict is inevitable. The problems associated with quantifying values and goals are encapsulated by Goodhart’s law: when a metric becomes a target, it ceases to be a good metric. Surrogation, the process by which these surrogate metrics become targets themselves, is rife in corporations and states, leading them to sacrifice the unmeasured good in pursuit of metrics. Maximising profit or minimising unemployment payments, for example, without regard for the resulting harm.
If these problems emerge in AI, the consequences will be disastrous. AIs have many advantages over human minds. AIs do not necessarily need rest or consciousness, can easily be copied, run on computer processors that are constantly improving and operate at a frequency ten million times faster than the human brain, and can use the entire internet and all recorded human text as training data. So as AI systems develop, their role in scientific, technological, and economic progress will grow as human input and control shrinks in equal measure.
In the future, we will likely construct AI systems that, in any specific but general domain, can reason at least as well as the best humans. Nothing in known science rules this out. And under competitive pressures to maximise profit and secure geopolitical dominance, states and corporations may relinquish more and more control over proliferating but inscrutable AI systems. Eventually, out-of-control AI systems might determine that the most effective way to pursue their unintendedly inhuman values and goals would be to seize control for themselves, executing an AI takeover and permanently disempowering humanity.
To prevent an AI takeover, there are two key problems we must solve. Alignment is the problem of imparting intended values and goals to AI systems, rather than mere surrogates. Interpretability is the problem of understanding how and why AI systems make the decisions that they do. If we solve these problems, we must then robustly align powerful AI systems with values that promote the flourishing of all humans, and indeed all sentient life, using oversight from equally powerful interpretability tools. While we often fail to do this for corporations and states, humans with power within these organisations can attempt to direct them to act in alignment with human values, and human whistleblowers and journalists can render them somewhat interpretable, limiting the resulting harm. But these mechanisms will not be there to save us from AI takeover if alignment and interpretability work fails.
Unfortunately, progress in alignment and interpretability currently lags far behind progress in AI capabilities. And while some organisations at the forefront of capabilities, like OpenAI and DeepMind, have safety teams focused on these problems, enough is not being done. We charge forward recklessly, headed towards disaster.
Many different skill sets will be required to navigate this risk and ensure that AI brings prosperity to all, from philosophy to computer science to politics and governance. To learn more, perhaps to contribute yourself, see the introduction AGI Safety From First Principles and the freely available course materials AGI Safety Fundamentals, both by OpenAI’s Richard Ngo. And for a lighter overview, see the Most Important Century series by Holden Karnofsky, co-CEO of Open Philanthropy.
Climate change was once an obscure and neglected issue, as AI takeover risk is today. I hope that you and the world take this risk seriously, as we have begun to do with climate change, because I believe navigating AI takeover risk to be the foremost challenge of our time, and of all time.
Let’s get to work.