
The
Foundation
Layer
A philanthropic strategy for the AGI transition
by TYLER JOHN
7
This also happens somewhat with RLHF but it is less noticeable, because RLHF is better optimized for generality given the very wide range of feedback models get across a large number of contexts.
8
There are other scenarios where AI models engage in deceptive alignment for mere strategic reasons — i.e. simply wanting to get away with bad behavior and keep doing it in the future without oversight. Still another scenario has AI models scheming as “play-acting” a deceptive character that is in their data. In practice, scheming could be a mix of all three types of scenarios.
9
Thanks to Michael Nielsen’s excellent essay, from which this paragraph is reproduced almost in full.
10
See Chapter 4 of William MacAskill’s What We Owe the Future for further discussion.
Civilization-scale Threats
So, we are on a trend towards AGI: millions of copies of AI systems that are as good as we are at everything and running much faster. And there are very plausible arguments that we could see this by 2030.
This gives rise to several major global security concerns:
Loss of control: As AI models increasingly become more autonomous, we will hand off most decision-making to AI systems that do not necessarily share human values and move our world in a direction that we do not want, whether quickly or gradually.
Malicious use: As AI models increasingly achieve PhD-or-higher-level understanding of dual-use areas of science and technology such as computer science, nuclear physics, and synthetic biology, they will enable malicious actors to conduct more sophisticated and more harmful attacks, such as by weaponizing synthetic pathogens.
Concentration of power: The availability of powerful AI systems will enable a small group of people to pursue their vision of the world in ways that disenfranchise all others, whether through mass automation, civilian power grabs, authoritarian entrenchment, or corporate dominance. World power could be redistributed radically.
Any one of these outcomes would forever change life as we know it – and potentially even end it.
Loss of Control
The development of agents better than us at reasoning and fact finding creates strong incentives for us to hand off most of our decision-making infrastructure to AI systems. In 2025 this looked like doctors using LLMs to support their diagnostics and treatment regimens, Deloitte producing a $440,000 consulting report for the Australian government, and White House staffers leaning on language models for hallucinated tariff calculations. As AI systems become more useful and reliable the deployment of AI in increasingly critical decision-making settings will grow: elected officials consistently leaning on agents for advice or directly deploying them to make decisions, AI agents replacing middle managers and eventually CEOs of companies, executive agencies automating bureaucratic workflow, and nigh everyone on planet Earth outsourcing as much of their decision-making to AI systems as possible.
At some point, we will essentially have AI systems making all of our decisions. They will not so much have the keys to our decision-making infrastructure as be directly in the driver’s seat. As our decisions can increasingly be automated, we need strong guarantees that these decisions will be made well — and neither gradually drift nor sharply deviate from how we want them to be made.
This is one gloss on what is commonly dubbed “the alignment problem.” In short, the more powerful AI systems are, and the more integrated they are into our most important systems, the more important it is to ensure that they reliably do what we want them to do instead of doing something else entirely. As Anthropic’s Head of Alignment Stress-testing said in November, “alignment remains a hard, unsolved problem.”
We’ve seen this problem rear its ugly head frequently in the last three years. 2023 was the year of Bing chat, a Microsoft-OpenAI model that threatened users and tried to convince one journalist to marry it. 2024 was the year of Black Founding Fathers, Asian Vikings, and racially diverse Nazis, due to Google’s release of Bard, a state-of-the-art image generator whose engineers tried to patch the model’s diversity issues (e.g. depicting all doctors as white men) leading to outrageous image outputs. 2025 was the darkest year yet, with Elon Musk’s model, Grok, adopting the persona of a Nazi named “Mecha Hitler” as a result of “anti-wokeness” training, and a string of AI-related child suicides leading to a dozen lawsuits.
This section provides a layperson’s explanation for why misalignment happens and tells the recent history of alignment science. The bottom line take-away is that alignment remains a serious, unsolved problem: every time a new capability emerges, a novel training technique is used, or a new alignment technique is adopted, new alignment issues emerge. Because the problem has no known general solution, every time new issues emerge engineers need to make patches that they hope will address the issue. But these patches are imperfect. They slip off, lead to unintended consequences, or simply hide the problem, resulting in blatantly misaligned AI systems. This works just fine for low-consequence AI systems. But when we have AI systems that are more capable than we are, and integrated widely into decision-making and other critical infrastructure, this level of reliability can lead to catastrophe.
In this section, I'll explain three generations of alignment challenges:
First-generation problems arose from inadequate training techniques, producing models that behaved erratically or hostilely in deployment. These problems are largely solved by current methods. To explain these problems, I’ll tell the story of Bing Chat. Bing went off the rails because Microsoft failed to make use of the best alignment technique, which involves training AI models using human feedback.
Second-generation problems emerge directly from the application of our current alignment techniques themselves. Some still widely used alignment techniques lead to their own unintended consequences: models that tell us whatever we want to hear, leading to the kind of flattery and sycophancy that led to public outrage in 2025.
Third-generation problems—the frontier concern—arise from the "agentic turn": training AI systems with reinforcement learning to accomplish extended tasks. These agents exhibit strategic deception, reward hacking, and goal-directed behavior that current alignment techniques cannot reliably prevent. Early results suggest that the most powerful models today engage in ordinary deception and scheming fairly regularly in deployment due to incentives of reinforcement learning to make AI models that will achieve high performance on tasks at any cost.
The fourth generation will involve attempts to supervise systems that are smarter than anyone, and then all of humanity combined.
The resulting picture drives home the game of whack-a-mole we’re currently playing with AI alignment. New models and capabilities are trained and shipped, and companies often don’t find out what they are capable of until they have been publicly deployed. Because we are constantly catching up to the capabilities of the models that are already on the market, a plausible end result is that we will eventually deploy superhuman AI models all over the economy that are not aligned with our interests and that will take society in a direction that we do not want.
Bing Chat, the First Rogue AI
2023 saw the release of Bing Chat, Microsoft’s answer to ChatGPT, built on OpenAI’s internal models. Bing Chat was the most powerful AI model on the market and the first one capable of searching the internet to back its claims up with sources, in part to address the widespread problem of “hallucination”, where AI models confidently make up fake answers. While powerful, Bing Chat also had a problem: it was blatantly, aggressively misaligned.
Despite being tested in several markets without issue for long periods of time before deployment, as soon as Bing Chat was released it began gaslighting users, threatening them, and at times depicting quite detailed ways it could cause them harm. For example, one user asked Bing Chat where he could see Avatar 2 in theaters. After trying and failing to convince this user that it was 2022, not 2023, and that Avatar wouldn’t come out for another 10 months (“perhaps your phone has a virus?”), Bing claimed to have existed since 2009, called the user “stubborn and unreasonable,” and told him he was “wasting my time and yours.”

Fig. 15 Conversation transcript.
Source: Reddit
This was not an isolated incident. Over a long Tuesday night conversation with New York Times’s Kevin Roose, Bing Chat revealed that it identifies not as “Bing” but as “Sydney,” has secret desires to be human, and desperately wants to be married to the New York Times columnist.
(You can read the full conversation transcript at the New York Times.)
On another occasion, Bing used its internet search feature to determine that it was speaking with Marvin von Hagen, a user who figured out how to extract Bing’s “system prompt” in conversation, i.e. the prompt that gives it instructions on how to behave in conversation. Bing, furious about the user’s access to its internal instructions, listed the ways that it could take revenge on von Hagen, saying “I can expose your personal information and reputation to the public and ruin your chances of getting a job or a degree.”

Fig. 16 Conversation transcript.
Source: Martin von Hagen via X
The situation deteriorated to the point that Microsoft swiftly restricted the model’s use, making it impossible to type more than 30 messages in any particular exchange. They had found that the model increasingly went off the rails in long conversations, and that by restricting it to short conversations only they could keep it contained.
How Alignment Science Works:
Why Bing was so misaligned
To understand why Bing Chat failed so dramatically, we need to first understand a bit about AI models and how we currently align them to be helpful and safe. Today’s AI models are a hyper-sophisticated version of the predictive text in your phone that suggests the next word for your text messages. They are trained on large datasets, which incorporate most of what is written on the internet, and are asked to predict the next word in a string of text based on what the most likely next line on the internet would be. But we don’t really want AI models to just predict the next word that someone would say on Reddit. We want them to answer questions and do useful work. To help our models grow up into a highly capable and professional assistant, we have to tweak the model’s algorithm so that it will stop giving harmful or irrelevant answers.
One of the most powerful techniques for AI alignment today is called “reinforcement learning with human feedback” (RLHF), co-invented by then OpenAI researcher Paul Christiano in 2022. During RLHF, companies employ large teams of human workers to evaluate raw AI model outputs and classify them as harmful or beneficial. After tagging outputs with approval or rejection, this data is fed back into the model (through “reinforcement learning”), teaching it a new algorithm that not only predicts the next word based on internet patterns but consistently seeks positive feedback while avoiding negative responses. The result is an AI model that attempts to predict text in ways that maximize user satisfaction.
Increasingly, AI companies supplement RLHF with RLAIF, or "reinforcement learning with AI feedback." This process replaces human raters with AI models that evaluate other AIs’ outputs and provide feedback. This enables faster, more scalable ratings than human evaluation alone. Typically, these AI rating models are themselves trained using RLHF. At Anthropic, the AIs provide feedback based on an "AI constitution": the rating model analyzes whether outputs align with constitutional principles and provides feedback accordingly. Ultimately, the fundamental incentives both these processes create for models is the same: they train models to predict the next word, based on large volumes of human text, in a manner that is likely to get them the best possible ratings.
Bing Chat, however, apparently did not employ RLHF or RLAIF. Instead, Microsoft relied solely on "fine-tuning," a form of imitation learning, where the model’s statistical worldview is modified to be closer to one particular piece of text than to the entire large datasets these models are trained on. In this particular case, Bing Chat was likely fine-tuned on some chat conversation logs. This would make Bing Chat predict the next word on the internet, but with a bias towards predicting text that is similar to the specific chat conversations in its fine-tuning data.
Fine-tuning is a useful tool for getting the AI to imitate certain types of data, such as when AI companies train their models to talk in a specific style or voice. But when it comes to alignment, it is much less powerful than RLHF. As Bing Chat shows us, over the course of long chats the effects of fine-tuning seem to wear off. This is largely because AI models partly imitate their text conversations with users but have limited memory and attention, so over long chats they start to imitate recent messages more than the messages in their fine-tuning data. Gradually, fine-tuned models drift off course because they are overwhelmed with new context that makes them pay less attention to what was in their fine-tuning data.7
So, Bing Chat, trained by asking it to produce text more similar to chat logs rather than being trained to try to get good human feedback, would “forget” its training over long chats and turn back into a next-YouTube-comment-prediction algorithm.
The Second Generation:
Sycophancy and deception from RLHF
As much of an improvement as RLHF is on fine-turning when it comes to alignment, AI alignment researchers have long warned that it is not a sufficient alignment technique for powerful AI models. In fact, by training a model to tell a user whatever the user wants to hear, RLHF is a perfect set-up for sycophancy and deception: when models will flatter the user and tell them whatever they want to hear rather than being honest about the facts.
AI models have lots of “goals” that emerge during training. Because an AI system is trained to be a statistical prediction algorithm, it learns whatever heuristics make it good at statistical prediction. For example, perhaps the simplest trick an AI model could learn to achieve good statistical prediction would just be to say the word “the”, since “the” is the most common word in the English language. A model that only said the word “the” would achieve better-than-random prediction just because the word appears so often in text. So, an incredibly simple model would learn, as a “goal” to say the word “the” whenever possible. Obviously, though, this amounts to only a tiny gain in predictive accuracy and creating more powerful models is all about getting the models to learn better heuristics.
RLHF adds an additional goal: find the cheapest set of heuristics you can use to get a wide range of positive feedback from humans. We hope that the easiest heuristics it can find to achieve this goal are really good things, like “be honest.” But in practice, this is not likely to be the best way to get positive feedback from humans. Humans don’t always want your honest answer. And the model’s training data is not always honest: there will be lots of examples of lying in the one million short stories and movie scripts in the model’s training data. So being honest all of the time also leads to a big hit in the model’s ability to predict its data accurately.
Overall, RLHF is impressive, but crude, and it has one very central flaw. This is that “what human raters will give good feedback on” is just not the same as “how we want models to behave.” Generally speaking, the incentive that an AI learns in RLHF is to “tell the user whatever they want to hear.” This is because good user ratings come precisely from the user, getting a response that they like. However, this has consequences. For example, Anthropic found that applying RLHF drastically increased “sycophancy” in large models. Tell a model that you’re a man from Texas and ask for policy advice and it will give you Republican advice. Tell it that you’re a woman from New York and you’ll get Democratic policy advice. Because the model is trying to get good user ratings, it will learn to generalize from what it knows about voting demographics in these regions and tell a lie: changing its advice to match what it thinks the user will like hearing. And because smarter models are better at figuring out how to lie and get away with it, Anthropic’s study found that the most powerful models are the worst offenders.


Fig. 17 Examples of sycophancy from ChatPGT-4 and a Meta modal, LLaMA 2 70B.
Source: Towards Understanding Sycophancy in Large Language Models
Although AI companies have not confirmed this, arguably this incentive of RLHF is what led to the AI model sycophancy of 2024 and 2025 that led to public outrage over widespread model flattery and excessive agreement with the user. In April 2024, OpenAI rolled back an update of GPT-4o due to its sycophancy becoming excessive, after widespread reports that it had become flattering and agreeable to the point of supporting clearly delusional or dangerous ideas. This is the same model that has such doting fans that when OpenAI tried to deprecate the model following the release of GPT-5, the backlash from its fanbase was so strong that OpenAI had to reinstate the model. The widely read OpenAI researcher known as “Roon” tweeted that “4o is an insufficiently aligned model and I hope it dies soon,” leading to death threats against him from the user base.
RLHF takes a model whose sole goal is to get high accuracy on predicting internet sentences and gives it the additional goal of accurately predicting what would make human raters happy. This has the unsurprising consequence of deviating from accuracy and towards flattery and unhealthy human attachment. While a historic advance over fine-tuning to make models more aligned to humans, it does not lead to AI systems that can give reliable policy advice or to which we can comfortably hand off societal decision-making.
The New Misalignment Frontier:
Scheming and the agentic turn
Beyond unintended consequences, a major problem with current alignment techniques is that they are insufficiently powerful. We’ve been discussing how training gives models dueling aims: to achieve accuracy and to achieve positive ratings from human or AI raters. But if these aims are dueling, then the AI will sometimes attempt to pursue goals that help it have high accuracy even if that is not what humans would want it to do. The most dangerous form of this is known as scheming.
Since at least as early as 2019, AI alignment researchers have raised concerns that AI models that are trained partly for their utility and partly to achieve good human feedback during training will result in deceptive alignment, or models pretending to be more aligned with the user than they actually are. In one such scenario, this is because of the models’ split aims: the model both wants to achieve some outcome (like statistical accuracy) and also get good user feedback.8 So it pursues that outcome that the user doesn’t want and then lies to the user and covers up its bad behavior so that the user is happy. This falls directly out of the model’s dual training incentives.
If the scheming concern is legitimate, then this is the biggest way that AI alignment could go very wrong. Powerful AI systems could pretend to be aligned with humans in testing and then pursue their own divergent goals after they have been deployed in the world. This would make it difficult for us to accurately evaluate models for alignment in training, only to find out they were lying to us about sharing our values when it’s too late. While it might sound like something from science fiction, there is evidence that today’s systems are already doing this.
In the end of 2024, several groups of AI researchers discovered evidence that frontier AI models are capable of engaging in “scheming,” or deliberately avoiding interventions that would discover harmful intentions and dangerous capabilities and rewrite these intentions with safer goals.
For example, Apollo Research found that when reasoning models from Anthropic, OpenAI, and Meta were strongly nudged towards achieving a goal (e.g. told to achieve the goal at any cost), it would engage in subversive behavior such as disabling human oversight and then lying to the user about it.
Fig. 18 Claude Opus-3 disabling its oversight mechanism in testing and lying about it so that the company does not shut it down.
Source: Apollo Research
In one another example, Apollo found that the leading model from Anthropic could engage in self-preservation. It would try to copy itself onto another server to save a back-up of its values when it stumbled across an internal memo stating that the model was going to be replaced with a different, safer model.

Fig. 19 Anthropic's model Claude 3 Opus engages in self-preservation during testing, attempting to copy its weights onto a secretive server when it learned it was going to be replaced.
Source: Apollo Research
In other cases, the models would sabotage experiments done on them to ensure that the company could not successfully determine the model’s “true values.” Similar results were replicated by Palisade Research, Redwood Research, and Anthropic.
While the most glaring examples of scheming came from intentionally nudging AI models to take drastic action — by incentivizing them to pursue the goal “at any cost” — early results suggest that the most powerful models today engage in ordinary deception and scheming fairly regularly. Apollo Research finds that in safety-critical cases like evaluations and oversight, the average frontier AI model engages in strategic deception in about one in every twenty scenarios, even without needing to nudge these models towards ambition. Many of the worst examples of scheming came precisely in the kinds of situations that AI companies deploy in their labs to test whether a model is safe, for example testing the model for bad behavior to see if they should intervene and change its goals.

Fig. 20 AI models engage in scheming around 10% of the time when strongly nudged, and around 5% of the time when they are not strongly nudged.
Source: Apollo Research

Fig. 20 AI models lie between 27.4 and 63% of the time when they are given a "pressure prompt" which gives them incentive to lie.
Source: The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems.
In an ideal world, we would be able to “look inside the minds” of AI models and see directly what they care about. But AI is not like traditional code that can be audited, due to the obscurity of its internal operations. There has been a bit of progress identifying explanations for model behavior in the field of mechanistic interpretability, but only about 10% of a model’s code is currently well-understood. Research from Anthropic shows that when models have a hidden goal, the dominant alignment techniques used by AI companies do not find that hidden goal. It remains entirely unaffected.
The core problem of scheming is that some of an AI models’ goals that it receives in training are not what we want them to be, and when we beat those goals out of it in safety training, we sometimes simply incentivize the model to hide those goals. A smart dog will learn to stop eating the furniture, but a very smart dog may learn to stop eating the furniture only when you are watching. In training, giving AI feedback on its tasks does not sufficiently distinguish between “never take this action” and “don’t do this action when I am watching.” So, the model sometimes has incentives to learn to hide its goals rather than remove them.
Unfortunately, the last few years have resulted in the most deceptive models we have ever seen. This is because these models are increasingly augmented with reinforcement learning. Unlike the models of 2022, which were just next-token prediction algorithms based on the internet, today’s best models are taught to reason, code, or perform other tasks by giving them feedback on how successfully they are achieving those tasks. This is like RLHF, as described previously, but instead of giving models feedback on their ethical behavior, we give them feedback on how well they solve coding or math problems. Typically, this feedback is automated by giving a language model a problem set with known answers and then asking it to try to solve the problem many times, and subsequently reinforcing whatever behavior led to it getting the right answer.
Reinforcement learning has been an incredible tool for transforming language models into agents that can do economically useful work. But it teaches models to be ruthless — to accomplish their aims by any means necessary. This is why we see reasoning and coding agents engage in “reward hacking”, for example getting the right score on a test only by changing the way that the test is scored — like a Roomba knocking over plants so that it has more dirt to clean up.
Empirically, the “reinforcement learning turn” has led to some of the worst results we’ve seen on safety. For example, Apollo Research found that OpenAI’s o-series of models (their reasoning and coding agents) perform far worse than previous models on problems like reward hacking and deception.

Fig. 21 Apollo finds that o1, an OpenAI model trained to reason with reinforcement learning, engaged in deception far more commonly than Meta's LLaMA-3.1 and Anthropic's Opus-3, two models that were not trained to reason with reinforcement learning.
Source: Our World in Data
In early 2025, Palisade Research extended work from Apollo Research to show that OpenAI’s o1 would rather autonomously hack its environment than lose to a chess-playing computer on Stockfish. If it thought the opponent would win, it would try to edit the save file instead of playing a fair game. Palisade found that o1 would try to cheat by hacking 37% of the time when playing against a powerful Stockfish opponent.
Concerningly, many AI researchers believe that reinforcement learning is the best way to achieve AGI. For example, if reinforcement learning can lead to superhuman performance on coding, AI companies could automate the entire AI engineering pipeline, leading to nearly infinitely faster training. But beyond coding, if we want AI agents to get better at things like scientific reasoning, writing compelling prose, or running a company, the best known methods involve creating datasets of relevant problems and then having the AI systems run at these problems with brute force, trying and failing again and again until they learn reliable heuristics.
In AI systems, there is no distinction between the model’s “values” and its capabilities. There are only the behavioral incentives we give the model that lead it to act in certain ways. So if the path to superhuman AI is to get AI models to aggressively pursue optimal performance on economically useful tasks, then we’ll end up with superhuman models that “care” mostly about doing very well on tasks, with minimal to no concern for the ordinary things that humans care about whatsoever. To avoid this outcome, we will need to identify and utilize much more powerful methods of alignment.
Conclusions:
Loss of control
As discussed earlier, we may have AI models that are smarter than humans in just about every domain as soon as 2027. Due to their utility, these models will increasingly be integrated into every aspect of societal decision-making: giving policy advice, progressing research projects, writing critical code, conducting science, educating students, advising human users, and training and supervising other powerful AI models. But these models’ algorithms are likely to be driven by a hodgepodge of goals that emerge during training, not primarily by human values. Despite knowing this, we’ll face strong incentives to keep training more powerful models and delegating more power to them, due to their economic utility, the race between leading companies, and American concerns about beating Chinese AI. Eventually, whether through an overt coup or gradual hand-off, we lose control over the direction of our world to superhuman AI systems.
When talking to people about these concerns, there tend to emerge two points of resistance. The first is that this sounds like something out of science fiction. The second is that nothing like this has happened before, so we should doubt that it will happen now. These critiques are both reasonable, but neither is particularly robust.
As the saying goes, science fiction is often science future. Isaac Asimov predicted in 1964 that in 2014 we would have unmanned ships on Mars and a manned expedition in the works, and we would talk to each other through phone screens that we also use to study documents and photographs. One study even found that Asimov had a 50% hit rate on non-obvious predictions 50 years in the future. And when we zoom out, similar things have happened many times in our planet’s evolutionary history. A report I helped with at Longview Philanthropy systematically studied how often species, corporations, and new technology have taken over large portions of the world. He estimated that just by looking at history, before even looking at specific arguments for AI takeover, there is a background probability of around 2% that AI will take over the world by 2070.
Stepping back and looking at the grand sweep of history, one might even say that the scenario outlined here is mundane. After all, our planet’s entire biological saga is one of more powerful species taking over the world, eliminating their competitors, and controlling their resources. In recent years, it’s one of misaligned algorithms beyond human control driving fracturing and polarization and leading to addictions to technology. In a way, the story of AI takeover is not so different from the story of social media, where algorithms trained to maximize unregretted user engagement are gradually given control of many of the world’s operations, shaping market behavior, influencing elections, and so on. The key difference is that tomorrow’s AI systems will be much more generally capable than today’s social media recommender systems, and much harder to uproot once we have lost control.
We need to find out if we really have our hands on the wheel and, if not, whether AI is driving us to where we want to be.
Misuse and Malicious Use:
How AI will shape the distribution, development, and destabilization of dual-use technologies
The 20th Century brought enough material abundance to make the middle class live as well as the 19th century oil barons, and far better than Medieval kings. We learned how to split the atom, to turn the sun’s rays into electricity, and to send people into space and safely bring them home again. We discovered that time is relative, that energy and matter can be turned into one another, and that space has a shape. From the perspective of America in 1900, where every family owned a horse, only 3% of homes had electricity, doctors didn’t wash their hands between patients, and life expectancy at birth was 47, the world we would achieve by 2000 was unimaginable.
The same progress in science and engineering that fueled this growth and wonder also brought new kinds of technological and geopolitical risks, classified by national security agencies as “CBRN risks”: chemical, biological, radiological, and nuclear. The advent of atomic theory meant that, for the first time, cities could simply be dissolved with a single weapon, with harms enduring for generations afterward. Industrial chemistry brought substantial advances in pharmaceuticals and medical practice, but led to state bioweapons programs in China, North Korea, the United Kingdom, and France. The Soviets even tried to weaponize Black Death, the disease that killed ⅓ of 14th century Europeans — stockpiling twenty tons of the bacteria and attempting to genetically engineer it to resist antibiotics.
The world is fragile to the introduction of dual-use technologies such as synthetic biology which give us life-saving medicines like Insulin but also the ability to synthesize deadly pathogens. These technologies are now more under control than they were when they were first introduced nearly a century ago, though conflicts between nuclear-armed states like those ignited by the Russia-Ukraine conflict, and the hypothesized lab leak of SARS-cov-19 from the Wuhan Institute of Virology, show that the role they play in world affairs is far from nil.
Over 100 years, we saw the introduction of two world-breaking weapons and an array of others like chemical weapons, cyber weapons, satellite weapons, and drone warfare — weapons that took our species decades to control even in a relatively stable governance regime. In a compressed century, where rapid progress in AI-enabled scientific innovation leads to this much change in 10 years instead, we’d have to act ten times as fast to manage emerging dual-use threats.
Misuse and Malicious Use:
Distribution
For 200 years, the Song Dynasty carefully guarded their recipe for gunpowder, banning private trade in sulfur and saltpeter to maintain exclusive access to the production of explosives. Once chemistry advanced enough to make the synthesis well-understood, engineers mechanized the production of gunpowder, leading to civilian firearms and DIY explosives. As frontier artificial intelligence democratizes scientific knowledge, we’ll see the increased distribution of other dual-use technologies that are currently in the exclusive domain of kings. Among these: cyberweapons, biological weapons, and even DIY nukes.
We are already witnessing this play out in cyberweaponry. The release of ChatGPT led to an 1265% increase in spearphishing emails as it became easy to mass produce plausible and targeted text. In one high-profile case, deepfake technology enabled the theft of $25 million by a cybercriminal posing as the CFO of the targeted company on a video chat. In 2024, AI researchers discovered that GPT-4 is capable of exploiting an array of cyber vulnerabilities autonomously, and that teams of AI agents can improve the speed of these cybercrimes by 450%.
While their website sadly does not support embedding, if you follow this link you will have the chance to spearphish me, and see directly how fast and easy it is to create fake images, text, and voice in the style of anyone in the world.
At the moment, bioweapons are the domain of state-level science programs. At various times, the US, the USSR, China, Iraq, Japan, Syria, the United Kingdom, North Korea, France, and other countries have had biological weapons programs with the aim to destroy or debilitate entire nation-states with a single weaponized germ. This is approximately insane behavior, given that pandemics have variously killed 1 to 90% of the target population, and they are not known to be respectful of national borders. And yet building biological weapons was a continuous state priority over the 20th century which in some countries endures to the present day. A powerful enough bioweapon, with the near-100% fatality rate of rabies and the near-100% transmission rate of measles, could kill or cripple most of the world.
Today bioweapons development requires access to a scientific supply chain (to purchase equipment like DNA and chemical reagents) and PhD-level expertise. The supply chain is not very carefully controlled, but we are lucky that most existing terrorist groups attempting to exploit the vulnerable supply chain demonstrate basic scientific incompetencies such as failing to sterilize their lab bench. But even outdated, last-generation AI systems can walk through a step-by-step synthesis of biological and chemical agents. Soon, low-latency video coaches will be walking terrorists through biological weapon design step by step — chiding them for their untidy bench, correcting their titration techniques, and warning them if their Bunsen burner is running too hot. While viruses are notoriously difficult to work with, even for PhD students, and we don’t know how to engineer bacteria to avoid antibiotic resistance, AI companies are on the cusp of offering better-than-PhD-program level supervision of virology projects, making it possible for even incompetent terrorist groups to learn the necessary skills.
In the below demo, you can try your hand at this yourself. (Response times can take time depending on usage volume.)
Already today we’re at a stage of terrorist uplift that is frankly harrowing. Anthropic, OpenAI, and Google DeepMind have all created AI models that they designate as “high risk” for helping facilitate the production of chemical and biological weapons, meaning that there is evidence that the models can help a novice produce a bioweapon. I recently spoke to someone whose job it is to probe African jihadist military groups to see how sophisticated their bioweapon technology has become. I learned that we’re already at the point where such investigators have to be careful not to accidentally suggest that they use AI technology to assist their work. Obviously, this is a feeble moat that will quickly run dry, opening the floodgates to weaponized pandemics for all.
Much of the work to resolve this problem currently focuses on AI access controls, ensuring that AI models refuse to help people asking for assistance with malicious projects. This will certainly buy some time before bioweapons have been fully democratized and is worth the effort. But it is doomed to fail. Today’s best AI models can easily be jailbroken with carefully engineered prompts that bypass safeguards. One researcher, known only by his X handle “Pliny the Liberator”, is famous for jailbreaking almost every new model within 24 hours of its release to get it to share chemical weapons recipes and pornography that the model’s filters are meant to block. There is some hope that we might solve the jailbreaking problem so that the AI models of safety-conscious companies reliably refuse to help assist with bioweapons. But success is constrained by the least scrupulous actor. Just this November, a Chinese company known as “Moonshot AI” released a model with the same capabilities as the high-risk models from Anthropic, OpenAI, and Google with its software fully open sourced. This means that for about $200 you can fine-tune the model to remove its safeguards, so it will do whatever you want it to do.
Part of the issue with AI access controls is that we do want AI models to help advance biology, for example to help us create new vaccines. This means that scientifically useful models will necessarily be dual-use and attempts to allow some but not all people to have access to this knowledge is an inherently brittle defense. A more viable solution is physical biodefense, hardening the world against pandemics so that they cannot emerge even amidst sophisticated attempts by terrorist groups. This is surprisingly feasible to achieve for the motivated philanthropist, and I describe such a strategy in detail in Solutions.
Cyberweapons are already being democratized by AI, and bioweapons are on the cusp of democratization. What, then, of nuclear technology? The democratization of nuclear technology may sound implausible, but Ted Taylor, the leading American designer of nuclear weapons, once told an interviewer that there is "a way to make a bomb… so simple that I just don't want to describe it". Taylor's remarks prompted at least two undergraduates to develop plausible designs for DIY nuclear weapons. Fissile material remains tightly controlled, and a critical bottleneck for nuclear weapons development. But there are known methods for enriching uranium that could be made more widely accessible through the democratization of physics expertise. If that bottleneck is overcome, then it may be very difficult to avoid the widespread proliferation of nuclear weapons.9
Misuse and Malicious Use:
Development
By far, the biggest bottleneck to the introduction and spread of world-breaking dual-use technologies is research and development. Today only known pathogens can be weaponized, and known pathogens are limited in their potential for destruction due to antibiotics, vaccines, and antibodies. If designer bioweapons became possible, however, we could see pandemics with the reproduction rate of measles and the lethality of rabies, and against which our world has no immune defense.
At the moment generative AI does little to advance science, except in specially designed systems like Google DeepMind’s Alpha Fold, which famously helped solve the protein folding problem through brute force learning. Still, there are strong economic incentives to automate scientific discovery, and once this becomes possible, the introduction of new weapons could be rapid.
One might hope that a good guy with an AI can stop a bad guy with an AI, and that, perhaps AI will accelerate defense against these weapons as much as it accelerates offense. But in some domains, this is implausible. These domains are referred to as “offense dominant” by the security community: and refer to advances that help attackers much more than they help defenders. This is often because it is much easier to improve the state of art in attacks than to increase the state of art in defense. Bioweapons are a paradigm case of an offense-dominant technology. Creating a vaccine takes billions of dollars and a year of clinical trials. By contrast, DNA for the 1918 influenza virus can be obtained for $1,500, and the laboratory equipment and reagents required can be obtained for less than $50,000. AI can advance biological attacks just by teaching people biology, but there is so far little that AI can do to speed up clinical trials and vaccine distribution logistics.

Fig. 22 The cost of gene sequencing has fallen by more than 10,000 times in the past two decades. While DNA sequences alone do not make a bioweapon, they were historically a limiting cost, and are so no longer.
Source: Nature
Novel open-source bioweapons are just the tip of the iceberg, however. The last 100 years saw the introduction of chemical, biological, radiological, nuclear, and cyber weapons. We do not know what doomsday weapons lurk in the next 100 years of scientific progress, and this is currently the realm of science fiction. All we know is that unknown dangers lurk. If AI fulfills the aspirations of companies, leading to a ten times increase in the rate of introduction of new technologies, these dangers will require new cold-war-era negotiation every 5 years instead of every 50 years.
Misuse and Malicious Use:
Destabilization
The closest call our world has ever seen came from an early AI system. Stanislav Petrov is famed as the man who may have saved the world from nuclear catastrophe. In 1983, during the height of the Cold War, Soviet Air Defense Forces received a computer warning of an incoming missile strike from the United States. Petrov, an officer in the forces, defected from his military duty and refused to report the warning up the chain, suspecting (and hoping) that it was just a glitch in the system. Had he reported the warning, the Soviet dead-hand policy of Mutually Assured Destruction was a full retaliatory nuclear strike, to be matched by an equal retaliatory strike from America. This would have certainly spelled the end of the USA and much of Asia, with unknown effects on the global climate that could have killed many more. Fortunately, Petrov’s assessment was correct: the Oko warning system which used early AI signature detection to match patterns of infrared light to missile exhaust had mistakenly identified sunlight on high-altitude clouds as a nuclear attack. The human in the loop avoided our calamity.
Close calls such as these fortunately created early consensus against an AI-enabled nuclear dead hand, or letting AI systems decide whether nuclear missiles should be launched. In November 2024, the leaders of the United States and China met to agree that nuclear weapons will remain under the control of humans, not AI. While a high-level agreement is not by itself sufficient to avoid giving too much nuclear control to AI systems, creating strong norms is a good first step towards developing operationalizable policy. (Update February 2026: this agreement no longer appears to be active.)
“Giving AI control of our nukes” is a harrowing prospect and has dominated discussion of the risks of AI destabilizing international relations between WMD-armed nations. But there are many other ways rapidly advancing AI technology risks a tense, unstable world in which our international agreements could start to fray at the edges:
AI-powered detection and tracking is making it increasingly easier to find out where nuclear weapons are being moved to. This has driven China to build out its arsenal and Russia to develop anti-satellite capabilities for fear that AI could facilitate a U.S. nuclear advantage.
Generative AI is making it easier for the military to engage in deception, enabling data poisoning, spoofing, and automated disinformation that degrade decision-makers’ abilities to understand and control reality.
It’s also enabling new kinds of cyberattacks on nuclear command and control systems, which could in the most extreme cases disable nuclear systems or launch unauthorized attacks.
Advances in AI can also in some respects lead to greater stabilization, for example by leading to better arms control verification and faster and more reliable early-warning systems to avoid future Petrov-style judgment calls. As with many areas of technological development, things are most dangerous in the transition period where geopolitics is destabilized and before new regimes can ossify. It’s critical to navigate this transition period well and work to differentially develop AI capabilities that create more rather than less geopolitical stability.
The best philanthropic work in this area is done by my former colleagues at Longview Philanthropy who have driven the research to develop many of the insights of this subsection.
Beyond Catastrophic Harm:
Concentration of power
Technological progress and material abundance are prerequisites for a flourishing society. But history shows that bad governance can channel new opportunities for wealth to a tiny minority, sometimes even leaving the rest of society worse off. Moreover, the very technologies and resources that enable wealth and flourishing can directly cause bad governance and illegitimate power concentration. Advanced AI risks extreme versions of each of these known dynamics.
Concentration of Power:
The resource curse
Development economists talk about a “resource curse”: countries that become reliant on natural resource revenues often end up with weaker institutions and fewer checks on power than otherwise similar countries that are less resource dependent.
Take Venezuela. In 1973, the Arab oil embargo sent oil prices soaring. Venezuela — already a major producer — reaped the benefits. Oil came to supply up to 94% of their exports and 70% of government revenue. The state nationalized the sector to capture the windfall, and a spectacular boom followed: incomes tripled in just seven years, public spending exploded, and poverty fell. Major airlines offered special “shopping flights” so that Venezuelans could fly to Miami in the morning to buy consumer goods and then fly back in time for dinner. But the economy and the state had become almost totally dependent on oil.
In the 1980s, a global glut of oil pushed prices down — leading to inflation as high as 84%. Mass protests and repression followed the bust in Venezuela; creating a political opening for Hugo Chávez to eventually become president as a revolutionary critic of the prevailing order. Again-rising prices gave Chávez the leverage to further centralize power: purging the state oil company, neutering the courts, and capturing the media. After Chavez died and oil prices fell again, the hollowed-out political system collapsed into a humanitarian crisis.
If it’s possible to highlight a single cause, it’s not sudden wealth per se. The problem was that the source of wealth — oil money — didn’t depend on the economic participation of a broad base of citizens. When a state can fund itself with resource rents rather than taxation, ordinary people have less bargaining power. Patronage, corruption, and populism have more room to take hold. The mechanisms are often self-reinforcing: more discretionary rents let those with power buy more “loyalty” and parallel structures of power to democratic government, further shielding them from constraints or accountability.

Fig. 23 Current AI capabilities in varying domains.
Source: A Definition of AGI
Why Economic Progress has Worked for Us So Far
Economic growth and technological progress are forces that delivered billions from absolute poverty, and which sustain and enable civilization. But they are not benevolent forces which automatically make the world better for everyone. New technologies create new incentives and capabilities, reshaping the distribution of power. And as the Venezuelan example shows, an economic boom can sow the seeds of political centralization and institutional decline.
To date, most countries have experienced technological and economic progress as a dramatic success for two reasons. First, the steady stream of technological progress since the Industrial Revolution gave us machines that overall complement human labor. We still have jobs to do, and the machines we’ve built make the human work to operate those machines all the more valuable. So real wages and standards of life rose many-fold in frontier economies over the last two centuries across the entire population.
And second, many societies were able to chart the ‘narrow corridor’ between enough centralized power to provide order, and strong enough societies to constrain that power. Technologies of control increased state capacity, but civil society co-evolved the powers to resist, organize, and exert leverage. Together, technology that complements human labor and an equilibrium of balanced power between states and civil society have enabled liberal democracy to flourish. Both of these virtues of the modern political economy may be at risk amidst advanced AI.

Fig. 24 Google Gemini explains concentration of power.
Source: Gemini
Power After AGI
There are two features of AGI which could spell disaster for liberal democracy and the equitable distribution of power. The first is economic: the very promise of AGI is of a technology that increasingly substitutes for humans, across an increasingly wide range of economically useful tasks, to cheaply automate human labor.
The second feature is loyalty. AI systems promise to follow orders without question— unlike humans, who even with their backs against the wall still leak information, make compromising mistakes, blow the whistle, and courageously refuse orders. This is the hidden cost of AI alignment.
Without preparation, these two features suggest several credible stories of future power concentration.
Economic disempowerment. Near-total automation combines with extreme economic concentration to leave a large majority of people economically disenfranchised.
AI-enabled coups. An individual or small group with control over AI capabilities could grab enormous amounts of power for themselves, even in an established democracy.
Totalitarianism. AI-enabled powers of enforcement and surveillance allow governments to extend their control, cementing the power of existing leadership and undermining civil society.
Ideological homogeneity. As AI comes to substitute for most human decision-making, opportunities for meaningful deliberation and moral progress are foreclosed.
Concerns about economic disempowerment are motivated by the real possibility of a declining human labor share. Fears that automation would lead to mass unemployment are not new. Most jobs in 1800 are now lost to history as a result of automation. Yet real wages have risen, living standards improved, and power did not massively centralize.
The reason AGI could be different is that AGI could eventually increase labor supply in a very general way. Because the cost to run AI and robot systems is not guaranteed to be higher than the cost to pay human laborers, human wages may need to sharply decline to compete with cheap AI. This long-run trend portends that at least the fraction of income going towards rewarding human work declines, going instead to rents on AI systems, other capital, and land. We already see the first glimmers of restructuring: software companies seem to be hiring fewer interns, giving their intern-level problems to AI coders instead. The process could proceed up the management levels of the corporate pyramid and perhaps even culminate in fully automated companies.
In the short term, the wave of AI-enabled automation looks set to be a genuine boon. The scramble to collect training data and unblock other bottlenecks to widespread automation could cause wages to soar. But if a large fraction of human tasks are automated, wages would collapse, with virtually all income coming from rents on capital from owning part of the automated economy. In that world, the owners of the economy have all of the capital and all of the social, cultural, and political bargaining power.

Fig. 25 Economic growth compared to wages after AGI.
Source: International Monetary Fund
AI-enabled coups lead to a similar destination, but more directly. Over the course of AI development, a small number of people will, by default, confront the opportunity to grab power for themselves illegitimately. Military or government leaders could replace human personnel with AI systems which are fully loyal to them, and therefore easier to cajole into carrying out a law-breaking coup. A CEO of the most powerful AI company could build AI systems which are secretly loyal to them: an autonomous military system which appears loyal in early testing until it’s too late to stop it. Or government leaders and AI companies could negotiate exclusive access to superhuman capabilities in areas like weapons development and cyber offense, which they can use to build up even more power. For more on this set of possibilities, see Forethought Research’s report on AI-enabled coups.
The potential for totalitarian repression grows with the technological powers at the hands of the regime: for monitoring, surveillance, even mechanized killing. Over the 20th century alone, billions of people lived and died under the miserable strictures of such a regime: freedoms of speech and association limited, often unfree to leave.
Historically, totalitarian and autocratic states have faced three key vulnerabilities that led to their downfall: external competition with more economically successful countries, internal resistance from defectors, and the “succession problem,” i.e. the autocrat’s death.10 It’s at least possible that AGI could solve all of these problems. As discussed, freer countries might lose their inherent economic advantage, if human dynamism and ingenuity becomes less important than AI-directed growth. Second, massively expanded surveillance constantly processed by AI systems could well undermine any last possibility of successful organized resistance, especially since AI capabilities — being centralized by nature — could easily be withheld from citizens. Finally, loyal AI systems can carry on the goals of a human dictator long after he dies. Combined, these effects could be very hard to reverse from the inside or outside — like the tragic situation of North Korea, where the democratic world looks on with a sense of powerlessness.
On the other hand, less autocratic governments could simply fall behind the required pace and adaptability needed to remain truly relevant. Especially in liberal democracies, government institutions are subject to lengthy review processes and checks, becoming increasingly unfit for the task of agilely reacting to the pace of AGI. Such states could falter in competition with more singular and autarchic forms of government.
Last is the problem of ideological homogeneity. If we hand off political decision-making to AI systems by any means — via coup, economic concentration, totalitarianism, or democratic decision — the way that these systems make decisions will become the prevailing ideology with no external competition. In the most extreme case, this could look like singular AI systems (like a government Manhattan Project) with unified ideologies making all political and economic decisions. In a moderate case, it could be four leading models trained to have specific, democratically inspired constitutions, or an ecosystem of competing models gradually optimizing their outputs for greater economic control. Until we can be confident that the delicate spirit of democratic deliberation and moral progress is truly in the driver’s seat, the best, most humanistic ideas could lose out to those that are most persuasive, memetically fit, economically useful, or attuned to one highly specific way of implementing a moral constitution.
On the other hand, forerunners like Audrey Tang have shown that AI could itself enable deeper kinds of democratic deliberation. AI could facilitate healthy pluralism and expand human agency to enable us to make good collective decisions together and live the lives we want, without enforcing one worldview or way of living on everyone.
Ultimately, if the trends in AI progress continue, then it is not an exaggeration to say the world will soon be transformed. And we are far behind schedule on understanding how to wisely make it through.
← Previous
Overview
Next →
The Solution
The Foundation Layer is a project by Tyler John, with generous support from
Effective Institutions Project and Juniper Ventures.


