The Foundation Layer

The
Foundation
Layer

A philanthropic strategy for the AGI transition

by TYLER JOHN

Overview

Executive Summary

About the Author

Introduction

II. The Exponential Trend

III. Civilization-scale Threats

IV. The Philanthropic Solution

V. The Case for Philanthropy

VI. Political Giving and Impact Investing

VII. Why the Problem Remains Neglected — For Now

Appendix A: How to Get Started

Appendix B: AI Consciousness

Appendix C: How AI Works

How AI Works

The Base Model

Making Models Smart and Helpful

The Black Box

This is rounding a little bit. Some long words are broken down into multiple tokens. And words are typically represented by multiple different tokens. A word with a space before it is treated as a different token from a word that is at the beginning of a sentence, since these tokens often function differently in language — starting a sentence or modifying what is before it.

We don’t want it to get too good at predicting the dataset! We want it to get pretty good at predicting the dataset but not perfect, so that it learns how the rules of English and ordinary conversation work in general without instead learning how the rules of English and ordinary conversation work only in this one specific dataset.

There is one more noteworthy related technique called “supervised learning.” This approach is like reinforcement learning, but instead of attempting to answer problems with known answers, a teacher corrects the model and tells it when it does a good job.

How AI Works

Large language models, like ChatGPT, Grok, Gemini, and Claude, are at their root statistical prediction algorithms that try to predict the next word based on their training data. This “base model” is then augmented in various ways, such as with “reinforcement learning.” Here is an accessible introduction to all of the key things that go into an LLM so the reader can understand how they work and form clearer intuitions about what LLMs are like.

Building a Base Model

In the first days of its life, a large language model is just a random sequence of numbers. These numbers are connected to each other so that what happens to one number affects what happens to other numbers. This is called a “neural net.”

Fig. A1 Image of a neural net.

Source: How Backpropagation Works

The neural net’s job is to take some numbers as input and then specify what numbers it thinks it will see next, assigning probabilities between 0% and 100% to every number it might possibly see.

So, for example, you could feed a neural net a sequence of numbers like [1, 5, 1,000], and it could say that it thinks there is a 5% chance the next number is a 1, a 5% chance the next number is a 77, and a 90% chance the next number is a 10. Remember, at first the neural net’s predictions are random, so it is not going to give us a very meaningful or rational answer.

To teach a model to speak English, we need to give it some data to start predicting. So AI researchers build massive datasets of books, internet comment threads, and anywhere else they can find high-quality samples of English language. They then use a tool called a “tokenizer,” which essentially takes each word and converts it into a number.1 This is critical, since a neural net has no idea what words are or how to use them — all it knows how to do is take numbers in and predict the next number. These numbers are called “tokens.”

For example:

[The] = 285

[ cat] = 12

[ is] = 9987

[ on] = 99

[ the] = 8800

[ mat] = 913

So now we have a random neural network and a huge dataset of English sentences and paragraphs where every single word has been converted into a number. Now we can start teaching the neural network to predict the next word. To do so, we feed it sentences, in the form of sequences of numbers, and ask it to predict the next word.

So the sentence: “The cat is on the…” is converted into numbers, [285, 12, 9987, 99, 8800], and given as an input to the neural network. Since all a neural network does is take a sequence of numbers and predict the probability of the next output, it will return a probability distribution over the next number, for example:

913 — 1%

252 — 3%

9987 — 25%

…

At first the neural network will be very bad at predicting the next word since it is initialized randomly. In this wholly fictional example it gives a 1% chance that the next word is “mat” and a 25% chance that it is “the.”

We give the neural network dozens to thousands of examples like this and see what it says. Then, so that it can get better at predicting the next word, we calculate a “loss function” — that is, a mathematical result that shows how close its probability distribution was to the true answer. Finally, we apply an algorithm called “gradient descent” which slightly updates the model’s numbers (or its “weights”) in whatever direction will help it get better results in the future. The result is that we move the neural network in the direction of being better at predicting the data. If it gets good enough at predicting the data, it will start to give us correct answers, predicting that “The cat is on the” is likely to end with “mat.” This is how the model learns English!

To have a neural network that is fairly good at predicting the data, we have to run this process thousands of times, gradually moving the neural network in the direction of predicting the dataset.2 The result is a model that speaks English by predicting the next word (represented by a random number) in a sentence. This simple model is known as a “base model.”

Making the model smart and helpful

A base model like we’ve just created isn’t very helpful. It knows many of the rules of English and how conversations work. But it doesn’t know that it’s an assistant model. It doesn’t know the form of a question. And it’s not going to be a polite and appropriate dinner guest. Because the model is only trying to predict what is in its data, its outputs are going to reflect what is in the data. If we trained the model on YouTube comments, it’s going to reply to our prompts like it is finishing an average YouTube comment. If we trained it on a textbook, it’s going to reply to our prompts like it is continuing to write a textbook.

If you ask a modern model “Where can I find the nearest drive-in movie theater?” it will answer like a helpful assistant:

“That is an excellent question! Let me ask you a few more questions to know about your preferences and location so I can help you.”

But if you ask a base model this, it will answer like it is finishing a sentence in its data:

“Where can I find the nearest drive-in movie theater?” James asked a passersby, who turned her head in his direction to assess the situation.

Moreover, base models aren’t trained for safety or politeness. They may scream at you, curse you out, lie, or give you instructions for building a biological weapon.

To make a model useful, we need to apply “post-training” techniques. The two most important post-training techniques are known as “fine-tuning” and “reinforcement learning.”

Fine-tuning involves giving an AI model specific, smaller datasets — such as AI assistant chat logs — and asking it to update in the direction of predicting that new data. Since the model has already learned English, we can now teach it to speak in specific ways, for example by writing like an AI assistant.

Reinforcement learning3 involves giving a model structured tasks with known answers, asking the AI model to throw itself at those tasks over and over again until it gets the right answer, and then updating the model in the direction of giving those answers in the future. Reinforcement learning is a very old technique, which has long been used for game playing agents. For example, you can teach an agent how to play Pong by rewarding it every time it scores a point and penalizing it every time a point is scored against it. By reinforcing whatever random behaviour the AI did that happened to perform well, over and over again, the model gradually learns to do behaviour that scores points against the opponent.

Fig. A2 Pong.

Source: FreePong.org

In large language models, reinforcement learning is commonly used to teach models how to code. So you can give a base model that knows the language of Python open coding problems and ask it to try random things until it succeeds, and then reinforce whatever random behavior solved the problem. Do it enough times and the model will be very good at coding.

Reinforcement learning is also used to teach AI models ethics. If you have a set of ethics questions or ethically significant tasks with correct answers provided by humans, you can have the model try to answer the questions correctly over and over again and reward it when it gives the right answer.

Together fine-tuning and reinforcement learning move a model’s next token prediction in a helpful direction. It speaks English (or another language, such as Chinese, Java, C#, Python, or mathematics) in patterns that are consistently helpful to us.

This is why it is sometimes said today that AI models are no longer next-token prediction algorithms. Fundamentally, they are next-token prediction algorithms. But using post-training techniques we move them away from predicting the specific data they were trained on and towards creating outputs that are useful for accomplishing other goals. Because of reinforcement learning, they start to behave less like models that are trying to predict the next token and more like game-playing models that we’ve taught to play Pong — or in this case, that we’ve taught to code, reason, and answer mathematical questions.

The Black Box

At some level we know how AI models work. The base model makes statistical predictions about the most likely next word given what was most likely to appear in its dataset. After post-training the model tries to make statistical predictions about the best behaviour to take based on what succeeded during reinforcement learning. But during training AI models learn all kinds of random tricks that help them to succeed and we don’t know what these tricks are.

AI model training is a lot like our own evolutionary history. Our environment created various selection pressures and the most fit organisms survived. In order to survive, they learned all kinds of tricks: detecting patterns of light by evolving eyes, differentiating chemicals by evolving smell, and developing hunting capabilities by evolving strong legs and sharp teeth. Platypuses developed sonar. Cordyceps learned to mind-control ants. Plants learned to turn photons into energy.

The AI models we train develop similar tricks. They learn whatever concepts, techniques, and heuristics best allow them to achieve good results. But we don’t understand the concepts, techniques, and heuristics they develop. We don’t know what AI models know, what they can do, or how they learn things.

One of the most striking examples of our ignorance was discovered by some of my grantees at Harvard University. These researchers discovered that AI systems can develop complete internal world models without us knowing.

These researchers trained an AI model to play the board game Othello. This AI model didn’t know what board games are, what Othello is, or that it was playing a game. It didn’t know English or any other language. All it knew was that it got inputs like “F4, F3, D2” and was rewarded for giving other kinds of outputs like “B3” and “H6,” and this was the sum total of the model’s knowledge.

But once the model was trained to correctly predict winning moves in Othello, researchers found that it learned the rules of Othello and it even built an internal representation of an Othello board!

Fig. A3 Interpretability studies on Othello-GPT found that it had an internal representation of the board state even though it had only ever seen letters and numbers.

Source: Othello-GPT Has A Linear Emergent World Representation

In the model’s entire life it only saw [F1, F2, G3] and yet inside its neural net it formed a complete representation of the board!

Before this study, we didn’t know how this model, “Othello-GPT”, was managing to pick good moves. But after the study, we learned that part of how it was winning was that it had created a “mental map” of the board.

Modern language models are much bigger and more complicated than Othello-GPT and we don’t know very much at all about what they learn to achieve good performance. They could have entire world models inside of them, like Othello-GPT, and we would have no idea.

This is why AI models are considered “black box models.” We know that they take inputs to outputs, and that they do that by adjusting the neural net in the direction of better predictions. But we don’t know how they do that. AI models are grown rather than coded. No one writes the code that they use, and no one could write down how these models make decisions in a simple way if they tried to. Typically, we don’t know what the models can do until we deploy them and start asking them questions.

Researchers in the field of mechanistic interpretability, like those who made the Othello-GPT discovery, are working on figuring out how AI models solve problems. For example, teams at Anthropic and OpenAI have located specific numbers in image neural nets that act as "detectors" for wheels, windows, or car bodies. OpenAI’s work on interpretability found out the function of about 10% of the numbers inside of GPT-4. Understanding exactly why and how AI models make the decisions they do is one of the great intellectual frontiers in AI research. Until it is solved (if it can be solved), our knowledge of AI systems’ internal operations will be very limited.

Fig. A4 How a neural net might represent a car in its weights.

Source: Mechanistic Interpretability Quickstart Guide

← Previous

Overview

About the Author

The Foundation Layer is a project by Tyler John, with generous support from
Effective Institutions Project and Juniper Ventures.