Large language models like OpenAI’s GPT-3 are massive neural networks that can generate human-like text, from poetry to programming code. Trained using an Internet database, these machine learning models take a small input of text and then predict what text is likely to come next.
But that’s not all these models can do. The researchers are investigating an interesting phenomenon known as in-context learning, in which a large language model learns to perform a task after seeing only a few examples, despite the fact that it has not been trained for the task. For example, someone can feed the model some example sentences and their sentiments (positive or negative), then prompt with a new sentence and the model can give the correct sentiment.
Typically, a machine learning model such as GPT-3 needs to be trained with new data for this new task. During this training process, the model updates its parameters as it processes new information to learn the task. But in context learning, the model parameters are not updated, so it looks like the model is learning a new task without learning anything at all.
Scientists from MIT, Google Research and Stanford University are trying to solve this mystery. They studied models very similar to large language models to see how they could learn without updating parameters.
The researchers’ theoretical results show that these massive neural network models are capable of containing smaller, simpler linear models buried within them. The large model can then apply a simple learning algorithm to train this smaller, linear model to perform a new task using only the information already contained in the larger model. Its parameters remain fixed.
An important step toward understanding the mechanisms of in-context learning, this research opens the door to learning algorithms that can apply these large models, said Ekin Akyurek, a computer science graduate student and lead author of a paper examining the phenomenon. With a better understanding of contextual learning, researchers can enable models to perform new tasks without the need for costly training.
“Typically, if you want to refine these models, you have to collect domain-specific data and do some sophisticated engineering. But now we can just type it in, five examples, and it does what we want it to do. Thus, “Contextualized learning is an incredibly effective learning phenomenon that needs to be understood,” says Akyurek.
Akyürek is joined on paper by Dale Schurmans, a Google Brain researcher and professor of computer science at the University of Alberta; as well as senior authors Jacob Andreas, an X Consortium Assistant Professor in MIT’s Department of Electrical Engineering and Computer Science and a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL); Tengyu Ma, assistant professor of computer science and statistics at Stanford; and Danny Zhu, Chief Scientist and Director of Research at Google Brain. The research will be presented at the International Conference on Learning Representations.
A model within a model
Many scientists in the machine learning research community have come to believe that large language models can perform embedded learning because of how they are trained, Akurek says.
For example, GPT-3 has hundreds of billions of parameters and was trained by reading vast amounts of text on the Internet, from Wikipedia articles to Reddit posts. So when someone shows you mock examples of a new task, he’s likely already seen something very similar because his training database included text from billions of websites. It repeats patterns it has seen in training rather than learning to perform new tasks.
Akyurek hypothesized that in context learners don’t just match patterns they’ve seen before, they actually learn to perform new tasks. He and others had experimented with giving these models cues using synthetic data they couldn’t see anywhere before, and found that the models could still learn from just a few examples. Akyurek and his colleagues thought that perhaps these neural network models had smaller machine learning models within them that could train the models to perform a new task.
“It can explain almost all the learning phenomena we’ve seen with these large models,” he says.
To test this hypothesis, the researchers used a neural network model called a transformer, which has the same architecture as GPT-3 but was specifically trained for in-context learning.
By studying the architecture of this transformer, they theoretically proved that it could write a linear model in its hidden states. A neural network is made up of many layers of interconnected nodes that process data. Hidden states are the layers between the input and output layers.
Their mathematical evaluations show that this linear model is written in the earliest layers of the transformer. The transformer can then update the linear model by applying simple learning algorithms.
Essentially, the model simulates and trains a smaller version of itself.
Probing hidden layers
The researchers investigated this hypothesis using probing experiments, where they searched the hidden layers of the transformer in an attempt to recover a certain amount.
“In this case, we tried to recover the true solution of the linear model and were able to show that the parameter is written in the hidden states. This means that the linear model is there somewhere,” he says.
Based on this theoretical work, researchers can enable the transformer to perform in-context learning by adding just two layers to the neural network. There are still many technical details to be worked out, warns Akyürek, but it could help engineers create models that can perform new tasks without the need to train on new data.
“The paper sheds light on one of the most remarkable properties of modern large language models: their ability to learn from their input without explicit training. Using the simplified case of linear regression, the authors show theoretically how models can apply standard learning algorithms when reading their input, and empirically which learning algorithms best match their observed behavior,” said Mike Lewis, Facebook AI Research. of the research scientist who was not. involved in this work. “These results are a stepping stone to understanding how models can learn more complex tasks and will help researchers develop better training methods for language models for further improvement.”
Moving forward, Akyurek plans to continue exploring in-context learning with functions that are more complex than the linear models they explored in this work. They could also apply these experiments to large language models to see if their behavior is also described by simple learning algorithms. Additionally, he wants to delve deeper into the types of preparation data that can enable contextual learning.
“With this work, people can now imagine how these models can learn from examples. So I hope it changes some people’s views about in-contextual learning,” says Akyurek. “These models are not as stupid as people think. They don’t just memorize these tasks. They can learn new tasks, and we’ve shown them how to do that.”