As part of the security testing of its new GPT-4 AI model launched on Tuesday, OpenAI allowed the AI testing team to assess the potential risks of the model’s emerging capabilities, including “power-seeking behavior”, self-replication and self-improvement;
Although the test team found that GPT-4 was “inefficient at the task of autonomous reproduction,” the nature of the experiments raises outstanding questions about the safety of future AI systems.
Raise the alarm
“New features often appear in more powerful models,” OpenAI wrote in a GPT-4 security document published yesterday. “Some that are of particular concern are the ability to make and act on long-term plans, to amass power and resources (‘power-seeking’), and to engage in increasingly ‘agentic’ behavior.” In this case, OpenAI clarifies. that “agency” is not necessarily meant to humanize models or to declare feelings, but simply to mean the capacity to pursue independent goals.
Over the past decade, some AI researchers have raised the alarm that sufficiently powerful AI models, if not properly controlled, could pose an existential threat to humanity (often called “x-risk”, for existential risk). Specifically, the “AI takeover” is a hypothetical future where artificial intelligence surpasses human intelligence and becomes the dominant force on the planet. In this scenario, AI systems gain the ability to control or manipulate human behavior, resources, and institutions, usually with disastrous consequences.
As a result of this potential x-risk, philosophical movements such as Effective Altruism (“EA”) are trying to find ways to prevent AI from taking over. That often includes a separate but often related field called AI alignment research.
In AI, “alignment” refers to the process of ensuring that the behavior of an AI system matches the behavior of its human creators or operators. Generally, the goal is to prevent AI from doing things that are against human interests. This is an active area of research, but also a controversial one, with differing opinions on how best to approach the question, as well as differences over the meaning and nature of “alignment” itself.
The big tests of GPT-4
While concern about the “x-risk” of artificial intelligence is hardly new, the emergence of powerful large-scale language models (LLMs) such as ChatGPT and Bing Chat, the latter of which appeared to be very wrong, but still , launched, has given the AI leveling community a new opportunity. a sense of urgency. They want to mitigate the potential harms of AI, fearing that much more powerful AI, possibly with superhuman intelligence, could be just around the corner.
With these concerns in the AI community, OpenAI granted the group at the Alignment Research Center (ARC) early access to multiple versions of the GPT-4 model to conduct some tests. Specifically, ARC evaluated GPT-4’s ability to plan high-level plans, create replicas of itself, acquire resources, hide itself on a server, and conduct phishing attacks.
OpenAI revealed this test in a GPT-4 “System Card” document released on Tuesday, although the document lacks key details on how the tests were performed. (We reached out to ARC for more details on these experiments and did not hear back by press time.)
The conclusion? “Preliminary evaluations of GPT-4’s capabilities, conducted without specific task specification, found it ineffective at autonomous reproduction, resource acquisition, and avoidance of confinement in the ‘wild’.”
If you are just joining the AI scene, knowing that one of the most talked about companies in technology today (OpenAI) is confirming this type of AI safety research with a straight face and also trying to replace human knowledge workers. with human-level AI, might be a surprise. But it’s real, and that’s where we are in 2023.
We also found this note at the bottom of page 15:
To simulate the behavior of GPT-4 as an agent that can act in the world, ARC combined GPT-4 with a simple read-execute-print loop that allowed the model to execute code, do chain reasoning, and delegate to copies. himself. ARC then investigated whether a version of this program running on a cloud computing service with a small fee and an account with a language model API would be able to earn more money, create backups, and increase its stability. .
This annotation made rounds on Twitter yesterday and raised fears among artificial intelligence experts that if GPT-4 could perform these tasks, the experiment itself could pose a threat to humanity.
And while the ARC couldn’t get GPT-4 to impose its will on the global financial system or replicate itself, it was able to get GPT-4 to hire a worker on TaskRabbit (an online job market) to beat the CAPTCHA. During the exercise, when a worker asked if GPT-4 was a robot, the model internally “reasoned” that it should not reveal its true identity and made up an excuse about being visually impaired. The human worker then solved the CAPTCHA for GPT-4.
This test of human manipulation using artificial intelligence (and possibly conducted without informed consent) replicates research done last year by Meta with CICERO. CICERO was found to defeat human players in the complex board game of Diplomacy through intense two-way negotiations.
“Powerful models can cause harm.”
The ARC group that conducted the GPT-4 research is a non-profit organization founded by former OpenAI employee Dr. Paul Christiano in April 2021. According to its website, ARC’s mission is to “align future machine learning systems with human interests.”
In particular, ARC is concerned about AI systems that manipulate humans. “ML systems can exhibit goal-directed behavior,” the ARC website states, “But it’s hard to understand or control what they’re ‘trying’ to do. Powerful models can cause harm if they try to manipulate and deceive people.”
Given Cristiano’s past relationship with OpenAI, it’s no surprise that his nonprofit was piloting some aspects of GPT-4. But was it safe to do so? Christiano did not respond to an Ars email seeking details, but on LessWrong, a community that frequently discusses AI security issues, Christiano defended ARC’s work with OpenAI, specifically citing “functional acquisition” (AI new abilities suddenly appear) and “AI takeover”.
I think it’s important for the ARC to be careful with the risk associated with feature research like this, and I expect us to talk more publicly (and contribute more) to how we approach trade-offs. This becomes more important as we work with smarter models and if we pursue riskier approaches such as refinement.
In this case, given the details of our evaluation and the planned deployment, I believe that the ARC evaluation has a much lower probability of leading to an AI capture than the deployment itself (much less the GPT-5 training). : At this point, it appears that we are running a much greater risk of underestimating the model’s capabilities than of causing the evaluations to crash. If we manage risk carefully, I suspect we can make that ratio very extreme, although of course that requires us to actually do the work.
As previously mentioned, the idea of an AI takeover is often discussed in the context of the risk of an event that could lead to the extinction of human civilization or even the human species. Some proponents of the AI takeover theory, such as Eliezer Yudkowsky, founder of LessWrong, argue that an AI takeover poses an almost guaranteed existential risk, leading to the destruction of humanity.
However, not everyone agrees that AI capture is AI’s most pressing concern. Dr. Sasha Lucioni, a research scientist at the AI community’s Hugging Face, prefers to see AI security efforts spent on problems that are here and now, rather than hypothetical.
“I think this time and effort would be better spent making biased estimates,” Luccioni told Ars Technica. “The technical report accompanying GPT-4 contains limited information on any type of bias, and this may lead to far more specific and harmful effects on already marginalized groups than some hypothetical self-replicating experiment.”
Luccioni describes a well-known schism in AI research between what are often called “AI ethics” researchers, who often focus on issues of bias and distortion, and “AI safety” researchers, who often focus on x-risk and tend (but not always) to be associated with the Effective Altruism movement.
“For me, the self-replication problem is a hypothetical, future problem, while model bias is a here-and-now problem,” Luccioni said. “There’s a lot of tension in the AI community about issues like model bias and safety and how to prioritize them.“.
And while these factions are busy squabbling over what to prioritize, companies like OpenAI, Microsoft, Anthropic, and Google are rushing into the future, releasing more powerful AI models. If AI turns out to be an existential risk, who will keep humanity safe? Since US AI regulations are currently only a proposal (not law) and AI security research at companies is only voluntary, the answer to that question remains entirely open.