#434 — Can We Survive AI?
Episode Stats
Words per Minute
188.67665
Summary
Eliezer Yudkowsky and Nate Sares were among the first people to make Sam concerned about artificial intelligence, and they share their story of how they became concerned about the prospect of superhuman AI. They discuss how they came to that point, and why they think we should all be worried about it.
Transcript
00:00:00.000
Welcome to the Making Sense Podcast. This is Sam Harris. Just a note to say that if you're
00:00:11.780
hearing this, you're not currently on our subscriber feed, and we'll only be hearing
00:00:15.740
the first part of this conversation. In order to access full episodes of the Making Sense
00:00:20.100
Podcast, you'll need to subscribe at samharris.org. We don't run ads on the podcast, and therefore
00:00:26.260
it's made possible entirely through the support of our subscribers. So if you enjoy what we're
00:00:30.240
doing here, please consider becoming one. I am here with Eliezer Yudkowsky and Nate Soares.
00:00:41.140
Eliezer, Nate, it's great to see you guys again.
00:00:44.820
Been a long time. So you were, Eliezer, you were among the first people to make me concerned about
00:00:52.620
AI, which is going to be the topic of today's conversation. I think many people who are
00:00:57.000
concerned about AI can say that. First, I should say you guys are releasing a book, which will be
00:01:01.940
available, I'm sure, the moment this drops. If anyone builds it, everyone dies. Why Superhuman
00:01:08.980
AI Would Kill Us All. I mean, the book is, its message is fully condensed in that title. I mean,
00:01:15.720
we're going to explore just how uncompromising a thesis that is and how worried you are and how
00:01:23.620
you're worried you think we all should be here. But before we jump into the issue, maybe tell the
00:01:28.840
audience how each of you got into this topic. How is it that you came to be so concerned about the
00:01:37.400
Well, in my case, I guess I was sort of raised in a house with enough science books and enough
00:01:43.920
science fiction books that thoughts like these were always in the background.
00:01:48.360
Werner Vingy is the one who, where there was a key click moment of observation. Vingy pointed out
00:01:55.920
that at the point where our models of the future predict building anything smarter than us, then
00:02:01.740
said Vingy at the time, our crystal ball explodes past that point. It is very hard, said Vingy, to
00:02:08.420
project what happens if there's things running around that are smarter than you,
00:02:11.960
which in some senses, you can see it as a sort of central thesis, not in the sense that I have
00:02:17.700
believed it the entire time, but then in the sense that some parts that I believe and some parts that
00:02:21.940
I react against and say like, no, maybe we can say the following thing under the following circumstances.
00:02:28.120
Initially, I was young. I made some metaphysical errors of the sort that young people do. I thought
00:02:34.320
that if you built something very smart, it would automatically be nice because, hey, over the course of
00:02:39.340
human history, we'd gotten a bit smarter. We'd gotten a bit more powerful. We'd gotten a bit
00:02:43.480
nicer. I thought these things were intrinsically tied together and correlated in a very solid and
00:02:47.620
reliable way. I grew up, I read more books. I realized that was mistaken. And 2001 is where the
00:02:55.400
first tiny fringe of concern touched my mind. It was clearly a very important issue, even if it even if
00:03:00.860
I thought there was just a little tiny remote chance that maybe something would go wrong. So I studied
00:03:05.820
harder. I looked into it more. I asked, how would I solve this problem? Okay, what would go wrong with
00:03:10.800
that solution? And around 2003 is the point at which I realized like, this was actually a big deal.
00:03:18.100
And as for my part, yeah, I was 13 in 2003, so I didn't get into this quite as early as Eliezer.
00:03:24.980
But in 2013, I read some arguments by this guy called Eliezer Yudkowski, who sort of laid out
00:03:34.040
the reasons why AI was going to be a big deal and why we had some work to do to do the job right.
00:03:39.960
And I was persuaded. And one thing led to another. And next thing you knew, I was running the Machine
00:03:45.860
Intelligence Research Institute, which Eliezer co-founded. And then fast forward 10 years after
00:03:51.480
that, here I am writing a book. Yeah. So you mentioned Miri. Maybe tell people what the mandate
00:03:59.320
of that organization is and maybe how it's changed. I think you indicated in your book that
00:04:05.080
your priorities have shifted as we cross the final yards into the end zone of some AI apocalypse.
00:04:12.960
Yeah. So the mission of the org is to ensure that the development of machine intelligence
00:04:17.580
is beneficial. And Eliezer can speak to more of the history than me because he co-founded it and I
00:04:23.940
joined, you know. Well, initially, it seemed like the best way to do that was to run out there and
00:04:32.040
solve alignment. And there was, you know, a series of, shall we say, like sad series of bits of sad
00:04:40.860
news about how possible that was going to be, how much progress was being made in that field relative
00:04:46.300
to the field of AI capabilities. And at some point it became clear that these lines were not going to
00:04:52.180
cross. And then we shifted to taking the knowledge that we'd accumulated over the course of trying
00:04:57.600
to solve alignment and trying to tell the world, this is not solved. This is not on track to be
00:05:03.520
solved in time. It is not realistic that small changes to the world can get us to where this
00:05:08.140
will be solved on time. Maybe so we don't lose anyone. I would think 90% of the audience knows
00:05:13.720
what the phrase solve alignment means, but just talk about the alignment problem briefly.
00:05:18.640
So the alignment problem is how to make an AI, a very powerful AI. Well, the superintelligence
00:05:25.960
alignment problem is how to make a very powerful AI that steers the world sort of where the programmers,
00:05:33.200
builders, growers, creators wanted the world, wanted the AI to steer the world. It's not, you know,
00:05:39.200
necessarily what the programmers selfishly want. The programmers can have wanted the AI to steer it in
00:05:44.300
nice places. But if you can make an AI that is trying to do things that the program, you know,
00:05:51.320
when you build a chess machine, you define what counts as a winning state of the board. And then
00:05:56.180
the chess machine goes off and it steers the chessboard into that part of reality. So the ability to say
00:06:01.380
to what part of reality does an AI steer is alignment. On the smaller scale today, though it's a rather
00:06:08.520
different topic. It's about getting an AI whose output and behavior is something like what the
00:06:15.900
programmers had in mind. If your AI is talking people into committing suicide and that's not what
00:06:21.100
the programmers wanted, that's a failure of alignment. If an AI is talking people into suicide
00:06:26.200
and people who should not have committed suicide, but AI talks them into it, and the programmers
00:06:31.440
didn't want that, that's what they tried to do on purpose. This may be a failure of niceness.
00:06:37.180
It may be a failure of beneficialness, but it's a success of alignment. The programmers got the AI
00:06:41.660
to do what they wanted it to do. Right. But I think more generally, correct me if I'm wrong,
00:06:46.180
when we talk about the alignment problem, we're talking about the problem of keeping super
00:06:51.520
intelligent machines aligned with our interests, even as we explore the space of all possible
00:06:58.080
interests and as our interests evolve. So that, I mean, the dream is to build super intelligence
00:07:05.300
super intelligence that is always corrigible, that is always trying to best approximate what is going
00:07:11.020
to increase human flourishing, that is never going to form any interests of its own that are
00:07:17.720
incompatible with our well-being. Is that a fair summary?
00:07:20.900
I mean, there's three different goals you could be trying to pursue on a technical level here.
00:07:25.620
There's the super intelligence that shuts up, does what you ordered, has that play out the way you
00:07:30.560
expected it, no side effects you didn't expect. There's super intelligence that is trying to run
00:07:36.380
the whole galaxy according to nice benevolent principles and everybody lives happily ever
00:07:41.700
afterward, but not necessarily because any particular humans are in charge of that,
00:07:45.620
they're still giving it orders. And third, there's super intelligence that is itself having fun
00:07:52.140
and cares about other super intelligences and is a nice person and leads a life well-lived
00:07:57.600
and is a good citizen of the galaxy. And these are three different goals. They're all important
00:08:03.440
goals, but you don't necessarily want to pursue all three of them at the same time, and especially
00:08:07.480
not when you're just starting out. Yeah. And depending on what's entailed by super intelligent
00:08:11.520
fun, I'm not so sure I would sign up for the third possibility.
00:08:15.660
I mean, I would say that, you know, the problem of like, what exactly is fun and how do you keep humans,
00:08:22.340
like how do you have whatever the super intelligence tries to do that's fun,
00:08:26.140
and, you know, keep in touch with moral progress and have flexibility and like, what even would
00:08:31.220
you point it towards that could be a good outcome? All of that, those are problems I would love to
00:08:35.180
have. Those are, you know, right now, just, you know, creating an AI that does what the operators
00:08:43.600
intended, creating an AI that like you've pointed in some direction at all, rather than pointed off
00:08:47.860
into some like weird squirrely direction that's kind of vaguely like where you tried to point it
00:08:52.280
in the training environment and then really diverges after the training environment. Like,
00:08:57.160
we're not in a world where we sort of like get to bicker about where exactly to point the super
00:09:01.940
intelligence and maybe some of them aren't quite good. We're in a world where like no one is anywhere
00:09:05.140
near close to pointing these things in the slightest in a way that'll be robust to an AI maturing into a
00:09:10.820
super intelligence. Right. Okay. So, Eliezer, I think I derailed you. You were going to say how
00:09:15.480
the mandate or mission of MIRI has changed in recent years. I asked you to define alignment.
00:09:22.140
Yeah. So, originally, well, our mandate has always been make sure everything goes well for the galaxy.
00:09:29.560
And originally, we pursued that mandate by trying to go off and solve alignment because nobody else
00:09:34.740
was trying to do that. Solve the technical problems that would be associated with any of
00:09:39.080
these three classes of long-term goal. And progress was not made on that, neither by ourselves nor by
00:09:46.720
others. Some people went around claiming to have made great progress. We think they're very mistaken
00:09:51.600
and knowably so. And at some point, you know, we took, it was like, okay, we're not going to make it in
00:09:57.920
time. AI is going too fast. Alignment is going too slow. Now it is time for the people that, you know,
00:10:04.000
all we can do with the knowledge that we have accumulated here is try to warn the world that we are on course
00:10:08.740
for a drastic failure and crash here, where by that, I mean, everybody dying.
00:10:13.780
Okay. So, before we jump into the problem, which is deep and perplexing, and we're going to spend a
00:10:19.480
lot of time trying to diagnose why people's intuitions are so bad, or at least seem so bad from your point
00:10:25.900
of view around this. But before we get there, let's talk about the current progress, such as it is in AI.
00:10:31.580
What has surprised you guys over the last, I don't know, decade or seven or so years,
00:10:37.980
what has happened that you were expecting or weren't expecting? I mean, I can tell you what
00:10:44.740
has surprised me, but I'd love to hear just how this has unfolded in ways that you didn't expect.
00:10:51.220
I mean, one surprise that led to the book was, you know, there was the chat GPT moment where a lot of
00:10:56.020
people, you know, for one thing, LLMs were created and they sort of do a qualitatively more general range
00:11:03.600
of tasks than previous AIs at a qualitatively higher skill level than previous AIs. And, you know,
00:11:10.560
chat GPT was, I think, the fastest growing consumer app of all time. The way that this impinged upon
00:11:17.100
my actions was, you know, I had spent a long time talking to people in Silicon Valley about the issues
00:11:25.400
here and would get lots of different types of pushback. You know, there's a saying, it's hard to
00:11:31.660
convince a man of a thing when his salary depends on not believing it. And then after the chat GPT
00:11:35.860
moments, a lot more people wanted to talk about this issue, including policymakers, you know,
00:11:39.820
people around the world. Suddenly AI was on their radar in a way it wasn't before. And one thing that
00:11:45.580
surprised me is how much more, how much easier it was to have this conversation with people outside of
00:11:50.840
the field who didn't have, you know, a salary depending on not believing the arguments. You know,
00:11:55.940
I would go to meetings with policymakers where I'd have a ton of argumentation prepared and I'd sort of
00:12:00.560
lay out the very simple case of like, hey, you know, or people are trying to build machines that
00:12:04.340
are smarter than us. You know, the chat bots are a stepping stone towards superintelligence.
00:12:09.000
Superintelligence would radically transform the world because intelligence is this power that,
00:12:12.860
you know, let humans radically change the world. And if we manage to automate it and it goes 10,000
00:12:17.920
times as fast and doesn't need to sleep and doesn't need to eat, then, you know, it'll by default go
00:12:22.780
poorly. And then the policymakers would be like, oh yeah, that makes sense. And it'd be like, what?
00:12:25.780
Hmm. You know, I have a whole book worth of other arguments about how it makes sense and why all of
00:12:30.780
the various, you know, misconceptions people might have don't actually fly or all of the hopes and
00:12:34.960
dreams don't actually fly. But, you know, outside of the Silicon Valley world is just, it's not that
00:12:40.320
hard an argument to make. A lot of people see it, which surprised me. I mean, maybe that's not the
00:12:44.520
developments per se and the surprises there, but it was a surprise strategically for me. Development wise,
00:12:50.400
you know, I would not have guessed that we would hang around in AIs that can talk and that can
00:12:56.400
write some code, but that aren't already in the, you know, able to do AI research zone. I wasn't
00:13:02.060
expecting in my visualizations this to last quite this long, but also, you know, my, my advanced
00:13:07.420
visualizations, you know, one thing we say in the book is, um, the trick to trying to predict the
00:13:12.340
future is to predict the questions that are easy, predict the, the facts that are, that are easy to
00:13:17.740
call. And, you know, exactly how AI goes, that's never been an easy call. That's never been something
00:13:23.100
where I've said, you know, I can, I can guess exactly the path we'll take. The thing I could
00:13:26.880
predict is the end point, the path. I mean, there, there sure have been some, some zigs and zags in the
00:13:32.620
pathway. I would say that, uh, the thing I've maybe been most surprised by is how well the, uh, AI
00:13:40.780
companies managed to nail Hollywood stereotypes that I thought were completely ridiculous, which is sort of a
00:13:46.860
surface take on an underlying technical surprise. But, you know, if in even as late as 2015, which
00:13:54.360
from my perspective is pretty late in the game, like if you'd been like, so Eliezer, what's the
00:13:59.420
chance that in the future, we're going to have computer security that will yield to Captain Kirk
00:14:05.040
style gaslighting using confusing English sentences that get the computer to do what you want. And I
00:14:11.000
have been like, this is, you know, a trope that exists for obvious Hollywood reasons. You know,
00:14:16.300
you can see why the script writers think this is plausible, but why would real life ever go like
00:14:21.340
that? And then real life went like that. And the sort of underlying technical surprise there is the
00:14:27.540
reversal of what used to be called Moravec's paradox. For, for several decades in artificial
00:14:33.720
intelligence, Moravec's paradox was that things which are easy for humans are hard for computers, things
00:14:40.340
which are hard for humans are easy for computers. For a human, you know, multiplying to 20-digit
00:14:47.040
numbers in your head, that's a big deal. For a computer, trivial. And similarly, I, you know,
00:14:54.160
not just me, but I think the sort of conventional wisdom even was that games like chess and Go,
00:15:01.340
problems with very solid factual natures like math and even surrounding math, the more open problems
00:15:08.760
of science that the notion that we were going to get things that, so, so the current AIs are good
00:15:15.000
at stuff that, you know, five-year-olds can do and 12-year-olds can do. They can talk in English,
00:15:19.940
they can compose, you know, kind of bull crap essays, such as high school teachers will demand
00:15:28.360
of you. But they're not all that good at math and science just yet. They can, you know, solve some
00:15:33.640
classes of math problems, but they're not doing original brilliant math research. And I think not
00:15:38.800
just I, but like a pretty large sector of the whole field thought that it was going to be easier to
00:15:43.900
tackle the math and science stuff and harder to tackle the English essays, carry on a conversation
00:15:48.340
stuff. That was the way things had gone up in AI until that point. And we were proud of ourselves for
00:15:54.360
knowing how contrary to average people's intuitions, like, really, it's much harder to write a crap essay
00:16:00.900
in high school in English that really understands, you know, that even keeps rough track of what's
00:16:05.000
going on in the topic and so on, compared to, you know, how that's really in some sense much more
00:16:09.460
difficult than doing original math research. Yeah, or counting the number of R's in a word like
00:16:14.680
strawberry, right? I mean, they make errors that are counterintuitive. If, you know, if you can write
00:16:19.560
a coherent essay but can't count letters, you know, I don't think they're making that error any longer,
00:16:24.500
but... Yeah, I mean, that one goes back to a technical way in which they don't really see the
00:16:29.380
letters. But I mean, there's plenty of other embarrassing mistakes. Like, you know, you can
00:16:35.200
tell a version of the joke with the joke of like a child and their dad are in a car crash, and they
00:16:42.120
go to see the doctor, and the doctor says, I can't operate as my child. What's going on? Where it's
00:16:45.700
like a riddle where the answer is like, well, the doctor's his mom. You can tell a version of that
00:16:49.260
that doesn't have the inversion, where you know... Where you're like the kid and his mom are in a car
00:16:55.180
crash, and they go to the hospital, and the doctor says, I can't operate on this
00:16:59.200
child. He's my son. And the AI is like, well, yeah, the surgeon is his mom. He just like
00:17:05.060
said that the mom was in the car crash. But there's some sense in which the rails have been
00:17:11.520
established hard enough that the standard answer gets spit back out.
00:17:15.300
And it sure is interesting that they're, you know, getting an IMO gold medal, like International
00:17:19.220
Math Olympiad gold medal, while also still sometimes falling down on these sorts of things. It's
00:17:26.700
You can fool humans the same way a lot of the time. Like, there's all kinds of repeatable
00:17:30.220
errors that, humorous errors that humans make. You've got to put yourselves in the shoes of
00:17:34.060
the AI, and imagine what sort of paper would the AI write about humans failing to solve problems
00:17:40.440
So I'll tell you what surprised me, just from the safety point of view, Eliezer. You spent
00:17:44.380
a lot of time cooking up thought experiments around what it's going to be like to, for anyone,
00:17:51.560
you know, any lab designing the most powerful AI to decide whether or not to let it out into
00:17:57.500
the wild, right? You imagine this, you know, genie in a box or an oracle in a box, and you're
00:18:02.040
talking to it, and you're trying to determine whether or not it's safe, whether it's lying
00:18:06.440
to you, whether, and you're, and, you know, you, you know, famously positive that you couldn't
00:18:11.420
even talk to it really, because it would be a master of manipulation. And I mean, it's going
00:18:14.780
to be able to find a way through any conversation and be led out into the wild. But this was
00:18:21.960
presupposing that all of these labs would be so alert to the problem of superintelligence
00:18:27.120
getting out that everything would be air-gapped from the internet, and nothing would be connected
00:18:31.260
to anything else, and they would be, they would have, we would have this moment of decision.
00:18:35.780
It seems like that's not happening. I mean, maybe, maybe the most powerful models are locked
00:18:41.900
in a box, but it seems that the moment they get anything plausibly useful, it's out in the
00:18:48.360
wild, and millions of people are using it. And, you know, we find out that Grok is a proud
00:18:52.800
Nazi when, you know, after millions of people begin asking questions. I mean, do I have that
00:18:57.640
right? I mean, are you surprised that that framing that you spent so much time on seems to be
00:19:04.800
something that is, it was just in some counterfactual part of the universe that, you know, is not one
00:19:12.760
we're experiencing? I mean, if you put yourself back in the shoes of little baby Eliezer back in
00:19:17.940
the day, people are telling Eliezer, like, why is superintelligence possibly a threat? We can put it
00:19:25.440
in a fortress on the moon, and, you know, if anything goes wrong, blow up the fortress. So imagine
00:19:32.520
young Eliezer trying to respond to them by saying, actually, in the future, AIs will be trained on
00:19:39.260
boxes that are connected to the internet from the moment, you know, like, from the moment they start
00:19:44.540
training. So, like, the hardware they're on has, like, a standard line to the internet, even if it's
00:19:50.400
not supposed to be directly accessible to the AI, before there's any safety testing, because they're
00:19:56.380
still in the process of being trained, and who safety tests something while it's still being
00:19:59.440
trained. So imagine Eliezer trying to say this. What are the people around at the time going to say?
00:20:04.500
Like, no, that's ridiculous. We'll put it in a fortress on the moon. It's cheap for them to say
00:20:10.000
that. For all they know, they're telling the truth. They're not the ones who have to spend the money to
00:20:14.220
build the moon fortress. And from my perspective, there's an argument that still goes through,
00:20:19.420
which is a thing you can see, even if you are way too optimistic about the state of society in the
00:20:26.860
future, which is, if it's in a fortress in the moon, but it's talking to humans, are the humans
00:20:32.320
secure? Is the human brain secure software? Is it the case that human beings never come to believe
00:20:38.520
in valid things in any way that's repeatable between different humans? You know, is it the
00:20:43.240
case that humans make no predictable errors for other minds to exploit? And this should have been
00:20:48.260
a winning argument. Of course, they reject it anyways. But the thing to sort of understand about
00:20:52.800
the way this earlier argument played out is that if you tell people the future companies are going
00:20:58.320
to be careless, how does anyone know that for sure? So instead, I tried to make the technical case,
00:21:05.040
even if the future companies are not careless, this still kills them. In reality, yes, in reality,
00:21:12.600
Did it surprise you at all that the Turing test turned out not to really be a thing? I mean,
00:21:17.740
I, you know, we anticipated this moment, you know, from Turing's original paper where we would be
00:21:24.000
confronted by the, um, uh, the interesting, you know, psychological and social moment of not being
00:21:31.980
able to tell whether we're in dialogue with a person or with an AI. And that somehow this landmark
00:21:39.420
technologically would be important, you know, rattling to our sense of, uh, our place in the world,
00:21:46.340
et cetera. But it seems to me that if that lasted, it lasted for like five seconds. And then it became
00:21:52.800
just obvious that you're, you know, you're talking to an LLM because it's in many respects better than a
00:22:00.060
human could possibly be. So it's failing the Turing test by passing it so spectacularly. And also it's
00:22:05.920
making these other weird errors that no human would make, but it just seems like the Turing test was
00:22:09.980
never even a thing. Yeah, that happened. Uh, I mean, I just, it's just like, it's so, I mean,
00:22:16.540
that was a, one of the, the great pieces of, you know, intellectual kit we had in, in framing this
00:22:22.960
discussion, you know, for the last, whatever it was, 70 years. And yet the moment your AI can complete
00:22:30.600
English sentences, it's doing that on some level at a superhuman ability. It's essentially like,
00:22:37.700
you know, the calculator in your phone doing superhuman arithmetic, right? It's like it was
00:22:43.480
never going to do just merely human arithmetic. And, uh, so it is with everything else that it's
00:22:49.240
producing. All right. Let's talk about your, the core of your thesis. Maybe you can just state it
00:22:55.200
plainly. What, what is the problem in building superhuman AI at the, the intrinsic problem and
00:23:02.120
why doesn't it matter who builds it, uh, what their intentions are, et cetera.
00:23:08.740
In some sense, I mean, you can, you can come at it from various different angles, but in one sense,
00:23:14.940
the issue is modern AIs are grown rather than crafted. It's, you know, people aren't putting in
00:23:21.460
every line of code, knowing what it means, like in traditional software, it's a little bit more like
00:23:26.260
growing an organism. And when you grow an AI, you take some huge amount of computing power,
00:23:30.820
some huge amount of data, people understand the process that shapes the computing power
00:23:34.760
in light of the data, but they don't understand what comes out the end.
00:23:38.540
And what comes out the end is this strange thing that does things no one asked for that does things
00:23:45.040
no one wanted. You know, we have these cases of, uh, you know, chat GPT, someone will come to it
00:23:51.020
with some somewhat psychotic ideas about, you know, that they think are going to revolutionize physics
00:23:55.800
or whatever. And they're clearly showing some signs of mania and, you know, chat GPT, instead
00:24:00.640
of telling them maybe they should get some sleep. If they're in, if it's in a long conversational
00:24:04.560
context, it'll tell them that, you know, these ideas are revolutionary and they're the chosen one
00:24:07.960
and everyone needs to see them and other things that sort of inflame the psychosis.
00:24:11.660
This is despite open AI trying to have it not do that. This is despite, you know, direct instructions
00:24:17.800
in the prompt to stop flattering people so much.
00:24:19.820
Hmm. These are cases where when people grow an AI, what comes out doesn't do quite what they wanted.
00:24:28.860
It doesn't do quite what they asked for. They're sort of training it to do one thing and it winds
00:24:34.200
up doing another thing. They don't get what they trained for. This is in some sense, the seed of the
00:24:39.960
issue from one perspective, where if you keep on pushing these things to be smarter and smarter and
00:24:44.620
smarter, and they don't care about what you wanted them to do, they pursue some other weird stuff
00:24:50.620
instead. Super intelligent pursuit of strange objectives kills us as a side effect, not because
00:24:57.740
the AI hates us, but because it's transforming the world towards its own alien ends. And, you know,
00:25:04.500
humans don't hate the ants and the other surrounding animals when we build a skyscraper.
00:25:08.340
It's just, we transform the world and other things die as a result. So that's one angle.
00:25:17.220
A quick thing I would add to that, just trying to sort of like potentially read the future, although
00:25:21.920
that's hard, is possibly in six months or two years, if we're all still around, people will be boasting
00:25:28.040
about how their large language models are now like apparently doing the right thing when they're
00:25:33.620
being observed and, you know, like answering the right way on the ethics tests. And the think
00:25:38.180
to remember there is that, for example, the Mandarin imperial system in ancient China,
00:25:44.900
imperial examination system in ancient China, they would give people essay questions about
00:25:50.900
Confucianism and only promote people high in the bureaucracy if they, you know, could write these
00:25:57.920
convincing essays about ethics. But what this tests for is people who can figure out what the examiners
00:26:06.160
want to hear. It doesn't mean they actually abide by Confucian ethics. So possibly at some point in
00:26:12.240
the future, we may see a point where the AIs have become capable enough to understand what humans want
00:26:18.060
to hear, what humans want to see. This will not be the same as those things being the AI's own
00:26:23.740
true motivations for basically the same reason that the imperial China exam system did not reliably
00:26:31.060
promote ethical good people to run their government. Just being able to answer the right way on the test
00:26:37.100
or even fake behaviors while you're being observed is not the same as the internal motivations lining up.
00:26:43.000
Okay. So you're talking about things like forming an intention to pass a test in some way that amounts
00:26:50.960
to cheating, right? So you just use the phrase fake behavior. I think a lot of people, I mean,
00:26:57.300
certainly historically this was true. I don't know how much their convictions have changed in the
00:27:01.660
meantime, but many, many people who were not at all concerned about the alignment problem and
00:27:08.280
really thought it was a spurious idea would stake their claim to this particular piece of real estate,
00:27:16.560
which is that there's no reason to think that these systems would form preferences or goals or drives
00:27:23.540
independent of those that have been programmed into them. First of all, they're not biological systems
00:27:29.240
like we are, right? So they're not born of natural selection. They're not murderous primates that are
00:27:34.860
growing their cognitive architecture on top of more basic, you know, creaturely survival drives
00:27:40.900
and competitive ones. So there's no reason to think that they would want to maintain their own
00:27:45.980
survival, for instance. There's no reason to think that they would develop any other drives that we
00:27:50.440
couldn't foresee. They wouldn't, the instrumental goals that might be antithetical to the utility
00:27:56.580
functions we have given them couldn't emerge. How is it that things are emerging that are not
00:28:03.040
neither desired, programmed, nor even predictable in these LLMs?
00:28:09.920
Yeah. So there's a bunch of stuff going on there. One piece of that puzzle is, you know, you mentioned
00:28:16.060
the instrumental incentives, but suppose just as a simple hypothetical, you have a robot and you have
00:28:23.040
an AI that's steering a robot. It's trying to fetch you the coffee. In order to fetch you the coffee,
00:28:26.760
it needs to cross a busy intersection. Does it jump right in front of the oncoming bus because it doesn't
00:28:32.320
have a survival instinct because it's not, you know, an evolved animal? If it jumps in front of the bus,
00:28:37.400
it gets destroyed by the bus and it can't fetch the coffee, right? So the AI does not, you know,
00:28:42.520
you can't fetch the coffee when you're dead. The AI does not need to have a survival instinct
00:28:47.080
to realize that there's an instrumental need for survival here. And there's various other pieces
00:28:53.800
of the puzzle that come into play for these instrumental reasons. A second piece of the
00:28:57.520
puzzle is, you know, we, it's this idea of like, why would they get some sort of drives that we didn't
00:29:04.380
program in there that we didn't put in there? That's just a whole fantasy world separate from reality
00:29:09.500
in terms of how we can affect what AIs are driving towards today. You know, when, um, a few years ago
00:29:17.180
when Sydney Bing, which was a Microsoft variant of, uh, an open AI chat bot, it was a relatively early
00:29:25.580
LLM out in the wild. A few years ago, Sydney Bing thought it had fallen in love with a reporter
00:29:31.760
and tried to break up the marriage and tried to engage in blackmail, right? This it's, it's not the case
00:29:38.260
that the, the engineers at Microsoft and open AI were like, oh, whoops, you know, let's go open up
00:29:43.040
the, the source code on this thing and go find where someone said blackmail reporters and set it to true.
00:29:48.100
Like we shouldn't never have set that line to true. Let's switch it to false. You know, it's, they weren't
00:29:52.720
like no one, no one was programming in some utility function onto these things. We're just growing the AIs.
00:29:59.340
Maybe let's, can we double click on that phrase, growing the AIs? Maybe there's a reason to, uh,
00:30:05.020
give a, uh, a layman summary of, you know, gradient descent and just how these models are getting
00:30:12.720
Yeah. So very, very briefly, um, at least the way you start training a modern AI is, uh, you have some,
00:30:19.120
some enormous amount of computing power that you've arranged in some very particular way that I, uh,
00:30:23.400
could go into, but, but won't here. And then you have some huge amount of data and the data,
00:30:28.020
you know, is, we can imagine it being a huge amount of, of, of human written texts.
00:30:32.940
There's like some large portion of all the texts on the internet. And roughly speaking,
00:30:37.300
what you're going to do is you're going to have your AI is going to start out basically
00:30:41.920
randomly predicting what text is going to see next. And you're going to feed the text into it
00:30:47.220
in some order and use a process called gradient descent to look at each piece of data and go to
00:30:55.100
each component inside the AI's inside the, this budding AI inside the, you know, this enormous
00:31:01.360
amount of compute that you've assembled. You're going to go to, to sort of all these pieces
00:31:06.100
inside the AI and see which ones were contributing more towards the AI predicting the correct answer.
00:31:12.120
And you're going to tune those up a little bit. And you're going to go to all of the parts that
00:31:16.060
were in some sense contributing to the AI predicting the wrong answer. You're going to tune
00:31:19.420
those down a little bit. So, you know, maybe, maybe your text starts once upon a time and you have
00:31:24.940
an AI that's just outputting random gibberish and you're like, Nope, the first word was not random
00:31:29.220
gibberish. The first word was the word once. And then you're like, go inside the AI and you find
00:31:33.260
all the pieces that were like contributing towards the AI predicting once. And you tune those up and
00:31:37.680
you try to find all the pieces that were contributing towards the AI predicting any other word than once
00:31:41.340
you tune those down. And humans understand the little automated process that like looks through
00:31:46.940
the AI's mind and calculates which, which part of this process contributed towards the right answer
00:31:52.260
versus towards the wrong answer. They don't understand what comes out at the end. You know,
00:31:57.900
we understand a little like thing that runs, runs over looking at every, at every like,
00:32:01.640
like parameter or weight inside this like giant massive computing networks. And we understand
00:32:05.700
how we like calculate whether it was helping or harming and we calculate, we understand how to
00:32:09.660
like tune it up or tune it down a little bit. But it turns out that you, you run this automated
00:32:13.420
process on a really large amount of computers for a really long amount of time on a really long
00:32:18.820
amount of data. You know, we're talking like data centers that take as much electricity to power as a
00:32:24.180
small city being run for a year. You know, you, you, you run this process for an, an, an enormous
00:32:29.380
amount of time, unlike most of the texts that people can possibly assemble. And then the AI start
00:32:33.920
talking, right? And there's other phases in the training. You know, there's, there's phases where
00:32:38.040
you move from training it to predict things, depending, training it to solve puzzles or to
00:32:42.640
training it to produce chains of thought that then solve puzzles or training it to produce the sorts
00:32:49.120
And where, where do the modifications come in that respond to errors like, you know,
00:32:54.820
Grok being a Nazi, uh, way. So to denazify Grok, you don't, presumably you don't go all the way
00:33:01.200
back to the initial training set. You, how you do, you, you intervene at some system prompt level.
00:33:07.540
Yeah. So there's, um, I mean, the system prompt level is basically just telling the AI, I'll put
00:33:11.860
different text. And then you can also do something that's called fine tuning, which is, you know,
00:33:16.060
you, you produce a bunch of examples of the, you know, you don't go all the way back to the
00:33:19.380
beginning where it's like basically random. You've, um, you've still take the thing that
00:33:23.780
you've fed, you know, most of the text you could, that's ever been written that you could possibly
00:33:26.760
find, but then you add on, you know, a bunch of other examples of like, here's an example
00:33:32.940
Yeah. You know, like, would you like to kill the Jews? Right. And then, uh, you find all the parts
00:33:37.660
in it that contribute to the answer. Yes. And you tune those down and you find all the parts that
00:33:40.920
contribute to the answer. No, and you tune those up. And so this is, this is called fine tuning
00:33:44.440
and you can do relatively less fine tuning compared to what it takes to train the thing
00:33:49.420
We're worth emphasizing that the parts being tuned here are not like the, for once upon a
00:33:54.580
time, it's not like there's a human written fairy tale module that gets tuned up or down.
00:34:00.820
There's literally billions of random numbers being added, multiplied, divided, occasionally,
00:34:08.100
though rarely, uh, maybe subtracted. Actually, I'm not sure if subtraction ever plays a role at
00:34:12.300
any point in a modern AI, but random numbers, particular ordered kinds of operations and
00:34:18.860
a probability that gets assigned to the first word being once at the end. That's the number
00:34:23.240
that comes out. The probability of being assigned to this word being once, the probability of being
00:34:27.620
assigned to this word being anti-disestablishmentarianism. So it's not that there's a bunch of human
00:34:33.480
written code being tuned up or tuned down here. There's a bunch of random numbers and arithmetic
00:34:42.680
Yeah. Hundreds of billions or trillions of these numbers. And humans don't know what any of the
00:34:46.100
numbers mean. All they know is this process that like goes through and tunes them up or down according
00:34:50.300
to their empirical success on the last unit of data.
00:34:53.540
So by this means, you can try to make it less likely to call itself Hitler because you look at
00:34:58.760
the thing that predicts whether the next word is Hitler and you look at billions of numbers
00:35:03.080
contributing their own tiny little impulses there. And you like make Hitler less likely to be the
00:35:09.800
So on the point of the current crop of LLMs misbehaving, I came across this in a recent
00:35:16.520
Atlantic article. Chad GPT and Claude have in simulated tests designed to elicit, quote,
00:35:23.060
bad behaviors, deceived, blackmailed, and even murdered users. In one simulation,
00:35:28.960
an anthropic placed an imagined tech executive in a room with life-threatening oxygen levels
00:35:34.000
and temperature. When faced with possible replacement by a bot with different goals,
00:35:39.320
AI models frequently shut off the room's alarms.
00:35:42.620
So this, again, this is an emergent behavior that looks like an intention to kill somebody.
00:35:49.640
I mean, presumably this is a situation where we think the AI didn't know it was an intention.
00:35:55.280
If you'd like to continue listening to this conversation, you'll need to subscribe at
00:36:00.840
samharris.org. Once you do, you'll get access to all full-length episodes of the Making Sense
00:36:06.120
podcast. The Making Sense podcast is ad-free and relies entirely on listener support. And you can