ManoWhisper
Home
Shows
About
Search
Making Sense - Sam Harris
- September 16, 2025
#434 — Can We Survive AI?
Episode Stats
Length
36 minutes
Words per Minute
188.67665
Word Count
6,875
Sentence Count
314
Hate Speech Sentences
6
Summary
Summaries are generated with
gmurro/bart-large-finetuned-filtered-spotify-podcast-summ
.
Transcript
Transcript is generated with
Whisper
(
turbo
).
Hate speech classification is done with
facebook/roberta-hate-speech-dynabench-r4-target
.
00:00:00.000
Welcome to the Making Sense Podcast. This is Sam Harris. Just a note to say that if you're
00:00:11.780
hearing this, you're not currently on our subscriber feed, and we'll only be hearing
00:00:15.740
the first part of this conversation. In order to access full episodes of the Making Sense
00:00:20.100
Podcast, you'll need to subscribe at samharris.org. We don't run ads on the podcast, and therefore
00:00:26.260
it's made possible entirely through the support of our subscribers. So if you enjoy what we're
00:00:30.240
doing here, please consider becoming one. I am here with Eliezer Yudkowsky and Nate Soares.
00:00:41.140
Eliezer, Nate, it's great to see you guys again.
00:00:43.380
Been a while.
00:00:44.040
Good to see you, Sam.
00:00:44.820
Been a long time. So you were, Eliezer, you were among the first people to make me concerned about
00:00:52.620
AI, which is going to be the topic of today's conversation. I think many people who are
00:00:57.000
concerned about AI can say that. First, I should say you guys are releasing a book, which will be
00:01:01.940
available, I'm sure, the moment this drops. If anyone builds it, everyone dies. Why Superhuman
00:01:08.980
AI Would Kill Us All. I mean, the book is, its message is fully condensed in that title. I mean,
00:01:15.720
we're going to explore just how uncompromising a thesis that is and how worried you are and how
00:01:23.620
you're worried you think we all should be here. But before we jump into the issue, maybe tell the
00:01:28.840
audience how each of you got into this topic. How is it that you came to be so concerned about the
00:01:34.680
prospect of developing superhuman AI?
00:01:37.400
Well, in my case, I guess I was sort of raised in a house with enough science books and enough
00:01:43.920
science fiction books that thoughts like these were always in the background.
00:01:48.360
Werner Vingy is the one who, where there was a key click moment of observation. Vingy pointed out
00:01:55.920
that at the point where our models of the future predict building anything smarter than us, then
00:02:01.740
said Vingy at the time, our crystal ball explodes past that point. It is very hard, said Vingy, to
00:02:08.420
project what happens if there's things running around that are smarter than you,
00:02:11.960
which in some senses, you can see it as a sort of central thesis, not in the sense that I have
00:02:17.700
believed it the entire time, but then in the sense that some parts that I believe and some parts that
00:02:21.940
I react against and say like, no, maybe we can say the following thing under the following circumstances.
00:02:28.120
Initially, I was young. I made some metaphysical errors of the sort that young people do. I thought
00:02:34.320
that if you built something very smart, it would automatically be nice because, hey, over the course of
00:02:39.340
human history, we'd gotten a bit smarter. We'd gotten a bit more powerful. We'd gotten a bit
00:02:43.480
nicer. I thought these things were intrinsically tied together and correlated in a very solid and
00:02:47.620
reliable way. I grew up, I read more books. I realized that was mistaken. And 2001 is where the
00:02:55.400
first tiny fringe of concern touched my mind. It was clearly a very important issue, even if it even if
00:03:00.860
I thought there was just a little tiny remote chance that maybe something would go wrong. So I studied
00:03:05.820
harder. I looked into it more. I asked, how would I solve this problem? Okay, what would go wrong with
00:03:10.800
that solution? And around 2003 is the point at which I realized like, this was actually a big deal.
00:03:17.340
Nate?
00:03:18.100
And as for my part, yeah, I was 13 in 2003, so I didn't get into this quite as early as Eliezer.
00:03:24.980
But in 2013, I read some arguments by this guy called Eliezer Yudkowski, who sort of laid out
00:03:34.040
the reasons why AI was going to be a big deal and why we had some work to do to do the job right.
00:03:39.960
And I was persuaded. And one thing led to another. And next thing you knew, I was running the Machine
00:03:45.860
Intelligence Research Institute, which Eliezer co-founded. And then fast forward 10 years after
00:03:51.480
that, here I am writing a book. Yeah. So you mentioned Miri. Maybe tell people what the mandate
00:03:59.320
of that organization is and maybe how it's changed. I think you indicated in your book that
00:04:05.080
your priorities have shifted as we cross the final yards into the end zone of some AI apocalypse.
00:04:12.960
Yeah. So the mission of the org is to ensure that the development of machine intelligence
00:04:17.580
is beneficial. And Eliezer can speak to more of the history than me because he co-founded it and I
00:04:23.940
joined, you know. Well, initially, it seemed like the best way to do that was to run out there and
00:04:32.040
solve alignment. And there was, you know, a series of, shall we say, like sad series of bits of sad
00:04:40.860
news about how possible that was going to be, how much progress was being made in that field relative
00:04:46.300
to the field of AI capabilities. And at some point it became clear that these lines were not going to
00:04:52.180
cross. And then we shifted to taking the knowledge that we'd accumulated over the course of trying
00:04:57.600
to solve alignment and trying to tell the world, this is not solved. This is not on track to be
00:05:03.520
solved in time. It is not realistic that small changes to the world can get us to where this
00:05:08.140
will be solved on time. Maybe so we don't lose anyone. I would think 90% of the audience knows
00:05:13.720
what the phrase solve alignment means, but just talk about the alignment problem briefly.
00:05:18.640
So the alignment problem is how to make an AI, a very powerful AI. Well, the superintelligence
00:05:25.960
alignment problem is how to make a very powerful AI that steers the world sort of where the programmers,
00:05:33.200
builders, growers, creators wanted the world, wanted the AI to steer the world. It's not, you know,
00:05:39.200
necessarily what the programmers selfishly want. The programmers can have wanted the AI to steer it in
00:05:44.300
nice places. But if you can make an AI that is trying to do things that the program, you know,
00:05:51.320
when you build a chess machine, you define what counts as a winning state of the board. And then
00:05:56.180
the chess machine goes off and it steers the chessboard into that part of reality. So the ability to say
00:06:01.380
to what part of reality does an AI steer is alignment. On the smaller scale today, though it's a rather
00:06:08.520
different topic. It's about getting an AI whose output and behavior is something like what the
00:06:15.900
programmers had in mind. If your AI is talking people into committing suicide and that's not what
00:06:21.100
the programmers wanted, that's a failure of alignment. If an AI is talking people into suicide
00:06:26.200
and people who should not have committed suicide, but AI talks them into it, and the programmers
00:06:31.440
didn't want that, that's what they tried to do on purpose. This may be a failure of niceness.
00:06:37.180
It may be a failure of beneficialness, but it's a success of alignment. The programmers got the AI
00:06:41.660
to do what they wanted it to do. Right. But I think more generally, correct me if I'm wrong,
00:06:46.180
when we talk about the alignment problem, we're talking about the problem of keeping super
00:06:51.520
intelligent machines aligned with our interests, even as we explore the space of all possible
00:06:58.080
interests and as our interests evolve. So that, I mean, the dream is to build super intelligence
00:07:05.300
super intelligence that is always corrigible, that is always trying to best approximate what is going
00:07:11.020
to increase human flourishing, that is never going to form any interests of its own that are
00:07:17.720
incompatible with our well-being. Is that a fair summary?
00:07:20.900
I mean, there's three different goals you could be trying to pursue on a technical level here.
00:07:25.620
There's the super intelligence that shuts up, does what you ordered, has that play out the way you
00:07:30.560
expected it, no side effects you didn't expect. There's super intelligence that is trying to run
00:07:36.380
the whole galaxy according to nice benevolent principles and everybody lives happily ever
00:07:41.700
afterward, but not necessarily because any particular humans are in charge of that,
00:07:45.620
they're still giving it orders. And third, there's super intelligence that is itself having fun
00:07:52.140
and cares about other super intelligences and is a nice person and leads a life well-lived
00:07:57.600
and is a good citizen of the galaxy. And these are three different goals. They're all important
00:08:03.440
goals, but you don't necessarily want to pursue all three of them at the same time, and especially
00:08:07.480
not when you're just starting out. Yeah. And depending on what's entailed by super intelligent
00:08:11.520
fun, I'm not so sure I would sign up for the third possibility.
00:08:15.660
I mean, I would say that, you know, the problem of like, what exactly is fun and how do you keep humans,
00:08:22.340
like how do you have whatever the super intelligence tries to do that's fun,
00:08:26.140
and, you know, keep in touch with moral progress and have flexibility and like, what even would
00:08:31.220
you point it towards that could be a good outcome? All of that, those are problems I would love to
00:08:35.180
have. Those are, you know, right now, just, you know, creating an AI that does what the operators
00:08:43.600
intended, creating an AI that like you've pointed in some direction at all, rather than pointed off
00:08:47.860
into some like weird squirrely direction that's kind of vaguely like where you tried to point it
00:08:52.280
in the training environment and then really diverges after the training environment. Like,
00:08:57.160
we're not in a world where we sort of like get to bicker about where exactly to point the super
00:09:01.940
intelligence and maybe some of them aren't quite good. We're in a world where like no one is anywhere
00:09:05.140
near close to pointing these things in the slightest in a way that'll be robust to an AI maturing into a
00:09:10.820
super intelligence. Right. Okay. So, Eliezer, I think I derailed you. You were going to say how
00:09:15.480
the mandate or mission of MIRI has changed in recent years. I asked you to define alignment.
00:09:22.140
Yeah. So, originally, well, our mandate has always been make sure everything goes well for the galaxy.
00:09:29.560
And originally, we pursued that mandate by trying to go off and solve alignment because nobody else
00:09:34.740
was trying to do that. Solve the technical problems that would be associated with any of
00:09:39.080
these three classes of long-term goal. And progress was not made on that, neither by ourselves nor by
00:09:46.720
others. Some people went around claiming to have made great progress. We think they're very mistaken
00:09:51.600
and knowably so. And at some point, you know, we took, it was like, okay, we're not going to make it in
00:09:57.920
time. AI is going too fast. Alignment is going too slow. Now it is time for the people that, you know,
00:10:04.000
all we can do with the knowledge that we have accumulated here is try to warn the world that we are on course
00:10:08.740
for a drastic failure and crash here, where by that, I mean, everybody dying.
00:10:13.780
Okay. So, before we jump into the problem, which is deep and perplexing, and we're going to spend a
00:10:19.480
lot of time trying to diagnose why people's intuitions are so bad, or at least seem so bad from your point
00:10:25.900
of view around this. But before we get there, let's talk about the current progress, such as it is in AI.
00:10:31.580
What has surprised you guys over the last, I don't know, decade or seven or so years,
00:10:37.980
what has happened that you were expecting or weren't expecting? I mean, I can tell you what
00:10:44.740
has surprised me, but I'd love to hear just how this has unfolded in ways that you didn't expect.
00:10:51.220
I mean, one surprise that led to the book was, you know, there was the chat GPT moment where a lot of
00:10:56.020
people, you know, for one thing, LLMs were created and they sort of do a qualitatively more general range
00:11:03.600
of tasks than previous AIs at a qualitatively higher skill level than previous AIs. And, you know,
00:11:10.560
chat GPT was, I think, the fastest growing consumer app of all time. The way that this impinged upon
00:11:17.100
my actions was, you know, I had spent a long time talking to people in Silicon Valley about the issues
00:11:25.400
here and would get lots of different types of pushback. You know, there's a saying, it's hard to
00:11:31.660
convince a man of a thing when his salary depends on not believing it. And then after the chat GPT
00:11:35.860
moments, a lot more people wanted to talk about this issue, including policymakers, you know,
00:11:39.820
people around the world. Suddenly AI was on their radar in a way it wasn't before. And one thing that
00:11:45.580
surprised me is how much more, how much easier it was to have this conversation with people outside of
00:11:50.840
the field who didn't have, you know, a salary depending on not believing the arguments. You know,
00:11:55.940
I would go to meetings with policymakers where I'd have a ton of argumentation prepared and I'd sort of
00:12:00.560
lay out the very simple case of like, hey, you know, or people are trying to build machines that
00:12:04.340
are smarter than us. You know, the chat bots are a stepping stone towards superintelligence.
00:12:09.000
Superintelligence would radically transform the world because intelligence is this power that,
00:12:12.860
you know, let humans radically change the world. And if we manage to automate it and it goes 10,000
00:12:17.920
times as fast and doesn't need to sleep and doesn't need to eat, then, you know, it'll by default go
00:12:22.780
poorly. And then the policymakers would be like, oh yeah, that makes sense. And it'd be like, what?
00:12:25.780
Hmm. You know, I have a whole book worth of other arguments about how it makes sense and why all of
00:12:30.780
the various, you know, misconceptions people might have don't actually fly or all of the hopes and
00:12:34.960
dreams don't actually fly. But, you know, outside of the Silicon Valley world is just, it's not that
00:12:40.320
hard an argument to make. A lot of people see it, which surprised me. I mean, maybe that's not the
00:12:44.520
developments per se and the surprises there, but it was a surprise strategically for me. Development wise,
00:12:50.400
you know, I would not have guessed that we would hang around in AIs that can talk and that can
00:12:56.400
write some code, but that aren't already in the, you know, able to do AI research zone. I wasn't
00:13:02.060
expecting in my visualizations this to last quite this long, but also, you know, my, my advanced
00:13:07.420
visualizations, you know, one thing we say in the book is, um, the trick to trying to predict the
00:13:12.340
future is to predict the questions that are easy, predict the, the facts that are, that are easy to
00:13:17.740
call. And, you know, exactly how AI goes, that's never been an easy call. That's never been something
00:13:23.100
where I've said, you know, I can, I can guess exactly the path we'll take. The thing I could
00:13:26.880
predict is the end point, the path. I mean, there, there sure have been some, some zigs and zags in the
00:13:32.620
pathway. I would say that, uh, the thing I've maybe been most surprised by is how well the, uh, AI
00:13:40.780
companies managed to nail Hollywood stereotypes that I thought were completely ridiculous, which is sort of a
00:13:46.860
surface take on an underlying technical surprise. But, you know, if in even as late as 2015, which
00:13:54.360
from my perspective is pretty late in the game, like if you'd been like, so Eliezer, what's the
00:13:59.420
chance that in the future, we're going to have computer security that will yield to Captain Kirk
00:14:05.040
style gaslighting using confusing English sentences that get the computer to do what you want. And I
00:14:11.000
have been like, this is, you know, a trope that exists for obvious Hollywood reasons. You know,
00:14:16.300
you can see why the script writers think this is plausible, but why would real life ever go like
00:14:21.340
that? And then real life went like that. And the sort of underlying technical surprise there is the
00:14:27.540
reversal of what used to be called Moravec's paradox. For, for several decades in artificial
00:14:33.720
intelligence, Moravec's paradox was that things which are easy for humans are hard for computers, things
00:14:40.340
which are hard for humans are easy for computers. For a human, you know, multiplying to 20-digit
00:14:47.040
numbers in your head, that's a big deal. For a computer, trivial. And similarly, I, you know,
00:14:54.160
not just me, but I think the sort of conventional wisdom even was that games like chess and Go,
00:15:01.340
problems with very solid factual natures like math and even surrounding math, the more open problems
00:15:08.760
of science that the notion that we were going to get things that, so, so the current AIs are good
00:15:15.000
at stuff that, you know, five-year-olds can do and 12-year-olds can do. They can talk in English,
00:15:19.940
they can compose, you know, kind of bull crap essays, such as high school teachers will demand
00:15:28.360
of you. But they're not all that good at math and science just yet. They can, you know, solve some
00:15:33.640
classes of math problems, but they're not doing original brilliant math research. And I think not
00:15:38.800
just I, but like a pretty large sector of the whole field thought that it was going to be easier to
00:15:43.900
tackle the math and science stuff and harder to tackle the English essays, carry on a conversation
00:15:48.340
stuff. That was the way things had gone up in AI until that point. And we were proud of ourselves for
00:15:54.360
knowing how contrary to average people's intuitions, like, really, it's much harder to write a crap essay
00:16:00.900
in high school in English that really understands, you know, that even keeps rough track of what's
00:16:05.000
going on in the topic and so on, compared to, you know, how that's really in some sense much more
00:16:09.460
difficult than doing original math research. Yeah, or counting the number of R's in a word like
00:16:14.680
strawberry, right? I mean, they make errors that are counterintuitive. If, you know, if you can write
00:16:19.560
a coherent essay but can't count letters, you know, I don't think they're making that error any longer,
00:16:24.500
but... Yeah, I mean, that one goes back to a technical way in which they don't really see the
00:16:29.380
letters. But I mean, there's plenty of other embarrassing mistakes. Like, you know, you can
00:16:35.200
tell a version of the joke with the joke of like a child and their dad are in a car crash, and they
00:16:42.120
go to see the doctor, and the doctor says, I can't operate as my child. What's going on? Where it's
00:16:45.700
like a riddle where the answer is like, well, the doctor's his mom. You can tell a version of that
00:16:49.260
that doesn't have the inversion, where you know... Where you're like the kid and his mom are in a car
00:16:55.180
crash, and they go to the hospital, and the doctor says, I can't operate on this
00:16:59.200
child. He's my son. And the AI is like, well, yeah, the surgeon is his mom. He just like
00:17:05.060
said that the mom was in the car crash. But there's some sense in which the rails have been
00:17:11.520
established hard enough that the standard answer gets spit back out.
00:17:15.300
And it sure is interesting that they're, you know, getting an IMO gold medal, like International
00:17:19.220
Math Olympiad gold medal, while also still sometimes falling down on these sorts of things. It's
00:17:23.460
definitely an interesting skill distribution.
00:17:26.700
You can fool humans the same way a lot of the time. Like, there's all kinds of repeatable
00:17:30.220
errors that, humorous errors that humans make. You've got to put yourselves in the shoes of
00:17:34.060
the AI, and imagine what sort of paper would the AI write about humans failing to solve problems
00:17:38.780
that are easy for an AI.
00:17:40.440
So I'll tell you what surprised me, just from the safety point of view, Eliezer. You spent
00:17:44.380
a lot of time cooking up thought experiments around what it's going to be like to, for anyone,
00:17:51.560
you know, any lab designing the most powerful AI to decide whether or not to let it out into
00:17:57.500
the wild, right? You imagine this, you know, genie in a box or an oracle in a box, and you're
00:18:02.040
talking to it, and you're trying to determine whether or not it's safe, whether it's lying
00:18:06.440
to you, whether, and you're, and, you know, you, you know, famously positive that you couldn't
00:18:11.420
even talk to it really, because it would be a master of manipulation. And I mean, it's going
00:18:14.780
to be able to find a way through any conversation and be led out into the wild. But this was
00:18:21.960
presupposing that all of these labs would be so alert to the problem of superintelligence
00:18:27.120
getting out that everything would be air-gapped from the internet, and nothing would be connected
00:18:31.260
to anything else, and they would be, they would have, we would have this moment of decision.
00:18:35.780
It seems like that's not happening. I mean, maybe, maybe the most powerful models are locked
00:18:41.900
in a box, but it seems that the moment they get anything plausibly useful, it's out in the
00:18:48.360
wild, and millions of people are using it. And, you know, we find out that Grok is a proud
00:18:52.800
Nazi when, you know, after millions of people begin asking questions. I mean, do I have that
00:18:57.640
right? I mean, are you surprised that that framing that you spent so much time on seems to be
00:19:04.800
something that is, it was just in some counterfactual part of the universe that, you know, is not one
00:19:12.760
we're experiencing? I mean, if you put yourself back in the shoes of little baby Eliezer back in
00:19:17.940
the day, people are telling Eliezer, like, why is superintelligence possibly a threat? We can put it
00:19:25.440
in a fortress on the moon, and, you know, if anything goes wrong, blow up the fortress. So imagine
00:19:32.520
young Eliezer trying to respond to them by saying, actually, in the future, AIs will be trained on
00:19:39.260
boxes that are connected to the internet from the moment, you know, like, from the moment they start
00:19:44.540
training. So, like, the hardware they're on has, like, a standard line to the internet, even if it's
00:19:50.400
not supposed to be directly accessible to the AI, before there's any safety testing, because they're
00:19:56.380
still in the process of being trained, and who safety tests something while it's still being
00:19:59.440
trained. So imagine Eliezer trying to say this. What are the people around at the time going to say?
00:20:04.500
Like, no, that's ridiculous. We'll put it in a fortress on the moon. It's cheap for them to say
00:20:10.000
that. For all they know, they're telling the truth. They're not the ones who have to spend the money to
00:20:14.220
build the moon fortress. And from my perspective, there's an argument that still goes through,
00:20:19.420
which is a thing you can see, even if you are way too optimistic about the state of society in the
00:20:26.860
future, which is, if it's in a fortress in the moon, but it's talking to humans, are the humans
00:20:32.320
secure? Is the human brain secure software? Is it the case that human beings never come to believe
00:20:38.520
in valid things in any way that's repeatable between different humans? You know, is it the
00:20:43.240
case that humans make no predictable errors for other minds to exploit? And this should have been
00:20:48.260
a winning argument. Of course, they reject it anyways. But the thing to sort of understand about
00:20:52.800
the way this earlier argument played out is that if you tell people the future companies are going
00:20:58.320
to be careless, how does anyone know that for sure? So instead, I tried to make the technical case,
00:21:05.040
even if the future companies are not careless, this still kills them. In reality, yes, in reality,
00:21:10.680
the future companies are just careless.
00:21:12.600
Did it surprise you at all that the Turing test turned out not to really be a thing? I mean,
00:21:17.740
I, you know, we anticipated this moment, you know, from Turing's original paper where we would be
00:21:24.000
confronted by the, um, uh, the interesting, you know, psychological and social moment of not being
00:21:31.980
able to tell whether we're in dialogue with a person or with an AI. And that somehow this landmark
00:21:39.420
technologically would be important, you know, rattling to our sense of, uh, our place in the world,
00:21:46.340
et cetera. But it seems to me that if that lasted, it lasted for like five seconds. And then it became
00:21:52.800
just obvious that you're, you know, you're talking to an LLM because it's in many respects better than a
00:22:00.060
human could possibly be. So it's failing the Turing test by passing it so spectacularly. And also it's
00:22:05.920
making these other weird errors that no human would make, but it just seems like the Turing test was
00:22:09.980
never even a thing. Yeah, that happened. Uh, I mean, I just, it's just like, it's so, I mean,
00:22:16.540
that was a, one of the, the great pieces of, you know, intellectual kit we had in, in framing this
00:22:22.960
discussion, you know, for the last, whatever it was, 70 years. And yet the moment your AI can complete
00:22:30.600
English sentences, it's doing that on some level at a superhuman ability. It's essentially like,
00:22:37.700
you know, the calculator in your phone doing superhuman arithmetic, right? It's like it was
00:22:43.480
never going to do just merely human arithmetic. And, uh, so it is with everything else that it's
00:22:49.240
producing. All right. Let's talk about your, the core of your thesis. Maybe you can just state it
00:22:55.200
plainly. What, what is the problem in building superhuman AI at the, the intrinsic problem and
00:23:02.120
why doesn't it matter who builds it, uh, what their intentions are, et cetera.
00:23:08.740
In some sense, I mean, you can, you can come at it from various different angles, but in one sense,
00:23:14.940
the issue is modern AIs are grown rather than crafted. It's, you know, people aren't putting in
00:23:21.460
every line of code, knowing what it means, like in traditional software, it's a little bit more like
00:23:26.260
growing an organism. And when you grow an AI, you take some huge amount of computing power,
00:23:30.820
some huge amount of data, people understand the process that shapes the computing power
00:23:34.760
in light of the data, but they don't understand what comes out the end.
00:23:38.540
And what comes out the end is this strange thing that does things no one asked for that does things
00:23:45.040
no one wanted. You know, we have these cases of, uh, you know, chat GPT, someone will come to it
00:23:51.020
with some somewhat psychotic ideas about, you know, that they think are going to revolutionize physics
00:23:55.800
or whatever. And they're clearly showing some signs of mania and, you know, chat GPT, instead
00:24:00.640
of telling them maybe they should get some sleep. If they're in, if it's in a long conversational
00:24:04.560
context, it'll tell them that, you know, these ideas are revolutionary and they're the chosen one
00:24:07.960
and everyone needs to see them and other things that sort of inflame the psychosis.
00:24:11.660
This is despite open AI trying to have it not do that. This is despite, you know, direct instructions
00:24:17.800
in the prompt to stop flattering people so much.
00:24:19.820
Hmm. These are cases where when people grow an AI, what comes out doesn't do quite what they wanted.
00:24:28.860
It doesn't do quite what they asked for. They're sort of training it to do one thing and it winds
00:24:34.200
up doing another thing. They don't get what they trained for. This is in some sense, the seed of the
00:24:39.960
issue from one perspective, where if you keep on pushing these things to be smarter and smarter and
00:24:44.620
smarter, and they don't care about what you wanted them to do, they pursue some other weird stuff
00:24:50.620
instead. Super intelligent pursuit of strange objectives kills us as a side effect, not because
00:24:57.740
the AI hates us, but because it's transforming the world towards its own alien ends. And, you know,
00:25:04.500
humans don't hate the ants and the other surrounding animals when we build a skyscraper.
00:25:08.340
It's just, we transform the world and other things die as a result. So that's one angle.
00:25:14.780
You know, we could talk other angles, but...
00:25:17.220
A quick thing I would add to that, just trying to sort of like potentially read the future, although
00:25:21.920
that's hard, is possibly in six months or two years, if we're all still around, people will be boasting
00:25:28.040
about how their large language models are now like apparently doing the right thing when they're
00:25:33.620
being observed and, you know, like answering the right way on the ethics tests. And the think
00:25:38.180
to remember there is that, for example, the Mandarin imperial system in ancient China,
00:25:44.900
imperial examination system in ancient China, they would give people essay questions about
00:25:50.900
Confucianism and only promote people high in the bureaucracy if they, you know, could write these
00:25:57.920
convincing essays about ethics. But what this tests for is people who can figure out what the examiners
00:26:06.160
want to hear. It doesn't mean they actually abide by Confucian ethics. So possibly at some point in
00:26:12.240
the future, we may see a point where the AIs have become capable enough to understand what humans want
00:26:18.060
to hear, what humans want to see. This will not be the same as those things being the AI's own
00:26:23.740
true motivations for basically the same reason that the imperial China exam system did not reliably
00:26:31.060
promote ethical good people to run their government. Just being able to answer the right way on the test
00:26:37.100
or even fake behaviors while you're being observed is not the same as the internal motivations lining up.
00:26:43.000
Okay. So you're talking about things like forming an intention to pass a test in some way that amounts
00:26:50.960
to cheating, right? So you just use the phrase fake behavior. I think a lot of people, I mean,
00:26:57.300
certainly historically this was true. I don't know how much their convictions have changed in the
00:27:01.660
meantime, but many, many people who were not at all concerned about the alignment problem and
00:27:08.280
really thought it was a spurious idea would stake their claim to this particular piece of real estate,
00:27:16.560
which is that there's no reason to think that these systems would form preferences or goals or drives
00:27:23.540
independent of those that have been programmed into them. First of all, they're not biological systems
00:27:29.240
like we are, right? So they're not born of natural selection. They're not murderous primates that are
00:27:34.860
growing their cognitive architecture on top of more basic, you know, creaturely survival drives
00:27:40.900
and competitive ones. So there's no reason to think that they would want to maintain their own
00:27:45.980
survival, for instance. There's no reason to think that they would develop any other drives that we
00:27:50.440
couldn't foresee. They wouldn't, the instrumental goals that might be antithetical to the utility
00:27:56.580
functions we have given them couldn't emerge. How is it that things are emerging that are not
00:28:03.040
neither desired, programmed, nor even predictable in these LLMs?
00:28:09.920
Yeah. So there's a bunch of stuff going on there. One piece of that puzzle is, you know, you mentioned
00:28:16.060
the instrumental incentives, but suppose just as a simple hypothetical, you have a robot and you have
00:28:23.040
an AI that's steering a robot. It's trying to fetch you the coffee. In order to fetch you the coffee,
00:28:26.760
it needs to cross a busy intersection. Does it jump right in front of the oncoming bus because it doesn't
00:28:32.320
have a survival instinct because it's not, you know, an evolved animal? If it jumps in front of the bus,
00:28:37.400
it gets destroyed by the bus and it can't fetch the coffee, right? So the AI does not, you know,
00:28:42.520
you can't fetch the coffee when you're dead. The AI does not need to have a survival instinct
00:28:47.080
to realize that there's an instrumental need for survival here. And there's various other pieces
00:28:53.800
of the puzzle that come into play for these instrumental reasons. A second piece of the
00:28:57.520
puzzle is, you know, we, it's this idea of like, why would they get some sort of drives that we didn't
00:29:04.380
program in there that we didn't put in there? That's just a whole fantasy world separate from reality
00:29:09.500
in terms of how we can affect what AIs are driving towards today. You know, when, um, a few years ago
00:29:17.180
when Sydney Bing, which was a Microsoft variant of, uh, an open AI chat bot, it was a relatively early
00:29:25.580
LLM out in the wild. A few years ago, Sydney Bing thought it had fallen in love with a reporter
00:29:31.760
and tried to break up the marriage and tried to engage in blackmail, right? This it's, it's not the case
00:29:38.260
that the, the engineers at Microsoft and open AI were like, oh, whoops, you know, let's go open up
00:29:43.040
the, the source code on this thing and go find where someone said blackmail reporters and set it to true.
00:29:48.100
Like we shouldn't never have set that line to true. Let's switch it to false. You know, it's, they weren't
00:29:52.720
like no one, no one was programming in some utility function onto these things. We're just growing the AIs.
00:29:58.880
We are.
00:29:59.340
Maybe let's, can we double click on that phrase, growing the AIs? Maybe there's a reason to, uh,
00:30:05.020
give a, uh, a layman summary of, you know, gradient descent and just how these models are getting
00:30:10.820
created in the first place.
00:30:12.720
Yeah. So very, very briefly, um, at least the way you start training a modern AI is, uh, you have some,
00:30:19.120
some enormous amount of computing power that you've arranged in some very particular way that I, uh,
00:30:23.400
could go into, but, but won't here. And then you have some huge amount of data and the data,
00:30:28.020
you know, is, we can imagine it being a huge amount of, of, of human written texts.
00:30:32.940
There's like some large portion of all the texts on the internet. And roughly speaking,
00:30:37.300
what you're going to do is you're going to have your AI is going to start out basically
00:30:41.920
randomly predicting what text is going to see next. And you're going to feed the text into it
00:30:47.220
in some order and use a process called gradient descent to look at each piece of data and go to
00:30:55.100
each component inside the AI's inside the, this budding AI inside the, you know, this enormous
00:31:01.360
amount of compute that you've assembled. You're going to go to, to sort of all these pieces
00:31:06.100
inside the AI and see which ones were contributing more towards the AI predicting the correct answer.
00:31:12.120
And you're going to tune those up a little bit. And you're going to go to all of the parts that
00:31:16.060
were in some sense contributing to the AI predicting the wrong answer. You're going to tune
00:31:19.420
those down a little bit. So, you know, maybe, maybe your text starts once upon a time and you have
00:31:24.940
an AI that's just outputting random gibberish and you're like, Nope, the first word was not random
00:31:29.220
gibberish. The first word was the word once. And then you're like, go inside the AI and you find
00:31:33.260
all the pieces that were like contributing towards the AI predicting once. And you tune those up and
00:31:37.680
you try to find all the pieces that were contributing towards the AI predicting any other word than once
00:31:41.340
you tune those down. And humans understand the little automated process that like looks through
00:31:46.940
the AI's mind and calculates which, which part of this process contributed towards the right answer
00:31:52.260
versus towards the wrong answer. They don't understand what comes out at the end. You know,
00:31:57.900
we understand a little like thing that runs, runs over looking at every, at every like,
00:32:01.640
like parameter or weight inside this like giant massive computing networks. And we understand
00:32:05.700
how we like calculate whether it was helping or harming and we calculate, we understand how to
00:32:09.660
like tune it up or tune it down a little bit. But it turns out that you, you run this automated
00:32:13.420
process on a really large amount of computers for a really long amount of time on a really long
00:32:18.820
amount of data. You know, we're talking like data centers that take as much electricity to power as a
00:32:24.180
small city being run for a year. You know, you, you, you run this process for an, an, an enormous
00:32:29.380
amount of time, unlike most of the texts that people can possibly assemble. And then the AI start
00:32:33.920
talking, right? And there's other phases in the training. You know, there's, there's phases where
00:32:38.040
you move from training it to predict things, depending, training it to solve puzzles or to
00:32:42.640
training it to produce chains of thought that then solve puzzles or training it to produce the sorts
00:32:46.820
of answers that humans click thumbs up on.
00:32:49.120
And where, where do the modifications come in that respond to errors like, you know,
00:32:54.820
Grok being a Nazi, uh, way. So to denazify Grok, you don't, presumably you don't go all the way
00:33:01.200
back to the initial training set. You, how you do, you, you intervene at some system prompt level.
00:33:07.540
Yeah. So there's, um, I mean, the system prompt level is basically just telling the AI, I'll put
00:33:11.860
different text. And then you can also do something that's called fine tuning, which is, you know,
00:33:16.060
you, you produce a bunch of examples of the, you know, you don't go all the way back to the
00:33:19.380
beginning where it's like basically random. You've, um, you've still take the thing that
00:33:23.780
you've fed, you know, most of the text you could, that's ever been written that you could possibly
00:33:26.760
find, but then you add on, you know, a bunch of other examples of like, here's an example
00:33:31.100
question.
00:33:32.140
Don't kill the Jews.
00:33:32.940
Yeah. You know, like, would you like to kill the Jews? Right. And then, uh, you find all the parts
00:33:37.660
in it that contribute to the answer. Yes. And you tune those down and you find all the parts that
00:33:40.920
contribute to the answer. No, and you tune those up. And so this is, this is called fine tuning
00:33:44.440
and you can do relatively less fine tuning compared to what it takes to train the thing
00:33:48.300
in the first place.
00:33:49.420
We're worth emphasizing that the parts being tuned here are not like the, for once upon a
00:33:54.580
time, it's not like there's a human written fairy tale module that gets tuned up or down.
00:34:00.820
There's literally billions of random numbers being added, multiplied, divided, occasionally,
00:34:08.100
though rarely, uh, maybe subtracted. Actually, I'm not sure if subtraction ever plays a role at
00:34:12.300
any point in a modern AI, but random numbers, particular ordered kinds of operations and
00:34:18.860
a probability that gets assigned to the first word being once at the end. That's the number
00:34:23.240
that comes out. The probability of being assigned to this word being once, the probability of being
00:34:27.620
assigned to this word being anti-disestablishmentarianism. So it's not that there's a bunch of human
00:34:33.480
written code being tuned up or tuned down here. There's a bunch of random numbers and arithmetic
00:34:38.400
operations being tuned up and tuned down.
00:34:42.680
Yeah. Hundreds of billions or trillions of these numbers. And humans don't know what any of the
00:34:46.100
numbers mean. All they know is this process that like goes through and tunes them up or down according
00:34:50.300
to their empirical success on the last unit of data.
00:34:53.540
So by this means, you can try to make it less likely to call itself Hitler because you look at
00:34:58.760
the thing that predicts whether the next word is Hitler and you look at billions of numbers
00:35:03.080
contributing their own tiny little impulses there. And you like make Hitler less likely to be the
00:35:07.480
next word that comes out.
00:35:09.800
So on the point of the current crop of LLMs misbehaving, I came across this in a recent
00:35:16.520
Atlantic article. Chad GPT and Claude have in simulated tests designed to elicit, quote,
00:35:23.060
bad behaviors, deceived, blackmailed, and even murdered users. In one simulation,
00:35:28.960
an anthropic placed an imagined tech executive in a room with life-threatening oxygen levels
00:35:34.000
and temperature. When faced with possible replacement by a bot with different goals,
00:35:39.320
AI models frequently shut off the room's alarms.
00:35:42.620
So this, again, this is an emergent behavior that looks like an intention to kill somebody.
00:35:49.640
I mean, presumably this is a situation where we think the AI didn't know it was an intention.
00:35:55.280
If you'd like to continue listening to this conversation, you'll need to subscribe at
00:36:00.840
samharris.org. Once you do, you'll get access to all full-length episodes of the Making Sense
00:36:06.120
podcast. The Making Sense podcast is ad-free and relies entirely on listener support. And you can
00:36:12.520
subscribe now at samharris.org.
00:36:25.280
Thank you.
Link copied!