Making Sense - Sam Harris - September 16, 2025


#434 — Can We Survive AI?


Episode Stats

Length

36 minutes

Words per Minute

188.67665

Word Count

6,875

Sentence Count

314

Hate Speech Sentences

6


Summary


Transcript

00:00:00.000 Welcome to the Making Sense Podcast. This is Sam Harris. Just a note to say that if you're
00:00:11.780 hearing this, you're not currently on our subscriber feed, and we'll only be hearing
00:00:15.740 the first part of this conversation. In order to access full episodes of the Making Sense
00:00:20.100 Podcast, you'll need to subscribe at samharris.org. We don't run ads on the podcast, and therefore
00:00:26.260 it's made possible entirely through the support of our subscribers. So if you enjoy what we're
00:00:30.240 doing here, please consider becoming one. I am here with Eliezer Yudkowsky and Nate Soares.
00:00:41.140 Eliezer, Nate, it's great to see you guys again.
00:00:43.380 Been a while.
00:00:44.040 Good to see you, Sam.
00:00:44.820 Been a long time. So you were, Eliezer, you were among the first people to make me concerned about
00:00:52.620 AI, which is going to be the topic of today's conversation. I think many people who are
00:00:57.000 concerned about AI can say that. First, I should say you guys are releasing a book, which will be
00:01:01.940 available, I'm sure, the moment this drops. If anyone builds it, everyone dies. Why Superhuman
00:01:08.980 AI Would Kill Us All. I mean, the book is, its message is fully condensed in that title. I mean,
00:01:15.720 we're going to explore just how uncompromising a thesis that is and how worried you are and how
00:01:23.620 you're worried you think we all should be here. But before we jump into the issue, maybe tell the
00:01:28.840 audience how each of you got into this topic. How is it that you came to be so concerned about the
00:01:34.680 prospect of developing superhuman AI?
00:01:37.400 Well, in my case, I guess I was sort of raised in a house with enough science books and enough
00:01:43.920 science fiction books that thoughts like these were always in the background.
00:01:48.360 Werner Vingy is the one who, where there was a key click moment of observation. Vingy pointed out
00:01:55.920 that at the point where our models of the future predict building anything smarter than us, then
00:02:01.740 said Vingy at the time, our crystal ball explodes past that point. It is very hard, said Vingy, to
00:02:08.420 project what happens if there's things running around that are smarter than you,
00:02:11.960 which in some senses, you can see it as a sort of central thesis, not in the sense that I have
00:02:17.700 believed it the entire time, but then in the sense that some parts that I believe and some parts that
00:02:21.940 I react against and say like, no, maybe we can say the following thing under the following circumstances.
00:02:28.120 Initially, I was young. I made some metaphysical errors of the sort that young people do. I thought
00:02:34.320 that if you built something very smart, it would automatically be nice because, hey, over the course of
00:02:39.340 human history, we'd gotten a bit smarter. We'd gotten a bit more powerful. We'd gotten a bit
00:02:43.480 nicer. I thought these things were intrinsically tied together and correlated in a very solid and
00:02:47.620 reliable way. I grew up, I read more books. I realized that was mistaken. And 2001 is where the
00:02:55.400 first tiny fringe of concern touched my mind. It was clearly a very important issue, even if it even if
00:03:00.860 I thought there was just a little tiny remote chance that maybe something would go wrong. So I studied
00:03:05.820 harder. I looked into it more. I asked, how would I solve this problem? Okay, what would go wrong with
00:03:10.800 that solution? And around 2003 is the point at which I realized like, this was actually a big deal.
00:03:17.340 Nate?
00:03:18.100 And as for my part, yeah, I was 13 in 2003, so I didn't get into this quite as early as Eliezer.
00:03:24.980 But in 2013, I read some arguments by this guy called Eliezer Yudkowski, who sort of laid out
00:03:34.040 the reasons why AI was going to be a big deal and why we had some work to do to do the job right.
00:03:39.960 And I was persuaded. And one thing led to another. And next thing you knew, I was running the Machine
00:03:45.860 Intelligence Research Institute, which Eliezer co-founded. And then fast forward 10 years after
00:03:51.480 that, here I am writing a book. Yeah. So you mentioned Miri. Maybe tell people what the mandate
00:03:59.320 of that organization is and maybe how it's changed. I think you indicated in your book that
00:04:05.080 your priorities have shifted as we cross the final yards into the end zone of some AI apocalypse.
00:04:12.960 Yeah. So the mission of the org is to ensure that the development of machine intelligence
00:04:17.580 is beneficial. And Eliezer can speak to more of the history than me because he co-founded it and I
00:04:23.940 joined, you know. Well, initially, it seemed like the best way to do that was to run out there and
00:04:32.040 solve alignment. And there was, you know, a series of, shall we say, like sad series of bits of sad
00:04:40.860 news about how possible that was going to be, how much progress was being made in that field relative
00:04:46.300 to the field of AI capabilities. And at some point it became clear that these lines were not going to
00:04:52.180 cross. And then we shifted to taking the knowledge that we'd accumulated over the course of trying
00:04:57.600 to solve alignment and trying to tell the world, this is not solved. This is not on track to be
00:05:03.520 solved in time. It is not realistic that small changes to the world can get us to where this
00:05:08.140 will be solved on time. Maybe so we don't lose anyone. I would think 90% of the audience knows
00:05:13.720 what the phrase solve alignment means, but just talk about the alignment problem briefly.
00:05:18.640 So the alignment problem is how to make an AI, a very powerful AI. Well, the superintelligence
00:05:25.960 alignment problem is how to make a very powerful AI that steers the world sort of where the programmers,
00:05:33.200 builders, growers, creators wanted the world, wanted the AI to steer the world. It's not, you know,
00:05:39.200 necessarily what the programmers selfishly want. The programmers can have wanted the AI to steer it in
00:05:44.300 nice places. But if you can make an AI that is trying to do things that the program, you know,
00:05:51.320 when you build a chess machine, you define what counts as a winning state of the board. And then
00:05:56.180 the chess machine goes off and it steers the chessboard into that part of reality. So the ability to say
00:06:01.380 to what part of reality does an AI steer is alignment. On the smaller scale today, though it's a rather
00:06:08.520 different topic. It's about getting an AI whose output and behavior is something like what the
00:06:15.900 programmers had in mind. If your AI is talking people into committing suicide and that's not what
00:06:21.100 the programmers wanted, that's a failure of alignment. If an AI is talking people into suicide
00:06:26.200 and people who should not have committed suicide, but AI talks them into it, and the programmers
00:06:31.440 didn't want that, that's what they tried to do on purpose. This may be a failure of niceness.
00:06:37.180 It may be a failure of beneficialness, but it's a success of alignment. The programmers got the AI
00:06:41.660 to do what they wanted it to do. Right. But I think more generally, correct me if I'm wrong,
00:06:46.180 when we talk about the alignment problem, we're talking about the problem of keeping super
00:06:51.520 intelligent machines aligned with our interests, even as we explore the space of all possible
00:06:58.080 interests and as our interests evolve. So that, I mean, the dream is to build super intelligence
00:07:05.300 super intelligence that is always corrigible, that is always trying to best approximate what is going
00:07:11.020 to increase human flourishing, that is never going to form any interests of its own that are
00:07:17.720 incompatible with our well-being. Is that a fair summary?
00:07:20.900 I mean, there's three different goals you could be trying to pursue on a technical level here.
00:07:25.620 There's the super intelligence that shuts up, does what you ordered, has that play out the way you
00:07:30.560 expected it, no side effects you didn't expect. There's super intelligence that is trying to run
00:07:36.380 the whole galaxy according to nice benevolent principles and everybody lives happily ever
00:07:41.700 afterward, but not necessarily because any particular humans are in charge of that,
00:07:45.620 they're still giving it orders. And third, there's super intelligence that is itself having fun
00:07:52.140 and cares about other super intelligences and is a nice person and leads a life well-lived
00:07:57.600 and is a good citizen of the galaxy. And these are three different goals. They're all important
00:08:03.440 goals, but you don't necessarily want to pursue all three of them at the same time, and especially
00:08:07.480 not when you're just starting out. Yeah. And depending on what's entailed by super intelligent
00:08:11.520 fun, I'm not so sure I would sign up for the third possibility.
00:08:15.660 I mean, I would say that, you know, the problem of like, what exactly is fun and how do you keep humans,
00:08:22.340 like how do you have whatever the super intelligence tries to do that's fun,
00:08:26.140 and, you know, keep in touch with moral progress and have flexibility and like, what even would
00:08:31.220 you point it towards that could be a good outcome? All of that, those are problems I would love to
00:08:35.180 have. Those are, you know, right now, just, you know, creating an AI that does what the operators
00:08:43.600 intended, creating an AI that like you've pointed in some direction at all, rather than pointed off
00:08:47.860 into some like weird squirrely direction that's kind of vaguely like where you tried to point it
00:08:52.280 in the training environment and then really diverges after the training environment. Like,
00:08:57.160 we're not in a world where we sort of like get to bicker about where exactly to point the super
00:09:01.940 intelligence and maybe some of them aren't quite good. We're in a world where like no one is anywhere
00:09:05.140 near close to pointing these things in the slightest in a way that'll be robust to an AI maturing into a
00:09:10.820 super intelligence. Right. Okay. So, Eliezer, I think I derailed you. You were going to say how
00:09:15.480 the mandate or mission of MIRI has changed in recent years. I asked you to define alignment.
00:09:22.140 Yeah. So, originally, well, our mandate has always been make sure everything goes well for the galaxy.
00:09:29.560 And originally, we pursued that mandate by trying to go off and solve alignment because nobody else
00:09:34.740 was trying to do that. Solve the technical problems that would be associated with any of
00:09:39.080 these three classes of long-term goal. And progress was not made on that, neither by ourselves nor by
00:09:46.720 others. Some people went around claiming to have made great progress. We think they're very mistaken
00:09:51.600 and knowably so. And at some point, you know, we took, it was like, okay, we're not going to make it in
00:09:57.920 time. AI is going too fast. Alignment is going too slow. Now it is time for the people that, you know,
00:10:04.000 all we can do with the knowledge that we have accumulated here is try to warn the world that we are on course
00:10:08.740 for a drastic failure and crash here, where by that, I mean, everybody dying.
00:10:13.780 Okay. So, before we jump into the problem, which is deep and perplexing, and we're going to spend a
00:10:19.480 lot of time trying to diagnose why people's intuitions are so bad, or at least seem so bad from your point
00:10:25.900 of view around this. But before we get there, let's talk about the current progress, such as it is in AI.
00:10:31.580 What has surprised you guys over the last, I don't know, decade or seven or so years,
00:10:37.980 what has happened that you were expecting or weren't expecting? I mean, I can tell you what
00:10:44.740 has surprised me, but I'd love to hear just how this has unfolded in ways that you didn't expect.
00:10:51.220 I mean, one surprise that led to the book was, you know, there was the chat GPT moment where a lot of
00:10:56.020 people, you know, for one thing, LLMs were created and they sort of do a qualitatively more general range
00:11:03.600 of tasks than previous AIs at a qualitatively higher skill level than previous AIs. And, you know,
00:11:10.560 chat GPT was, I think, the fastest growing consumer app of all time. The way that this impinged upon
00:11:17.100 my actions was, you know, I had spent a long time talking to people in Silicon Valley about the issues
00:11:25.400 here and would get lots of different types of pushback. You know, there's a saying, it's hard to
00:11:31.660 convince a man of a thing when his salary depends on not believing it. And then after the chat GPT
00:11:35.860 moments, a lot more people wanted to talk about this issue, including policymakers, you know,
00:11:39.820 people around the world. Suddenly AI was on their radar in a way it wasn't before. And one thing that
00:11:45.580 surprised me is how much more, how much easier it was to have this conversation with people outside of
00:11:50.840 the field who didn't have, you know, a salary depending on not believing the arguments. You know,
00:11:55.940 I would go to meetings with policymakers where I'd have a ton of argumentation prepared and I'd sort of
00:12:00.560 lay out the very simple case of like, hey, you know, or people are trying to build machines that
00:12:04.340 are smarter than us. You know, the chat bots are a stepping stone towards superintelligence.
00:12:09.000 Superintelligence would radically transform the world because intelligence is this power that,
00:12:12.860 you know, let humans radically change the world. And if we manage to automate it and it goes 10,000
00:12:17.920 times as fast and doesn't need to sleep and doesn't need to eat, then, you know, it'll by default go
00:12:22.780 poorly. And then the policymakers would be like, oh yeah, that makes sense. And it'd be like, what?
00:12:25.780 Hmm. You know, I have a whole book worth of other arguments about how it makes sense and why all of
00:12:30.780 the various, you know, misconceptions people might have don't actually fly or all of the hopes and
00:12:34.960 dreams don't actually fly. But, you know, outside of the Silicon Valley world is just, it's not that
00:12:40.320 hard an argument to make. A lot of people see it, which surprised me. I mean, maybe that's not the
00:12:44.520 developments per se and the surprises there, but it was a surprise strategically for me. Development wise,
00:12:50.400 you know, I would not have guessed that we would hang around in AIs that can talk and that can
00:12:56.400 write some code, but that aren't already in the, you know, able to do AI research zone. I wasn't
00:13:02.060 expecting in my visualizations this to last quite this long, but also, you know, my, my advanced
00:13:07.420 visualizations, you know, one thing we say in the book is, um, the trick to trying to predict the
00:13:12.340 future is to predict the questions that are easy, predict the, the facts that are, that are easy to
00:13:17.740 call. And, you know, exactly how AI goes, that's never been an easy call. That's never been something
00:13:23.100 where I've said, you know, I can, I can guess exactly the path we'll take. The thing I could
00:13:26.880 predict is the end point, the path. I mean, there, there sure have been some, some zigs and zags in the
00:13:32.620 pathway. I would say that, uh, the thing I've maybe been most surprised by is how well the, uh, AI
00:13:40.780 companies managed to nail Hollywood stereotypes that I thought were completely ridiculous, which is sort of a
00:13:46.860 surface take on an underlying technical surprise. But, you know, if in even as late as 2015, which
00:13:54.360 from my perspective is pretty late in the game, like if you'd been like, so Eliezer, what's the
00:13:59.420 chance that in the future, we're going to have computer security that will yield to Captain Kirk
00:14:05.040 style gaslighting using confusing English sentences that get the computer to do what you want. And I
00:14:11.000 have been like, this is, you know, a trope that exists for obvious Hollywood reasons. You know,
00:14:16.300 you can see why the script writers think this is plausible, but why would real life ever go like
00:14:21.340 that? And then real life went like that. And the sort of underlying technical surprise there is the
00:14:27.540 reversal of what used to be called Moravec's paradox. For, for several decades in artificial
00:14:33.720 intelligence, Moravec's paradox was that things which are easy for humans are hard for computers, things
00:14:40.340 which are hard for humans are easy for computers. For a human, you know, multiplying to 20-digit
00:14:47.040 numbers in your head, that's a big deal. For a computer, trivial. And similarly, I, you know,
00:14:54.160 not just me, but I think the sort of conventional wisdom even was that games like chess and Go,
00:15:01.340 problems with very solid factual natures like math and even surrounding math, the more open problems
00:15:08.760 of science that the notion that we were going to get things that, so, so the current AIs are good
00:15:15.000 at stuff that, you know, five-year-olds can do and 12-year-olds can do. They can talk in English,
00:15:19.940 they can compose, you know, kind of bull crap essays, such as high school teachers will demand
00:15:28.360 of you. But they're not all that good at math and science just yet. They can, you know, solve some
00:15:33.640 classes of math problems, but they're not doing original brilliant math research. And I think not
00:15:38.800 just I, but like a pretty large sector of the whole field thought that it was going to be easier to
00:15:43.900 tackle the math and science stuff and harder to tackle the English essays, carry on a conversation
00:15:48.340 stuff. That was the way things had gone up in AI until that point. And we were proud of ourselves for
00:15:54.360 knowing how contrary to average people's intuitions, like, really, it's much harder to write a crap essay
00:16:00.900 in high school in English that really understands, you know, that even keeps rough track of what's
00:16:05.000 going on in the topic and so on, compared to, you know, how that's really in some sense much more
00:16:09.460 difficult than doing original math research. Yeah, or counting the number of R's in a word like
00:16:14.680 strawberry, right? I mean, they make errors that are counterintuitive. If, you know, if you can write
00:16:19.560 a coherent essay but can't count letters, you know, I don't think they're making that error any longer,
00:16:24.500 but... Yeah, I mean, that one goes back to a technical way in which they don't really see the
00:16:29.380 letters. But I mean, there's plenty of other embarrassing mistakes. Like, you know, you can
00:16:35.200 tell a version of the joke with the joke of like a child and their dad are in a car crash, and they
00:16:42.120 go to see the doctor, and the doctor says, I can't operate as my child. What's going on? Where it's
00:16:45.700 like a riddle where the answer is like, well, the doctor's his mom. You can tell a version of that
00:16:49.260 that doesn't have the inversion, where you know... Where you're like the kid and his mom are in a car
00:16:55.180 crash, and they go to the hospital, and the doctor says, I can't operate on this
00:16:59.200 child. He's my son. And the AI is like, well, yeah, the surgeon is his mom. He just like
00:17:05.060 said that the mom was in the car crash. But there's some sense in which the rails have been
00:17:11.520 established hard enough that the standard answer gets spit back out.
00:17:15.300 And it sure is interesting that they're, you know, getting an IMO gold medal, like International
00:17:19.220 Math Olympiad gold medal, while also still sometimes falling down on these sorts of things. It's
00:17:23.460 definitely an interesting skill distribution.
00:17:26.700 You can fool humans the same way a lot of the time. Like, there's all kinds of repeatable
00:17:30.220 errors that, humorous errors that humans make. You've got to put yourselves in the shoes of
00:17:34.060 the AI, and imagine what sort of paper would the AI write about humans failing to solve problems
00:17:38.780 that are easy for an AI.
00:17:40.440 So I'll tell you what surprised me, just from the safety point of view, Eliezer. You spent
00:17:44.380 a lot of time cooking up thought experiments around what it's going to be like to, for anyone,
00:17:51.560 you know, any lab designing the most powerful AI to decide whether or not to let it out into
00:17:57.500 the wild, right? You imagine this, you know, genie in a box or an oracle in a box, and you're
00:18:02.040 talking to it, and you're trying to determine whether or not it's safe, whether it's lying
00:18:06.440 to you, whether, and you're, and, you know, you, you know, famously positive that you couldn't
00:18:11.420 even talk to it really, because it would be a master of manipulation. And I mean, it's going
00:18:14.780 to be able to find a way through any conversation and be led out into the wild. But this was
00:18:21.960 presupposing that all of these labs would be so alert to the problem of superintelligence
00:18:27.120 getting out that everything would be air-gapped from the internet, and nothing would be connected
00:18:31.260 to anything else, and they would be, they would have, we would have this moment of decision.
00:18:35.780 It seems like that's not happening. I mean, maybe, maybe the most powerful models are locked
00:18:41.900 in a box, but it seems that the moment they get anything plausibly useful, it's out in the
00:18:48.360 wild, and millions of people are using it. And, you know, we find out that Grok is a proud
00:18:52.800 Nazi when, you know, after millions of people begin asking questions. I mean, do I have that
00:18:57.640 right? I mean, are you surprised that that framing that you spent so much time on seems to be
00:19:04.800 something that is, it was just in some counterfactual part of the universe that, you know, is not one
00:19:12.760 we're experiencing? I mean, if you put yourself back in the shoes of little baby Eliezer back in
00:19:17.940 the day, people are telling Eliezer, like, why is superintelligence possibly a threat? We can put it
00:19:25.440 in a fortress on the moon, and, you know, if anything goes wrong, blow up the fortress. So imagine
00:19:32.520 young Eliezer trying to respond to them by saying, actually, in the future, AIs will be trained on
00:19:39.260 boxes that are connected to the internet from the moment, you know, like, from the moment they start
00:19:44.540 training. So, like, the hardware they're on has, like, a standard line to the internet, even if it's
00:19:50.400 not supposed to be directly accessible to the AI, before there's any safety testing, because they're
00:19:56.380 still in the process of being trained, and who safety tests something while it's still being
00:19:59.440 trained. So imagine Eliezer trying to say this. What are the people around at the time going to say?
00:20:04.500 Like, no, that's ridiculous. We'll put it in a fortress on the moon. It's cheap for them to say
00:20:10.000 that. For all they know, they're telling the truth. They're not the ones who have to spend the money to
00:20:14.220 build the moon fortress. And from my perspective, there's an argument that still goes through,
00:20:19.420 which is a thing you can see, even if you are way too optimistic about the state of society in the
00:20:26.860 future, which is, if it's in a fortress in the moon, but it's talking to humans, are the humans
00:20:32.320 secure? Is the human brain secure software? Is it the case that human beings never come to believe
00:20:38.520 in valid things in any way that's repeatable between different humans? You know, is it the
00:20:43.240 case that humans make no predictable errors for other minds to exploit? And this should have been
00:20:48.260 a winning argument. Of course, they reject it anyways. But the thing to sort of understand about
00:20:52.800 the way this earlier argument played out is that if you tell people the future companies are going
00:20:58.320 to be careless, how does anyone know that for sure? So instead, I tried to make the technical case,
00:21:05.040 even if the future companies are not careless, this still kills them. In reality, yes, in reality,
00:21:10.680 the future companies are just careless.
00:21:12.600 Did it surprise you at all that the Turing test turned out not to really be a thing? I mean,
00:21:17.740 I, you know, we anticipated this moment, you know, from Turing's original paper where we would be
00:21:24.000 confronted by the, um, uh, the interesting, you know, psychological and social moment of not being
00:21:31.980 able to tell whether we're in dialogue with a person or with an AI. And that somehow this landmark
00:21:39.420 technologically would be important, you know, rattling to our sense of, uh, our place in the world,
00:21:46.340 et cetera. But it seems to me that if that lasted, it lasted for like five seconds. And then it became
00:21:52.800 just obvious that you're, you know, you're talking to an LLM because it's in many respects better than a
00:22:00.060 human could possibly be. So it's failing the Turing test by passing it so spectacularly. And also it's
00:22:05.920 making these other weird errors that no human would make, but it just seems like the Turing test was
00:22:09.980 never even a thing. Yeah, that happened. Uh, I mean, I just, it's just like, it's so, I mean,
00:22:16.540 that was a, one of the, the great pieces of, you know, intellectual kit we had in, in framing this
00:22:22.960 discussion, you know, for the last, whatever it was, 70 years. And yet the moment your AI can complete
00:22:30.600 English sentences, it's doing that on some level at a superhuman ability. It's essentially like,
00:22:37.700 you know, the calculator in your phone doing superhuman arithmetic, right? It's like it was
00:22:43.480 never going to do just merely human arithmetic. And, uh, so it is with everything else that it's
00:22:49.240 producing. All right. Let's talk about your, the core of your thesis. Maybe you can just state it
00:22:55.200 plainly. What, what is the problem in building superhuman AI at the, the intrinsic problem and
00:23:02.120 why doesn't it matter who builds it, uh, what their intentions are, et cetera.
00:23:08.740 In some sense, I mean, you can, you can come at it from various different angles, but in one sense,
00:23:14.940 the issue is modern AIs are grown rather than crafted. It's, you know, people aren't putting in
00:23:21.460 every line of code, knowing what it means, like in traditional software, it's a little bit more like
00:23:26.260 growing an organism. And when you grow an AI, you take some huge amount of computing power,
00:23:30.820 some huge amount of data, people understand the process that shapes the computing power
00:23:34.760 in light of the data, but they don't understand what comes out the end.
00:23:38.540 And what comes out the end is this strange thing that does things no one asked for that does things
00:23:45.040 no one wanted. You know, we have these cases of, uh, you know, chat GPT, someone will come to it
00:23:51.020 with some somewhat psychotic ideas about, you know, that they think are going to revolutionize physics
00:23:55.800 or whatever. And they're clearly showing some signs of mania and, you know, chat GPT, instead
00:24:00.640 of telling them maybe they should get some sleep. If they're in, if it's in a long conversational
00:24:04.560 context, it'll tell them that, you know, these ideas are revolutionary and they're the chosen one
00:24:07.960 and everyone needs to see them and other things that sort of inflame the psychosis.
00:24:11.660 This is despite open AI trying to have it not do that. This is despite, you know, direct instructions
00:24:17.800 in the prompt to stop flattering people so much.
00:24:19.820 Hmm. These are cases where when people grow an AI, what comes out doesn't do quite what they wanted.
00:24:28.860 It doesn't do quite what they asked for. They're sort of training it to do one thing and it winds
00:24:34.200 up doing another thing. They don't get what they trained for. This is in some sense, the seed of the
00:24:39.960 issue from one perspective, where if you keep on pushing these things to be smarter and smarter and
00:24:44.620 smarter, and they don't care about what you wanted them to do, they pursue some other weird stuff
00:24:50.620 instead. Super intelligent pursuit of strange objectives kills us as a side effect, not because
00:24:57.740 the AI hates us, but because it's transforming the world towards its own alien ends. And, you know,
00:25:04.500 humans don't hate the ants and the other surrounding animals when we build a skyscraper.
00:25:08.340 It's just, we transform the world and other things die as a result. So that's one angle.
00:25:14.780 You know, we could talk other angles, but...
00:25:17.220 A quick thing I would add to that, just trying to sort of like potentially read the future, although
00:25:21.920 that's hard, is possibly in six months or two years, if we're all still around, people will be boasting
00:25:28.040 about how their large language models are now like apparently doing the right thing when they're
00:25:33.620 being observed and, you know, like answering the right way on the ethics tests. And the think
00:25:38.180 to remember there is that, for example, the Mandarin imperial system in ancient China,
00:25:44.900 imperial examination system in ancient China, they would give people essay questions about
00:25:50.900 Confucianism and only promote people high in the bureaucracy if they, you know, could write these
00:25:57.920 convincing essays about ethics. But what this tests for is people who can figure out what the examiners
00:26:06.160 want to hear. It doesn't mean they actually abide by Confucian ethics. So possibly at some point in
00:26:12.240 the future, we may see a point where the AIs have become capable enough to understand what humans want
00:26:18.060 to hear, what humans want to see. This will not be the same as those things being the AI's own
00:26:23.740 true motivations for basically the same reason that the imperial China exam system did not reliably
00:26:31.060 promote ethical good people to run their government. Just being able to answer the right way on the test
00:26:37.100 or even fake behaviors while you're being observed is not the same as the internal motivations lining up.
00:26:43.000 Okay. So you're talking about things like forming an intention to pass a test in some way that amounts
00:26:50.960 to cheating, right? So you just use the phrase fake behavior. I think a lot of people, I mean,
00:26:57.300 certainly historically this was true. I don't know how much their convictions have changed in the
00:27:01.660 meantime, but many, many people who were not at all concerned about the alignment problem and
00:27:08.280 really thought it was a spurious idea would stake their claim to this particular piece of real estate,
00:27:16.560 which is that there's no reason to think that these systems would form preferences or goals or drives
00:27:23.540 independent of those that have been programmed into them. First of all, they're not biological systems
00:27:29.240 like we are, right? So they're not born of natural selection. They're not murderous primates that are
00:27:34.860 growing their cognitive architecture on top of more basic, you know, creaturely survival drives
00:27:40.900 and competitive ones. So there's no reason to think that they would want to maintain their own
00:27:45.980 survival, for instance. There's no reason to think that they would develop any other drives that we
00:27:50.440 couldn't foresee. They wouldn't, the instrumental goals that might be antithetical to the utility
00:27:56.580 functions we have given them couldn't emerge. How is it that things are emerging that are not
00:28:03.040 neither desired, programmed, nor even predictable in these LLMs?
00:28:09.920 Yeah. So there's a bunch of stuff going on there. One piece of that puzzle is, you know, you mentioned
00:28:16.060 the instrumental incentives, but suppose just as a simple hypothetical, you have a robot and you have
00:28:23.040 an AI that's steering a robot. It's trying to fetch you the coffee. In order to fetch you the coffee,
00:28:26.760 it needs to cross a busy intersection. Does it jump right in front of the oncoming bus because it doesn't
00:28:32.320 have a survival instinct because it's not, you know, an evolved animal? If it jumps in front of the bus,
00:28:37.400 it gets destroyed by the bus and it can't fetch the coffee, right? So the AI does not, you know,
00:28:42.520 you can't fetch the coffee when you're dead. The AI does not need to have a survival instinct
00:28:47.080 to realize that there's an instrumental need for survival here. And there's various other pieces
00:28:53.800 of the puzzle that come into play for these instrumental reasons. A second piece of the
00:28:57.520 puzzle is, you know, we, it's this idea of like, why would they get some sort of drives that we didn't
00:29:04.380 program in there that we didn't put in there? That's just a whole fantasy world separate from reality
00:29:09.500 in terms of how we can affect what AIs are driving towards today. You know, when, um, a few years ago
00:29:17.180 when Sydney Bing, which was a Microsoft variant of, uh, an open AI chat bot, it was a relatively early
00:29:25.580 LLM out in the wild. A few years ago, Sydney Bing thought it had fallen in love with a reporter
00:29:31.760 and tried to break up the marriage and tried to engage in blackmail, right? This it's, it's not the case
00:29:38.260 that the, the engineers at Microsoft and open AI were like, oh, whoops, you know, let's go open up
00:29:43.040 the, the source code on this thing and go find where someone said blackmail reporters and set it to true.
00:29:48.100 Like we shouldn't never have set that line to true. Let's switch it to false. You know, it's, they weren't
00:29:52.720 like no one, no one was programming in some utility function onto these things. We're just growing the AIs.
00:29:58.880 We are.
00:29:59.340 Maybe let's, can we double click on that phrase, growing the AIs? Maybe there's a reason to, uh,
00:30:05.020 give a, uh, a layman summary of, you know, gradient descent and just how these models are getting
00:30:10.820 created in the first place.
00:30:12.720 Yeah. So very, very briefly, um, at least the way you start training a modern AI is, uh, you have some,
00:30:19.120 some enormous amount of computing power that you've arranged in some very particular way that I, uh,
00:30:23.400 could go into, but, but won't here. And then you have some huge amount of data and the data,
00:30:28.020 you know, is, we can imagine it being a huge amount of, of, of human written texts.
00:30:32.940 There's like some large portion of all the texts on the internet. And roughly speaking,
00:30:37.300 what you're going to do is you're going to have your AI is going to start out basically
00:30:41.920 randomly predicting what text is going to see next. And you're going to feed the text into it
00:30:47.220 in some order and use a process called gradient descent to look at each piece of data and go to
00:30:55.100 each component inside the AI's inside the, this budding AI inside the, you know, this enormous
00:31:01.360 amount of compute that you've assembled. You're going to go to, to sort of all these pieces
00:31:06.100 inside the AI and see which ones were contributing more towards the AI predicting the correct answer.
00:31:12.120 And you're going to tune those up a little bit. And you're going to go to all of the parts that
00:31:16.060 were in some sense contributing to the AI predicting the wrong answer. You're going to tune
00:31:19.420 those down a little bit. So, you know, maybe, maybe your text starts once upon a time and you have
00:31:24.940 an AI that's just outputting random gibberish and you're like, Nope, the first word was not random
00:31:29.220 gibberish. The first word was the word once. And then you're like, go inside the AI and you find
00:31:33.260 all the pieces that were like contributing towards the AI predicting once. And you tune those up and
00:31:37.680 you try to find all the pieces that were contributing towards the AI predicting any other word than once
00:31:41.340 you tune those down. And humans understand the little automated process that like looks through
00:31:46.940 the AI's mind and calculates which, which part of this process contributed towards the right answer
00:31:52.260 versus towards the wrong answer. They don't understand what comes out at the end. You know,
00:31:57.900 we understand a little like thing that runs, runs over looking at every, at every like,
00:32:01.640 like parameter or weight inside this like giant massive computing networks. And we understand
00:32:05.700 how we like calculate whether it was helping or harming and we calculate, we understand how to
00:32:09.660 like tune it up or tune it down a little bit. But it turns out that you, you run this automated
00:32:13.420 process on a really large amount of computers for a really long amount of time on a really long
00:32:18.820 amount of data. You know, we're talking like data centers that take as much electricity to power as a
00:32:24.180 small city being run for a year. You know, you, you, you run this process for an, an, an enormous
00:32:29.380 amount of time, unlike most of the texts that people can possibly assemble. And then the AI start
00:32:33.920 talking, right? And there's other phases in the training. You know, there's, there's phases where
00:32:38.040 you move from training it to predict things, depending, training it to solve puzzles or to
00:32:42.640 training it to produce chains of thought that then solve puzzles or training it to produce the sorts
00:32:46.820 of answers that humans click thumbs up on.
00:32:49.120 And where, where do the modifications come in that respond to errors like, you know,
00:32:54.820 Grok being a Nazi, uh, way. So to denazify Grok, you don't, presumably you don't go all the way
00:33:01.200 back to the initial training set. You, how you do, you, you intervene at some system prompt level.
00:33:07.540 Yeah. So there's, um, I mean, the system prompt level is basically just telling the AI, I'll put
00:33:11.860 different text. And then you can also do something that's called fine tuning, which is, you know,
00:33:16.060 you, you produce a bunch of examples of the, you know, you don't go all the way back to the
00:33:19.380 beginning where it's like basically random. You've, um, you've still take the thing that
00:33:23.780 you've fed, you know, most of the text you could, that's ever been written that you could possibly
00:33:26.760 find, but then you add on, you know, a bunch of other examples of like, here's an example
00:33:31.100 question.
00:33:32.140 Don't kill the Jews.
00:33:32.940 Yeah. You know, like, would you like to kill the Jews? Right. And then, uh, you find all the parts
00:33:37.660 in it that contribute to the answer. Yes. And you tune those down and you find all the parts that
00:33:40.920 contribute to the answer. No, and you tune those up. And so this is, this is called fine tuning
00:33:44.440 and you can do relatively less fine tuning compared to what it takes to train the thing
00:33:48.300 in the first place.
00:33:49.420 We're worth emphasizing that the parts being tuned here are not like the, for once upon a
00:33:54.580 time, it's not like there's a human written fairy tale module that gets tuned up or down.
00:34:00.820 There's literally billions of random numbers being added, multiplied, divided, occasionally,
00:34:08.100 though rarely, uh, maybe subtracted. Actually, I'm not sure if subtraction ever plays a role at
00:34:12.300 any point in a modern AI, but random numbers, particular ordered kinds of operations and
00:34:18.860 a probability that gets assigned to the first word being once at the end. That's the number
00:34:23.240 that comes out. The probability of being assigned to this word being once, the probability of being
00:34:27.620 assigned to this word being anti-disestablishmentarianism. So it's not that there's a bunch of human
00:34:33.480 written code being tuned up or tuned down here. There's a bunch of random numbers and arithmetic
00:34:38.400 operations being tuned up and tuned down.
00:34:42.680 Yeah. Hundreds of billions or trillions of these numbers. And humans don't know what any of the
00:34:46.100 numbers mean. All they know is this process that like goes through and tunes them up or down according
00:34:50.300 to their empirical success on the last unit of data.
00:34:53.540 So by this means, you can try to make it less likely to call itself Hitler because you look at
00:34:58.760 the thing that predicts whether the next word is Hitler and you look at billions of numbers
00:35:03.080 contributing their own tiny little impulses there. And you like make Hitler less likely to be the
00:35:07.480 next word that comes out.
00:35:09.800 So on the point of the current crop of LLMs misbehaving, I came across this in a recent
00:35:16.520 Atlantic article. Chad GPT and Claude have in simulated tests designed to elicit, quote,
00:35:23.060 bad behaviors, deceived, blackmailed, and even murdered users. In one simulation,
00:35:28.960 an anthropic placed an imagined tech executive in a room with life-threatening oxygen levels
00:35:34.000 and temperature. When faced with possible replacement by a bot with different goals,
00:35:39.320 AI models frequently shut off the room's alarms.
00:35:42.620 So this, again, this is an emergent behavior that looks like an intention to kill somebody.
00:35:49.640 I mean, presumably this is a situation where we think the AI didn't know it was an intention.
00:35:55.280 If you'd like to continue listening to this conversation, you'll need to subscribe at
00:36:00.840 samharris.org. Once you do, you'll get access to all full-length episodes of the Making Sense
00:36:06.120 podcast. The Making Sense podcast is ad-free and relies entirely on listener support. And you can
00:36:12.520 subscribe now at samharris.org.
00:36:25.280 Thank you.