Making Sense - Sam Harris - September 16, 2025


#434 — Can We Survive AI?


Episode Stats

Length

36 minutes

Words per Minute

188.67665

Word Count

6,875

Sentence Count

314

Hate Speech Sentences

6


Summary

Eliezer Yudkowsky and Nate Sares were among the first people to make Sam concerned about artificial intelligence, and they share their story of how they became concerned about the prospect of superhuman AI. They discuss how they came to that point, and why they think we should all be worried about it.


Transcript

00:00:00.000 Welcome to the Making Sense Podcast. This is Sam Harris. Just a note to say that if you're
00:00:11.780 hearing this, you're not currently on our subscriber feed, and we'll only be hearing
00:00:15.740 the first part of this conversation. In order to access full episodes of the Making Sense
00:00:20.100 Podcast, you'll need to subscribe at samharris.org. We don't run ads on the podcast, and therefore
00:00:26.260 it's made possible entirely through the support of our subscribers. So if you enjoy what we're
00:00:30.240 doing here, please consider becoming one. I am here with Eliezer Yudkowsky and Nate Soares.
00:00:41.140 Eliezer, Nate, it's great to see you guys again.
00:00:43.380 Been a while.
00:00:44.040 Good to see you, Sam.
00:00:44.820 Been a long time. So you were, Eliezer, you were among the first people to make me concerned about
00:00:52.620 AI, which is going to be the topic of today's conversation. I think many people who are
00:00:57.000 concerned about AI can say that. First, I should say you guys are releasing a book, which will be
00:01:01.940 available, I'm sure, the moment this drops. If anyone builds it, everyone dies. Why Superhuman
00:01:08.980 AI Would Kill Us All. I mean, the book is, its message is fully condensed in that title. I mean,
00:01:15.720 we're going to explore just how uncompromising a thesis that is and how worried you are and how
00:01:23.620 you're worried you think we all should be here. But before we jump into the issue, maybe tell the
00:01:28.840 audience how each of you got into this topic. How is it that you came to be so concerned about the
00:01:34.680 prospect of developing superhuman AI?
00:01:37.400 Well, in my case, I guess I was sort of raised in a house with enough science books and enough
00:01:43.920 science fiction books that thoughts like these were always in the background.
00:01:48.360 Werner Vingy is the one who, where there was a key click moment of observation. Vingy pointed out
00:01:55.920 that at the point where our models of the future predict building anything smarter than us, then
00:02:01.740 said Vingy at the time, our crystal ball explodes past that point. It is very hard, said Vingy, to
00:02:08.420 project what happens if there's things running around that are smarter than you,
00:02:11.960 which in some senses, you can see it as a sort of central thesis, not in the sense that I have
00:02:17.700 believed it the entire time, but then in the sense that some parts that I believe and some parts that
00:02:21.940 I react against and say like, no, maybe we can say the following thing under the following circumstances.
00:02:28.120 Initially, I was young. I made some metaphysical errors of the sort that young people do. I thought
00:02:34.320 that if you built something very smart, it would automatically be nice because, hey, over the course of
00:02:39.340 human history, we'd gotten a bit smarter. We'd gotten a bit more powerful. We'd gotten a bit
00:02:43.480 nicer. I thought these things were intrinsically tied together and correlated in a very solid and
00:02:47.620 reliable way. I grew up, I read more books. I realized that was mistaken. And 2001 is where the
00:02:55.400 first tiny fringe of concern touched my mind. It was clearly a very important issue, even if it even if
00:03:00.860 I thought there was just a little tiny remote chance that maybe something would go wrong. So I studied
00:03:05.820 harder. I looked into it more. I asked, how would I solve this problem? Okay, what would go wrong with
00:03:10.800 that solution? And around 2003 is the point at which I realized like, this was actually a big deal.
00:03:17.340 Nate?
00:03:18.100 And as for my part, yeah, I was 13 in 2003, so I didn't get into this quite as early as Eliezer.
00:03:24.980 But in 2013, I read some arguments by this guy called Eliezer Yudkowski, who sort of laid out
00:03:34.040 the reasons why AI was going to be a big deal and why we had some work to do to do the job right.
00:03:39.960 And I was persuaded. And one thing led to another. And next thing you knew, I was running the Machine
00:03:45.860 Intelligence Research Institute, which Eliezer co-founded. And then fast forward 10 years after
00:03:51.480 that, here I am writing a book. Yeah. So you mentioned Miri. Maybe tell people what the mandate
00:03:59.320 of that organization is and maybe how it's changed. I think you indicated in your book that
00:04:05.080 your priorities have shifted as we cross the final yards into the end zone of some AI apocalypse.
00:04:12.960 Yeah. So the mission of the org is to ensure that the development of machine intelligence
00:04:17.580 is beneficial. And Eliezer can speak to more of the history than me because he co-founded it and I
00:04:23.940 joined, you know. Well, initially, it seemed like the best way to do that was to run out there and
00:04:32.040 solve alignment. And there was, you know, a series of, shall we say, like sad series of bits of sad
00:04:40.860 news about how possible that was going to be, how much progress was being made in that field relative
00:04:46.300 to the field of AI capabilities. And at some point it became clear that these lines were not going to
00:04:52.180 cross. And then we shifted to taking the knowledge that we'd accumulated over the course of trying
00:04:57.600 to solve alignment and trying to tell the world, this is not solved. This is not on track to be
00:05:03.520 solved in time. It is not realistic that small changes to the world can get us to where this
00:05:08.140 will be solved on time. Maybe so we don't lose anyone. I would think 90% of the audience knows
00:05:13.720 what the phrase solve alignment means, but just talk about the alignment problem briefly.
00:05:18.640 So the alignment problem is how to make an AI, a very powerful AI. Well, the superintelligence
00:05:25.960 alignment problem is how to make a very powerful AI that steers the world sort of where the programmers,
00:05:33.200 builders, growers, creators wanted the world, wanted the AI to steer the world. It's not, you know,
00:05:39.200 necessarily what the programmers selfishly want. The programmers can have wanted the AI to steer it in
00:05:44.300 nice places. But if you can make an AI that is trying to do things that the program, you know,
00:05:51.320 when you build a chess machine, you define what counts as a winning state of the board. And then
00:05:56.180 the chess machine goes off and it steers the chessboard into that part of reality. So the ability to say
00:06:01.380 to what part of reality does an AI steer is alignment. On the smaller scale today, though it's a rather
00:06:08.520 different topic. It's about getting an AI whose output and behavior is something like what the
00:06:15.900 programmers had in mind. If your AI is talking people into committing suicide and that's not what
00:06:21.100 the programmers wanted, that's a failure of alignment. If an AI is talking people into suicide
00:06:26.200 and people who should not have committed suicide, but AI talks them into it, and the programmers
00:06:31.440 didn't want that, that's what they tried to do on purpose. This may be a failure of niceness.
00:06:37.180 It may be a failure of beneficialness, but it's a success of alignment. The programmers got the AI
00:06:41.660 to do what they wanted it to do. Right. But I think more generally, correct me if I'm wrong,
00:06:46.180 when we talk about the alignment problem, we're talking about the problem of keeping super
00:06:51.520 intelligent machines aligned with our interests, even as we explore the space of all possible
00:06:58.080 interests and as our interests evolve. So that, I mean, the dream is to build super intelligence
00:07:05.300 super intelligence that is always corrigible, that is always trying to best approximate what is going
00:07:11.020 to increase human flourishing, that is never going to form any interests of its own that are
00:07:17.720 incompatible with our well-being. Is that a fair summary?
00:07:20.900 I mean, there's three different goals you could be trying to pursue on a technical level here.
00:07:25.620 There's the super intelligence that shuts up, does what you ordered, has that play out the way you
00:07:30.560 expected it, no side effects you didn't expect. There's super intelligence that is trying to run
00:07:36.380 the whole galaxy according to nice benevolent principles and everybody lives happily ever
00:07:41.700 afterward, but not necessarily because any particular humans are in charge of that,
00:07:45.620 they're still giving it orders. And third, there's super intelligence that is itself having fun
00:07:52.140 and cares about other super intelligences and is a nice person and leads a life well-lived
00:07:57.600 and is a good citizen of the galaxy. And these are three different goals. They're all important
00:08:03.440 goals, but you don't necessarily want to pursue all three of them at the same time, and especially
00:08:07.480 not when you're just starting out. Yeah. And depending on what's entailed by super intelligent
00:08:11.520 fun, I'm not so sure I would sign up for the third possibility.
00:08:15.660 I mean, I would say that, you know, the problem of like, what exactly is fun and how do you keep humans,
00:08:22.340 like how do you have whatever the super intelligence tries to do that's fun,
00:08:26.140 and, you know, keep in touch with moral progress and have flexibility and like, what even would
00:08:31.220 you point it towards that could be a good outcome? All of that, those are problems I would love to
00:08:35.180 have. Those are, you know, right now, just, you know, creating an AI that does what the operators
00:08:43.600 intended, creating an AI that like you've pointed in some direction at all, rather than pointed off
00:08:47.860 into some like weird squirrely direction that's kind of vaguely like where you tried to point it
00:08:52.280 in the training environment and then really diverges after the training environment. Like,
00:08:57.160 we're not in a world where we sort of like get to bicker about where exactly to point the super
00:09:01.940 intelligence and maybe some of them aren't quite good. We're in a world where like no one is anywhere
00:09:05.140 near close to pointing these things in the slightest in a way that'll be robust to an AI maturing into a
00:09:10.820 super intelligence. Right. Okay. So, Eliezer, I think I derailed you. You were going to say how
00:09:15.480 the mandate or mission of MIRI has changed in recent years. I asked you to define alignment.
00:09:22.140 Yeah. So, originally, well, our mandate has always been make sure everything goes well for the galaxy.
00:09:29.560 And originally, we pursued that mandate by trying to go off and solve alignment because nobody else
00:09:34.740 was trying to do that. Solve the technical problems that would be associated with any of
00:09:39.080 these three classes of long-term goal. And progress was not made on that, neither by ourselves nor by
00:09:46.720 others. Some people went around claiming to have made great progress. We think they're very mistaken
00:09:51.600 and knowably so. And at some point, you know, we took, it was like, okay, we're not going to make it in
00:09:57.920 time. AI is going too fast. Alignment is going too slow. Now it is time for the people that, you know,
00:10:04.000 all we can do with the knowledge that we have accumulated here is try to warn the world that we are on course
00:10:08.740 for a drastic failure and crash here, where by that, I mean, everybody dying.
00:10:13.780 Okay. So, before we jump into the problem, which is deep and perplexing, and we're going to spend a
00:10:19.480 lot of time trying to diagnose why people's intuitions are so bad, or at least seem so bad from your point
00:10:25.900 of view around this. But before we get there, let's talk about the current progress, such as it is in AI.
00:10:31.580 What has surprised you guys over the last, I don't know, decade or seven or so years,
00:10:37.980 what has happened that you were expecting or weren't expecting? I mean, I can tell you what
00:10:44.740 has surprised me, but I'd love to hear just how this has unfolded in ways that you didn't expect.
00:10:51.220 I mean, one surprise that led to the book was, you know, there was the chat GPT moment where a lot of
00:10:56.020 people, you know, for one thing, LLMs were created and they sort of do a qualitatively more general range
00:11:03.600 of tasks than previous AIs at a qualitatively higher skill level than previous AIs. And, you know,
00:11:10.560 chat GPT was, I think, the fastest growing consumer app of all time. The way that this impinged upon
00:11:17.100 my actions was, you know, I had spent a long time talking to people in Silicon Valley about the issues
00:11:25.400 here and would get lots of different types of pushback. You know, there's a saying, it's hard to
00:11:31.660 convince a man of a thing when his salary depends on not believing it. And then after the chat GPT
00:11:35.860 moments, a lot more people wanted to talk about this issue, including policymakers, you know,
00:11:39.820 people around the world. Suddenly AI was on their radar in a way it wasn't before. And one thing that
00:11:45.580 surprised me is how much more, how much easier it was to have this conversation with people outside of
00:11:50.840 the field who didn't have, you know, a salary depending on not believing the arguments. You know,
00:11:55.940 I would go to meetings with policymakers where I'd have a ton of argumentation prepared and I'd sort of
00:12:00.560 lay out the very simple case of like, hey, you know, or people are trying to build machines that
00:12:04.340 are smarter than us. You know, the chat bots are a stepping stone towards superintelligence.
00:12:09.000 Superintelligence would radically transform the world because intelligence is this power that,
00:12:12.860 you know, let humans radically change the world. And if we manage to automate it and it goes 10,000
00:12:17.920 times as fast and doesn't need to sleep and doesn't need to eat, then, you know, it'll by default go
00:12:22.780 poorly. And then the policymakers would be like, oh yeah, that makes sense. And it'd be like, what?
00:12:25.780 Hmm. You know, I have a whole book worth of other arguments about how it makes sense and why all of
00:12:30.780 the various, you know, misconceptions people might have don't actually fly or all of the hopes and
00:12:34.960 dreams don't actually fly. But, you know, outside of the Silicon Valley world is just, it's not that
00:12:40.320 hard an argument to make. A lot of people see it, which surprised me. I mean, maybe that's not the
00:12:44.520 developments per se and the surprises there, but it was a surprise strategically for me. Development wise,
00:12:50.400 you know, I would not have guessed that we would hang around in AIs that can talk and that can
00:12:56.400 write some code, but that aren't already in the, you know, able to do AI research zone. I wasn't
00:13:02.060 expecting in my visualizations this to last quite this long, but also, you know, my, my advanced
00:13:07.420 visualizations, you know, one thing we say in the book is, um, the trick to trying to predict the
00:13:12.340 future is to predict the questions that are easy, predict the, the facts that are, that are easy to
00:13:17.740 call. And, you know, exactly how AI goes, that's never been an easy call. That's never been something
00:13:23.100 where I've said, you know, I can, I can guess exactly the path we'll take. The thing I could
00:13:26.880 predict is the end point, the path. I mean, there, there sure have been some, some zigs and zags in the
00:13:32.620 pathway. I would say that, uh, the thing I've maybe been most surprised by is how well the, uh, AI
00:13:40.780 companies managed to nail Hollywood stereotypes that I thought were completely ridiculous, which is sort of a
00:13:46.860 surface take on an underlying technical surprise. But, you know, if in even as late as 2015, which
00:13:54.360 from my perspective is pretty late in the game, like if you'd been like, so Eliezer, what's the
00:13:59.420 chance that in the future, we're going to have computer security that will yield to Captain Kirk
00:14:05.040 style gaslighting using confusing English sentences that get the computer to do what you want. And I
00:14:11.000 have been like, this is, you know, a trope that exists for obvious Hollywood reasons. You know,
00:14:16.300 you can see why the script writers think this is plausible, but why would real life ever go like
00:14:21.340 that? And then real life went like that. And the sort of underlying technical surprise there is the
00:14:27.540 reversal of what used to be called Moravec's paradox. For, for several decades in artificial
00:14:33.720 intelligence, Moravec's paradox was that things which are easy for humans are hard for computers, things
00:14:40.340 which are hard for humans are easy for computers. For a human, you know, multiplying to 20-digit
00:14:47.040 numbers in your head, that's a big deal. For a computer, trivial. And similarly, I, you know,
00:14:54.160 not just me, but I think the sort of conventional wisdom even was that games like chess and Go,
00:15:01.340 problems with very solid factual natures like math and even surrounding math, the more open problems
00:15:08.760 of science that the notion that we were going to get things that, so, so the current AIs are good
00:15:15.000 at stuff that, you know, five-year-olds can do and 12-year-olds can do. They can talk in English,
00:15:19.940 they can compose, you know, kind of bull crap essays, such as high school teachers will demand
00:15:28.360 of you. But they're not all that good at math and science just yet. They can, you know, solve some
00:15:33.640 classes of math problems, but they're not doing original brilliant math research. And I think not
00:15:38.800 just I, but like a pretty large sector of the whole field thought that it was going to be easier to
00:15:43.900 tackle the math and science stuff and harder to tackle the English essays, carry on a conversation
00:15:48.340 stuff. That was the way things had gone up in AI until that point. And we were proud of ourselves for
00:15:54.360 knowing how contrary to average people's intuitions, like, really, it's much harder to write a crap essay
00:16:00.900 in high school in English that really understands, you know, that even keeps rough track of what's
00:16:05.000 going on in the topic and so on, compared to, you know, how that's really in some sense much more
00:16:09.460 difficult than doing original math research. Yeah, or counting the number of R's in a word like
00:16:14.680 strawberry, right? I mean, they make errors that are counterintuitive. If, you know, if you can write
00:16:19.560 a coherent essay but can't count letters, you know, I don't think they're making that error any longer,
00:16:24.500 but... Yeah, I mean, that one goes back to a technical way in which they don't really see the
00:16:29.380 letters. But I mean, there's plenty of other embarrassing mistakes. Like, you know, you can
00:16:35.200 tell a version of the joke with the joke of like a child and their dad are in a car crash, and they
00:16:42.120 go to see the doctor, and the doctor says, I can't operate as my child. What's going on? Where it's
00:16:45.700 like a riddle where the answer is like, well, the doctor's his mom. You can tell a version of that
00:16:49.260 that doesn't have the inversion, where you know... Where you're like the kid and his mom are in a car
00:16:55.180 crash, and they go to the hospital, and the doctor says, I can't operate on this
00:16:59.200 child. He's my son. And the AI is like, well, yeah, the surgeon is his mom. He just like
00:17:05.060 said that the mom was in the car crash. But there's some sense in which the rails have been
00:17:11.520 established hard enough that the standard answer gets spit back out.
00:17:15.300 And it sure is interesting that they're, you know, getting an IMO gold medal, like International
00:17:19.220 Math Olympiad gold medal, while also still sometimes falling down on these sorts of things. It's
00:17:23.460 definitely an interesting skill distribution.
00:17:26.700 You can fool humans the same way a lot of the time. Like, there's all kinds of repeatable
00:17:30.220 errors that, humorous errors that humans make. You've got to put yourselves in the shoes of
00:17:34.060 the AI, and imagine what sort of paper would the AI write about humans failing to solve problems
00:17:38.780 that are easy for an AI.
00:17:40.440 So I'll tell you what surprised me, just from the safety point of view, Eliezer. You spent
00:17:44.380 a lot of time cooking up thought experiments around what it's going to be like to, for anyone,
00:17:51.560 you know, any lab designing the most powerful AI to decide whether or not to let it out into
00:17:57.500 the wild, right? You imagine this, you know, genie in a box or an oracle in a box, and you're
00:18:02.040 talking to it, and you're trying to determine whether or not it's safe, whether it's lying
00:18:06.440 to you, whether, and you're, and, you know, you, you know, famously positive that you couldn't
00:18:11.420 even talk to it really, because it would be a master of manipulation. And I mean, it's going
00:18:14.780 to be able to find a way through any conversation and be led out into the wild. But this was
00:18:21.960 presupposing that all of these labs would be so alert to the problem of superintelligence
00:18:27.120 getting out that everything would be air-gapped from the internet, and nothing would be connected
00:18:31.260 to anything else, and they would be, they would have, we would have this moment of decision.
00:18:35.780 It seems like that's not happening. I mean, maybe, maybe the most powerful models are locked
00:18:41.900 in a box, but it seems that the moment they get anything plausibly useful, it's out in the
00:18:48.360 wild, and millions of people are using it. And, you know, we find out that Grok is a proud
00:18:52.800 Nazi when, you know, after millions of people begin asking questions. I mean, do I have that
00:18:57.640 right? I mean, are you surprised that that framing that you spent so much time on seems to be
00:19:04.800 something that is, it was just in some counterfactual part of the universe that, you know, is not one
00:19:12.760 we're experiencing? I mean, if you put yourself back in the shoes of little baby Eliezer back in
00:19:17.940 the day, people are telling Eliezer, like, why is superintelligence possibly a threat? We can put it
00:19:25.440 in a fortress on the moon, and, you know, if anything goes wrong, blow up the fortress. So imagine
00:19:32.520 young Eliezer trying to respond to them by saying, actually, in the future, AIs will be trained on
00:19:39.260 boxes that are connected to the internet from the moment, you know, like, from the moment they start
00:19:44.540 training. So, like, the hardware they're on has, like, a standard line to the internet, even if it's
00:19:50.400 not supposed to be directly accessible to the AI, before there's any safety testing, because they're
00:19:56.380 still in the process of being trained, and who safety tests something while it's still being
00:19:59.440 trained. So imagine Eliezer trying to say this. What are the people around at the time going to say?
00:20:04.500 Like, no, that's ridiculous. We'll put it in a fortress on the moon. It's cheap for them to say
00:20:10.000 that. For all they know, they're telling the truth. They're not the ones who have to spend the money to
00:20:14.220 build the moon fortress. And from my perspective, there's an argument that still goes through,
00:20:19.420 which is a thing you can see, even if you are way too optimistic about the state of society in the
00:20:26.860 future, which is, if it's in a fortress in the moon, but it's talking to humans, are the humans
00:20:32.320 secure? Is the human brain secure software? Is it the case that human beings never come to believe
00:20:38.520 in valid things in any way that's repeatable between different humans? You know, is it the
00:20:43.240 case that humans make no predictable errors for other minds to exploit? And this should have been
00:20:48.260 a winning argument. Of course, they reject it anyways. But the thing to sort of understand about
00:20:52.800 the way this earlier argument played out is that if you tell people the future companies are going
00:20:58.320 to be careless, how does anyone know that for sure? So instead, I tried to make the technical case,
00:21:05.040 even if the future companies are not careless, this still kills them. In reality, yes, in reality,
00:21:10.680 the future companies are just careless.
00:21:12.600 Did it surprise you at all that the Turing test turned out not to really be a thing? I mean,
00:21:17.740 I, you know, we anticipated this moment, you know, from Turing's original paper where we would be
00:21:24.000 confronted by the, um, uh, the interesting, you know, psychological and social moment of not being
00:21:31.980 able to tell whether we're in dialogue with a person or with an AI. And that somehow this landmark
00:21:39.420 technologically would be important, you know, rattling to our sense of, uh, our place in the world,
00:21:46.340 et cetera. But it seems to me that if that lasted, it lasted for like five seconds. And then it became
00:21:52.800 just obvious that you're, you know, you're talking to an LLM because it's in many respects better than a
00:22:00.060 human could possibly be. So it's failing the Turing test by passing it so spectacularly. And also it's
00:22:05.920 making these other weird errors that no human would make, but it just seems like the Turing test was
00:22:09.980 never even a thing. Yeah, that happened. Uh, I mean, I just, it's just like, it's so, I mean,
00:22:16.540 that was a, one of the, the great pieces of, you know, intellectual kit we had in, in framing this
00:22:22.960 discussion, you know, for the last, whatever it was, 70 years. And yet the moment your AI can complete
00:22:30.600 English sentences, it's doing that on some level at a superhuman ability. It's essentially like,
00:22:37.700 you know, the calculator in your phone doing superhuman arithmetic, right? It's like it was
00:22:43.480 never going to do just merely human arithmetic. And, uh, so it is with everything else that it's
00:22:49.240 producing. All right. Let's talk about your, the core of your thesis. Maybe you can just state it
00:22:55.200 plainly. What, what is the problem in building superhuman AI at the, the intrinsic problem and
00:23:02.120 why doesn't it matter who builds it, uh, what their intentions are, et cetera.
00:23:08.740 In some sense, I mean, you can, you can come at it from various different angles, but in one sense,
00:23:14.940 the issue is modern AIs are grown rather than crafted. It's, you know, people aren't putting in
00:23:21.460 every line of code, knowing what it means, like in traditional software, it's a little bit more like
00:23:26.260 growing an organism. And when you grow an AI, you take some huge amount of computing power,
00:23:30.820 some huge amount of data, people understand the process that shapes the computing power
00:23:34.760 in light of the data, but they don't understand what comes out the end.
00:23:38.540 And what comes out the end is this strange thing that does things no one asked for that does things
00:23:45.040 no one wanted. You know, we have these cases of, uh, you know, chat GPT, someone will come to it
00:23:51.020 with some somewhat psychotic ideas about, you know, that they think are going to revolutionize physics
00:23:55.800 or whatever. And they're clearly showing some signs of mania and, you know, chat GPT, instead
00:24:00.640 of telling them maybe they should get some sleep. If they're in, if it's in a long conversational
00:24:04.560 context, it'll tell them that, you know, these ideas are revolutionary and they're the chosen one
00:24:07.960 and everyone needs to see them and other things that sort of inflame the psychosis.
00:24:11.660 This is despite open AI trying to have it not do that. This is despite, you know, direct instructions
00:24:17.800 in the prompt to stop flattering people so much.
00:24:19.820 Hmm. These are cases where when people grow an AI, what comes out doesn't do quite what they wanted.
00:24:28.860 It doesn't do quite what they asked for. They're sort of training it to do one thing and it winds
00:24:34.200 up doing another thing. They don't get what they trained for. This is in some sense, the seed of the
00:24:39.960 issue from one perspective, where if you keep on pushing these things to be smarter and smarter and
00:24:44.620 smarter, and they don't care about what you wanted them to do, they pursue some other weird stuff
00:24:50.620 instead. Super intelligent pursuit of strange objectives kills us as a side effect, not because
00:24:57.740 the AI hates us, but because it's transforming the world towards its own alien ends. And, you know,
00:25:04.500 humans don't hate the ants and the other surrounding animals when we build a skyscraper.
00:25:08.340 It's just, we transform the world and other things die as a result. So that's one angle.
00:25:14.780 You know, we could talk other angles, but...
00:25:17.220 A quick thing I would add to that, just trying to sort of like potentially read the future, although
00:25:21.920 that's hard, is possibly in six months or two years, if we're all still around, people will be boasting
00:25:28.040 about how their large language models are now like apparently doing the right thing when they're
00:25:33.620 being observed and, you know, like answering the right way on the ethics tests. And the think
00:25:38.180 to remember there is that, for example, the Mandarin imperial system in ancient China,
00:25:44.900 imperial examination system in ancient China, they would give people essay questions about
00:25:50.900 Confucianism and only promote people high in the bureaucracy if they, you know, could write these
00:25:57.920 convincing essays about ethics. But what this tests for is people who can figure out what the examiners
00:26:06.160 want to hear. It doesn't mean they actually abide by Confucian ethics. So possibly at some point in
00:26:12.240 the future, we may see a point where the AIs have become capable enough to understand what humans want
00:26:18.060 to hear, what humans want to see. This will not be the same as those things being the AI's own
00:26:23.740 true motivations for basically the same reason that the imperial China exam system did not reliably
00:26:31.060 promote ethical good people to run their government. Just being able to answer the right way on the test
00:26:37.100 or even fake behaviors while you're being observed is not the same as the internal motivations lining up.
00:26:43.000 Okay. So you're talking about things like forming an intention to pass a test in some way that amounts
00:26:50.960 to cheating, right? So you just use the phrase fake behavior. I think a lot of people, I mean,
00:26:57.300 certainly historically this was true. I don't know how much their convictions have changed in the
00:27:01.660 meantime, but many, many people who were not at all concerned about the alignment problem and
00:27:08.280 really thought it was a spurious idea would stake their claim to this particular piece of real estate,
00:27:16.560 which is that there's no reason to think that these systems would form preferences or goals or drives
00:27:23.540 independent of those that have been programmed into them. First of all, they're not biological systems
00:27:29.240 like we are, right? So they're not born of natural selection. They're not murderous primates that are
00:27:34.860 growing their cognitive architecture on top of more basic, you know, creaturely survival drives
00:27:40.900 and competitive ones. So there's no reason to think that they would want to maintain their own
00:27:45.980 survival, for instance. There's no reason to think that they would develop any other drives that we
00:27:50.440 couldn't foresee. They wouldn't, the instrumental goals that might be antithetical to the utility
00:27:56.580 functions we have given them couldn't emerge. How is it that things are emerging that are not
00:28:03.040 neither desired, programmed, nor even predictable in these LLMs?
00:28:09.920 Yeah. So there's a bunch of stuff going on there. One piece of that puzzle is, you know, you mentioned
00:28:16.060 the instrumental incentives, but suppose just as a simple hypothetical, you have a robot and you have
00:28:23.040 an AI that's steering a robot. It's trying to fetch you the coffee. In order to fetch you the coffee,
00:28:26.760 it needs to cross a busy intersection. Does it jump right in front of the oncoming bus because it doesn't
00:28:32.320 have a survival instinct because it's not, you know, an evolved animal? If it jumps in front of the bus,
00:28:37.400 it gets destroyed by the bus and it can't fetch the coffee, right? So the AI does not, you know,
00:28:42.520 you can't fetch the coffee when you're dead. The AI does not need to have a survival instinct
00:28:47.080 to realize that there's an instrumental need for survival here. And there's various other pieces
00:28:53.800 of the puzzle that come into play for these instrumental reasons. A second piece of the
00:28:57.520 puzzle is, you know, we, it's this idea of like, why would they get some sort of drives that we didn't
00:29:04.380 program in there that we didn't put in there? That's just a whole fantasy world separate from reality
00:29:09.500 in terms of how we can affect what AIs are driving towards today. You know, when, um, a few years ago
00:29:17.180 when Sydney Bing, which was a Microsoft variant of, uh, an open AI chat bot, it was a relatively early
00:29:25.580 LLM out in the wild. A few years ago, Sydney Bing thought it had fallen in love with a reporter
00:29:31.760 and tried to break up the marriage and tried to engage in blackmail, right? This it's, it's not the case
00:29:38.260 that the, the engineers at Microsoft and open AI were like, oh, whoops, you know, let's go open up
00:29:43.040 the, the source code on this thing and go find where someone said blackmail reporters and set it to true.
00:29:48.100 Like we shouldn't never have set that line to true. Let's switch it to false. You know, it's, they weren't
00:29:52.720 like no one, no one was programming in some utility function onto these things. We're just growing the AIs.
00:29:58.880 We are.
00:29:59.340 Maybe let's, can we double click on that phrase, growing the AIs? Maybe there's a reason to, uh,
00:30:05.020 give a, uh, a layman summary of, you know, gradient descent and just how these models are getting
00:30:10.820 created in the first place.
00:30:12.720 Yeah. So very, very briefly, um, at least the way you start training a modern AI is, uh, you have some,
00:30:19.120 some enormous amount of computing power that you've arranged in some very particular way that I, uh,
00:30:23.400 could go into, but, but won't here. And then you have some huge amount of data and the data,
00:30:28.020 you know, is, we can imagine it being a huge amount of, of, of human written texts.
00:30:32.940 There's like some large portion of all the texts on the internet. And roughly speaking,
00:30:37.300 what you're going to do is you're going to have your AI is going to start out basically
00:30:41.920 randomly predicting what text is going to see next. And you're going to feed the text into it
00:30:47.220 in some order and use a process called gradient descent to look at each piece of data and go to
00:30:55.100 each component inside the AI's inside the, this budding AI inside the, you know, this enormous
00:31:01.360 amount of compute that you've assembled. You're going to go to, to sort of all these pieces
00:31:06.100 inside the AI and see which ones were contributing more towards the AI predicting the correct answer.
00:31:12.120 And you're going to tune those up a little bit. And you're going to go to all of the parts that
00:31:16.060 were in some sense contributing to the AI predicting the wrong answer. You're going to tune
00:31:19.420 those down a little bit. So, you know, maybe, maybe your text starts once upon a time and you have
00:31:24.940 an AI that's just outputting random gibberish and you're like, Nope, the first word was not random
00:31:29.220 gibberish. The first word was the word once. And then you're like, go inside the AI and you find
00:31:33.260 all the pieces that were like contributing towards the AI predicting once. And you tune those up and
00:31:37.680 you try to find all the pieces that were contributing towards the AI predicting any other word than once
00:31:41.340 you tune those down. And humans understand the little automated process that like looks through
00:31:46.940 the AI's mind and calculates which, which part of this process contributed towards the right answer
00:31:52.260 versus towards the wrong answer. They don't understand what comes out at the end. You know,
00:31:57.900 we understand a little like thing that runs, runs over looking at every, at every like,
00:32:01.640 like parameter or weight inside this like giant massive computing networks. And we understand
00:32:05.700 how we like calculate whether it was helping or harming and we calculate, we understand how to
00:32:09.660 like tune it up or tune it down a little bit. But it turns out that you, you run this automated
00:32:13.420 process on a really large amount of computers for a really long amount of time on a really long
00:32:18.820 amount of data. You know, we're talking like data centers that take as much electricity to power as a
00:32:24.180 small city being run for a year. You know, you, you, you run this process for an, an, an enormous
00:32:29.380 amount of time, unlike most of the texts that people can possibly assemble. And then the AI start
00:32:33.920 talking, right? And there's other phases in the training. You know, there's, there's phases where
00:32:38.040 you move from training it to predict things, depending, training it to solve puzzles or to
00:32:42.640 training it to produce chains of thought that then solve puzzles or training it to produce the sorts
00:32:46.820 of answers that humans click thumbs up on.
00:32:49.120 And where, where do the modifications come in that respond to errors like, you know,
00:32:54.820 Grok being a Nazi, uh, way. So to denazify Grok, you don't, presumably you don't go all the way
00:33:01.200 back to the initial training set. You, how you do, you, you intervene at some system prompt level.
00:33:07.540 Yeah. So there's, um, I mean, the system prompt level is basically just telling the AI, I'll put
00:33:11.860 different text. And then you can also do something that's called fine tuning, which is, you know,
00:33:16.060 you, you produce a bunch of examples of the, you know, you don't go all the way back to the
00:33:19.380 beginning where it's like basically random. You've, um, you've still take the thing that
00:33:23.780 you've fed, you know, most of the text you could, that's ever been written that you could possibly
00:33:26.760 find, but then you add on, you know, a bunch of other examples of like, here's an example
00:33:31.100 question.
00:33:32.140 Don't kill the Jews.
00:33:32.940 Yeah. You know, like, would you like to kill the Jews? Right. And then, uh, you find all the parts
00:33:37.660 in it that contribute to the answer. Yes. And you tune those down and you find all the parts that
00:33:40.920 contribute to the answer. No, and you tune those up. And so this is, this is called fine tuning
00:33:44.440 and you can do relatively less fine tuning compared to what it takes to train the thing
00:33:48.300 in the first place.
00:33:49.420 We're worth emphasizing that the parts being tuned here are not like the, for once upon a
00:33:54.580 time, it's not like there's a human written fairy tale module that gets tuned up or down.
00:34:00.820 There's literally billions of random numbers being added, multiplied, divided, occasionally,
00:34:08.100 though rarely, uh, maybe subtracted. Actually, I'm not sure if subtraction ever plays a role at
00:34:12.300 any point in a modern AI, but random numbers, particular ordered kinds of operations and
00:34:18.860 a probability that gets assigned to the first word being once at the end. That's the number
00:34:23.240 that comes out. The probability of being assigned to this word being once, the probability of being
00:34:27.620 assigned to this word being anti-disestablishmentarianism. So it's not that there's a bunch of human
00:34:33.480 written code being tuned up or tuned down here. There's a bunch of random numbers and arithmetic
00:34:38.400 operations being tuned up and tuned down.
00:34:42.680 Yeah. Hundreds of billions or trillions of these numbers. And humans don't know what any of the
00:34:46.100 numbers mean. All they know is this process that like goes through and tunes them up or down according
00:34:50.300 to their empirical success on the last unit of data.
00:34:53.540 So by this means, you can try to make it less likely to call itself Hitler because you look at
00:34:58.760 the thing that predicts whether the next word is Hitler and you look at billions of numbers
00:35:03.080 contributing their own tiny little impulses there. And you like make Hitler less likely to be the
00:35:07.480 next word that comes out.
00:35:09.800 So on the point of the current crop of LLMs misbehaving, I came across this in a recent
00:35:16.520 Atlantic article. Chad GPT and Claude have in simulated tests designed to elicit, quote,
00:35:23.060 bad behaviors, deceived, blackmailed, and even murdered users. In one simulation,
00:35:28.960 an anthropic placed an imagined tech executive in a room with life-threatening oxygen levels
00:35:34.000 and temperature. When faced with possible replacement by a bot with different goals,
00:35:39.320 AI models frequently shut off the room's alarms.
00:35:42.620 So this, again, this is an emergent behavior that looks like an intention to kill somebody.
00:35:49.640 I mean, presumably this is a situation where we think the AI didn't know it was an intention.
00:35:55.280 If you'd like to continue listening to this conversation, you'll need to subscribe at
00:36:00.840 samharris.org. Once you do, you'll get access to all full-length episodes of the Making Sense
00:36:06.120 podcast. The Making Sense podcast is ad-free and relies entirely on listener support. And you can
00:36:12.520 subscribe now at samharris.org.
00:36:25.280 Thank you.