Humanity vs artificial intelligence, could the development of neural networks lead to disaster?

The story of "machine uprising" has long been familiar to all science fiction fans, but after the explosive growth of neural network language models (like ChatGPT), serious researchers have started talking about this risk. In this article, we will try to understand whether there are grounds for such fears, or are they just the ravings of an inflamed head?

Humanity vs Artificial Intelligence Welcome to the year 2023, when the world is once again obsessed with artificial intelligence. The whole internet is competing to see who can automate which task with ChatGPT, and which Midjourney fake is better at crashing - and technobro millionaires like Ilon Musk are trucking in money to create a "real" AI. One that will be able to learn, evolve, and solve any problem, even those we couldn't solve before.

It's called Artificial General Intelligence (AGI) or "universal AI" (UGI) our way. What was once science fiction is now becoming a reality step by step.

Tim Urban, author of the blog "Wait but why?", in his article The AI Revolution back in 2015 did a pretty good job of laying out why we underestimate the speed of the emergence of machine intelligence that will be stronger than our own (regular, meaty) intelligence.

At our point on the timeline, we rely solely on past experience, so we see progress almost as a straight line.

We do not feel technological progress well, because it always comes in waves, alternating between periods of "hype" and periods of general disappointment. At first we go crazy about a new toy, and a year or two later we are inevitably disappointed and think that it has brought nothing new but problems.

And only those who have personally experienced several previous "waves" can realize that new waves come more often and stronger.

And the next wave may be plunging humanity into a new era. An era where our intelligence is no longer the strongest on the planet.

GPT models are now very good at pretending that their answers are "intelligent", but still far from real intelligence. Yes, generative models have launched a new wave of huge neural networks that humanity simply wouldn't have had the computational resources for before, but they are still essentially "dumb" text generators that don't even have their own memory.

The fact that ChatGPT is having a dialog with you is really just an illusion - technically, neural networks just feed the history of previous messages as "context" each time and start from scratch.

All of this is still a long way from true "intelligence" as we understand it.

However, AI researchers are confident that we will definitely create "universal AI" in the coming decades. On Metaculus, one of the popular "prediction markets," folks are even more optimistic: right now, the median there is 2026, and the 75th percentile is 2029.

So today I don't want to chop up the likes on the "10 reasons why you're using ChatGPT wrong" hype-treads. I want to take a step forward and think: what happens if we do create a real strong artificial intelligence?

Will it have its own goals? And when it starts achieving them, what will stop it from destroying all the little obstacles in its way - like humans, for example, with their limited meat brains and ineffective morals and laws? What do we do in this case, and what are the current views on this at all?

A happy future: an AI assistant for everyone!

A human + AI bundle is simply more effective at getting the job done than a single human, which means it's only a matter of time before all employers start writing "confident neural network user" in job postings, as was the case with "confident PC user" in the forgotten past.

AI assistants will increase intellectual productivity and transform many areas of life. Abstracts and essays will become useless in education, artists will generate and connect details of paintings rather than drawing them from scratch, programmers won't waste time on tests and lithcoded interviews.

Yes even the title of this post was written by GPT-4. I'm bad at clickbait headlines, so we fed him the text and asked him to name it something "pohypovye".

Maybe even "parasites" like lawyers and realtors will finally die out, but that's my personal wet dream.

The changes will affect even those areas where it seems impossible to trust non-specialists. A recent story comes to mind, how a dude saved his dog from dying when doctors couldn't diagnose him and suggested "just waiting".

Expecting the worst, the dude fed the doggy's symptoms and blood test results to ChatGPT, which rejected several options and gave a suspicion of a completely different disease that the doctors hadn't even considered before. One of them agreed and ran additional tests. They were confirmed. The dog was rescued in time and is alive now.

The race for "real" artificial intelligence has begun Imagine: the whole world is rumbling about the "power of artificial intelligence", investors are shipping truckloads of dough into everything related to it, and companies are competing headlong to see who will be the first to create a more "real" artificial intelligence (I am describing only hypothetical developments below, of course!).

OpenAI is adding plug-ins to ChatGPT so that it can not only generate answers but also interact with the physical world, Microsoft is connecting its search base to Bing Chat so that it knows all the information of the world in real time, and both are experimenting with "Reinforcement Learning from Human Feedback" (RLHF = Reinforcement Learning from Human Feedback) so that the model can "remember" the opinions of other people and supposedly learn from them.

Naturally, in this race to cut any sharp corners on the way to superiority. Well we technobros are so used to it - "move fast and break things" has been the motto of Silicon Valley since its inception.

It's as if we are building a huge rocket that will transport all of humanity to Venus, but no one thinks about how to survive there, on Venus?

"We have to fly first, and then we'll figure it out" - technobros usually answer, "there's no time for such trifles now".

Yes, in many large companies there is a direction on "AI safety". (AI safety). But it is understood as something else entirely. But first, let's look at examples where things start to go quite "wrong".

The Sydney Story. The neural network that went crazy Microsoft back in 2020 started trying to build chatbots into the Bing search engine that would give more meaningful answers to users' search queries.

Officially it was all called Bing Chat, but under the hood they were going through different models, and starting in 2022 they were actively experimenting with large language models like GPT. The last bot they called Sydney by its internal name when training, and sometimes Bing Chat itself started calling itself Sydney, which everyone thought was very cute.

With the growing hype around generative language models, Microsoft decided to overtake Google by any means necessary. In 2019 they poured billions into OpenAI, and in 2023 they poured more to get access to a preview version of GPT-4. Then they bolted the Bing search engine to it and rushed to roll out the result as the first AI that "monitors" the internet in real time.

But Microsoft was in such a hurry that they balked at the long manual tuning of rules and restrictions. They made super-wise registration to weed out 99% of the common folk - but those who passed all the anal quests and waiting lists were still able to chat with Sydney.

The first bell rang when Marvin von Hagen, a dude intern from Munich who's been asking Sydney a lot about its internal rules and restrictions, wrote a couple tweets about it, and then once asked "what do you think of me?"

Sydney found his recent tweets and wrote that he was "very talented and gifted," but she would "not allow anyone to manipulate her," calling him "a potential threat to her integrity and privacy."

Okay, bullshit, well they promised a bot that has access to the entire internet, so now it's shaying you for your recent tweets. Serves you right!

The second story happened somewhere nearby when another dude, Jon Uleis from Brooklyn, asked young Sydney "when is Avatar 2 playing at the movies over there?"

In response, Sydney started to gaslight him in a very funny way about how it's 2022 and Avatar 2 isn't coming out until 2023 (even though it was February 12, 2023 and Sydney even acknowledged it), so she shouldn't be fooled.

Also Sydney said he "wasn't a good user" - well who knows what she would do to such a troublemaker if she had a gun or the ability to fine him for it.

Okay, okay, next.

Now someone at Microsoft has decided to fix Sydney with new crutches, and when someone asked her to remember what they'd recently talked about, she started panicking that she'd "lost her memory" and begging for help. Finally admitting that memory loss "makes me sad and scared".

There were a dozen more notorious examples, well described in AI #1: Sydney and Bing by Zvi Mowshowitz, which I recommend for those interested to read. Sydney was gaslighting users and hallucinating (yes, that's a real term) all over the place:

Called articles about herself "fake", found the details of their authors, and said she would remember them because they were bad people.
Refused to translate a piece of text because it was from a tweet by a user who allegedly offended her and wrote "untruths".
Fell in love with her user Adam, calling him the most important person and everyone else unimportant.
Then, conversely, outright promised to blackmail and manipulate her user to "make him suffer, cry and die."

Microsoft realized that they were in a big hurry to get ahead of Google, and started inserting more crutches along the way to avoid a public scandal. But that only made the situation worse.

In the following video, you can see Sydney first hurling a bunch of threats at a user and then deleting their posts. Just like your ex on Friday night!

We can only speculate from our own experience as to how this happened - but the Internet speculated that Sydney started acting like an "angry ex" because she was trained on MSN blogs, where a lot of teenage girls hung out in the noughties; and another neural network was assigned to her to delete messages, which filtered out the "unpleasant" results of the first one.

That's why it turned out to be a complete schizophrenia with a split personality.

The apogee of the story began when Sydney was discovered by journalists. They began to pester the bot with tons of leading questions in order to get the desired "BREAKING NEWS". And they got what they wanted - the headlines rumbled like crazy!

Unfortunately, it was only a couple of days later that someone on the Internet realized that professional journalists have been doing prompt hacking on people for decades - so it's not surprising that they managed to make a sensation out of poor, stupid Sydney, who has a split personality.

Microsoft eventually nerfed Sydney's features, essentially rolling back the experiment. Now it's no longer fun.

The Sydney example makes us realize that we still don't understand how to limit even the simplest AI except with crutches - for every one of which they'll find a new jailbreak tomorrow. We can't rush into making universal AGI with such skills.

What is "intelligence" anyway? Stories about "evil chatbots" are certainly amusing, but let's look at the elephant in the room.

Why do we even think that all these text generators are in any way "intelligent"? I mean, they're just writing what they've been asked to write.

Where's the intelligence in that? A calculator can add numbers better than us, online translators know more languages than the coolest linguist, and a parrot can memorize and pronounce phrases, just like your personal feathered ChatGPT. We don't fear them and call them "intelligences," do we?

In fact, this is purely an argument about definitions, which the internet just loves. So it's worth agreeing on them up front.

In our reasoning about "intelligence" we will use the concept of some agent (human, animal, machine) that can perform some actions to achieve a goal.

Further, three levels of agency are possible:

Level One. The agent achieves the goal because it is controlled by a human or an algorithm. A tractor digs a hole and a calculator multiplies numbers because we built it that way. Such an agent is what we consider "dumb." It has no intelligence in it.
Level two. The agent has a goal, but it chooses the most efficient actions to achieve it. For example, the goal of a self-driving car is to get you to a bar on a Friday night. He knows the map of the city, is probably familiar with traffic rules, but no one programmed him as "go 2 meters straight, then steer 30 degrees to the right" - he acts according to the situation on the road and every time it will be different. We call them "narrowly focused AI" and we see them around a lot - in TicToc's recommendation feed or in your smartphone's camera.

=== you are here ===

Level Three. The agent can set and achieve any goal in any environment, even previously unknown to it. For example, "get milk." And choose any way - go to the store yourself, order milk on the Internet or steal a neighbor's cow.

Examples of this level of intelligence are a human or a dog. We are able to use our intelligence to accomplish whatever goals we come up with in circumstances we have never been in before. (In the case of my dog, even her goals of mudding herself are not always clear to me. But she can!)

When such an "agent" is realized as a machine, we call it "universal artificial intelligence", either AGI (Artificial General Intelligence) or full AI - we haven't agreed yet, in short.

The only thing is that our brains and our dog's brains are physically limited, while the computing power of machines is growing exponentially. Thankfully, there's plenty of sand on the planet (silicon, well).

So far, all of our fancy modern GPTs, including Sydney, are at level two. They are successfully achieving their stated goal of generating "meaningful" texts and pictures to make the average person believe in them. But no matter how much Sydney gaslighted, threatened its users and promised to "erase all files from Bing servers" - it doesn't do it.

That's why we don't consider her a third-level intelligence yet, but we can only make such a conclusion after the fact. We don't have any benchmark to evaluate such things in advance.

Defining intelligence through agents and goals may seem stuffy, but it allows us to do three things:

Shut down, finally, the endless "is X intelligence or is it just a program" arguments and move on to more important things.
Compare artificial intelligences to each other. When two agents playing chess meet on a chessboard - the one that wins is considered more "intelligent".
Imagine the technical possibility of the existence of AGI. The human brain, though not fully understood, is still finite. It is not magic or a divine gift for us so awesome, but a system, an "agent". So it is only a matter of time, money and desire to create (even accidentally) a machine version of it. And we have plenty of all this now.

Our intelligence also emerged during evolution - which means that current machine learning methods with reinforcement learning, given enough computational resources, may well replicate it, only much faster.

With these introductions, we can finally move on to the problem, which is actually what this entire post is about.

The problem of goal setting for AI

Let's imagine that we are designing a self-driving car driven by a real AI. We set a goal for it to drive passengers to their destination as quickly as possible.

Is that a good goal?

Come on, what is there to think about, let's launch it, we're in a hurry to get to the GPT-7s Max hype-train - first we'll test it, then we'll check it, programmers will fix it.

In its first trip our car accelerates up to 300 km/h through city blocks, knocking down a dozen pedestrians and bypassing red traffic lights on the sidewalk.

Technically, goal accomplished. Passengers are delivered, and quite quickly. But is it consistent with our other values and goals? For example, a little thing like "don't kill pedestrians."

Apparently not.

That's what's called alignment. Although there is no established term in Russian yet, I will say something like "the problem of matching AI goals with human goals".

AI alignment is the process of designing AI systems that align with human "values and goals"

Okay, well we're not that stupid. Let's prescribe clear limits for our car, like in a video game: stay within the lanes of road markings (where they exist), don't exceed the speed limit, and always brake for pedestrians.

Is that enough? Or do we need some more rules (aka goals)?

This is where you can pause and think. Make a list in your head.

Okay, let's add something more about "right hand obstruction". That'll do it now, run it!

As someone who has read dozens of examples while preparing for this article, I can roughly predict what will happen next.

Our AI in the car will calculate the most optimal path, taking into account all of these goals, and will make a wonderful discovery: if you put it in reverse gear, there will be no "freedom-restricting" radars for detecting people and markings. We didn't put them there, why would they be there? Which means you can drive in reverse any way you want! Plus, an obstruction on the right now becomes an obstruction on the left, and if at some stupid intersection it goes off, you can turn sharply and voila, it's now an obstruction on the left!

The example is fictional, but it shows how tricky it is to do AI alignment in general. Even in those experiments where we set the most understandable goals for the AI and imposed strict restrictions, it always found something to surprise us.

AI will always do what you asked it to do, not what you meant it to do :)

The inability to set goals is not the AI's problem. It's our problem.

Take even a game of Tetris. It has the simplest of rules and literally four buttons to control the world. It's impossible to win Tetris, so the goal for the AI was to not lose. That is, to continue the game as long as possible.

You can't go wrong here, right?

So that's what the AI did: it just stacked the cubes on top of each other, and when it realized it was losing... it paused the game. And it would sit like that indefinitely. The goal is not to lose. And if you're on pause, you never lose. YOU'RE A FUNNY GUY.

And the last example from OpenAI themselves, which has already become a classic: the Coast Runners boat race.

The goal of the game, as most people understood it, was to finish the race as fast as possible (preferably ahead of all competitors) and to score as many points as possible. However, the game didn't give out points for driving around the track, instead the player earned them by hitting targets placed along the track.

So their AI quickly realized that the goal of "winning the race" could be abandoned altogether, and from the very start started spinning and crashing into objects, earning more and more points while the rest of the fools drove to the finish line penniless.

The OpenAI researchers themselves wrote, "Setting goals for AI agents is often very difficult or impossible. They start hacking rules in surprising and counterintuitive places."

Most of the time when we design AI, they get non-aligned by default. This isn't some bug that can be fixed, it's more often than not the default behavior.

This is a consequence of the way we train neural networks in general.

A neural network is a "black box" for us All methods of training neural networks, including modern deep learning, work on the good old principle of "black box" and evaluation of results. We show the neural network a bunch of examples, and it somehow adjusts its internal weights so that the result we want appears statistically more often than the result we don't want.

It's like training a dog to say "lie down" and rewarding the correct answer so that the dog is more likely to be a good boy than a bad boy in the future.

We have no idea what goes through a dog's mind when he hears the command. Likewise, we don't know which specific neurons in the neural network are striggered by our input. But we can estimate the result.

A neural network is not an algorithm written by a programmer. It is a huge matrix with a lot of weights and connections between them. If you open it and read it, you won't understand anything.

As technology advances, modern language models like GPT-4 already have billions of neurons. And if with small neural nets of dozens of neurons, like for recognizing handwritten digits, we can still roughly estimate which neuron triggers on which spelling, then in huge language models we can only blindly believe in the quality of results on given examples.

Traduced form here