When the Machine Learns Too Well

From Engineering to Experimentation: Can We Keep AI Safe?

Sep 14, 2025

In 1818, at the age of eighteen, Mary Shelley published her novel Frankenstein, widely regarded as the first science fiction work in history. As it is already familiar to most people, the story is about Victor Frankenstein, a passionate young scientist who creates a creature with a human-like body (but taller and stronger) and mind. Victor immediately regrets his creation once it comes to life, but the creature has gone beyond his control, leading to devastating consequences that he had not anticipated.

More than two centuries later, AI reminds us of the creature in Shelley’s novel, with Geoffrey Hinton, known as “The Godfather of AI”, standing as the living Victor.

In a 60 Minutes interview, when being asked, “Does humanity know what it is doing?” Hinton gave a shocking but firm answer, “No.”

Hinton: “We have a very good idea of sort of roughly what it is doing. But as soon as it gets really complicated, we don’t actually know what’s going on any more than we know what’s going on in your brain.”
Host: “What do you mean we don’t know exactly how it works? It was designed by people?”
Hinton: “No, it wasn’t. What we did was we designed the learning algorithm, that’s a bit like designing the principle of evolution. But when this learning algorithm interacts with data, it produces complicated neural networks that are good at doing things. But we don’t really understand exactly how they do those things.”

Moreover, Hinton acknowledged that the future is unpredictable, considering the path forward for ensuring safety.

“I don’t know. I can’t see a path that guarantees safety. We’re entering a period of great uncertainty where we’re dealing with things we’ve never dealt with before. And normally, the first time you deal with something totally novel, you get it wrong. And we can’t afford to get it wrong with these things. Why? Well, because they might take over.”

There have been many debates about AI, ranging from optimistic promises to gloomy predictions. But, as someone who has spent 50 years of hard work firsthand nurturing the growth of artificial neural networks, Hinton’s concerns need to be taken seriously.

A Brief History of Artificial Neural Networks

When Hinton was in high school, he was initially interested in figuring out how the brain works. Then he realized he’d better “figure out how to build one to understand it.” With this realization, Hinton joined the Ph.D program in artificial intelligence at the University of Edinburgh back in 1972.

At that time, AI experienced its first “winter” after the single-layered neural network, called “Perceptrons”, failed to meet the initial promises raised by its inventor, psychologist and computer scientist Frank Rosenblatt. Rosenblatt drew inspiration from the biological brain, specifically Hebb’s rule, “Neurons that fire together wire together,” and built a learning algorithm involving adjustments of connection weights in artificial neurons. (Please refer to one of my previous articles for a more detailed explanation.)

Disappointing to most people, what Perceptrons could do was shown to be very limited. Even the whole school of artificial intelligence, from which Hinton earned his doctorate, switched away from neural networks, while he was the only exception. In his dissertation, Hinton pioneered a neural network with parallel and interconnected artificial neurons in a graph-like structure to achieve globally consistent visual interpretations. The structure was considerably more advanced than a Perceptron, and its development laid the groundwork for the future deep neural network architecture.

However, Hinton’s neural network could not learn effectively because of a fundamental challenge: how to adjust the connection weights through multiple layers?

After graduating, he came to the US and worked as a postdoctoral researcher under psychologist David Rumelhart at the University of California, San Diego. In October 1986, Hinton finally solved the problem. Rumelhart, Hinton, and Ronald Williams, a mathematician, published their groundbreaking paper in Nature, “Learning representations by back-propagating errors.” As its name suggests, the backpropagation method allows the error, computed using gradient descent and the chain rule, to be propagated backward from the output layer through the hidden and input layers.

The paper addressed a significant challenge that had stalled the development of neural networks for over two decades. Their method, simple and elegant, now enables any size of neural network to adjust the connection weights of any number of internal layers based on the aggregated errors from the output layer itself.

Notably, in this original paper, Rumelhart, Hinton, and Williams made another keen observation on the learning ability of the neural network — the ability to generalize with “hidden” features. They summarize in the abstract:

“As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.”

Previous machine learning methods, ranging from traditional regression methods to support vector machine (SVM), all relied on data scientists to hand-pick features and test them using statistical methods to determine which features have the most predictive power. For example, to predict who is likely to renew a magazine subscription, the most relevant features are likely to be customers’ age, gender, and household income, among others. Statisticians need to perform various analyses to confirm or reject each of these factors.

Conversely, the internal layers of a neural network can automatically distill features through numerous rounds of weight adjustments based on trial and error. In other words, deep neural networks are equipped with the general learning capability without detailed rules and instructions from humans.

Armed with the backpropagation algorithm, the only other variables that computer engineers needed to make leaps in AI were more data and computing power.

In 2009, computer scientist Fei-Fei Li led her team in building a considerably large image dataset, called ImageNet, including over 14 million hand-annotated images across about 22,000 categories. In 2010, her team launched a contest in the AI community, utilizing 1.2 million images from 1,000 categories for training and an additional 100,000 images to evaluate performance.

In parallel, Nvidia released the first Graphics Processing Unit (GPU) in 1999, which was initially designed for rendering graphics faster in video games with parallel data processing. It was soon discovered that the chips were also ideal for linear algebra operations (e.g., matrix multiplication) required by training neural networks. In 2006, Nvidia released CUDA (Compute Unified Device Architecture), a software platform that allowed developers to use GPUs for non-graphical, general-purpose computing tasks.

In 2012, Hinton was already a professor at the University of Toronto. His two Ph.D. students, Alex Krizhevsky and Ilya Sutskever, came up with the idea of using GPUs to run a large, computationally intensive neural network. Specifically, they built the largest neural network at that time, called AlexNet, comprising half a million artificial neurons and 60 million weights, utilizing two Nvidia GPUs. They won the ImageNet contest of that year with an error rate significantly lower than that of the previous two years’ champions.

AlexNet’s win marked the start of an era of incredibly rapid AI growth enabled by deep neural networks. Remarkably, after more than two decades, Hinton was back at the forefront of AI advances, thanks to his unwavering faith and effort in neural networks since his days as a graduate student.

Next, Hinton was awarded the prestigious Turing Award in 2018, alongside Yoshua Bengio and Yann LeCun. In 2024, he received the Nobel Prize in Physics, along with physicist John Hopfield, for their pioneering discoveries in artificial neural networks.

Today, the scale of neural networks is so immense and the achieved intelligence is so astonishing that it has even alarmed Hinton. Since 2023, he has become a full-time advocate of AI safety. This marks a sharp turn in Hinton’s long career in neural networks.

Why today’s AI is so unsettling to Hinton?

First of all, in just a decade, deep neural networks have scaled at an unprecedented speed. ChatGPT 4 had reached 1.76 trillion parameters and was trained using approximately 25,000 NVIDIA A100 GPUs. This colossal scale was truly unthinkable at the time when AlexNet was trained on a home computer with only two early-generation GPUs.

Meanwhile, those large, deep neural networks have become so complex that no one, including Hinton himself, fully understands why they are capable of doing what they are doing and what their future holds. This is the most unsettling part for Hinton.

Before the rise of deep neural networks, machine learning (ML) relied on well-established mathematical principles. One of them is the bias-variance tradeoff, which predicts that a simpler model tends to underfit the training data, therefore, perform poorly with high bias. In contrast, non-linear models risk overfitting, where a model fits exceptionally well the training data, including spurious patterns or noise, but falls short when generalizing to new cases (i.e., high variance).

Therefore, the theory recommends finding a sweet spot, or Goldilocks point, between underfitting and overfitting, located on a U-shaped curve between prediction error rate and model capacity, using methods such as sprinkling in regularizations and refining the machine learning architecture.

Surprisingly, deep neural networks with hundreds of millions or billions of parameters achieve nearly perfect fitting of training data and make accurate predictions. There is no need to search for a sweet spot for them, because they behave very differently from previous conventional machine learning models — they can learn anything with a decent tolerance of noise and generalize very well at the same time.

The research from Belkin and colleagues reveals the “double descent” effect, where, unlike the conventional U-shaped bias-variance trade-off curve, there is a second trend of decreasing error after the model size becomes sufficiently large. What causes this second descent is unknown.

Hinton is right. To understand large neural networks now requires empirical observations, as they have already entered into a territory beyond the reach of the existing ML theories.

Another surprise comes from how these networks learn. In theory, finding the best fit in a complex model using gradient descent requires reaching the global minimum while avoiding getting stuck in a local minimum that is shallower than the global one. For large neural networks, this problem was expected to worsen as complexity grows. Yet, these networks learn efficiently and find good solutions smoothly without being “stuck”. Some researchers suggest that there are many good “mini” valleys in large networks that are nearly as good as the best one, though the exact reasons remain unclear.

Unlike traditional engineering, where systems such as airplanes are well-understood and predictable, large AI models operate in ways that often defy straightforward mathematical explanations. Engineers know precisely how each part of an airplane functions, but AI systems require extensive empirical testing and experimentation to fully understand their capabilities, which has become an ever-increasing challenge with these increasingly advanced systems.

As an inventor who has spent his entire life working with neural networks, Hinton fully understands every component of the artificial deep neural networks behind today’s LLM models. Yet, he hadn’t anticipated the revelation he has now, as if they have evolved onto their own trajectory that further departs from and surpasses human brains. As Hinton said in another interview (Bold are mine):

“I thought if we built computer models of how the brain learns, we would understand more about how the brain learns. And as a side effect, we will get better machine learning on computers, and all that was going on very well. Then, very suddenly, I realized recently that maybe the digital intelligences we are building on computers were actually learning better than the brain, and that sort of changed my mind after about 50 years of thinking we would make better digital intelligences by making them more like the brain. I suddenly realized we might have something rather different that was already better.”

Then Hinton stated three reasons why he thinks AI is now superior and different from the human brain:

First, it demonstrates an understanding of the meaning, such as the ability to explain why a joke is funny (without being taught explicitly). Second, ChatGPT has approximately 1–2 trillion weights, while the human brain has around 100 trillion weights. With this 1–2 percent of the human brain’s capacity, it learns and stores so much more information than humans, suggesting that its learning algorithm might be better. Third, the models running on supercomputers can exchange information by directly copying weights, which is much faster at a higher level, at “trillions of bits a second,” as opposed to “hundreds of bits a second” for the biological brain.

Hinton recognized many years ago that it appears impossible for the brain to have a backpropagation-like learning algorithm because it requires the brain’s internal neuronal layers to preserve all the previous weights before they are adjusted, which is not biologically feasible. Neuroscientists have been actively searching for alternatives in recent decades. Perhaps it is this mathematics-driven algorithm that could cause AI models to diverge from the brain in the first place?

On the other hand, computing hardware, especially GPUs, is another factor that has accelerated AI development at a speed far beyond biological evolution.

Can AI understand the meaning?

When LLM first appeared, numerous articles explained that the model operates simply as a word-by-word prediction machine based on statistics. In other words, the model is “simply generating the most probable completion of the sequence of words in its current prompt,” as Stanford professors Fei-Fei Li and John Etchemendy write in a Time article.

It is likely not as simple as it sounds. Hinton says in the 60 Minutes interview:

“To predict the next word, you have to understand the sentences, so the idea that they’re just predicting the next word, so they’re not intelligent, is crazy — you have to be really intelligent to predict the next word really accurately.”

As mentioned above, when Hinton published the backpropagation algorithm, another key observation they had was that the model possesses the general learning capability by creating new features within the internal layers of the neural network. These features are the source from which new concepts, rules, and creativity originate.

For example, current LMMs have learned the syntactical (e.g., grammar) and semantic (e.g., synonyms) rules by simply reading from the text, rather than having engineers code the rules explicitly into the computer program. This learning process is similar to how children learn a language. These rules are internal representations enabled by the learned features in the high-dimensional space constituted by the large number of word embeddings. (For more details on word vectors and embeddings, you might want to read my previous article on the Transformer architecture.)

As shown in a Microsoft research paper, words with the same syntactic rules, for example, a singular form like “apple”, “car”, “movie”, and their corresponding plural forms (apples, cars, and movies), exhibit the same vector relationships, in terms of distance and angle, in the word-embedding space for each pair. As the weights of a word are adjusted from learning, all its related ones are also adjusted to maintain similar relationships. As such, the language reasoning process is achieved by consistent and coherent vector transformations, such as addition, subtraction, and multiplication. As the research shows, after the model learns, a straightforward linear algebra calculation of “King — Man + Woman” points to a vector position close to “Queen,” because these words are already bound as part of an internal representation.

More abstractly speaking, the vector of each word comprises hundreds or thousands of weight values, with each mapped to a dimension, as all the words in the same model share the same number of dimensions. The result of the learning process, through iterations of gradient descent and backpropagation, is the gradual shifting of these vectors in the high-dimensional space, along which the relationship between them becomes increasingly consistent, coherent, and predictable, governed by the underlying implicit rules and contexts. At the same time, each dimension gradually pins an internal “feature”, shared by those vectors in the same dimensional frame.

Because it is impossible for humans to view a high-dimensional space, the word embeddings and their internal representation have been a “black box”, and no one knows exactly what those features are. One common method is to project a sample of the vectors into a 2- or 3-dimensional space by focusing on the most relevant features, which can also be very challenging and feasible mostly for much smaller models.

In 2021, OpenAI researchers published a study on the “grokking” behavior of an LLM, where the model seems to suddenly “get it” after an extensive period of training. The training set is like a 2-variable (x, y) lookup table, and the result (let’s say z) should be the modulo 97 of the sum of x and y.

z = (x + y) mod 97 where 0 <= x, y < 97

The model is not instructed with the formula. Instead, it is trained with half of the x, y, and z value sets, and tested with the other half. Here are some examples:

z = 7, when x = 0, y = 7

z = 4, when x = 10, y = 91

z = 26, when x = 26, y = 97

Based on the paper, before 1000 iterations, the model can predict the trained examples correctly 100% of the time, but performs poorly (seemingly randomly) when given new tests that it has not seen before. At this point, the model “memorizes” the training data, and its prediction follows along the autocompletion mechanism.

However, when the model is continued to be trained for close to 100K iterations, something happens. The model seems to be figuring out the underlying formula, showing dramatically improved performance in “calculating” the numbers not present in the training data. After approximately 1 million iterations, the model was able to calculate perfectly by mastering the mathematical formula on its own.

When the researchers attempted to examine the internal features by projecting the high-dimensional space onto a two-dimensional one, they observed that the number vectors are organized in a circular fashion, with numbers of the same modulus clustered next to each other orderly.

This signifies a leap in generalization, showing a human-like inductive reasoning process. When a child sees a new bird, they know it is a bird, even though they can’t name it. We are also skilled at identifying causal-effect relationships when we make repetitive observations of two events, one preceding the other. The research indicates that, compared to humans, it takes much longer for the deep neural network to generalize. But what causes the difference, and what factors could accelerate this process? Notably, the experiment was done on a very small decoder-only transformer with only two layers and four attention heads. The size of the model might have played a role? Another question remains to be answered.

It has long been known that new emergent phenomena, such as complex, large-scale properties or behaviors of a system, can arise spontaneously from the seemingly simple interactions of smaller, fundamental constituents. However, how the phenomenon occurs cannot be easily explained or predicted by examining those parts in isolation. These emergent phenomena are ubiquitous, manifesting in various aspects of climate, human society, the stock market, human and animal behaviors, among others.

For neural networks, unlike traditional computer programming, where every line of code offers an instruction with a clear purpose, neural networks are black boxes that can potentially create emergent behaviors that even the designers didn’t predict.

This is why Hinton feels the urgency to raise awareness that engineers and scientists need to spend more time researching and conducting experiments with LLMs (and other large neural networks) to understand them. Hinton states in the same interview:

“In terms of keeping control of a super intelligence, what you need is the people who are developing it to be doing lots of little experiments with it and seeing what happens as they’re developing it and before it’s out of control. And that’s going to be mainly the researchers in companies.”

Furthermore, Hinton argues that the effort dedicated to this type of research should be close to 50%, equal to the effort allocated to model development. Dismally, 99% of the current effort is spent on the latter.

In Shelley’s classic novel, Victor Frankenstein initially feels passionate about creating a companion for himself, but he immediately abandons the creature when he finds that the outcome is not what he expected. Remarkably, the creature manages to learn everything from humans during his wandering, including language, human relationships, and morals. In his exchanges with Victor, he demonstrates remarkably coherent reasoning and persuasiveness with compassion.

Today, the story is no longer science fiction, but is becoming increasingly closer to reality. Over two hundred years ago, Shelley already called the ambition of science and technology into question: Do scientists even know what they are creating, and have they ever thought about the eventual consequences?

Completely opposite to Victor, Hinton shows the scientist’s integrity with honesty, courage, and responsiveness. He never wants to stop or pause the development of superintelligence, not merely out of fear and uncertainty. He acknowledged that AI is extremely helpful in many fields for humanity, such as reading medical scans, designing drugs, and understanding climate change, just to name a few.

What is truly needed, as Hinton tirelessly advocates, is to invest efforts and resources in researching and understanding large AI models and their emerging behaviors. This is to ensure humankind can develop safety measures to proactively prevent potential damage that AI might cause in the not-too-distant future.

Mosaics

Discussion about this post

Ready for more?