An Interactive Companion · Qualifying Paper Supplement

The Long Arc
of Machine
Intelligence

From Alan Turing's 1950 thought experiment to the agentic systems of today — seven decades in which the question "can machines think?" quietly became the question "what should we let them do?"

Structured for educators, researchers, and curious readers.  ·  Interactive throughout. Click, drag, type.
Chronological Index
195019701990201020202026
§ I1950 — 2005

Foundations in silence

Before the past twenty years, there were fifty more. The modern deep learning boom did not appear from nothing — it grew out of a long, often discouraging arc of theorems, hardware, and the stubbornness of a small group of researchers who kept building neural networks while most of the field thought they were a dead end. To understand ImageNet, GANs, and Transformers, you first have to understand what had to go right for them to even be possible.

1950
Computing Machinery and Intelligence
Alan Turing · Mind journal
Turing proposed what is now called the Turing Test — replacing the metaphysical question "can machines think?" with an operational one about imitation. The paper framed intelligence as behavior, a decision whose consequences still reverberate.
1956
The Dartmouth Workshop
McCarthy · Minsky · Rochester · Shannon
Two months at Dartmouth College where the term "artificial intelligence" was coined. The proposal claimed the problem could largely be solved in a summer. It could not.
1958
The Perceptron
Frank Rosenblatt · Cornell
The first trainable artificial neuron — a machine that could learn a decision boundary from examples. The New York Times predicted it would soon "walk, talk, see, write, reproduce itself."
1969
Perceptrons (the book)
Marvin Minsky · Seymour Papert
A mathematical critique showing single-layer perceptrons could not learn XOR. Interpreted more broadly than warranted, it contributed to the first "AI winter" — a funding and interest collapse for neural approaches.
1986
Backpropagation, revived
Rumelhart · Hinton · Williams
The backpropagation algorithm for training multi-layer networks was popularized. The math had existed earlier in control theory; this paper made it usable. Multi-layer networks could now learn representations — in principle.
1989–98
Convolutional networks & LeNet
Yann LeCun · Bell Labs
LeCun's LeNet-5 recognized handwritten digits well enough to read U.S. postal ZIP codes and bank checks. Convolutions, weight sharing, pooling — the architectural ideas that would later underwrite AlexNet were already here.
1997
Deep Blue & LSTMs
IBM · Hochreiter & Schmidhuber
Deep Blue beat Kasparov through brute search — a symbolic victory, not a learning one. The same year, Long Short-Term Memory networks were published, quietly solving the vanishing gradient problem that had crippled recurrent networks.
1995–2005
The statistical interregnum
Vapnik · Breiman · Koller et al.
Support Vector Machines, Random Forests, and graphical models dominated industry. Neural networks were widely considered unfashionable; most ML conferences filtered them out at the abstract stage.
Interactive · Foundations
Rosenblatt's Perceptron — why a single layer cannot learn XOR
Drag the line. Pick a pattern.
1.00
1.00
-1.00
A perceptron outputs 1 when w₁·x₁ + w₂·x₂ + b > 0. Classes must be separable by a single line. XOR is not.
Status: drag sliders to adjust the decision boundary. Green dots should end up on one side, red on the other.
§ II2006 — 2011

The awakening

Three conditions converged in the second half of the 2000s: a quiet algorithmic comeback for deep networks, the arrival of commodity GPUs originally built for video games, and — critically — data at a scale no previous generation of researchers had access to. The "deep learning" brand was coined; most of the field did not yet believe it.

2006
Deep Belief Networks
Hinton · Osindero · Teh
Geoffrey Hinton's group showed that layer-by-layer unsupervised pre-training could initialize deep networks that previously would not train. The term "deep learning" began its public career.
2007
CUDA & GPU compute
NVIDIA
NVIDIA released CUDA, making GPUs general-purpose compute devices. The same silicon selling to teenagers to render video games would soon run matrix multiplications for neural networks at unprecedented speed.
2009
ImageNet is released
Fei-Fei Li · Princeton / Stanford
A labeled dataset of 14+ million images across 22,000 categories, organized on the WordNet hierarchy. Built by crowdsourcing labels through Amazon Mechanical Turk. Most of the community saw it as an engineering curiosity, not a research catalyst.
2011
IBM Watson wins Jeopardy!
IBM Research
A pipeline of hundreds of NLP components beat champions Ken Jennings and Brad Rutter. Not deep learning — but a public event that signaled that "AI" could do something that felt genuinely surprising.
"We decided we wanted to do something that was completely historically unprecedented. We wanted to map out the entire world of objects." — Fei-Fei Li, on ImageNet, 2009
Interactive · Neural Network
A multilayer network recognizes a 4×4 pattern
Click pixels. Watch activations flow.
INPUT · 16 pixels
click to toggle
Output: waiting…
§ III2009 — 2012 · Deep Dive

ImageNet & Fei-Fei Li

FEI-FEI LI · 2009
Founder, ImageNet

Fei-Fei Li

In 2007, as an assistant professor at Princeton, Fei-Fei Li made a bet that cut against prevailing wisdom in computer vision. The field was obsessed with algorithms; she argued the bottleneck was data. She proposed mapping the entire WordNet noun hierarchy to images — tens of thousands of categories, millions of examples, all human-labeled.

Grant reviewers told her the project was "not a good idea for a junior professor." She built it anyway, at first with undergraduate labelers, then at scale through Amazon Mechanical Turk — reportedly processing 167 countries of labelers at one point. ImageNet launched in 2009 and the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) began in 2010. For two years, error rates barely moved. Then came 2012.

The benchmark that made the field turn

Consider what the ImageNet challenge measured: top-5 classification error on 1,000 object categories drawn from 1.2 million training images. In 2010 the winning system — a careful ensemble of hand-crafted SIFT features and Fisher vectors — got 28% of images wrong. In 2011, a similar approach got 26% wrong. Incremental progress. Respectable work. Then in 2012, a submission from the University of Toronto called AlexNet — a deep convolutional network trained on two consumer GPUs — came in at 15.3% error. The next best entry was at 26.2%. It was not an improvement. It was a discontinuity.

ILSVRC top-5 error rate, winning entry · lower is better · human performance ≈ 5.1%
Interactive · ImageNet
Explore the WordNet hierarchy ImageNet was built on
Filter by super-category. Hover for labels.
ImageNet's real scale: ~14M images, ~22,000 synsets (WordNet categories). Each cell here is a schematic representation — hover to see how the labels cluster by super-category.
"We were looking for a North Star. Without a benchmark that everyone agreed on, every team's results were noise. ImageNet gave the field a shared language for progress." — paraphrased from Fei-Fei Li's reflections on the challenge's early years
§ IV2012 — 2016

A Cambrian explosion

Once AlexNet proved that deep networks could win at scale, the architectural experiments arrived in waves. In four years the field produced image recognition better than human on ImageNet, word embeddings that captured semantic analogies, generative adversarial networks that could hallucinate faces, networks deep enough to be called residual, and a Go-playing system that beat a human world champion. For anyone paying attention, it no longer looked like progress. It looked like a phase change.

2012
AlexNet wins ImageNet
Krizhevsky · Sutskever · Hinton
A deep convolutional network, ReLU activations, dropout, and two GTX 580 GPUs. 15.3% top-5 error when the runner-up was at 26%. The paper became the most-cited in modern AI. Most tenure committees would not have let a grad student bet on it.
2013
Word2Vec
Mikolov et al. · Google
Dense vector embeddings learned from text co-occurrence. Famously: king − man + woman ≈ queen. Language, arithmetically. The first hint that semantic structure was latent in raw text at scale.
2014
Generative Adversarial Networks
Ian Goodfellow et al.
Two networks — a generator producing fakes and a discriminator trying to catch them — trained against each other. The idea reportedly came to Goodfellow at a Montreal pub. Within a few years, GAN-generated faces were indistinguishable from photographs.
2014
Seq2Seq & attention
Sutskever et al. · Bahdanau et al.
Encoder-decoder networks enabled end-to-end neural machine translation. Bahdanau's attention mechanism let the decoder look back at relevant parts of the source — the architectural seed that would later flower into Transformers.
2015
ResNet
He · Zhang · Ren · Sun · Microsoft
Residual connections — letting gradients skip layers — made networks of 152 layers trainable. That year, ResNet surpassed human-level accuracy on ImageNet. The depth limit, essentially, dissolved.
2016
AlphaGo beats Lee Sedol
DeepMind
Deep reinforcement learning combined with Monte Carlo tree search won 4–1 against one of the strongest Go players in history. Move 37 of Game 2 was so unexpected that commentators assumed it was a bug. It was strategy no human had seen.
Interactive · GAN
Generator vs Discriminator — the adversarial game
Press play. Watch the fakes improve.

Generator G

fake · step 0

Discriminator D

real · D(fake) = 0.50
G is losing. D catches it 95% of the time.
A simplified GAN game. The real target is a gradient-textured image; the generator starts from noise and updates slightly toward it while the discriminator updates away from it. The two losses seek an uneasy equilibrium.
§ V2017 — 2021

Attention is all you need

In June 2017, eight researchers at Google Brain and Google Research published a 15-page paper titled with unusual confidence. It proposed replacing recurrent and convolutional structures in sequence modeling with a single mechanism: scaled dot-product attention, stacked in parallel. The architecture was called the Transformer. Within three years it had eaten machine translation, language modeling, image classification, and protein structure prediction.

2017
Attention Is All You Need
Vaswani · Shazeer · Parmar · Uszkoreit · Jones · Gomez · Kaiser · Polosukhin
The Transformer architecture. Self-attention layers let every token query every other token directly, removing the sequential bottleneck of RNNs. Parallelizable, scalable, and — it turned out — almost embarrassingly general.
2018
BERT & GPT-1
Google · OpenAI
Two divergent bets on transformers for language. BERT (bidirectional, fill-in-the-blank) and GPT-1 (unidirectional, next-token). Pretrain on large corpora, fine-tune for tasks. The era of pretrained foundation models began here.
2020
GPT-3 · Scaling laws
OpenAI · Kaplan et al.
A 175-billion parameter transformer trained on hundreds of billions of tokens. In-context learning — solving new tasks from examples in the prompt — emerged at scale without explicit fine-tuning. The Kaplan scaling laws made the relationship between compute, data, and loss legible.
2020
AlphaFold 2
DeepMind
A transformer-based system solved protein structure prediction to near-experimental accuracy at CASP14. A 50-year open problem in biology, resolved to the point that scientists debated whether the problem was still open. Deep learning, outside of vision and language, no longer looked exotic.
2021
CLIP & DALL·E
OpenAI
Joint training on 400 million image-text pairs produced a model — CLIP — that could evaluate image-caption alignment. Paired with generative models, it unlocked text-to-image generation. The generative frontier quietly shifted out of the research lab.
Interactive · Transformer
Self-attention — which words does each word look at?
Click a token. See where attention lands.
CLICK A TOKEN (query)
ATTENTION WEIGHTS (softmax over keys)
Simulated attention weights, derived from simple syntactic and co-occurrence heuristics for illustration. In a real transformer, these weights are learned by gradient descent across billions of tokens. Every layer computes multiple heads in parallel; the model stacks dozens of these.
§ VI2022 — 2024

The generative flood

For most of AI's history, the work happened in labs and conference proceedings. Between November 2022 and the end of 2024, it moved into phones, classrooms, writing workflows, student desks, and district IT tickets. The technical advances were real — diffusion models, RLHF, multimodality — but the deeper shift was cultural. AI stopped being a topic teachers taught about and became a material they negotiated with.

2022
Stable Diffusion · Latent diffusion
Rombach et al. · Stability AI · LMU Munich
Open-weight text-to-image generation runnable on a consumer GPU. Diffusion models — iteratively denoising random noise toward a prompt-conditioned target — displaced GANs as the dominant approach to image synthesis.
2022
ChatGPT · 30 November
OpenAI
GPT-3.5 with RLHF (reinforcement learning from human feedback) behind a chat interface. 100 million users in two months. Every K-12 district in the country heard the word "AI" from a board member within weeks.
2023
GPT-4, Claude, Llama, Gemini
Multiple labs
Frontier models at multiple organizations. Multimodal input (images, then audio). Open-weight models — Meta's Llama 2 in particular — put serious capability into local deployment. The labor market began to feel it.
2024
Video generation, long context
OpenAI · Google · Runway · Anthropic
Sora, Veo, and others produced minute-long photorealistic video from text. Context windows expanded past a million tokens. The question shifted from "can the model do it?" to "should the classroom, or the district, let it?"
"The fluency of the output is the problem. The output looks like learning whether or not learning has occurred." — a framing central to pedagogical friction as a construct
§ VII2025 — 2026

The agentic turn

Chatbots answer questions. Agents take actions. The distinction is small in architecture and very large in consequence. Between 2024 and 2026, the frontier shifted from models that produce text for a human to read to models that produce plans, call tools, execute code, browse the web, and decide what to do next. Governance frameworks — including the one the CoSN Agentic AI Subcommittee has been drafting — are still catching up.

2024
Reasoning models
OpenAI · DeepSeek · others
Models trained to produce extended internal deliberation before answering. Performance on formal mathematics, competitive coding, and scientific problems jumped sharply. Inference-time compute became a first-class variable alongside parameter count.
2025
Agentic tool use matures
Anthropic · OpenAI · Google
Computer-use APIs, browser agents, and code-execution environments ship behind frontier models. Multi-step tasks — research, email triage, spreadsheet manipulation — move into model scope. MCP (Model Context Protocol) emerges as the integration standard.
2026
The present moment
You, reading this
Districts are drafting their first agentic AI policies. The question is no longer what can the model do but what should a student, teacher, or principal authorize it to do on their behalf — and whether the resulting output reflects learning, compliance, or what might be called unproductive success.

What the long arc does not show

A timeline like this inevitably rewards the architectures and the benchmarks. It does not reward the people who refused to run an experiment, the datasets that were built on labor that was under-compensated or undisclosed, the languages and communities whose texts were scraped without consent, or the knowledge traditions that do not fit cleanly into "classification error." A full history would include all of this. The short version for educators: the technical arc is real, and it is incomplete. The work of deciding what it means for schools is not a technical question.