The Long Arc — A History of Machine Intelligence (1950

§ I1950 — 2005

Foundations in silence

Before the past twenty years, there were fifty more. The modern deep learning boom did not appear from nothing — it grew out of a long, often discouraging arc of theorems, hardware, and the stubbornness of a small group of researchers who kept building neural networks while most of the field thought they were a dead end. To understand ImageNet, GANs, and Transformers, you first have to understand what had to go right for them to even be possible.

1950

Computing Machinery and Intelligence

Alan Turing · Mind journal

Turing proposed what is now called the Turing Test — replacing the metaphysical question "can machines think?" with an operational one about imitation. The paper framed intelligence as behavior, a decision whose consequences still reverberate.

1956

The Dartmouth Workshop

McCarthy · Minsky · Rochester · Shannon

Two months at Dartmouth College where the term "artificial intelligence" was coined. The proposal claimed the problem could largely be solved in a summer. It could not.

1958

The Perceptron

Frank Rosenblatt · Cornell

The first trainable artificial neuron — a machine that could learn a decision boundary from examples. The New York Times predicted it would soon "walk, talk, see, write, reproduce itself."

1969

Perceptrons (the book)

Marvin Minsky · Seymour Papert

A mathematical critique showing single-layer perceptrons could not learn XOR. Interpreted more broadly than warranted, it contributed to the first "AI winter" — a funding and interest collapse for neural approaches.

1986

Backpropagation, revived

Rumelhart · Hinton · Williams

The backpropagation algorithm for training multi-layer networks was popularized. The math had existed earlier in control theory; this paper made it usable. Multi-layer networks could now learn representations — in principle.

1989–98

Convolutional networks & LeNet

Yann LeCun · Bell Labs

LeCun's LeNet-5 recognized handwritten digits well enough to read U.S. postal ZIP codes and bank checks. Convolutions, weight sharing, pooling — the architectural ideas that would later underwrite AlexNet were already here.

1997

Deep Blue & LSTMs

IBM · Hochreiter & Schmidhuber

Deep Blue beat Kasparov through brute search — a symbolic victory, not a learning one. The same year, Long Short-Term Memory networks were published, quietly solving the vanishing gradient problem that had crippled recurrent networks.

1995–2005

The statistical interregnum

Vapnik · Breiman · Koller et al.

Support Vector Machines, Random Forests, and graphical models dominated industry. Neural networks were widely considered unfashionable; most ML conferences filtered them out at the abstract stage.

Interactive · Foundations

Rosenblatt's Perceptron — why a single layer cannot learn XOR

Drag the line. Pick a pattern.

Weight w₁

1.00

Weight w₂

1.00

Bias b

-1.00

Pattern

A perceptron outputs 1 when w₁·x₁ + w₂·x₂ + b > 0. Classes must be separable by a single line. XOR is not.

Status: drag sliders to adjust the decision boundary. Green dots should end up on one side, red on the other.

§ II2006 — 2011

The awakening

Three conditions converged in the second half of the 2000s: a quiet algorithmic comeback for deep networks, the arrival of commodity GPUs originally built for video games, and — critically — data at a scale no previous generation of researchers had access to. The "deep learning" brand was coined; most of the field did not yet believe it.

2006

Deep Belief Networks

Hinton · Osindero · Teh

Geoffrey Hinton's group showed that layer-by-layer unsupervised pre-training could initialize deep networks that previously would not train. The term "deep learning" began its public career.

2007

CUDA & GPU compute

NVIDIA

NVIDIA released CUDA, making GPUs general-purpose compute devices. The same silicon selling to teenagers to render video games would soon run matrix multiplications for neural networks at unprecedented speed.

2009

ImageNet is released

Fei-Fei Li · Princeton / Stanford

A labeled dataset of 14+ million images across 22,000 categories, organized on the WordNet hierarchy. Built by crowdsourcing labels through Amazon Mechanical Turk. Most of the community saw it as an engineering curiosity, not a research catalyst.

2011

IBM Watson wins Jeopardy!

IBM Research

A pipeline of hundreds of NLP components beat champions Ken Jennings and Brad Rutter. Not deep learning — but a public event that signaled that "AI" could do something that felt genuinely surprising.

"We decided we wanted to do something that was completely historically unprecedented. We wanted to map out the entire world of objects." — Fei-Fei Li, on ImageNet, 2009

Interactive · Neural Network

A multilayer network recognizes a 4×4 pattern

Click pixels. Watch activations flow.

INPUT · 16 pixels

click to toggle

Output: waiting…

§ III2009 — 2012 · Deep Dive

ImageNet & Fei-Fei Li

Founder, ImageNet

Fei-Fei Li

In 2007, as an assistant professor at Princeton, Fei-Fei Li made a bet that cut against prevailing wisdom in computer vision. The field was obsessed with algorithms; she argued the bottleneck was data. She proposed mapping the entire WordNet noun hierarchy to images — tens of thousands of categories, millions of examples, all human-labeled.

Grant reviewers told her the project was "not a good idea for a junior professor." She built it anyway, at first with undergraduate labelers, then at scale through Amazon Mechanical Turk — reportedly processing 167 countries of labelers at one point. ImageNet launched in 2009 and the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) began in 2010. For two years, error rates barely moved. Then came 2012.

The benchmark that made the field turn

Consider what the ImageNet challenge measured: top-5 classification error on 1,000 object categories drawn from 1.2 million training images. In 2010 the winning system — a careful ensemble of hand-crafted SIFT features and Fisher vectors — got 28% of images wrong. In 2011, a similar approach got 26% wrong. Incremental progress. Respectable work. Then in 2012, a submission from the University of Toronto called AlexNet — a deep convolutional network trained on two consumer GPUs — came in at 15.3% error. The next best entry was at 26.2%. It was not an improvement. It was a discontinuity.

ILSVRC top-5 error rate, winning entry · lower is better · human performance ≈ 5.1%

Interactive · ImageNet

Explore the WordNet hierarchy ImageNet was built on

Filter by super-category. Hover for labels.

ImageNet's real scale: ~14M images, ~22,000 synsets (WordNet categories). Each cell here is a schematic representation — hover to see how the labels cluster by super-category.

"We were looking for a North Star. Without a benchmark that everyone agreed on, every team's results were noise. ImageNet gave the field a shared language for progress." — paraphrased from Fei-Fei Li's reflections on the challenge's early years

§ IV2012 — 2016

A Cambrian explosion

Once AlexNet proved that deep networks could win at scale, the architectural experiments arrived in waves. In four years the field produced image recognition better than human on ImageNet, word embeddings that captured semantic analogies, generative adversarial networks that could hallucinate faces, networks deep enough to be called residual, and a Go-playing system that beat a human world champion. For anyone paying attention, it no longer looked like progress. It looked like a phase change.

2012

AlexNet wins ImageNet

Krizhevsky · Sutskever · Hinton

A deep convolutional network, ReLU activations, dropout, and two GTX 580 GPUs. 15.3% top-5 error when the runner-up was at 26%. The paper became the most-cited in modern AI. Most tenure committees would not have let a grad student bet on it.

2013

Word2Vec

Mikolov et al. · Google

Dense vector embeddings learned from text co-occurrence. Famously: king − man + woman ≈ queen. Language, arithmetically. The first hint that semantic structure was latent in raw text at scale.

2014

Generative Adversarial Networks

Ian Goodfellow et al.

Two networks — a generator producing fakes and a discriminator trying to catch them — trained against each other. The idea reportedly came to Goodfellow at a Montreal pub. Within a few years, GAN-generated faces were indistinguishable from photographs.

2014

Seq2Seq & attention

Sutskever et al. · Bahdanau et al.

Encoder-decoder networks enabled end-to-end neural machine translation. Bahdanau's attention mechanism let the decoder look back at relevant parts of the source — the architectural seed that would later flower into Transformers.

2015

ResNet

He · Zhang · Ren · Sun · Microsoft

Residual connections — letting gradients skip layers — made networks of 152 layers trainable. That year, ResNet surpassed human-level accuracy on ImageNet. The depth limit, essentially, dissolved.

2016

AlphaGo beats Lee Sedol

DeepMind

Deep reinforcement learning combined with Monte Carlo tree search won 4–1 against one of the strongest Go players in history. Move 37 of Game 2 was so unexpected that commentators assumed it was a bug. It was strategy no human had seen.

Interactive · GAN

Generator vs Discriminator — the adversarial game

Press play. Watch the fakes improve.

Generator G

fake · step 0

⇆

Discriminator D

real · D(fake) = 0.50

G is losing. D catches it 95% of the time.

A simplified GAN game. The real target is a gradient-textured image; the generator starts from noise and updates slightly toward it while the discriminator updates away from it. The two losses seek an uneasy equilibrium.

§ V2017 — 2021

Attention is all you need

In June 2017, eight researchers at Google Brain and Google Research published a 15-page paper titled with unusual confidence. It proposed replacing recurrent and convolutional structures in sequence modeling with a single mechanism: scaled dot-product attention, stacked in parallel. The architecture was called the Transformer. Within three years it had eaten machine translation, language modeling, image classification, and protein structure prediction.

2017

Attention Is All You Need

Vaswani · Shazeer · Parmar · Uszkoreit · Jones · Gomez · Kaiser · Polosukhin

The Transformer architecture. Self-attention layers let every token query every other token directly, removing the sequential bottleneck of RNNs. Parallelizable, scalable, and — it turned out — almost embarrassingly general.

2018

BERT & GPT-1

Google · OpenAI

Two divergent bets on transformers for language. BERT (bidirectional, fill-in-the-blank) and GPT-1 (unidirectional, next-token). Pretrain on large corpora, fine-tune for tasks. The era of pretrained foundation models began here.

2020

GPT-3 · Scaling laws

OpenAI · Kaplan et al.

A 175-billion parameter transformer trained on hundreds of billions of tokens. In-context learning — solving new tasks from examples in the prompt — emerged at scale without explicit fine-tuning. The Kaplan scaling laws made the relationship between compute, data, and loss legible.

2020

AlphaFold 2

DeepMind

A transformer-based system solved protein structure prediction to near-experimental accuracy at CASP14. A 50-year open problem in biology, resolved to the point that scientists debated whether the problem was still open. Deep learning, outside of vision and language, no longer looked exotic.

2021

CLIP & DALL·E

OpenAI

Joint training on 400 million image-text pairs produced a model — CLIP — that could evaluate image-caption alignment. Paired with generative models, it unlocked text-to-image generation. The generative frontier quietly shifted out of the research lab.

Interactive · Transformer

Self-attention — which words does each word look at?

Click a token. See where attention lands.

CLICK A TOKEN (query)

ATTENTION WEIGHTS (softmax over keys)

Simulated attention weights, derived from simple syntactic and co-occurrence heuristics for illustration. In a real transformer, these weights are learned by gradient descent across billions of tokens. Every layer computes multiple heads in parallel; the model stacks dozens of these.

§ VI2022 — 2024

The generative flood

For most of AI's history, the work happened in labs and conference proceedings. Between November 2022 and the end of 2024, it moved into phones, classrooms, writing workflows, student desks, and district IT tickets. The technical advances were real — diffusion models, RLHF, multimodality — but the deeper shift was cultural. AI stopped being a topic teachers taught about and became a material they negotiated with.

2022

Stable Diffusion · Latent diffusion

Rombach et al. · Stability AI · LMU Munich

Open-weight text-to-image generation runnable on a consumer GPU. Diffusion models — iteratively denoising random noise toward a prompt-conditioned target — displaced GANs as the dominant approach to image synthesis.

2022

ChatGPT · 30 November

OpenAI

GPT-3.5 with RLHF (reinforcement learning from human feedback) behind a chat interface. 100 million users in two months. Every K-12 district in the country heard the word "AI" from a board member within weeks.

2023

GPT-4, Claude, Llama, Gemini

Multiple labs

Frontier models at multiple organizations. Multimodal input (images, then audio). Open-weight models — Meta's Llama 2 in particular — put serious capability into local deployment. The labor market began to feel it.

2024

Video generation, long context

OpenAI · Google · Runway · Anthropic

Sora, Veo, and others produced minute-long photorealistic video from text. Context windows expanded past a million tokens. The question shifted from "can the model do it?" to "should the classroom, or the district, let it?"

"The fluency of the output is the problem. The output looks like learning whether or not learning has occurred." — a framing central to pedagogical friction as a construct

§ VII2025 — 2026

The agentic turn

Chatbots answer questions. Agents take actions. The distinction is small in architecture and very large in consequence. Between 2024 and 2026, the frontier shifted from models that produce text for a human to read to models that produce plans, call tools, execute code, browse the web, and decide what to do next. Governance frameworks — including the one the CoSN Agentic AI Subcommittee has been drafting — are still catching up.

2024

Reasoning models

OpenAI · DeepSeek · others

Models trained to produce extended internal deliberation before answering. Performance on formal mathematics, competitive coding, and scientific problems jumped sharply. Inference-time compute became a first-class variable alongside parameter count.

2025

Agentic tool use matures

Anthropic · OpenAI · Google

Computer-use APIs, browser agents, and code-execution environments ship behind frontier models. Multi-step tasks — research, email triage, spreadsheet manipulation — move into model scope. MCP (Model Context Protocol) emerges as the integration standard.

2026

The present moment

You, reading this

Districts are drafting their first agentic AI policies. The question is no longer what can the model do but what should a student, teacher, or principal authorize it to do on their behalf — and whether the resulting output reflects learning, compliance, or what might be called unproductive success.

What the long arc does not show

A timeline like this inevitably rewards the architectures and the benchmarks. It does not reward the people who refused to run an experiment, the datasets that were built on labor that was under-compensated or undisclosed, the languages and communities whose texts were scraped without consent, or the knowledge traditions that do not fit cleanly into "classification error." A full history would include all of this. The short version for educators: the technical arc is real, and it is incomplete. The work of deciding what it means for schools is not a technical question.