From Alan Turing's 1950 thought experiment to the agentic systems of today — seven decades in which the question "can machines think?" quietly became the question "what should we let them do?"
Before the past twenty years, there were fifty more. The modern deep learning boom did not appear from nothing — it grew out of a long, often discouraging arc of theorems, hardware, and the stubbornness of a small group of researchers who kept building neural networks while most of the field thought they were a dead end. To understand ImageNet, GANs, and Transformers, you first have to understand what had to go right for them to even be possible.
Three conditions converged in the second half of the 2000s: a quiet algorithmic comeback for deep networks, the arrival of commodity GPUs originally built for video games, and — critically — data at a scale no previous generation of researchers had access to. The "deep learning" brand was coined; most of the field did not yet believe it.
"We decided we wanted to do something that was completely historically unprecedented. We wanted to map out the entire world of objects." — Fei-Fei Li, on ImageNet, 2009
Consider what the ImageNet challenge measured: top-5 classification error on 1,000 object categories drawn from 1.2 million training images. In 2010 the winning system — a careful ensemble of hand-crafted SIFT features and Fisher vectors — got 28% of images wrong. In 2011, a similar approach got 26% wrong. Incremental progress. Respectable work. Then in 2012, a submission from the University of Toronto called AlexNet — a deep convolutional network trained on two consumer GPUs — came in at 15.3% error. The next best entry was at 26.2%. It was not an improvement. It was a discontinuity.
"We were looking for a North Star. Without a benchmark that everyone agreed on, every team's results were noise. ImageNet gave the field a shared language for progress." — paraphrased from Fei-Fei Li's reflections on the challenge's early years
Once AlexNet proved that deep networks could win at scale, the architectural experiments arrived in waves. In four years the field produced image recognition better than human on ImageNet, word embeddings that captured semantic analogies, generative adversarial networks that could hallucinate faces, networks deep enough to be called residual, and a Go-playing system that beat a human world champion. For anyone paying attention, it no longer looked like progress. It looked like a phase change.
In June 2017, eight researchers at Google Brain and Google Research published a 15-page paper titled with unusual confidence. It proposed replacing recurrent and convolutional structures in sequence modeling with a single mechanism: scaled dot-product attention, stacked in parallel. The architecture was called the Transformer. Within three years it had eaten machine translation, language modeling, image classification, and protein structure prediction.
For most of AI's history, the work happened in labs and conference proceedings. Between November 2022 and the end of 2024, it moved into phones, classrooms, writing workflows, student desks, and district IT tickets. The technical advances were real — diffusion models, RLHF, multimodality — but the deeper shift was cultural. AI stopped being a topic teachers taught about and became a material they negotiated with.
"The fluency of the output is the problem. The output looks like learning whether or not learning has occurred." — a framing central to pedagogical friction as a construct
Chatbots answer questions. Agents take actions. The distinction is small in architecture and very large in consequence. Between 2024 and 2026, the frontier shifted from models that produce text for a human to read to models that produce plans, call tools, execute code, browse the web, and decide what to do next. Governance frameworks — including the one the CoSN Agentic AI Subcommittee has been drafting — are still catching up.
A timeline like this inevitably rewards the architectures and the benchmarks. It does not reward the people who refused to run an experiment, the datasets that were built on labor that was under-compensated or undisclosed, the languages and communities whose texts were scraped without consent, or the knowledge traditions that do not fit cleanly into "classification error." A full history would include all of this. The short version for educators: the technical arc is real, and it is incomplete. The work of deciding what it means for schools is not a technical question.