Uncomfortable Lessons Learned from AI Experimentation in Contracts Law

The headline finding of Law Professors Prefer AI Over Peer Answers is almost too neat. When law professors were asked to choose between short answers written by other law professors and short answers written by AI, they usually chose the AI. The paper, co-authored by 21 researchers and dated May 27, 2026, reports a blind study in which 16 U.S. contracts professors generated student-style questions, wrote short answers, and then judged anonymized answer pairs. Across 2,918 comparisons, professors preferred LLM responses over human professor responses roughly three-quarters of the time.

That is a genuinely provocative result. Not because it proves that AI “understands law,” or that law professors are obsolete, or even that students should outsource doctrinal learning to chatbots. It does not prove any of those things. What it shows is narrower, but still important. In the highly practical genre of the law professors’ office-hours explanation, current AI systems can produce reasoning answers that expert legal educators often prefer to peer-written alternatives.

The study’s design matters. This was not a multiple-choice benchmark, a bar-exam test, or a hallucination hunt. The professors all taught contracts from the same casebook. They created questions that first-year students ask after class or during office hours. The researchers curated forty questions across four categories: case/code recall, doctrinal recall, hypotheticals, and policy. That mix included both questions with something close to a “right answer” and questions that require legal judgment. Professors and AI had to apply doctrine to new facts, weighing ambiguity, or explaining policy trade-offs.

The human side of the experiment was also intentionally realistic. Professors were asked to answer questions they had not authored, to do so briefly, and generally not to conduct additional research. This was intended to simulate a real office-hours interaction in which a student asks a question, and the professor gives a concise explanation on the spot. The AI systems answered the same forty questions.

The two human-evaluated models were Gemini 2.5 Pro and NotebookLM, with NotebookLM grounded in the shared casebook (NotebookLM’s foundational structure relies on the input of books, papers, and other sources). Before judging, the answers were anonymized, lightly standardized, randomly positioned, and presented side-by-side. The professors chose which answer they would rather give to a student and could flag answers as harmful to learning.

The results are striking. Gemini 2.5 Pro won 75.92% of its comparisons against instructors, while NotebookLM won 74.75%. Individual instructor win rates against the two LLMs ranged from 2.96% to 51.15%, with the pooled instructor average at 24.67%. Even more telling, every judge preferred LLM answers over human answers overall. The lowest judge-level LLM preference rate was still 56%.

The AI advantage did not disappear when the questions got more lawyerly. One easy explanation would be: “Of course the AI did well on recall questions; it is a memorization machine.” But the paper reports that the LLM advantage persisted across all four categories, including hypotheticals and policy questions. Gemini’s win rate ranged from 74.24% on hypotheticals to 77.17% on case/code recall; NotebookLM’s ranged from 72.69% on hypotheticals to 76.80% on case/code recall.

That is probably the paper’s deepest challenge to conventional legal-academic intuitions. Law professors often distinguish “knowing the rule” from “thinking like a lawyer.” This study suggests that, at least in short-form contracts pedagogy, the gap between those two activities may be smaller than we like to imagine, or that frontier LLMs are now good at imitating the second, not just the first.

The harmfulness results add another twist. Professors flagged Gemini answers as harmful 3.41% of the time and NotebookLM answers as harmful 3.64% of the time. Human instructor answers, by contrast, had an average harmfulness rate of 12.06%, with a range from 1.00% to 39.75%. The paper emphasizes that “harmful” here means pedagogically harmful, such as likely to mislead or hinder learning, and not malpractice-grade danger. The outcomes cut against a common assumption that the human professor is the safe baseline and the AI is the risky deviation. In this setting, the model answers were not only more preferred, they were also less often flagged as harmful.

The researchers also try to answer the “style over substance” objection. Maybe professors just liked the AI because it wrote smoothly, sounded organized, or had the familiar polished cadence of a chatbot. The researchers engineered textual features such as length, structure, legal anchors, confidence tone, clarity, and pedagogical support. Those features explained only part of the LLM advantage. The observed LLM win rates remained systematically higher than predicted by surface-level features alone. That does not prove deep legal understanding, but it weakens the simplest version of the “mere polish” critique.

Law Still Needs a Human in the Loop

One of the paper’s most interesting moves is its concept of a “shared professional standard.” In many legal questions, there is no single answer key. So how do we know whether an answer is good? The authors argue that expert agreement can itself reveal the discipline’s latent standard. Their analysis found that professors’ judgments converged more than would be expected if they were merely following private tastes plus a generic preference for AI. Agreement was evident across categories, especially for policy questions.

That claim of a latent standard is powerful, but it is also where the paper should make legal educators uneasy. Law’s “shared professional standard” is real. There are better and worse ways to frame ambiguity, apply doctrine, and teach students how to reason from precedent. But law also depends on disagreement.

Some of the best teaching happens when a professor refuses the consensus framing, stresses a neglected institutional value, or shows how a doctrine looks different from another theoretical angle. A model that reliably captures the profession’s middle may be excellent for clarification and dangerous for intellectual edge. The risk is not just hallucination; it is homogenization.

The retrieval result is another fascinating surprise. NotebookLM had access to the casebook, while stock Gemini did not. Yet stock Gemini outperformed NotebookLM in the main expert evaluation, and the paper’s LLM-as-judge extension found a similar pattern for a commercial AI tutor grounded in the casebook. The authors suggest several possible explanations:

Base models may already encode enough first-year contracts structure
Long-document retrieval may add noise
Platform layers designed for tutoring may sometimes blunt the strengths of the underlying model

That point has immediate design implications. “Ground it in the course materials” sounds like the obvious answer to legal-AI risk. The authors suggest that grounding is not magic. Bad retrieval can be worse than no retrieval at all. A strong legal tutor probably needs carefully chosen chunks, citations, scope limits, refusal behavior, and tests that show when retrieval improves answers rather than dilutes them. The authors recommend clear scope limits, refusal policies, deterministic decoding, visible citations to course materials, and escalation paths to instructors.

The LLM-as-judge portion should be read with more caution. After validating Llama-4 Maverick against human majority judgments, the authors used it to rank additional systems across 42,652 comparisons. Claude Opus 4.7 ranked highest, followed by ChatGPT 5.4 and Gemini 2.5 Pro, with all evaluated AI models outperforming human instructors on average. But the authors acknowledge known LLM-judge problems, including position bias, verbosity bias, and in-family preference. The broad pattern may be meaningful; the precise leaderboard should not be overread.

Results Are Far From Definitive

The study’s limitations are serious. Sixteen professors provide a respectable expert panel for a labor-intensive design, but not a census of legal education. The participant pool overrepresented professors at Top 14 schools, underrepresented women, and was more heavily tenured than the full invited group. The authors are explicit that the “shared professional standard” they estimate is the standard of this subset of contracts instructors, not necessarily the entire legal academy.

There is also a genre limitation. This was not a test of mentoring, classroom Socratic dialogue, exam feedback, professional formation, moral judgment, client counseling, or legal practice. It was a test of short-answer questions. In real life, a professor might ask a follow-up question, diagnose the student’s confusion, connect the answer to a prior class discussion, or intentionally withhold a clean answer to force better reasoning. The study captures a valuable slice of teaching, but only a slice.

Most importantly, the paper does not show that students learn more from AI. The authors say this directly. Their design evaluates answer quality, specifically what an expert would prefer to deliver under blinding and not learning impact. They call for course-embedded randomized controlled trials and more careful deployment research. That distinction matters. A beautifully clear answer can help a confused student. It can also short-circuit productive struggle.

How Should Institutions Respond?

My takeaway is that law schools should neither panic nor shrug. The right institutional response is not “ban AI” or “replace teaching assistants with chatbots.” It is to treat AI as a potentially excellent first-line clarification layer that is available at all hours, tuned to the course, tested against faculty preferences, grounded where grounding helps, and designed to escalate hard or ambiguous questions to humans.

The article’s real contribution is not the leaderboard. It is the evaluation model. Instead of asking whether AI can hit a single answer key, the authors ask whether it can satisfy expert judgment in a domain where quality is partly tacit. That is exactly the right question for law, and probably for many fields beyond law.

The unsettling possibility is that a great deal of everyday legal teaching consists of compressible professional judgment, namely recurring patterns, standard distinctions, canonical caveats, and disciplined ways of saying “it depends.” If AI can deliver that layer well, professors become more important, but their comparative advantage moves upward. Less time spent repeating UCC basics. More time spent teaching students when the standard answer is incomplete, contested, or morally unsatisfying.

Despite the typical researcher caveats, the outcomes of this study likely lead to potential future changes in the way we teach and tutor law students and students in other areas. The outcomes may also lead to changes in the ways lawyers prepare for legal arguments.