A professor I work with on an advisory board for a Midwest university told me something last semester that I have not been able to stop thinking about.
She said the grades went up. Not because students were learning more. Because nobody was turning in garbage anymore. The floor rose. The ceiling dropped. And the middle, that thick band of B-minus to C-plus work that used to be the majority, became almost the entirety.
Her bell curve did not shift right. It collapsed.
She was not describing a cheating problem. She was describing a quality problem. And the difference between those two diagnoses changes everything about what you do next.
The Distribution Nobody Talks About
When everyone has access to the same tool that produces the same competent-but-unremarkable output, the entire distribution compresses to the mean. This is not a theory. The research is catching up to what teachers have been feeling for two years.
Comparative studies of AI versus human essay grading consistently find that AI grading produces narrower score distributions and tighter consistency than human grading. Not because AI is more accurate, but because AI converges to a homogenized standard. The same convergence happens to student writing when AI assists in producing it. Tighter distribution. Lower variance. Lower ceiling.
A study comparing LLM and teacher essay scoring found that AI models excelled at language-related criteria but struggled with the dimensions that distinguish exceptional work from competent work. Those are exactly the dimensions that were used to create the right tail of the grade distribution, the part where the A papers lived.
The Hechinger Report put it plainly: AI essay grading is already "as good as an overburdened teacher." That is exactly the problem. The new floor is "as good as overburdened." The new ceiling is also "as good as overburdened."
When the floor rises to the average, and the ceiling drops to meet it, you have not raised quality. You have erased the signal that told you who was actually learning.
Why Detection Does Not Fix This
Here is what the professor tried first, because it is what everyone tries first. She added detection. Ran the papers through a tool. Caught a few students. Held meetings. Wrote up violations.
The grades did not re-spread. The distribution stayed flat. Detection does not fix distribution collapse because detection answers a different question. Detection asks, "did a student use AI." The professor needed to answer, "can a student think."
Those are not the same question, and every hour spent on the first one is an hour not spent on the second.
The Redesign That Actually Worked
So she redesigned the assignment. Not the detection. The thing she was measuring.
She moved the deliverable from "produce a polished artifact" to "defend a specific decision." She graded the thinking trail, not the final document. She used in-class oral defense for a portion of the grade. She required students to cite their AI use and explain what they accepted, what they rejected, and why.
The result: the distribution re-spread. The Cs separated from the Bs again. The As re-emerged. The students who could think with the tool outperformed the students who could only copy from it.
She stopped measuring task completion. She started measuring learning. That one shift changed everything.
The fix was not catching the cheaters. The fix was making the assessment measure something AI could not carry a student through.
The Honest Cost
I want to be direct about something. The redesign works, and it costs enormous teacher time. Oral defenses do not scale. Grading a thinking trail is harder than grading a finished paper. Requiring AI citations means teaching a literacy skill that most institutions have not built a curriculum for yet.
The professor who redesigned her class spent roughly double the hours on assessment that semester. She told me it was the best teaching she had done in a decade and that she could not sustain it without institutional support.
That is the honest version. The fix exists. But it costs the exact thing the institution thought it was buying when it adopted AI in the first place: time. Leaders who want both the productivity gain and the distribution spread are asking for something that is not yet possible without significant reinvestment in how assessment works.
Same Problem, Different Industry
If you are not in education, you might think this does not apply to you. It does.
Every home services contractor with ServiceTitan, Housecall Pro, or a ChatGPT subscription is now generating polished, competent estimates. The customer's bell curve of "estimate quality experienced" has collapsed. Three years ago a homeowner could tell the professional from the amateur by reading two proposals side by side. Today, they all sound the same. Professional language, clean formatting, reasonable scope of work.
The contractor who used to win on a clearly better proposal can no longer differentiate on document quality. The ones winning now are differentiating on the defense of the estimate. The in-home walkthrough. The diagnostic explanation. The moment where the tech says, "I would run this line behind the dryer because the previous owner did something weird with the breaker box." That useless true detail, the one AI would never bother to invent, is the new signal.
Customer service replies are the same story. AI-drafted responses to reviews and messages have collapsed the quality distribution from "some shops respond like jerks, some respond like pros" to "everyone sounds like a polite bot." The shops winning attention now are the ones whose owners' voices show through.
When the floor rises to the average, the only competitive move is to make the thing that cannot be averaged the thing you are measured on.
The Question Behind the Question
The professor's story is not really about grading. It is about what happens to any system where the outputs converge, and nobody changes what they are measuring.
If you are an educator, the question is: what does your assessment measure that AI cannot do?
If you are a business owner, the question is: what does your proposal, your estimate, your customer interaction demonstrate that a competent AI template cannot?
If you are a leader of any kind, the question is: are you still measuring the artifact, or are you measuring the thinking?
The professor who noticed her bell curve had flattened had the courage to change the test instead of blaming the students. That is the move. It is available to anyone in any field. It is harder than buying a detector, and it is the only thing that works.
If your team's outputs are converging to the average and you are not sure what to measure instead, that is the diagnostic I build with clients through bensaibrain.com. Come figure it out.
Sources
AI Versus Human Effectiveness in Essay Evaluation -- Discover Education / Springer
Can AI Grade Your Essays? Multidimensional Essay Scoring -- arXiv
Created with ❤️ by humans + AI assistance 🤖