September 28, 2009

Reading Incomprehension


Many people remember those tests as lots of multiple-choice questions answered by marking bubbles with a No. 2 pencil, but today’s exams nearly always include the sort of “open ended” items where students fill up the blank pages of a test booklet with their own thoughts and words. On many tests today, a good number of points come from such open-ended items, and that’s where the real trouble begins.

Multiple-choice items are scored by machines, but open-ended items are be scored by subjective humans who are prone to errors. I know because I was one of them. In 1994, I was a graduate student looking for part-time work. After a five-minute interview I got the job of scoring fourth-grade, state-wide reading comprehension tests. The for-profit testing company that hired me paid almost $8 an hour, not bad money for me at the time.

One of the tests I scored had students read a passage about bicycle safety. They were then instructed to draw a poster that illustrated a rule that was indicated in the text. We would award one point for a poster that included a correct rule and zero for a drawing that did not.

The first poster I saw was a drawing of a young cyclist, a helmet tightly attached to his head, flying his bike over a canal filled with flaming oil, his two arms waving wildly in the air. I stared at the response for minutes. Was this a picture of a helmet-wearing child who understood the basic rules of bike safety? Or was it meant to portray a youngster killing himself on two wheels?

I was not the only one who was confused. Soon several of my fellow scorers — pretty much people off the street, like me — were debating my poster, some positing that it clearly showed an understanding of bike safety while others argued that it most certainly did not. I realized then — an epiphany confirmed over a decade and a half of experience in the testing industry — that the score any student would earn mostly depended on which temporary employee viewed his response.

A few years later, still a part-time worker, I had a similar experience. For one project our huge group spent weeks scoring ninth-grade movie reviews, each of us reading approximately 30 essays an hour (yes, one every two minutes), for eight hours a day, five days a week. At one point the woman beside me asked my opinion about the essay she was reading, a review of the X-rated movie “Debbie Does Dallas.” The woman thought it deserved a 3 (on a 6-point scale), but she settled on that only after weighing the student’s strong writing skills against the “inappropriate” subject matter. I argued the essay should be given a 6, as the comprehensive analysis of the movie was artfully written and also made me laugh my head off.

All of the 100 or so scorers in the room soon became embroiled in the debate. Eventually we came to the “consensus” that the essay deserved a 6 (“genius”), or 4 (well-written but “naughty”), or a zero (“filth”). The essay was ultimately given a zero.

This kind of arbitrary decision is the rule, not the exception. The years I spent assessing open-ended questions convinced me that large-scale assessment was mostly a mad scramble to score tests, meet deadlines and rake in cash.

The cash, though, wasn’t bad. It was largely for this reason that I eventually became a project director for a private testing company. The scoring standards were still bleak. A couple of years ago I supervised a statewide reading assessment test. My colleague and I were relaxing at a pool because we believed we’d already finished scoring all of the tens of thousands of student responses. Then a call from the home office informed us that a couple of dozen unscored tests had been discovered.

Because our company’s deadline for returning the tests was that day, my colleague and I had to score them even though we were already well into happy hour. We spent the evening listening to a squeaky-voiced secretary read student answers to us over a scratchy speakerphone line, while we made decisions that could affect somebody’s future.

These are the kinds of tests, after all, that can help determine government financing for schools. There is already much debate over whether the progress that Secretary Duncan hopes to measure can be determined by standardized testing at all. But in the meantime, we can give more thought to who scores these tests. We could start by requiring that scoring be done only by professionals who have made a commitment to education — rather than by people like me.

Todd Farley is the author of the forthcoming “Making the Grades: My Misadventures in the Standardized Testing Industry.”

Illustration: Tucker Nichols

Copyright 2009 The New York Times Company