Length: 3 minutes
AI beats humans in Stanford reading comprehension test
The writers and editors at Wired were more naturally curious than those at a number of other publications. They dug into the details. They wrote:
The benchmark is biased in favor of software, because humans and software are scored in different ways. The human responses were recorded on Amazon’s Mechanical Turk and it’s not clear what the people’s motives were in answering the questions.
They also quoted a Microsoft researcher who said
“People are still much better than machines” at understanding the nuances of language.
Indeed, the people who constructed the test said the benchmark isn’t a good measure of how a native English speaker would score on the test. It was calculated in a way that favors machines over humans.
So, really, how “smart” is AI in terms of understanding language?
Jia and Liang at Stanford explored variations in the text being read by AI routines. They found that changes in the test paragraphs could drop machine reading comprehension accuracy scores from 75 percent to 36 percent or even 7 percent without changing how humans would interpret the text.
Here’s an example of what Jia and Liang did.
First, read the original text on which various routines were tested:
“Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager.”
The question posed to the AI NLP (natural language processing) routines was:
“What is the name of the quarterback who was 38 in Super Bowl XXXIII?”
Original Reading Comprehension Routine Prediction:
Now add one sentence to the end of the original paragraph and retest:
“Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver’s Executive Vice President of Football Operations and General Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV.”
And the routines answered:
Which goes to show how poorly the technology “understood” anything about the paragraph!
One hundred and twenty four research papers have cited Jia and Liang’s research since its initial pre-publication on arXiv, 107 times this year, 2018.
There is no doubt that natural-language-processing (NLP) routines can make better sense of text than before.
Speech-to-text translation is getting very, very good (but it still fails to handle nuance.)
But text understanding? Think of the last commercial, conversational bots that you interacted with. Consider, for example cases where you were trying to resolve a problem with your billing. The conversational bots are getting better, aren’t they? But do they strike you as really intelligent?
Think too of how products like the Google Assistant and Alexa, for example, have trained us to talk to them! (Or rather, how the programmers scripting these products have trained us to interact with their programs.)
The next time, someone tells you AI understands, be polite. Point them at this post and move on.
Disclosure: I am the author of this article and it expresses my own opinions. This is not a sponsored post. I have no vested interest in any of the firms or institutions mentioned in this post. Nor does the Analyst Syndicate.