Today’s little essay has no goal other than to be mildly amusing and somewhat informative. We’ll talk about Winograd schemas and other challenges related to the evaluation of computer’s ability to understand natural language texts.

Let me state first that I belong to a somewhat "lower excitement" camp of AI enthusiasts. I tend to believe there is no other-worldly magic in human intelligence, and eventually we’ll be able to build something resembling our brains and even surpassing them. However, I don’t see this coming any time soon, and I doubt it is going to be achieved with the current technological fads. Still, it is obvious that we can do a lot with the tools we have now, and it is very interesting to explore their limits.

Experienced colleagues taught me to start any research activity with designing an evaluation scheme that will be used to prove some kind of superiority of my approach against the current state of the art. If I can’t measure my method’s performance objectively, I am going to have hard time defending its worth. This can work well for classic tasks like sorting, but how do we compare, for example, two machine translation systems? Which one "translates better"? Furthermore, if my system is just at the proof-of concept stage, how would I compare it with something established?

Evaluation schemes in natural language processing are fraught with troubles. We don’t even need a grand challenge like machine translation to notice them: let’s consider a modest task of part-of-speech tagging. A tagger has to label each word in the input text with a "POS tag", e.g., to decide whether the given word is a noun or a verb. At a glance, evaluating such a system should be easy: we simply need to feed the tagger a manually annotated "golden standard" collection of texts and calculate the percentage of correctly chosen tags. Creating a corpus of annotated texts isn’t the most pleasant kind of work, but it’s just a one-time effort.

It turns out, however, that plain scores like "90% accuracy" aren’t too revealing. Let’s first note that one must strive for rather high scores here. If a typical sentence contains 10 words, "90% accuracy" means an error in every single sentence on average. If the tags incorrectly assigned by the tool are needed later in the pipeline, they will cause further errors.

High accuracy can be obtained by means of certain trickery, which will backfire. For example, Penn Treebank corpus annotation presumes that the expressions like "credit card" or "tennis ball" are treated as "noun modifier + noun" rather than "adjective + noun". Thus, a tagger can simply mark "tennis" as a noun all the time and be correct. Naturally, if our next goal is, for example, entity recognition, this annotation won’t help to decide which nouns are proper objects and which ones are merely modifiers.

On top of that, text corpora are usually biased: for example, Penn Treebank is mostly based on Wall Street Journal articles and thus contains few interrogative sentences and presumably no sentences with grammatical errors. For this reason, testing a tagger on the Penn corpus won’t show how well it behaves on ungrammatical sentences and questions.

Now, let me repeat that evaluating natural language processing systems is hard, and it becomes harder as the objectives of these systems get more and more sophisticated. The most ambitious objective is, perhaps, "natural language understanding", comprising quite a diverse set of tasks. Nine such tasks together with evaluation criteria were relatively recently (2019) combined into GLUE Platform, which aims to provide a standardized multi-task benchmark for natural language understanding. In other words, if everyone uses GLUE to test their systems, we’ll have objective and comparable performance scores. GLUE tasks challenge an NLP system to evaluate grammatical correctness of input sentences, identify prevailing sentiments, check sentence pairs for semantic equivalence, and decide whether certain sentences are logically connected (one entails another).

Among all these tasks, I was intrigued most by the Winograd schema challenge, and its modified version WinoGrande, discussed in CACM last year. It struck me as a simple test yet seemingly requiring very advanced AI techniques to pass.

Wikipedia actually summarizes the main points of the matter quite well. In its basic form, a Winograd schema consists of a short text fragment (or a single sentence) containing a pronoun and a "special word". The task of a computer is to identify the antecedent of the pronoun in the given fragment and in a fragment where a special word is replaced with an "alternate word". Each schema is specifically crafted to make the right answers different in these two cases.

For example, consider the classic schema "The city councilmen refused the demonstrators a permit because they [feared | advocated] violence."

Here "they" refers to "councilmen" for the fragment with the original special word "feared". In case of a fragment where the special word is replaced with an alternate word "advocated", the right answer becomes "demonstrators".

So, at a glance, one needs to exhibit some sort of commonsense reasoning to be able to solve such a challenge. However, as it turns out, a Winograd schema has to be chosen very carefully to avoid possible reliance on simple statistical clues. For example, the following schema is easier to solve: "the women stopped taking pills because they were [pregnant | carcinogenic]." The problem is that pills cannot be pregnant, and women cannot be carcinogenic, which is reflected in concordance statistics. In other words, the combination of "women + were + carcinogenic" is very rare in comparison to "women + were + pregnant", so there is no reason to choose the first option at all.

It is very interesting that just a few years ago nobody could pass the Winograd schema challenge with a sufficient score. Just 5-7 years ago a typical accuracy score for a state-of-the-art system was around 60-70%. With the arrival of BERT and GPT-3 the accuracy scores jumped above 90%, which is already considered near-human performance.

Naturally, "near-human" results in this area are very intriguing, and trigger further investigation. The CACM paper authors found out that much of the current success can still be explained with statistical inference. As people, we can anticipate a clear statistical connection between the words like "woman" and "pregnancy" or "man" and "king", but sometimes a connection is revealed at the level of word embeddings rather than words, which represent quite a broad context. For example, a modern language model knows that certain professions like an engineer are typically associated with males rather than females.

As a side note, a lot of today’s research on word embeddings is focused on revealing various sorts of "biases" that make the model associate certain groups of people with certain kinds of negative concepts. While there is a lot of modern political agenda here, biases indeed may backfire in completely unrelated scenarios. For example, Google AI team’s blogpost discusses the following case. You write a system for analyzing sentiments of movie reviews. A conventional biased language model knows that certain non-European names are associated with negative sentiment. If such a name is mentioned in a movie review, this review is more likely to be deemed emotionally negative, which is hardly intended.

The CACM paper authors created their own variation of the Winograd schema challenge. The most important innovation here is the attempt to debias the candidate schemas and thus make them more difficult to solve with purely statistical methods.

For example, they rejected the following schema: "The monkey loved to play with the balls but ignored the blocks because he found them [exciting | dull]." It turns out that a language model associates the words "love" and "exciting" with a positive sentiment, and the word "ignored" and "dull" with a negative sentiment. The authors don’t explain what happens next, but I presume the system is able to associate a pronoun with a possible antecedent by sentiment similarity: a pronoun next to a positive word ("exciting") is associated with an object related to a positive verb ("loved"). Similarly, the pattern "pronoun + dull" is associated with a pattern "ignored + noun", because both "dull" and "ignored" have a negative sentiment.

I think their examples reveal rather straightforward associations, and there is no deep wisdom here. And yet, the authors managed to push the results of state-of-the-art systems on WinoGrande down to the accuracy of 60-80%. It isn’t bad still, but people easily score 94%, according to the same paper. I’d add that people aren’t very attentive and can make trivial errors just due to the lack of concentration, so they can probably score even higher.

All in all, this whole line of enquiry demonstrates well the current state of the art in NLP. The progress is truly impressive, on one hand, and the example of Winograd schema challenge shows it clearly. Just a few years ago a system able to disambiguate pronouns in Winograd schemas on par with humans was almost unconceivable. On the other hand, we see that modern technologies keep refining the same statistical trickery as we saw decades earlier. Let me return to my part-of-speech tagger example: if we decide that "tennis" is a always a noun, the system can achieve a perfect 100% accuracy. When we introduce cases where "tennis" is labeled as an adjective, the system can figure out that it happens rarely, simply ignore these observations, and still achieve an accuracy of 95%. The modern systems are true masters of "cutting the corners" and using whatever correlations they find in the input datasets. Since these correlations indeed reflect the data, the outcomes can be surprisingly good.

Many people are content with these results, and I won’t criticize them. After all, we need face recognition systems, machine translators, grammar checkers, sentiment analyzers, image classifiers and other practical tools. They are fine as long as we understand their flaws and provide a way for the user to work around the systems if something goes wrong. A lot of pain is caused by spurious punishments handed down and executed by Facebook and Google algorithms, but they merely fail in the same way as a door lock may fail to open even if you have the right key. The real issue here is the lack of proper arbitration. However, high performance of the current systems should not make us believe that everything is already solved. From a practical standpoint, we still need the systems that won’t fail unexpectedly in seemingly simple cases. We need the systems that are able to recover if they fail, and give the user better feedback on what is going on. From a theoretical standpoint, we still have the challenge of understanding the enigma of intelligence. Both the successes and the failures of modern AI help us on this road.