The effect of back-formulating questions in question answering evaluation

T Sakai, Y Saito, Y Ichimura, T Kokubu… - Proceedings of the 27th …, 2004 - dl.acm.org
T Sakai, Y Saito, Y Ichimura, T Kokubu, M Koyama
Proceedings of the 27th annual international ACM SIGIR conference on …, 2004dl.acm.org
Spurred by competition-style workshops such as TREC [6], NTCIR [1] and CLEF [2], question
answering (QA) has received a lot of attention in recent years. However, QA evaluation is
still in its infancy: very little is known as to what constitutes a reliable QA evaluation
experiment. For example, because QA needs to evaluate arbitrary answer strings as
opposed to document IDs in Information Retrieval (IR) evaluation, how to construct a truly
reusable QA test collection is an open problem. Moreover, how to devise a reliable QA …
Spurred by competition-style workshops such as TREC [6], NTCIR [1] and CLEF [2], question answering (QA) has received a lot of attention in recent years. However, QA evaluation is still in its infancy: very little is known as to what constitutes a reliable QA evaluation experiment. For example, because QA needs to evaluate arbitrary answer strings as opposed to document IDs in Information Retrieval (IR) evaluation, how to construct a truly reusable QA test collection is an open problem. Moreover, how to devise a reliable QA performance metric is still under debate [3]. This paper studies a specific issue in QA evaluation that has not been discussed widely before, namely, the effect of back-formulating questions in QA test collection construction. Although the question set of a QA test collection should ideally be a representative sample of naturally occurring real-world questions, test collection questions are often “made up” based on some passages drawn from a document collection due to time/human resource constraints or unavailability of real data. For example, the CLEF 2003 QA Track devised questions without access to the document collection for monolingual QA, but back-formulated questions for crosslanguage QA. Moreover, some of the NTCIR-3 QAC1 [1] questions are rather complex and appear to have been back-formulated, eg QAC1-1176 “Who had the thirteenth largest income among the Diet members in 1997?” However, it is not known whether an experiment using back-formulated questions can really simulate a practical QA situation. Although this applies to IR as well, it is probably a more serious question for QA, as a QA question set affects evaluation not only in terms of what query terms they contain
ACM Digital Library