The GIVE Challenge
Evaluating Natural Language Generation (NLG) systems is a notoriously hard problem: Unlike natural language interpretation, where annotated corpora may provide a gold standard against which a system can be measured, there are generally multiple equally good outputs that an NLG system might produce. On the other hand, access to human experimental subjects who could judge the quality of the system's output is usually too expensive for large-scale use. Nevertheless, there has recently been an increased interest in shared tasks and new methodologies for evaluating and comparing NLG systems.
The Challenge on Generating Instructions in Virtual Environments (GIVE) is a novel approach to the notoriously hard problem of evaluating natural language generation (NLG) systems. In this scenario, a human user performs a "treasure hunt" task in a virtual 3D environment. The NLG system's job is to generate, in real time, a sequence of natural-language instructions that will help the user perform this task. The crucial thing is that users connect to the generation systems over the Internet. By logging how well they were able to follow the system's instructions, we can evaluate the quality of these instructions in terms of task completion rates and times, subjective measures such as helpfulness and friendliness, and runtime performance. Because the user and the system don't need to be physically in the same place, access to experimental subjects over the Internet becomes easy. GIVE-1 has been shown to provide results that are consistent with, but more detailed than, ones obtained from a traditional lab-based evaluation.
GIVE is a theory-neutral, end-to-end evaluation effort for NLG systems. It involves research opportunities in text planning, sentence planning, realization, and situated communication. One particularly interesting aspect of situating the generation problem in a virtual environment is that spatial and relational expressions play a bigger role than in other NLG tasks. Beyond NLG, GIVE can be interesting as a testbed for improving the NLG components of dialogue systems, and for computational semanticists working on spatial language.
The Second GIVE Challenge (GIVE-2) is currently underway. We invite you to have a look at the website to find more information on how to participate. The GIVE-2 evaluation period will start in February 2010. Last year, we ran the GIVE-1 Challenge. In that challenge, five NLG systems were evaluated using data from over 1100 game runs. To our knowledge, this made GIVE-1 the largest ever NLG evaluation effort in terms of the number of experimental subjects.