Evaluating Natural Language Generation (NLG) systems is a notoriously hard problem: Unlike natural language interpretation, where annotated corpora may provide a gold standard against which a system can be measured, there are generally multiple equally good outputs that an NLG system might produce. On the other hand, access to human experimental subjects who could judge the quality of the system's output is usually too expensive for large-scale use. Nevertheless, there has recently been an increased interest in shared tasks and new methodologies for evaluating and comparing NLG systems.
The Challenge on Generating Instructions in Virtual Environments (GIVE) is a novel approach to the notoriously hard problem of evaluating NLG systems. In this scenario, a human user performs a "treasure hunt" task in a virtual 3D environment. The NLG system's job is to generate, in real time, a sequence of natural-language instructions that will help the user perform this task. The crucial thing is that users connect to the generation systems over the Internet. By logging how well they were able to follow the system's instructions, we can evaluate the quality of these instructions in terms of task completion rates and times, subjective measures such as helpfulness and friendliness, and runtime performance. Because the user and the system don't need to be physically in the same place, access to experimental subjects over the Internet becomes easy. GIVE has been shown to provide results that are consistent with, but more detailed than, ones obtained from a traditional lab-based evaluation.
GIVE is a theory-neutral, end-to-end evaluation effort for NLG systems. It involves research opportunities in text planning, sentence planning, realization, and situated communication. One particularly interesting aspect of situating the generation problem in a virtual environment is that spatial and relational expressions play a bigger role than in other NLG tasks. Beyond NLG, GIVE could be useful as a testbed for improving components of dialogue systems, or for computational semanticists working on spatial language. To get an idea of how it works, please take a look at our EACL 2009 demo paper describing the software architecture, or visit the GIVE software page and try it out on your own.
Three installments of the GIVE Challenge have been run so far. In 2009, GIVE-1 evaluated five NLG systems using data from over 1100 valid game runs. One year later, GIVE-2 was completed, attracting involvement of seven NLG systems and more than 1800 players from 39 countries. The latest challenge GIVE-2.5 took place in 2011-2012 and collected almost 700 runs for a total of eight participating NLG systems. To our knowledge, this makes GIVE the largest ever NLG evaluation effort in terms of the number of experimental subjects. The next edition of GIVE has not been planned yet, but we encourage everybody interested in GIVE to consider participating in the forthcoming GRUVE Challenge on Generating Routes under Uncertainty in Virtual Environments.