GIVE-1: Heat maps
On this page, we have tried to visualize areas in which different NLG systems had difficulties in the form of heat maps. Each map paints the tiles of one evaluation world in a color that represents some kind of intensity: the average time spent on that tile, the number of users who were standing on this tile when they lost the game, and the number of users who were standing on this tile when they asked for help because they didn't understand an instruction. Thus, "warmer" areas are those in which a system had particular difficulties. Below, we first link to the individual heat maps; then we explain in more detail how they were generated.
The heat maps
Time per tile:
Time per tile (quadratic):
Locations where users lost:
Locations where users asked for help:
How the heatmaps are computed
In each world+server, we extract the average time spent on each tile, and the number of times that a LostMessage or a DidNotUnderstand message was received on each tile, from the database. "Average" means that we divide the total time and the total counts by the number of valid games for the world/server combination. Both time spent on a tile and events occurring on a tile only count after the tutorial has been completed. Note that for determine where users lost, we use the tile where the user received the "lost" message. This may not be the same as where the alarm got triggered.
We then normalize the values by scaling the maximum values to one. In the "normalized for each world" case, we use the maximum value over all tiles and all servers for each world; in the "normalized for each world+server", we use the maximum value over all tiles for one specific combination of world and server. In the case of the average times, we ignore the top percentile of values as outliers when determining the maximum value. "Quadratic" in the case of average times means that all values were squared to bring high values out more clearly against the background noise.
Then the tiles are colored in the heatmap according to the scale below. Left/blue represents low values, and and right/red represents high values. Black is zero, and white is one (or higher, in case of outliers).