USR: Dialog Quality Annotations

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog (Mehri and Eskenazi, 2020)

This dataset was collected with the goal of assessing dialog evaluation metrics. In our paper (accepted to ACL 2020), we collect this data to measure the quality of several existing word-overlap and embedding-based metrics, as well as our newly proposed USR metric.

The human quality judgements were performed on two datasets:

Amazon Topical-Chat for which we evaluated 6 systems: (1) original ground-truth, (2) 4 different Transformers (argmax decoding, nucleus decoding with p = {0.3, 0.5, 0.7}), and (3) a newly generated human-written utterance.
Persona Chat for which we evaluated 5 systems: (1) original ground-truth, (2) 3 different models (KV-MemNN, Seq2Seq, LM), and (3) a newly generated human-written utterance.

We have also released/are in the process of releasing our code on GitHub.

Contact me if you have any questions about the creation or usage of this data.