USR: An Unsupervised and Reference Free Evaluation Metric for Dialog (Mehri and Eskenazi, 2020)
This dataset was collected with the goal of assessing dialog evaluation metrics. In our paper (accepted to ACL 2020), we collect this data to measure the quality of several existing word-overlap and embedding-based metrics, as well as our newly proposed USR metric.
The human quality judgements were performed on two datasets:
We have also released/are in the process of releasing our code on GitHub.
Contact me if you have any questions about the creation or usage of this data.