An Empirical Study on the Number of Items in Human Evaluation of Automatically Generated Texts

  1. González-Corbelle, Javier
  2. Alonso-Moral, Jose M.
  3. Crujeiras, Rosa M.
  4. Bugarín-Diz, Alberto
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2024

Issue Title: Procesamiento del Lenguaje Natural, Revista nº 72, marzo de 2024

Issue: 72

Pages: 45-55

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

Human evaluation of neural models in Natural Language Generation (NLG) requires a careful experimental design in terms of the number of evaluators, number of items to assess, number of quality criteria, among other factors, for the sake of reproducibility as well as for ensuring that significant conclusions are drawn. Although there are some generic recommendations on how to proceed, there is not an established or accepted evaluation protocol admitted worldwide yet. In this paper, we address empirically the impact of the number of items to assess in the context of human evaluation of NLG systems. We first apply resampling methods to simulate the evaluation of different sets of items by each evaluator. Then, we compare the results obtained by evaluating only a limited set of items with those obtained by evaluating all outputs of the system for a given test set. Empirical findings validate the research hypothesis: well-known resampling statistical methods can contribute to getting significant results even with a small number of items to be evaluated by each evaluator.

Bibliographic References

  • Altman, D. G. 1991. Practical Statistics for Medical Research. Chapman and Hall.
  • Banerjee, S. and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  • Belz, A. 2022. A Metrological Perspective on Reproducibility in NLP. Computational Linguistics, 48(4):1125–1135. Belz, A., C. Thomson, E. Reiter, and S. Mille. 2023. Non-repeatable experiments and nonreproducible results: The reproducibility crisis in human evaluation in NLP. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL2023, pages 3676–3687, Toronto, Canada. Association for Computational Linguistics.
  • Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.
  • De Leon, A. and Y. Zhu. 2008. ANOVA extensions for mixed discrete and continuous data. Computational Statistics Data Analysis, 52(4):2218–2227.
  • Efron, B. 1979. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1):1 – 26.
  • Faul, F., E. Erdfelder, A. Buchner, and A.-G. Lang. 2009. Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41:1149–1160.
  • Fisher, R. A., 1992. Breakthroughs in Statistics: Methodology and Distribution, chapter Statistical Methods for Research Workers, pages 66–70. Springer New York, New York, NY.
  • Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76:378–382.
  • González Corbelle, J., A. Bugarín-Diz, J. Alonso-Moral, and J. Taboada. 2022. Dealing with hallucination and omission in neural natural language generation: A use case on meteorology. In Proceedings of the 15th International Conference on Natural Language Generation, pages 121–130, Waterville, Maine, USA and virtual meeting. Association for Computational Linguistics.
  • Hesterberg, T. 2008. It’s time to retire the “n ≥ 30” rule. In Proceedings of the American Statistical Association, Alexandria VA.
  • Kane, H., M. Y. Kocyigit, A. Abdalla, P. Ajanoh, and M. Coulibali. 2020. NUBIA: NeUral based interchangeability assessor for text generation. In S. Agarwal, O. Dusek, S. Gehrmann, D. Gkatzia, I. Konstas, E. Van Miltenburg, and S. Santhanam, editors, Proceedings of the 1st Workshop on Evaluating NLG Evaluation, pages 28–37, Online (Dublin, Ireland). Association for Computational Linguistics.
  • Lin, C.-Y. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  • Mair, P. and R. Wilcox. 2020. Robust Statistical Methods in R Using the WRS2 Package. Behavior Research Methods, 52:464–488.
  • Moramarco, F., A. Papadopoulos Korfiatis, M. Perera, D. Juric, J. Flann, E. Reiter, A. Belz, and A. Savkov. 2022. Human evaluation and correlation with automatic metrics in consultation note generation. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5739–5754, Dublin, Ireland. Association for Computational Linguistics.
  • Obeid, J. and E. Hoque. 2020. Chart-to-text: Generating natural language descriptions for charts by adapting the Transformer model. In Proceedings of the 13th International Conference on Natural Language Generation, pages 138–147, Dublin, Ireland. Association for Computational Linguistics.
  • Papineni, K., S. Roukos, T.Ward, and W.-J. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, page 311–318, USA. Association for Computational Linguistics.
  • Reiter, E. 2018. A structured review of the validity of BLEU. Computational Linguistics, 44(3):393–401.
  • Sellam, T., D. Das, and A. Parikh. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892. Association for Computational Linguistics.
  • Student. 1908. Probable error of a correlation coefficient. Biometrika, 6(2/3):302–310.
  • Van der Lee, C., A. Gatt, E. van Miltenburg, and E. Krahmer. 2021. Human evaluation of automatically generated text: Current trends and best practice guidelines. Computer Speech & Language, 67:101151.1–101151.24.
  • Wang, J., Y. Liang, F. Meng, Z. Sun, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou. 2023. Is ChatGPT a good NLG evaluator? a preliminary study. In Y. Dong, W. Xiao, L. Wang, F. Liu, and G. Carenini, editors, Proceedings of the 4th New Frontiers in Summarization Workshop, pages 1–11, Hybrid. Association for Computational Linguistics.
  • Zhang, T., V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. 2020. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations (ICLR). OpenReview.