We conducted a human subject study of named entity recognition on a noisy corpus of conversational music recommendation queries, with many irregular and novel named entities.

We evaluated the human NER linguistic behaviour in these challenging conditions and compared it with the most common NER systems nowadays, fine-tuned transformers. Our goal was to learn about the task to guide the design of better evaluation methods and NER algorithms.

The results showed that NER in our context was quite hard for both human and algorithms under a strict evaluation schema; humans had higher precision, while the model higher recall because of entity exposure especially during pre-training; and entity types had different error patterns (e.g. frequent typing errors for artists).

The released corpus goes beyond predefined frames of interaction and can support future work in conversational music recommendation.

This paper has been accepted for publication in the proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023).