The growing interest for Human-centered MIR motivates the development of perceptually-grounded evaluation metrics. Despite remarkable progress of lyrics-to-audio alignment systems in recent years, one thing remaining unresolved is whether the metrics employed to assess their performance are perceptually grounded.
Even if a tolerance window for errors was fixed at 0.3s for the MIREX challenge, no experiment was conducted to confer psychological validity to this threshold. Following an interdisciplinary approach, fueled by psychology and musicology insights, we consider the lyrics-to-audio alignment evaluation from a user-centered perspective.
In this paper, we call into question the perceptual robustness of the most used metric to evaluate this task. We investigate the perception of audio and lyrics synchrony through two realistic experimental settings inspired from karaoke, and discuss implications for evaluation metrics. The most striking features of these results are the asymmetrical perceptual thresholds of synchrony perception between lyrics and audio, as well as the influence of rhythmic factors on them.
This paper has been accepted for publication in the proceedings of the 21rst International Conference of the ISMIR Society (ISMIR 2021)