We present a novel automatic system for performing explicit content detection directly on the audio signal.

Our modular approach uses an audio-to-character recognition model, a keyword spotting model associated with a dictionary of carefully chosen keywords, and a Random Forest classification model for the final decision. To the best of our knowledge, this is the first explicit content detection system based on audio only.

Overview of the proposed modular approach

We demonstrate the individual relevance of our modules on a set of sub-tasks and compare our approach to a lyrics-informed oracle and an end-to-end naive architecture. The results obtained are encouraging with a F1-score of 67% on a industrial scale explicit content dataset.

This paper has been published in the proceedings of the 45th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020).