[ad_1]
Google today announced the release of a new data set, the Free Universal Sound Separation Data Set, or FUSS for short, intended to support the development of AI models that can separate different sounds from recording mixes. Use cases are potentially endless, but if it were to go commercial, FUSS could be used in corporate settings to extract speech from conference calls.
Follow the steps of a study conducted by Google and the Idiap Research Institute in Switzerland that describes two machine learning models, a speaker recognition network and a spectrogram masking network, which together “significantly” reduced the error rate Speech Recognition Words (WER) on Multilayer Signals Elsewhere, the tech giants, including Alibaba and Microsoft, have invested a lot of time and resources in solving the sound separation problem.
As Google Research scientists John Hershey, Scott Wisdom, and Hakan Erdogan explain in a blog post, most sound separation models assume that the number of sounds in a mix is static, and they separate mixes from a small number sound types (such as speech versus non-speech) or different instances of the same sound type (such as a first speaker versus a second speaker). The FUSS dataset shifts the focus to the more general problem of separating a variable number of arbitrary sounds from each other.
To this end, the FUSS dataset includes a diverse set of sounds, a realistic room simulator, and code to mix these elements to get audio from multiple sources and multiple classes with the fundamental truth. By searching FreeSound.org audio clips filtered to exclude those that are not human-separable when mixed, Google researchers compiled 23 hours of audio consisting of 12,377 useful mixing sounds, of which they generated 20,000 training mixes. an AI model, 1,000 mixtures to validate it and 1,000 mixtures to evaluate it.
The researchers say they developed their own room simulator using Google’s TensorFlow machine learning framework, which generates the impulse response of a box-shaped room with “frequency-dependent” reflective properties given a sound source and location. microphone. FUSS ships with the pre-calculated room boost responses used for each audio sample, along with the mix code. This is complemented by a masking and pre-training based separation model that can reconstruct mixtures from multiple sources with high precision.
The Google team plans in the future to release the code for the room simulator and extend the simulator to address more computationally expensive acoustic properties, as well as materials with different reflective properties and novel room shapes. “Our hope is [the FUSS data set] it will lower the barrier to new research and, in particular, enable rapid iteration and the application of novel techniques from other machine learning domains to the sound separation challenge, “wrote Hershey, Wisdom and Erdogan.
The FUSS dataset is available on GitHub, and will be used in the DCASE challenge as a component of the Sound Event Detection and Separation Task of the Institute of Electrical and Electronics Engineers (IEEE). The released sound separation model will serve as the basis for this competition and a benchmark to demonstrate progress in future experiments.