Spoken language recognition on Mozilla Frequent Voice — Audio Transformations. | by Sergey Vilov

That is the third article on spoken language recognition primarily based on the Mozilla Common Voice dataset. In Part I, we mentioned knowledge choice and knowledge preprocessing and in Part II we analysed efficiency of a number of neural community classifiers.

The ultimate mannequin achieved 92% accuracy and 97% pairwise accuracy. Since this mannequin suffers from considerably excessive variance, the accuracy might doubtlessly be improved by including extra knowledge. One quite common solution to get additional knowledge is to synthesize it by performing numerous transformations on the accessible dataset.

On this article, we are going to contemplate 5 standard transformations for audio knowledge augmentation: including noise, altering velocity, altering pitch, time masking, and lower & splice.

The tutorial pocket book might be discovered here.

For illustration functions, will use the pattern common_voice_en_100040 from the Mozilla Common Voice (MCV) dataset. That is the sentence The burning hearth had been extinguished.

import librosa as lr
import IPythonsign, sr = lr.load('./remodeled/common_voice_en_100040.wav', res_type='kaiser_fast') #load sign
IPython.show.Audio(sign, charge=sr)

Unique pattern common_voice_en_100040 from MCV.

Unique sign waveform (picture by the writer)

Including noise is the only audio augmentation. The quantity of noise is characterised by the signal-to-noise ratio (SNR) — the ratio between maximal sign amplitude and customary deviation of noise. We are going to generate a number of noise ranges, outlined with SNR, and see how they modify the sign.

SNRs = (5,10,100,1000) #Sign-to-noise ratio: max amplitude over noise stdnoisy_signal = {}
for snr in SNRs:
noise_std = max(abs(sign))/snr #get noise std
noise =  noise_std*np.random.randn(len(sign),) #generate noise with given std
noisy_signal[snr] = sign+noise
IPython.show.show(IPython.show.Audio(noisy_signal[5], charge=sr))
IPython.show.show(IPython.show.Audio(noisy_signal[1000], charge=sr))