Spoken language recognition on Mozilla Frequent Voice — Audio Transformations. | by Sergey Vilov | Aug, 2023


Picture by Kelly Sikkema on Unsplash

That is the third article on spoken language recognition primarily based on the Mozilla Common Voice dataset. In Part I, we mentioned knowledge choice and knowledge preprocessing and in Part II we analysed efficiency of a number of neural community classifiers.

The ultimate mannequin achieved 92% accuracy and 97% pairwise accuracy. Since this mannequin suffers from considerably excessive variance, the accuracy might doubtlessly be improved by including extra knowledge. One quite common solution to get additional knowledge is to synthesize it by performing numerous transformations on the accessible dataset.

On this article, we are going to contemplate 5 standard transformations for audio knowledge augmentation: including noise, altering velocity, altering pitch, time masking, and lower & splice.

The tutorial pocket book might be discovered here.

For illustration functions, will use the pattern common_voice_en_100040 from the Mozilla Common Voice (MCV) dataset. That is the sentence The burning hearth had been extinguished.

import librosa as lr
import IPython

sign, sr = lr.load('./remodeled/common_voice_en_100040.wav', res_type='kaiser_fast') #load sign

IPython.show.Audio(sign, charge=sr)

Unique pattern common_voice_en_100040 from MCV.
Unique sign waveform (picture by the writer)

Including noise is the only audio augmentation. The quantity of noise is characterised by the signal-to-noise ratio (SNR) — the ratio between maximal sign amplitude and customary deviation of noise. We are going to generate a number of noise ranges, outlined with SNR, and see how they modify the sign.

SNRs = (5,10,100,1000) #Sign-to-noise ratio: max amplitude over noise std

noisy_signal = {}

for snr in SNRs:

noise_std = max(abs(sign))/snr #get noise std
noise = noise_std*np.random.randn(len(sign),) #generate noise with given std

noisy_signal[snr] = sign+noise

IPython.show.show(IPython.show.Audio(noisy_signal[5], charge=sr))
IPython.show.show(IPython.show.Audio(noisy_signal[1000], charge=sr))

Indicators obtained by superimposing noise with SNR=5 and SNR=1000 on the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform for a number of noise ranges (picture by the writer)

So, SNR=1000 sounds virtually just like the unperturbed audio, whereas at SNR=5 one can solely distinguish the strongest components of the sign. In observe, the SNR degree is hyperparameter that depends upon the dataset and the chosen classifier.

The only solution to change the velocity is simply to fake that the sign has a unique pattern charge. Nonetheless, this may also change the pitch (how low/excessive in frequency the audio sounds). Rising the sampling charge will make the voice sound larger. As an instance this we will “enhance” the sampling charge for our instance by 1.5:

IPython.show.Audio(sign, charge=sr*1.5)
Sign obtained through the use of a false sampling charge for the unique MCV pattern common_voice_en_100040 (generated by the writer).

Altering the velocity with out affecting the pitch is tougher. One wants to make use of the Phase Vocoder(PV) algorithm. Briefly, the enter sign is first cut up into overlapping frames. Then, the spectrum inside every body is computed by making use of Quick Fourier Transformation (FFT). The enjoying velocity is then modifyed by resynthetizing frames at a unique charge. Because the frequency content material of every body is just not affected, the pitch stays the identical. The PV interpolates between the frames and makes use of the section info to attain smoothness.

For our experiments, we are going to use the stretch_wo_loop time stretching operate from this PV implementation.

stretching_factor = 1.3

signal_stretched = stretch_wo_loop(sign, stretching_factor)
IPython.show.Audio(signal_stretched, charge=sr)

Sign obtained by various the velocity of the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform after velocity enhance (picture by the writer)

So, the period of the sign decreased since we elevated the velocity. Nonetheless, one can hear that the pitch has not modified. Notice that when the stretching issue is substantial, the section interpolation between frames won’t work properly. Because of this, echo artefacts could seem within the remodeled audio.

To change the pitch with out affecting the velocity, we will use the identical PV time stretch however fake that the sign has a unique sampling charge such that the entire period of the sign stays the identical:

IPython.show.Audio(signal_stretched, charge=sr/stretching_factor)
Sign obtained by various pitch of the unique MCV pattern common_voice_en_100040 (generated by the writer).

Why can we ever trouble with this PV whereas librosa already has time_stretch and pitch_shift features? Nicely, these features rework the sign again to the time area. When it’s good to compute embeddings afterwards, you’ll lose time on redundant Fourier transforms. Alternatively, it’s simple to change the stretch_wo_loop operate such that it yields Fourier output with out taking the inverse rework. One might in all probability additionally attempt to dig into librosa codes to attain related outcomes.

These two transformation have been initially proposed within the frequency area (Park et al. 2019). The concept was to save lots of time on FFT through the use of precomputed spectra for audio augmentations. For simplicity, we are going to display how these transformations work within the time area. The listed operations might be simply transferred to the frequency area by changing the time axis with body indices.

Time masking

The concept of time masking is to cowl up a random area within the sign. The neural community has then much less possibilities to study signal-specific temporal variations that aren’t generalizable.

max_mask_length = 0.3 #most masks period, proportion of sign size

L = len(sign)

mask_length = int(L*np.random.rand()*max_mask_length) #randomly select masks size
mask_start = int((L-mask_length)*np.random.rand()) #randomly select masks place

masked_signal = sign.copy()
masked_signal[mask_start:mask_start+mask_length] = 0

IPython.show.Audio(masked_signal, charge=sr)

Sign obtained by making use of time masks transformation on the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform after time masking (the masked area is indicated with orange) (picture by the writer)

Minimize & splice

The concept is to exchange a randomly chosen area of the sign with a random fragment from one other sign having the identical label. The implementation is sort of the identical as for time masking, besides {that a} piece of one other sign is positioned as a substitute of the masks.

other_signal, sr = lr.load('./common_voice_en_100038.wav', res_type='kaiser_fast') #load second sign

max_fragment_length = 0.3 #most fragment period, proportion of sign size

L = min(len(sign), len(other_signal))

mask_length = int(L*np.random.rand()*max_fragment_length) #randomly select masks size
mask_start = int((L-mask_length)*np.random.rand()) #randomly select masks place

synth_signal = sign.copy()
synth_signal[mask_start:mask_start+mask_length] = other_signal[mask_start:mask_start+mask_length]

IPython.show.Audio(synth_signal, charge=sr)

Artificial sign obtained by making use of lower&splice transformation on the unique MCV pattern common_voice_en_100040 (generated by the writer).
Sign waveform after lower&splice transformation (the inserted fragment from the opposite sign is indicated with orange) (picture by the writer)

Leave a Reply

Your email address will not be published. Required fields are marked *