Sunday, December 4, 2022

[Paper Review - Speech Technology, NLP] WaveNet: A Generative Model for Raw Audio (van den Oord et al., 2016)

Abstract

WaveNet: a generative deep neural network that generates raw audio waveform

i) fully probabilistic (modeling using random variable and probability distribution)

ii) autoregressive (the predictive distribution for each audio sample conditioned on all previous ones)

text-to-speech, a single WaveNet can capture the characteristics of many different speakers with equal fidelity, music generation, phoneme recognition, etc. 

1. Introduction

Modeling joint probabilities over pixels or words using neural architectures as products of conditional distributions yields SOTA generation.

WaveNet, an audio generative model based on the PixelCNN (van den Oord et al., 2016a;b)

Dilated causal convolution in order to deal with long-range temporal dependencies needed for raw audio generation

Contributions: 

i) natural raw speech signal in TTS

ii) a single model can be used to generate different voices

iii) strong results when tested on a small speech recognition dataset and promising when used to generate other audio modalities such as music

2. WaveNet

i) each audio sample x_t is therefore conditioned on the samples at all previous timesteps

ii) a stack of convolutional layers

iii) no pooling layers, thus the time dimensionality of output and that of input is the same as each other

iv) output: categorical value via softmax

v) optimization: update parameters to maximize log likelihood

Dilated causal convolutions

Causal convolution

i) By using causal convolutions, we make sure the model cannot violate the ordering in which we model the data.

ii) the prediction emitted by the model at timestep t only depends on the previous timesteps

iii) training: parallel, generation: sequential

iv) no recurrent connection, thus fast

v) require many layers, or large filters to increase the receptive field --> Dilated convolution can tackle this!

Dilated convolution

i) A dilated convolution is a convolution where the filter is applied over an area larger than its length by skipping input values with a certain step

ii) large receptive fields with fewer layers

iii) input shape kept (less information loss)

iv) more discriminative because of more non-linear calculations

Softmax Distributions

i) a softmax distribution tends to work better because a categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions about their shape

ii) quantization: mu-law companding transformation to reduce the number of possible values

iii) nonlinear (mu-law compounding) > linear (ReLu)

Gated Activation Units: the same gated activation unit as used in the gated PixelCNN

i) Element-wise multiplication of filter and gate

ii) Filter: dilated convolution and then tanh activation / local features from a certain layer

iii) Gate: dilated convolution and then sigmoid activation / decide how much the information of the filter will be passed to the next layer

Residual and Skip Connections

i) to speed up convergence and enable training of much deeper models

ii) 1 * 1 convolution: less calculation, shaping

Conditional WaveNets

i) an addional input h to model the conditional distribution p(x|h) of the audio

ii) by conditioning the model on other input variables, we can guide WaveNet's generation to produce audio with the required characteristics

iii) conditioning the model on other inputs in 2 ways: global conditioning (a single latent representation h that influences the output distribution across all timesteps) and local conditioning (timeseries h_t; transform this series using a transposed convolutional network that maps it to a new time series y = f(h) with the same resolution as the audio signal, which is then used in the activation unit)

Context Stacks: another way to increase the receptive field

i) a complementary approach is to use a separate, smaller context stack that processes a long part of the audio signal and locally conditions a larger WaveNet that processes only a smaller part of the audio signal

ii) Shorter-range WaveNet is the main model + local conditioning

3. Experiments

Multi-speaker-speech generation

i) Dataset: English multi-speeaker corpus from CSTR voice cloning toolkit

ii) The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector

iii) Adding speakers resulted in better validation set performance compared to training solely on a single speaker

iii) The model picked up on other characteristics in the audio apart from the voice itself

Text-To-Speech

i) Dataset: North American English and Mandarin Chinese dataset

ii) locally conditioned on linguistic features which were derived from input texts; the logarithmic fundamental frequency values in addition to the linguistic models

iii) WaveNet (both linguistic features + fundamental frequency) was the winner!

Music

i) Dataset: MargnaTagATune dataset, YouTube piano dataset

ii) conditional music models, which can generate music given a set of tags specifying e.g. genre or instruments

Speech Recognition

i) Dataset: TIMIT

ii) discrimination task

iii) For this task we added a mean-pooling layer after the dilated convolutions that aggregated the activations to coarser frames spanning 10 ms.

iv) best score

4. Conclusion

i) Deep generative model for audio data (waveform)

ii) autoregressive, dilated convolution

iii) can be conditioned on other in a global (e.g., speaker identity) or local way (e.g., linguistic features)

iv) promising results when applied to music audio modeling and speech recognition

No comments:

Post a Comment