Abstract
WaveNet: a generative deep neural network that generates raw audio waveform
i) fully probabilistic (modeling using random variable and probability distribution)
ii) autoregressive (the predictive distribution for each audio sample conditioned on all previous ones)
text-to-speech, a single WaveNet can capture the characteristics of many different speakers with equal fidelity, music generation, phoneme recognition, etc.
1. Introduction
Modeling joint probabilities over pixels or words using neural architectures as products of conditional distributions yields SOTA generation.
WaveNet, an audio generative model based on the PixelCNN (van den Oord et al., 2016a;b)
Dilated causal convolution in order to deal with long-range temporal dependencies needed for raw audio generation
Contributions:
i) natural raw speech signal in TTS
ii) a single model can be used to generate different voices
iii) strong results when tested on a small speech recognition dataset and promising when used to generate other audio modalities such as music
2. WaveNet
i) each audio sample x_t is therefore conditioned on the samples at all previous timesteps
ii) a stack of convolutional layers
iii) no pooling layers, thus the time dimensionality of output and that of input is the same as each other
iv) output: categorical value via softmax
v) optimization: update parameters to maximize log likelihood
Dilated causal convolutions
Causal convolution
i) By using causal convolutions, we make sure the model cannot violate the ordering in which we model the data.
ii) the prediction emitted by the model at timestep t only depends on the previous timesteps
iii) training: parallel, generation: sequential
iv) no recurrent connection, thus fast
v) require many layers, or large filters to increase the receptive field --> Dilated convolution can tackle this!
Dilated convolution
i) A dilated convolution is a convolution where the filter is applied over an area larger than its length by skipping input values with a certain step
ii) large receptive fields with fewer layers
iii) input shape kept (less information loss)
iv) more discriminative because of more non-linear calculations
Softmax Distributions
i) a softmax distribution tends to work better because a categorical distribution is more flexible and can more easily model arbitrary distributions because it makes no assumptions about their shape
ii) quantization: mu-law companding transformation to reduce the number of possible values
iii) nonlinear (mu-law compounding) > linear (ReLu)
Gated Activation Units: the same gated activation unit as used in the gated PixelCNN
i) Element-wise multiplication of filter and gate
ii) Filter: dilated convolution and then tanh activation / local features from a certain layer
iii) Gate: dilated convolution and then sigmoid activation / decide how much the information of the filter will be passed to the next layer
Residual and Skip Connections
i) to speed up convergence and enable training of much deeper models
ii) 1 * 1 convolution: less calculation, shaping
Conditional WaveNets
i) an addional input h to model the conditional distribution p(x|h) of the audio
ii) by conditioning the model on other input variables, we can guide WaveNet's generation to produce audio with the required characteristics
iii) conditioning the model on other inputs in 2 ways: global conditioning (a single latent representation h that influences the output distribution across all timesteps) and local conditioning (timeseries h_t; transform this series using a transposed convolutional network that maps it to a new time series y = f(h) with the same resolution as the audio signal, which is then used in the activation unit)
Context Stacks: another way to increase the receptive field
i) a complementary approach is to use a separate, smaller context stack that processes a long part of the audio signal and locally conditions a larger WaveNet that processes only a smaller part of the audio signal
ii) Shorter-range WaveNet is the main model + local conditioning
3. Experiments
Multi-speaker-speech generation
i) Dataset: English multi-speeaker corpus from CSTR voice cloning toolkit
ii) The conditioning was applied by feeding the speaker ID to the model in the form of a one-hot vector
iii) Adding speakers resulted in better validation set performance compared to training solely on a single speaker
iii) The model picked up on other characteristics in the audio apart from the voice itself
Text-To-Speech
i) Dataset: North American English and Mandarin Chinese dataset
ii) locally conditioned on linguistic features which were derived from input texts; the logarithmic fundamental frequency values in addition to the linguistic models
iii) WaveNet (both linguistic features + fundamental frequency) was the winner!
Music
i) Dataset: MargnaTagATune dataset, YouTube piano dataset
ii) conditional music models, which can generate music given a set of tags specifying e.g. genre or instruments
Speech Recognition
i) Dataset: TIMIT
ii) discrimination task
iii) For this task we added a mean-pooling layer after the dilated convolutions that aggregated the activations to coarser frames spanning 10 ms.
iv) best score
4. Conclusion
i) Deep generative model for audio data (waveform)
ii) autoregressive, dilated convolution
iii) can be conditioned on other in a global (e.g., speaker identity) or local way (e.g., linguistic features)
iv) promising results when applied to music audio modeling and speech recognition
No comments:
Post a Comment