Saturday, December 3, 2022

[Book Summary - Speech Technology] Introduction to Speech Technology (Holmes & Holmes, 2001)

Ch 4. Digital coding of speech

  • 3 things to consider in speech coding: data rate, speech quality, and algorithm complexity
  • Human cognitive processes cannot take account of an information rate in excess of a few tens of bis per second, thus implying a ratio of information transmitted to information used of between 1,000 and 10,000
  • 2 properties of speech communication: the restricted capacity of the human auditory system and the physiology of the speaking mechanism (based on the fact that the signal is known to be produced by a human talker)
  • 3 coding methods: simple waveform coders, analysis/synthesis systems, and intermediate systems

Simple waveform coders

  • copy the actual shape of the waveform produced by the microphone and its associated analogue circuits
  • consist of a band limiting filter, a sampler, and a device for coding the samples
  • types of simple waveform coders
    1. Pulse code modulation (PCM)i) used for feeding analogue signals into computers or other digital equipment for subsequent processingii) not normally used due to the high required digit rateiii) does not exploit the above two properties (i.e., speech production and/or auditory perception), except for the limited bandwidth
    2. Deltamodulation: uses its transmitted digital codes to generate a local copy of the input waveform and chooses successive digital codes

Analysis/synthesis systems (vocoders)

  • analyze the speech signal in terms of parameters
  • the output does not need to resemble the the original waveform in appearance, but should be perceptually similar
  • based on a model of speech production
  • the data are coded into frames representing speech spectra measured at 10-30 ms
  • types of vocoders
    1. channel vocoders i) principle: the spectrum is represented by the response of a bank of contiguous variable-gain bandpass filters, and the control signals for the channels are derived by measuring the short-term-average power from a similar set of filters fed with the input speech signal in the transmitter, ii) limitation: need a large number of channels, iii) performance: 15-20 channels reasonable for communication purposes; operating at data rates of around 2,400 bits/s or lower
    2. sinusoidal coders i) representing the short-term spectrum of a speech signal as a sum of sinusoids specified in terms of frequency, amplitude, and phase (called sinusoidal transform coding (STC), ii) voiced speech and unvoiced speech represented by harmonically related sinusoids for the former and with random phases for the latter, iii) Multi-band excitation (MBE) coding: voiced as a combination of the relevant set of harmonic sinusoids; unvoiced using a frequency-domain method with regard to a whitenoise excitation signaliv) 2,000-4,000 bits/s range
    3. linear predictive coding (LPC) vocoders
    4. formant vocoders

Intermediate systems

  • utilize both simple waveform coders and analysis/synthesis system

Ch. 8 Automatic speech recognition

General principles of pattern matching

  • early methods (e.g., Hyde (1972): rule-based approaches, which was not very successful due to co-articulation and difficulty of phone identification
  • pattern-matching techniques: one way is to store example acoustic patterns (called templates) for all the words

Distance metrics

  • Filter-bank analysis: describes the speech as a sequence of feature vectors, which can be compared with stored templates for all the words in the vocabulary using a suitable distance metric
  • Normalization
    1. adding a small consonant to the measured level before taking logarithms
    2. the square of the Euclidean distance in the multi-dimensional space
  • Dynamic time warping (DTW)
    • can deal with sequences of connected words ( ⇒ can solve the end-point detection problem)
    • by matching one word on to another in a way which applies the optimum non-linear timescale distortion to achieve the best match at all points
    • Score pruning: not allowing paths from relatively badly scoring points to propagate further in the DTW calculation

Speech recognition

  • Speech recognition: store the representation and choose the best matching
  • 3 broad kinds of synthesis
    1. Parametric: manipulate the acoustic?
    2. Physical (Frankenstein):
    3. Concatenative (cheating): using pre-recorded units
  • The synthesis stage
    1. a decoding process: the waveform of the acoustical units must be reconstructed from their coded version
    2. a concatenation process: the sequence of acoustical units must be concatenated after an appropriate modification of their intrinsic prosody
  • Types of synthesis
  • Concatenative synthesis: synthesize the sounds by concatenating such elementary speech units as diphones and demi-syllables→ Pitch adjustment is a problem
  • Some pitch adjustment techniques
    1. Linear Predictive Coding (LPC): an encoding technique. Do multiple regression on the wave samples to predict any particular sample from the n samples that precede it.Example) w0, w1, w2, w3, w4, w5, w6, w7, w8, w9 (10-sample windows)c1w0 + c2w1 + c3w2 = w3c1w1 + c2w2 + c3w3 = w4...c1w6 + c2w7 + c3w8 = 29→ calculate c1, c2, and c3 by the autoregression→ can be used to reconstruct waves
    2. Pitch Synchronous Overlap and Add (PSOLA): a digital signal processing technique. The speech waveform is divided into small overlapping segments, and then they are moved to further apart or closer together, in order to change pitch.

Ways to do pattern matching

  • Length
  • Filter banks: take a wave, break it into time windows, make a spectrum of each window, and then quantize those spectra into a vector of machines→ But how to deal with the data with the different duration? Dynamic time warping (finding an optimal path between two given (time-dependent) sequences)

Front-end analysis

  • Things to consider: i) naturalness, ii) computational efficiency, and iii) results (performance assessment)
  • with regard to filter banks: i) overlapping time windows, ii) amplitude scaling, and iii) frequency scaling→ about iii) frequency scaling:
    1. Mels: the idea is that low-frequency differences are precisely than high-frequency differences to humans, so scaling is done accordingly)
    2. Cepstrum: a way of frequency decomposition for a time window—creating a spectrum via Fourier analysis, log-transform it, and do the Fourier again
    3. MEL frequency cepstral coefficients (MFCC)
      1. 3-1. 25ms overlapped time windows
      2. MEL-scaled cepstra
      3. keep the first cepstral filterbanks
      4. compute 12 additional delta values (a difference score between a coefficient and its next two horizontal neighbors)
      5. compute another 12 delta-delta values

Hidden Markov Models (HMMs)

  • A generative model that models sequential data stochastically
  • Markov model: represents sequence that consists of states as state transition probability matrix
  • Markov assumption: the observation at the time t is dependent on the latest r observations
  • The basic structure: a finite number of states, a designated start state, a finite alphabet, a finite number of arcs, a probability associated with each arc, and for each state, a probability associated with each symbol
  • It is supposed that the current state is hidden and draws out it based on the shown information. The direct cause that results in it is unobservable; in other words, only what resulted from the state(s) can be seen. Keep in mind that what are hidden are not parameters, but continuous states the model goes through.
  • P(W|A) = P(A|W)P(W)/P(A), A ⇒ O, W ⇒ lambda, P(A|W) ⇒ P(O|lambda)
  • Forward and backward procedures: to calculate the probability given the sequence of the status and the model (lambda)
  • The Viterbi algorithm: to find the optimal status sequence to explain the data most well
  • Baum-Welch re-estimation procedure: to optimize the lambda (parameters of the model) for maximizing P(O|lambda)

https://blog.naver.com/jamiet1/221420317965

Some concepts

  • Finite state transducers (FST): a finite state automation (FSA), which produces output as well as reading input, meaning it is useful for parsing while a bare FSA can only be used for recognizing (i.e., pattern matching)

No comments:

Post a Comment