Cheonkam's Deep Learning Space: [Book Summary - Information Theory] Information Theory: A Tutorial Intorudction, Chs 1-2 by James V. Stone

Tuesday, December 20, 2022

[Book Summary - Information Theory] Information Theory: A Tutorial Intorudction, Chs 1-2 by James V. Stone

1. Useful information

- (maximize) informative redundancy
- (minimize) non-informative redundancy

2. Noise

3. Minimizing "effort" required for transmitting each portion of an informative signal: selection in biology/language/culture

What is "effort"?

i) articulatory? how much effort? (e.g., Spanish vs. English vowels)
i) cognitive effort?
Iii) motor plan for producing similar sounds,
v) sounds including some noise

4. Bits and binary digits

Bits: splitting the possible (meaning) space in half / quantity of the info
Binary digits: each place in a binary number (a number consisting of binary digits) is 1 bit.
Bytes: 8?

m outcomes = 2^n binary choices
n binary choices = log2(m outcomes)
n bits = log2(m outcomes)

5. Random variable: e.g., X(head) = 1, X(tall) = 0 / a kind of function

6. Rather than an actual variable ?

7. Schema for message transmission: a message is an ordered sequence of symbols corresponding to outcome values of random variables

Data: meaning… social… discourse… / linear symbols…? /
Encoder: codewords, taken from codebook (cf. symbols vs. codewords: might be same, but not necessarily, for example, freq.words with min.symbols & infreq.words with max.symbols
Input x: can be identical to message, or transformed by some mapping (e.g., to compress removing redundancy)
Channel: noise…? (can obscure the speech - brain filters the noise…?) need to know the differences btw noise and redundancy

With regard to top-down processing: redundancy is useful in that it confirms… (e.g., los angeles)

Output y: lossless, lossy
Channel capacity: information per codeword; information rate = information per sec / close to the channel capacity…?

Fewer contrast: by position to position fast but needs more time
More contrast: by position to position slowly but needs less time

Noise reduces channel capacity

8. Surprisal and entropy

Surprisal = log(1/p(x)), or - log(p(x))
If we use log2, surprisal is measured in bits.

9. Entropy

Average surprisal of elements in a system: H(X)
How do we estimate this in practice?

10. Entropy of a system

11. Some assumptions

Independent and Identically Distributed (a.k.a. i.i.d.)
Stationary: doesn't change through time
Ergodic: system can be represented by a reasonably large number of measurements

12. Uniform Information Density Hypothesis: if language is optimized for information transmission, it should tend toward equal surprisal for symbols per unit time

Discussion: The actual entropy of written English letters lower than that of English estimated from the individual prob of letters from a big sample of text? Cuz that is not independent! More skewed because of the context!

Tuesday, December 20, 2022

[Book Summary - Information Theory] Information Theory: A Tutorial Intorudction, Chs 1-2 by James V. Stone

No comments:

Post a Comment

[Book Summary - CtDSI] Cracking the Data Science Interview Ch. 1

Postings