< Day Day Up > |
CHAPTER 2
For a source that produces 42 symbols with equal probability, the entropy of the source is
H = log2 42 bits/symbol
= 5.55 bits/symbol
For a source that produces two symbols A and B with probabilities of 0.6 and 0.4, respectively, the entropy is
H = −{0.6 log2 0.6 + 0.4 log2 0.4} = 0.970 bits/symbol
In ASCII, each character is represented by seven bits. The frequency of occurrence of the English letters is not taken into consideration at all. If the frequency of occurrence is taken into consideration, then the most frequently occurring letters have to be represented by small code words (such as 2 bits) and less frequently occurring letters have to be represented by long code words. According to Shannon's theory, ASCII is not an efficient coding technique.
However, note that if an efficient coding technique is followed, then a lot of additional processing is involved, which causes delay in decoding the text.
You can write a program that obtains the frequency of occurrence of the English letters. The program takes a text file as input and produces the frequency of occurrence for all the letters and spaces. You can ignore the punctuation marks. You need to convert all letters either into capital letters or small letters. Based on the frequencies, if you apply Shannon's formula for entropy, you will get a value close to 4.07 bits/symbol.
You can modify the above program to calculate the frequencies of two letter combinations (aa, ab, ac,…ba, bb, … zy, zz). Again, if you apply the formula, you will get a value close to 3.36 bits/symbol.
No comments:
Post a Comment