Shannon Entropy Equation

^2024-06-17

One of the most fundamental ideas in information theory.

Logarithm Context

Okay first what a logarithm is doing?

Suppose we have a log of base 2 which takes 8. So in other words to what power we have to raise 2 to get 8?

We need to raise it to the 3rd power.

Let’s rethink this in another way.

A bit is either a 0 or 1. So if we have a number written with 3 bits (like 010 representing 2 or 110 representing 6), we can ask 3 questions to know what number we have:

is the first bit zero or one?
is the second bit zero or one?
is the third bit zero or one?

If we answer them we know what number it is.

For your additional knowledge the logarithm base 2 is giving negative values between 0 and 1. For 1 the result is 0 (2 raised to 0th power is 1).

Information value/uncertainty

The formula below gives us value of information given its probability - p(x) is probability for event x. The log is base 2 in this context.

h(x) = -log( p(x) )

So if we have an probability of event 50% this formula gives us 1 bit. Event with 100% probability gives 0 bits. And for 10% we get around 3.322 bits.

So what does this mean? We need approximately 3 bits to encode this event.

But how? Using an efficient variable length encoding algorithm like Huffman encoding. The lengths would be dependent on the probabilities of occuring.

Information value for each die roll result is about 2.5 bits so in Huffman encoding we need 3 bits (we can’t have a partial bit in the computer) to encode it.

Lower probability events require more bits to encode => they carry more information.

Entropy Equation

H(X) = - Σ [for each event x in input X] ( p(x) * log( p(x) ) )

This entropy equation gives us average information value/uncertainty, where the average in the sense that is based on the probability of occurence. So on average we would get an event with that information value.

Uncertainty

In the context of encoding the term information value is quite clear. However what does uncertainty mean?

Information reduces uncertainty, the more information the less uncertainty.

So if we roll a dice, we learn about 2.5 bits of information. Learn?

Not in the sense like we learn math, but rather we get reduction of uncertainty about the state of the world.

Before you roll a die, there are six possible outcomes, each with an equal probability. You don’t know which number will come up, so there’s uncertainty. When you roll it and see the result, the uncertainty is resolved - you now know the outcome.

Example Python code

from math import log2

events = {
    "win": 0.7,
    "lose": 0.1,
    "draw": 0.19,
    "end of the world": 0.01
}

def information_entropy(X: dict[str, float]) -> float:
    return -sum(p * log2(p) for p in X.values())

print(information_entropy(events))

This code gives the result of around 1.2 bits.