### Entropy Modulo a Prime (Continued)

#### Posted by Tom Leinster

In the comments last time, a conversation got going about *$p$-adic*
entropy. But here I’ll return to the original subject: entropy *modulo
$p$*. I’ll answer the question:

Given a “probability distribution” mod $p$, that is, a tuple $\pi = (\pi_1, \ldots, \pi_n) \in (\mathbb{Z}/p\mathbb{Z})^n$ summing to $1$, what is the right definition of its entropy $H_p(\pi) \in \mathbb{Z}/p\mathbb{Z}?$

How will we know when we’ve got the right definition? As I explained last time, the acid test is whether it satisfies the chain rule

$H_p(\gamma \circ (\pi^1, \ldots, \pi^n)) = H_p(\gamma) + \sum_{i = 1}^n \gamma_i H_p(\pi^i).$

This is supposed to hold for all $\gamma = (\gamma_1, \ldots, \gamma_n) \in \Pi_n$ and $\pi^i = (\pi^i_1, \ldots, \pi^i_{k_i}) \in \Pi_{k_i}$, where $\Pi_n$ is the hyperplane

$\Pi_n = \{ (\pi_1, \ldots, \pi_n) \in (\mathbb{Z}/p\mathbb{Z})^n : \pi_1 + \cdots + \pi_n = 1\},$

whose elements we’re calling “probability distributions” mod $p$. And if
God is smiling on us, $H_p$ will be essentially the *only* quantity
that satisfies the chain rule. Then we’ll know we’ve got the right
definition.

Black belts in functional equations will be able to use the chain rule and
nothing else to work out what $H_p$ must be. But the rest of us might like
an extra clue, and we have one in the definition of *real* Shannon entropy:

$H_\mathbb{R}(\pi) = - \sum_{i: \pi_i \neq 0} \pi_i \log \pi_i.$

Now, we saw last time that there is no logarithm mod $p$; that is, there is no group homomorphism

$(\mathbb{Z}/p\mathbb{Z})^\times \to \mathbb{Z}/p\mathbb{Z}.$

But there *is* a next-best thing: a homomorphism

$(\mathbb{Z}/p^2\mathbb{Z})^\times \to \mathbb{Z}/p\mathbb{Z}.$

This is called the Fermat quotient $q_p$, and it’s given by

$q_p(n) = \frac{n^{p - 1} - 1}{p} \in \mathbb{Z}/p\mathbb{Z}.$

Let’s go through why this works.

The elements of $\mathbb{Z}/p^2\mathbb{Z}$ are the congruence classes mod $p^2$ of the integers not divisible by $p$. Fermat’s little theorem says that whenever $n$ is not divisible by $p$,

$\frac{n^{p - 1} - 1}{p}$

is an integer. This, or rather its congruence class mod $p$, is the Fermat quotient. The congruence class of $n$ mod $p^2$ determines the congruence class of $n^{p - 1} - 1$ mod $p^2$, and it therefore determines the congruence class of $(n^{p - 1} - 1)/p$ mod $p$. So, $q_p$ defines a function $(\mathbb{Z}/p^2\mathbb{Z})^\times \to \mathbb{Z}/p\mathbb{Z}$. It’s a pleasant exercise to show that it’s a homomorphism. In other words, $q_p$ has the log-like property

$q_p(m n) = q_p(m) + q_p(n)$

for all integers $m, n$ not divisible by $p$.

In fact, it’s essentially unique as such. Any other homomorphism $(\mathbb{Z}/p^2\mathbb{Z})^\times \to \mathbb{Z}/p\mathbb{Z}$ is a scalar multiple of $q_p$. (This follows from the classical theorem that the group $(\mathbb{Z}/p^2\mathbb{Z})^\times$ is cyclic.) It’s just like the fact that up to a scalar multiple, the real logarithm is the unique measurable function $\log : (0, \infty) \to \R$ such that $\log(x y) = \log x + \log y$, but here there’s nothing like measurability complicating things.

So: $q_p$ functions as a kind of logarithm. Given a mod $p$ probability distribution $\pi = \in \Pi_n$, we might therefore guess that the right definition of its entropy is

$- \sum_{i : \pi_i \neq 0} \pi_i q_p(a_i),$

where $a_i$ is an integer representing $\pi_i \in \mathbb{Z}/p\mathbb{Z}$.

However, this doesn’t work. It depends on the choice of representatives $a_i$.

To get the right answer, we’ll look at real entropy in a slightly different way. Define $\partial_\mathbb{R}: [0, 1] \to \mathbb{R}$ by

$\partial_\mathbb{R}(x) = \begin{cases} - x \log x &if x \neq 0, \\ 0 &if x = 0. \end{cases}.$

Then $\partial_\mathbb{R}$ has the derivative-like property

$\partial_\mathbb{R}(x y) = x \partial_\mathbb{R}(y) + \partial_\mathbb{R}(x) y.$

A *linear* map with this property is called a derivation, so it’s
reasonable to call $\partial_\mathbb{R}$ a **nonlinear derivation**.

The observation that $\partial_\mathbb{R}$ is a nonlinear derivation turns out to be quite useful. For instance, real entropy is given by

$H_\mathbb{R}(\pi) = \sum_{i = 1}^n \partial_\mathbb{R}(\pi_i)$

($\pi \in \Pi_n$), and verifying the chain rule for $H_\mathbb{R}$ is done most neatly using the derivation property of $\partial_\mathbb{R}$.

An equivalent formula for real entropy is

$H_\mathbb{R}(\pi) = \sum_{i = 1}^n \partial_\mathbb{R}(\pi_i) - \partial_\mathbb{R}\biggl( \sum_{i = 1}^n \pi_i \biggr).$

This is a triviality: $\sum \pi_i = 1$, so $\partial_\mathbb{R}\bigl( \sum \pi_i \bigr) = 0$, so this is the same as the previous formula. But it’s also quite suggestive: $H_\mathbb{R}(\pi)$ measures the extent to which the nonlinear derivation $\partial_\mathbb{R}$ fails to preserve the sum $\sum \pi_i$.

Now let’s try to imitate this in $\mathbb{Z}/p\mathbb{Z}$. Since $q_p$ plays a similar role to $\log$, it’s natural to define

$\partial_p(n) = -n q_p(n) = \frac{n - n^p}{p}$

for integers $n$ not divisible by $p$. But the last expression makes sense
even if $n$ *is* divisible by $p$. So, we can define a function

$\partial_p : \mathbb{Z}/p^2\mathbb{Z} \to \mathbb{Z}/p\mathbb{Z}$

by $\partial_p(n) = (n - n^p)/p$. (This is called a $p$-derivation.) It’s easy to check that $\partial_p$ has the derivative-like property

$\partial_p(m n) = m \partial_p(n) + \partial_p(m) n.$

And now we arrive at the long-awaited definition. The **entropy mod $p$**
of $\pi = (\pi_1, \ldots, \pi_n)$ is

$H_p(\pi) = \sum_{i = 1}^n \partial_p(a_i) - \partial_p\biggl( \sum_{i = 1}^n a_i \biggr),$

where $a_i \in \mathbb{Z}$ represents $\pi_i \in \mathbb{Z}/p\mathbb{Z}$. This is independent of the choice of representatives $a_i$. And when you work it out explicitly, it gives

$H_p(\pi) = \frac{1}{p} \biggl( 1 - \sum_{i = 1}^n a_i^p \biggr).$

Just as in the real case, $H_p$ satisfies the chain rule, which is most easily shown using the derivation property of $\partial_p$.

Before I say any more, let’s have some examples.

In the real case, the uniform distribution $u_n = (1/n, \ldots, 1/n)$ has entropy $\log n$. Mod $p$, this distribution only makes sense if $p$ does not divide $n$ (otherwise $1/n$ is undefined); but assuming that, we do indeed have $H_p(u_n) = q_p(n)$, as we’d expect.

When we take our prime $p$ to be $2$, a probability distribution $\pi$ is just a sequence of bits like $(0, 0, 1, 0, 1, 1, 1, 0, 1)$ with an odd number of $1$s. Its entropy $H_2(\pi) \in \mathbb{Z}/2\mathbb{Z}$ turns out to be $0$ if the number of $1$s is congruent to $1$ mod $4$, and $1$ if the number of $1$s is congruent to $3$ mod $4$.

What about distributions on two elements? In other words, let $\alpha \in \mathbb{Z}/p\mathbb{Z}$ and put $\pi = (\alpha, 1 - \alpha)$. What is $H_p(\pi)$?

It takes a bit of algebra to figure this out, but it’s not too hard, and the outcome is that for $p \neq 2$, $H_p(\alpha, 1 - \alpha) = \sum_{r = 1}^{p - 1} \frac{\alpha^r}{r}.$ This function was, in fact, the starting point of Kontsevich’s note, and it’s what he called the $1\tfrac{1}{2}$-logarithm.

We’ve now succeeded in finding a definition of entropy mod $p$ that
satisfies the chain rule. That’s not quite enough, though. In principle,
there could be *loads* of things satisfying the chain rule, in which case,
what special status would ours have?

But in fact, up to the inevitable constant factor, our definition of
entropy mod $p$ is the *one and only* definition satisfying the chain rule:

TheoremLet $(I: \Pi_n \to \mathbb{Z}/p\mathbb{Z})$ be a sequence of functions. Then $I$ satisfies the chain rule if and only if $I = c H_p$ for some $c \in \mathbb{Z}/p\mathbb{Z}$.

This is precisely analogous to the characterization theorem for real entropy, except that in the real case some analytic condition on $I$ has to be imposed (continuity in Faddeev’s theorem, and measurability in the stronger theorem of Lee). So, this is excellent justification for calling $H_p$ the entropy mod $p$.

I’ll say nothing about the proof except the following. In Faddeev’s
theorem over $\mathbb{R}$, the tricky part of the proof involves the fact
that the sequence $(\log n)_{n \geq 1}$ is *not* uniquely characterized up
to a constant factor by the equation $\log(m n) = \log m + \log n$; to make
that work, you have to introduce some analytic condition. Over
$\mathbb{Z}/p\mathbb{Z}$, the tricky part involves the fact that the domain
of the “logarithm” (Fermat quotient) is not $\mathbb{Z}/p\mathbb{Z}$, but
$\mathbb{Z}/p^2\mathbb{Z}$. So, analytic difficulties are replaced by
number-theoretic difficulties.

Kontsevich didn’t actually write down a definition of entropy mod $p$ in his two-and-a-half page note. He did exactly enough to show that there must be a unique sensible such definition… and left it there! Of course he could have worked it out if he’d wanted to, and maybe he even did, but he didn’t write it up here.

Anyway, let’s return to the quotation from Kontsevich that I began my first post with:

Conclusion:If we have a random variable $\xi$ which takes finitely many values with all probabilities in $\mathbb{Q}$ then we can define not only the transcendental number $H(\xi)$ but also its “residues modulo $p$” for almost all primes $p$ !

In the notation of these posts, he’s saying the following. Let

$\pi = (\pi_1, \ldots, \pi_n)$

be a real probability distribution in which each $\pi_i$ is rational.
There are only finitely many primes that divide one or more of the
denominators of $\pi_1, \ldots, \pi_n$. For primes $p$ *not* belonging to
this exceptional set, we can interpret $\pi$ as a probability distribution
in $\mathbb{Z}/p\mathbb{Z}$. We can therefore take its mod $p$ entropy,
$H_p(\pi)$.

Kontsevich is playfully suggesting that we view $H_p(\pi) \in \mathbb{Z}/p\mathbb{Z}$ as the residue class mod $p$ of $H_\mathbb{R}(\pi) \in \mathbb{R}$.

There is more to this than meets the eye! Different real probability distributions can have the same real entropy, so there’s a question of consistency. Kontsevich’s suggestion only makes sense if

$H_\mathbb{R}(\pi) = H_\mathbb{R}(\pi') \implies H_p(\pi) = H_p(\pi').$

And this is true! I have a proof, though I’m not convinced it’s optimal. Does anyone see an easy argument for this?

Let’s write $\mathcal{H}^{(p)}$ for the set of real numbers of the form $H_\mathbb{R}(\pi)$, where $\pi$ is a real probability distribution whose probabilities $\pi_i$ can all be expressed as fractions with denominator not divisible by $p$. We’ve just seen that there’s a well-defined map

$[.] : \mathcal{H}^{(p)} \to \mathbb{Z}/p\mathbb{Z}$

defined by

$[H_\mathbb{R}(\pi)] = H_p(\pi).$

For $x \in \mathcal{H}^{(p)} \subseteq \mathbb{R}$, we view $[x]$ as the congruence class mod $p$ of $x$. This notion of “congruence class” even behaves something like the ordinary notion, in the sense that $[.]$ preserves addition.

(We can even go a bit further. Accompanying the characterization theorem
for entropy mod $p$, there is a characterization theorem for information
loss mod $p$, strictly analogous to the theorem that John Baez, Tobias
Fritz and I proved over $\mathbb{R}$. I won’t review that stuff here,
but the point is that an information loss is a *difference* of entropies,
and this enables us to define the congruence class mod $p$ of the
*difference* of two elements of $\mathcal{H}^{(p)}$. The same additivity holds.)

There’s just one more thing. In a way, the definition of entropy mod $p$ is unsatisfactory. In order to define it, we had to step outside the world of $\mathbb{Z}/p\mathbb{Z}$ by making arbitrary choices of representing integers, and then we had to show that the definition was independent of those choices. Can’t we do it directly?

In fact, we can. It’s a well-known miracle about finite fields $K$ that
*any* function $K \to K$ is a polynomial. It’s a slightly less well-known
miracle that any function $K^n \to K$, for any $n \geq 0$, is also a
polynomial.

Of course, multiple polynomials can induce the same function. For
instance, the polynomials $x^p$ and $x$ induce the same function
$\mathbb{Z}/p\mathbb{Z} \to \mathbb{Z}/p\mathbb{Z}$. But it’s still
possible to make a uniqueness statement. Given a function $F : K^n \to K$,
there’s a *unique* polynomial $f \in K[x_1, \ldots, x_n]$ that induces $F$
and is of degree less than the order of $K$ in each variable separately.

So, there must be a polynomial representing entropy, of order less than $p$ in each variable. And as it turns out, it’s this one:

$H_p(\pi_1, \ldots, \pi_n) = - \sum_{\substack{0 \leq r_1, \ldots, r_n \lt p:\\r_1 + \cdots + r_n = p}} \frac{\pi_1^{r_1} \cdots \pi_n^{r_n}}{r_1! \cdots r_n!}.$

You can check that when $n = 2$, this is in fact the same polynomial $\sum_{r = 1}^{p - 1} \pi_1^r/r$ as we met before — Kontsevich’s $1\tfrac{1}{2}$-logarithm.

It’s striking that this direct formula for entropy modulo a prime looks quite unlike the formula for real entropy,

$H_\mathbb{R}(\pi) = - \sum_{i : \pi_i \neq 0} \pi_i \log \pi_i.$

It’s also striking that in the case $n = 2$, the formula for real entropy is

$H_\mathbb{R}(\alpha, 1 - \alpha) = - \alpha \log \alpha - (1 - \alpha) \log(1 - \alpha),$

whereas mod $p$, we get

$H_p(\alpha, 1 - \alpha) = \sum_{r = 1}^{p - 1} \frac{\alpha^r}{r},$

which is a truncation of the Taylor series of $-\log(1 - \alpha)$. And yet, the characterization theorems for entropy over $\mathbb{R}$ and over $\mathbb{Z}/p\mathbb{Z}$ are strictly analogous.

As I see it, there are two or three big open questions:

Entropy over $\mathbb{R}$ can be understood, interpreted and applied in many ways. How can we understand, interpret or apply entropy mod $p$?

Entropy over $\mathbb{R}$ and entropy mod $p$ are defined in roughly analogous ways, and uniquely characterized by strictly analogous theorems. Is there a common generalization? That is, can we unify the two definitions and characterization theorems, perhaps proving a theorem about entropy over suitable fields?

## Re: Entropy Modulo a Prime (Continued)

Typo: “definitin”.