Let ${\mathcal M}=\{1,2,\dots,M\}$ be the set of messages (already compressed by Huffman ot some other compression method, e.g., block source codes).

The channel $W:{\mathcal X}\to{\mathcal Y}$ is a stochastic map from the set ${\mathcal X}=\{0,1,\dots,q-1\}$ to the set ${\mathcal Y}=\{0,1,\dots,|{\mathcal Y}|-1\}.$ It is often expressed as a stochastic matrix $(W(y|x))_{x\in{\mathcal X}}^{y\in{\mathcal Y}}$ or a graph.
Example: $$\begin{array}{c@{\quad}cccc} &&&{\mathcal Y}\\&&0&1&2\\{\mathcal X} &0&0.4&0.1&0.5\\
&1&0.8&0.05&0.15\end{array}$$ The corresponding graph is bipartite with one part formed of the elements of ${\mathcal X}$ and the other of the elements of ${\mathcal Y}.$ Vertices in ${\mathcal X}$ are connected by arrows to the vertices in ${\mathcal Y}.$ The arrow from $x\in {\mathcal X}$ to $y\in{\mathcal Y}$ is usually labelled by the probability $W(y|x).$ If $W(y|x)=0,$ the arrow is not drawn.

The set ${\mathcal X}$ is called the input alphabet of the channel $W$ and the set ${\mathcal Y}$ is called the output alphabet. Elements of ${\mathcal X}$ and ${\mathcal Y}$ are sometimes called letters.
A code $C$ of length $n$ is a pair of mappings $(f,g),$ where $f:{\mathcal M}\to {\mathcal X}^n$ and $g: {\mathcal Y}^n\to {\mathcal M}$. Here $f$ is called the encoding map (the encoder) and $g$ is called the decoding map (the decoder).
The number $R(C){\stackrel{\triangle}{=}}\frac {1}{n}\log M$ is called the rate of the code.
The code $C$ assigns codewords of length $n$ over ${\mathcal X}$ to the messages. Thus we have the set of codewords  $x_1, x_2,\dots, x_{M}$ where $x_i=(x_{i,1},x_{i,2},\dots, x_{i,n})$ is the value of $f(i),$ i.e., the codeword for the message $i\in {\mathcal M}.$ If $f$ is understood, then the set $C=\{x_1, x_2,\dots, x_M\}$ itself is called a code. The same encoding map $f$ can be coupled with different decoding maps $g$ to form a pair $(f,g)$ (trade-off between performance and complexity).
Definition: A code is $C=(f,g)$ of length $n$ is said to attain capacity ${\mathcal C}(W)$ of the channel $W$ if $R={\mathcal C}(W)$ and $\Pr\{g(f(i))\ne i\}=0$ for all $i\in{\mathcal M}.$
Note: An individual code that attains capacity must have error probability equal to zero (as opposed to arbitrarily small). Zero error probability is possible only if there are input letters that do not mix, i.e., $\exists x_1,x_2\in {\mathcal X}$ such that $W(y|x_1)W(y|x_2)=0$ for all $y\in {\mathcal Y}.$ By itself this condition does not guarantee that the code achieves capacity.

Let $(f,g)$ be fixed, then the error probability $$P_e(C){\stackrel{\triangle}{=}}\max_{1\le i\le M} \Pr\{g(f(i))\ne i\}.$$
Definition. A sequence of codes $C_i=(f_i,g_i)_{i\ge 1}$ of length $n_i, i=1,2,\dots$ is said to achieve capacity ${\mathcal C}(W)$ if
$\forall\delta>0, \epsilon>0$ there is a sufficiently large $i_0=i_0(\epsilon,\delta)$ such that.  for all $i\ge i_0$ both $R(C_i)> {\mathcal C}(W)-\delta$ and $P_e(C_i)\le \epsilon.$

A binary code $C$ is called linear if the encoding map $f$ is linear. Equivalently, $C$ is linear if the set of codewords $\{x_i\}$ is closed under addition modulo 2 (i.e., any 2 codevectors add to a codevector).
Polar codes give an example of a sequence of capacity achieving binary linear codes. The length of the polar codes is $N=2^n$ (note that now the length is denoted by $N$ while $n$ denotes the number of iterations). The encoding map $f:{\mathcal M}\to {\mathcal X}^N$ is multiplication by the submatrix $G_n$ formed of the $RN$ rows of $H_2^{\otimes n}$ that corrrespond to the $RN$ smallest values of $Z(W_N^{(i)}).$