Erasure codes for software engineers

In large-scale storage systems, erasure codes are essential for reducing storage costs and maintaining high durability. Erasure codes can also be used in more creative ways, such as reducing network latency by hedging requests 1. There are also many constraints to consider when choosing specific codes such as how often you need to perform repairs, available network capacity, data center redundancy needs, and durability guarantees.

In this post, we build up from simple linear systems in order to build the foundations for Reed-Solomon style codes and using tiny Python snippets along the way. We’ll also keep dependencies minimal and only use NumPy.

We’ll use this notation throughout:

k: data shards per stripe
m: parity shards per stripe
n = k + m: total shards per stripe

Storage overhead is n / k, the total shards over the data shards. For example, RS 10+4 means k = 10, m = 4, has 1.4x overhead, and tolerates any 4 shard erasures. By comparison, 3x replication has 3x overhead and tolerates 2 replica losses. Placement across disks, nodes, racks, and zones determines whether those shard losses are truly independent. Some modern codes intentionally add a bit more parity to reduce repair bandwidth. For example, locally repairable codes 2.

Here’s a quick comparison:

Scheme	Overhead	Stripe erasure tolerance	Data read to repair 1 lost shard
3x replication	3.0x	2 losses	1 shard
RS 10+4	1.4x	4 losses	10 shards
RS 12+2	1.17x	2 losses	12 shards

Warmup: solving a linear system Link to heading

A linear equation in two variables looks like:

$$ 2x + 3y = 7 $$

Two independent equations define a single point (x, y). For example:

$$ 2x + 3y = 7 \\ x - y = 1 $$

You can solve this by hand, but it is also just a matrix equation:

$$ A c = b $$

where A is the coefficient matrix, c is the unknowns, and b is the result vector.

Here is a tiny NumPy example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


import numpy as np

A = np.array([
    [2, 3],
    [1, -1],
], dtype=float)

b = np.array([7, 1], dtype=float)

c = np.linalg.solve(A, b)
print(c)  # [2. 1.]

Same idea, same tooling, if you move from 2 variables to 3+ variables.

A polynomial view Link to heading

Okay, here is where the ’trick’ comes in. The key is to encode data as coefficients of a polynomial. For example:

$$ M(x) = 2 + 3x + 4x^2 $$

The coefficients [2, 3, 4] are the data. If we evaluate M(x) at multiple points, we get more equations than unknowns, which gives redundancy.

This is the idea behind Reed-Solomon codes 3.

Encoding by evaluation Link to heading

We will:

Define a message polynomial.
Evaluate it at 5 points to create the codeword.
Lose two points and recover the coefficients from the remaining points.

1) Define the message polynomial Link to heading

Let:

$$ M(x) = 2 + 3x + 4x^2 $$

The coefficients [2, 3, 4] are the data.

1
2
3
4


import numpy as np
import matplotlib.pyplot as plt

coeffs = np.array([2, 3, 4], dtype=float)  # degree < 3

2) Choose evaluation points Link to heading

We will use 5 distinct points: 1, 2, 3, 4, 5.

1
2
3


num_points = 5
eval_points = np.arange(1, num_points + 1, dtype=float)
print("Evaluation points:", eval_points)

3) Encode: evaluate the polynomial Link to heading

We use Horner’s method to evaluate the polynomial (at the given points) and generate the codeword, which is just the list of outputs at those points. It’s a fast, standard way to evaluate polynomials.

1
2
3
4
5
6
7
8


def poly_eval(coeffs, x):
    result = 0.0
    for c in coeffs[::-1]:
        result = result * x + c
    return result

codeword = np.array([poly_eval(coeffs, x) for x in eval_points])
print("Encoded codeword (evaluations):", codeword)

4) Plot the polynomial and points (optional) Link to heading

This part is optional, but we can plot the polynomial at the different selected points in order to visualize what we’re doing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


x_dense = np.linspace(0, 6, 400)
y_dense = np.array([poly_eval(coeffs, x) for x in x_dense])

plt.figure(figsize=(8, 5))
plt.plot(x_dense, y_dense, label="Polynomial curve", color='blue')
plt.scatter(eval_points, codeword, color='red', zorder=5, label="Evaluation points")
plt.xlabel("x")
plt.ylabel("M(x)")
plt.title("Polynomial Evaluation and Codeword Points")
plt.legend()
plt.grid(True)
plt.show()

Polynomial evaluation and codeword points

5) Simulate erasures Link to heading

In this example, we simulate two erasures at indices 1 and 3 (that is, x = 2 and x = 4). Assume we only receive evaluations at indices 0, 2, 4.

1
2
3


known_indices = [0, 2, 4]
x_known = eval_points[known_indices]
y_known = codeword[known_indices]

6) Recover the polynomial Link to heading

To recover the coefficients, we write the equations in matrix form. A Vandermonde matrix is just the table of powers of the x-values (1, x, x^2, ...) so that A * coeffs = y. Solving that system gives us the original coefficients.

If there is a unique solution, then the matrix A must be invertible. That means we can do:

$$ coeffs = A^{-1} y $$

In our example, that means we can solve for the unknowns in one step. The x-values must be distinct for this to work.

Where do the linear equations come from? Pick an x, plug it into the polynomial, and you get an equation like:

$$ c_0 + c_1 x + c_2 x^2 = y $$

The unknowns are the coefficients c0, c1, c2, .... If there are k coefficients, you need at least k distinct equations (i.e., k different x values) to solve for them. The encoder’s job is to pick distinct x values, evaluate the polynomial, and send those outputs (the codeword).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


k = len(coeffs)  # degree < k, here k = 3
A = np.vander(x_known, N=k, increasing=True)

recovered_coeffs = np.linalg.solve(A, y_known)
print("Recovered polynomial coefficients:", recovered_coeffs)

if np.allclose(recovered_coeffs, coeffs):
    print("Successfully recovered the original polynomial coefficients!")
else:
    print("Recovery failed. Original coefficients:", coeffs)

Two erasure examples:

Lose two points (keep indices 0, 2, 4). This is the example above.
Lose two different points (keep indices 1, 2, 3). This also works because we still have 3 distinct points.

The rule is simple: a degree-2 polynomial needs any 3 distinct points.

This is exactly the erasure model: we lose symbols, but we know which ones.

Erasure vs. error correction (quick note): erasures mean we know which symbols are missing, while errors mean unknown symbols might be wrong. This post only covers erasures.

From linear equations to Reed-Solomon Link to heading

Reed-Solomon codes are polynomial evaluation codes over a finite field. They are MDS (maximum distance separable): any k out of n symbols recover the original data. The encoder chooses k data symbols (the coefficients) and outputs n evaluated symbols. Any k of those n are enough to reconstruct the polynomial, so you can tolerate n - k erasures.

Key point: the decoder is just solving a linear system.

Finite fields (Galois fields) Link to heading

The example above used real numbers, but storage systems use finite fields so that all arithmetic stays within a fixed symbol size (like a byte).

A Galois field GF(2^m) has exactly 2^m elements. Arithmetic wraps around inside this field, so every non-zero element has a multiplicative inverse, and all linear systems behave nicely for coding.

In practice, many Reed-Solomon deployments use GF(2^8) for byte symbols, though other symbol sizes are also used. The same linear algebra applies, but the math is done modulo a field polynomial instead of with reals.

If you want a tiny bit-level multiplication example in GF(2^3), see the appendix at the end.

A simple erasure code end-to-end Link to heading

Let k = 3 data symbols d0, d1, d2. We create n = 5 encoded symbols by choosing 5 distinct points and evaluating the polynomial:

$$ M(x) = d_0 + d_1 x + d_2 x^2 $$

The encoder sends M(1), M(2), M(3), M(4), M(5).

If any two are erased, the decoder picks any 3 remaining points, builds the Vandermonde matrix, and solves for [d0, d1, d2].

That is the entire idea: erasure codes are linear equations plus redundancy.

A full “hello” example (recover the original string) Link to heading

Okay, our equivalent “hello” test will tie all these concepts together. Here is a minimal example that encodes the string "hello", then simulates losing some encoded symbols and recovers the original string by solving for the coefficients again.

We will use a degree-4 polynomial code over the integers. We treat the bytes as the polynomial coefficients, evaluate it at x = 1..7, and use the extra points as parity. This is still a toy example (not a true Reed-Solomon over a finite field), but the mechanics are the same.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


import numpy as np

def poly_eval(coeffs, x):
    result = 0.0
    for c in coeffs[::-1]:
        result = result * x + c
    return result

def fit_poly(x, y):
    A = np.vander(x, N=5, increasing=True)
    return np.linalg.solve(A, y)

msg = "hello"
data = np.frombuffer(msg.encode("utf-8"), dtype=np.uint8)

block = data  # b"hello"

x_all = np.array([1, 2, 3, 4, 5, 6, 7], dtype=float)

# Encode: treat bytes as polynomial coefficients, then evaluate.
coeffs = block.astype(float)
codeword = np.array([poly_eval(coeffs, x) for x in x_all])

# Simulate losing two encoded symbols.
known_indices = [1, 2, 4, 5, 6]  # x = 2, 3, 5, 6, 7
x_known = x_all[known_indices]
y_known = codeword[known_indices]

recovered_coeffs = fit_poly(x_known, y_known)
recovered_block = np.round(recovered_coeffs).astype(np.uint8)
print(bytes(recovered_block).decode("utf-8"))  # "hello"

Warning: this demo uses floating point + rounding, which is not exact. Real Reed-Solomon uses finite field arithmetic so every step is exact and stays in a fixed-width byte.

What matters in distributed storage Link to heading

If you are onboarding to erasure coding for storage systems, these are usually the practical design levers:

Stripe geometry (k + m).
You split each stripe into k data shards and m parity shards.
Storage overhead.
Overhead is (k + m) / k.
Fault tolerance.
You can tolerate up to m shard erasures per stripe (assuming shard placement avoids correlated failures).
Repair bandwidth.
In classic Reed-Solomon, repairing one lost shard usually reads k surviving shards. This is where local-repair codes can help.
Write/read amplification.
Small random writes often force read-modify-write work for parity updates.

This is why production systems choose codes based on more than durability: repair speed, network cost, and placement constraints often dominate.

Closing Link to heading

This is the simplest path from linear equations to Reed-Solomon. Of course, there is a lot that we did not cover and many ways to approach this subject, but this path is often approachable for people with limited exposure to linear algebra. In a follow-up, we can explore other families of error-correcting codes.

Appendix: optional GF(2^3) multiplication Link to heading

This is a tiny bit-level example. It is optional, and only here if you want to see what finite-field multiplication can look like in code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


# Multiply two 3-bit values in GF(2^3) with modulus x^3 + x + 1 (0b1011).

def gf_mul(a, b):
    mod = 0b1011
    result = 0
    for _ in range(3):
        if b & 1:
            result ^= a
        b >>= 1
        carry = a & 0b100
        a = (a << 1) & 0b111
        if carry:
            a ^= mod & 0b111
    return result

print(bin(gf_mul(0b010, 0b111)))  # 0b101

If you want to suggest edits or improvements, feel free to open a pull request in the source repository 4.

References Link to heading

M. Brooker, “Erasure Coding versus Tail Latency,” Jan. 6, 2023. https://brooker.co.za/blog/2023/01/06/erasure.html
C. Huang et al., “Erasure Coding in Windows Azure Storage,” USENIX ATC, 2012. https://www.usenix.org/system/files/conference/atc12/atc12-final181_0.pdf
I. S. Reed and G. Solomon, “Polynomial Codes over Certain Finite Fields,” Journal of the Society for Industrial and Applied Mathematics, vol. 8, no. 2, pp. 300-304, 1960. DOI: https://sites.math.rutgers.edu/~zeilberg/akherim/ReedS1960.pdf
Source file for this post: https://github.com/tallwater/agriel-blog/blob/main/agriel-blog/content/posts/erasure-codes-basics.md