Erasure codes for Distributed Storage

Rishaab
Nov 10, 2022
8 min read

Updated: Jul 16

Data is exploding exponentially and so is the requirement for massively scalable distributed storage systems. A distributed storage system is a complicated beast which has to ensure performance, scalability, high tolerance and high availability. Today's topic is more targeted toward the availability aspect of data.

Distributed storage systems ensure availability by replicating the data. An RFn (Replication Factor n) would replicate a data block to n nodes. For eg. in RF3, data is replicated to 3 nodes. This is a very standard mechanism in a modern distributed storage system. With the advancement in the replication protocols, the data is replicated at a reasonable rate to other nodes, as such this mechanism is typically used in storage systems where data is hot.

We know designing a system is all about trade-offs. A 3-way replication as shown above requires 3x the storage capacity to save a single data block. This extra storage cost is acceptable because of the strict performance requirements that are expected out of a hot storage system. But using a similar mechanism for cold storage systems is simply a waste of storage capacity and resources. Unlike a hot storage system, a cold storage system does need to have a strict SLA on write/read performance. This holds true even for tiered storage because the tiering policies are generally set by the consumers. As such, it is expected that the data that have not been touched for several months will take some time to read. As such, we require a different mechanism to ensure the availability of the data while consuming lesser storage. And this is where erasure codes come into the picture.

To understand what erasure codes are we need to understand the coding theory. We know that data transmission is prone to corruption and losses, as such, the received data might not be the same as the sent data. And so, some formula must be devised to detect such corruption and loss of data and then fix them. This is what coding theory achieves and several well-known algorithms solve such cases.

The basic idea of any coding algorithm is to add some redundant bits to the original message, called parity bits. These parity bits help in reconstructing the original message in case of corruption or loss.

Consider that the message - "hello" has to be transmitted. To this message some extra characters (parity characters), say, "ab" is appended. As such, the message "helloab" will be transmitted. Let's say, the receiver receives the message "elloab", where the letter "h" has been lost. But because of the extra parity characters and using the magic of the coding algorithms, we will be able to recover the character "h". The string "helloab" is called the coded word.

Let's take the parity concept further and understand its use from a mathematical standpoint.

Consider that we have a set of points in a cartesian plane represented as (xi, yi).

Here, xi represents some input that has output yi. Say, we have 3 such points, (x1, y,1), (x2, y2), (x3, y3).

These points can be represented by a second-degree polynomial y = ax² +bx + c and the graph will look like so.

Now we can extrapolate on this graph to get more data points, say (x4, y4), (x5, y5). Since we now have 5 data points, we can take any 3 of them to solve the polynomial equation to get the coefficients a, b and c of the polynomial.

Here, we can consider data points (x4, y4) and (x5, y5) as parity points. When transmitting data points (x1, y1), (x2, y2), (x3, y3), we can also send (x4, y4) and (x5, y5) over the wire. In case any of the original data points, say, (x1, y1) is lost, we can regenerate the output data point y1 by solving the polynomial equation, provided we know the input value x1. In this example, we recovered the lost value y1 using the redundant information and this is what erasure codes achieve.

Erasure codes are the codes, that transform the original message to a longer message (by adding parity bits), such that any erasure of the data can be regenerated using the redundant information.

In practice, the math behind the coding theory is a lot more complicated. We certainly cannot do normal math (one which we are used to) when writing software as they will not be performant. Also, we are concerned here about storage systems and storage systems can store any arbitrary data, which can consist of rational numbers. Doing the math on rational numbers does not always produce finite results. Say, 4/3, which never terminates. As such, encoding (4/3 = 1.333...) and then decoding (1.333... * 3 = 3.999...) may result in lossy codes. This means, that the regenerated data may not be the same as the original data and this is a big problem for the storage system and erasure codes in general.

So, what is the solution? The solution lies behind beautiful math, called finite fields, also called Galois fields. Finite field follows different math rules and is a different subject on its own so we only cover a general idea here.

A finite field is a set of numbers on which the mathematical operations are well defined. For eg, {0, 1} can be considered as a field with 2 elements. And doing modulo math over this field results in value looping between 0 and 1. This is also called GF(2), Galois Field of 2 elements and modulo operations on addition, subtraction, division, multiplication, and inverse will lead to the element being either 0 or 1. The finite field theory generalizes this concept and states that if p is a prime number, then there exists a finite field with numbers ranging from 0 to p - 1 on which the mathematical operations are well defined.

Are the finite field only restricted to prime numbers? Luckily no and this is where the concept of the Extension field comes into the picture. For every p prime number and a whole number n, there exists a field GF(pⁿ).

The elements of GF(pⁿ) are represented as polynomials of degree n and the coefficient of polynomial lies in the GF(p) field. For eg, in GF(2²), we will have numbers from {0, 1, 2, 3} and can be represented as follows.

Field Element	Polynomial	Binary representation
0	0x + 0	00
1	0x + 1	01
2	1x + 0	10
3	1x + 1	11

The modulo in the extended field is done by using a prime polynomial. A prime polynomial can be thought of as a prime number which cannot be factored out. For eg. x² + x + 1 is a prime polynomial as we cannot find its factors. In fact, the binary representation of x² + x + 1 in the extended field is 111 which is 7 and we know 7 is a prime number.

So, how would we represent 4/3 which resulted in the infinite sequence of numbers? Consider GF(2⁴) then,

4/3 = 1/3 * 4 = 14 * 4 = 13

We can see that the number 13 is well-defined and can be represented in binary form and this is what finite fields achieve. You can refer to this link to perform various math on finite fields.

Ok, so that we have covered the pre-requisite for the coding theory, let's look into erasure codes.

A [n, k] erasure code takes k length data and encodes it into n length code. Another way of saying, it encodes the data into n shards and the original data is split into k shards, where each coded shard is 1/k the original data size. Such schema can tolerate n - k shards. As such, if each shard represents a disk and n = 5 and k = 3, then the system can tolerate 2 disk failures. This also aligns with our analogy of extrapolation on the graph.

So, how is this any better than the replication? Remember in the replication scheme, to survive 2 disk failures, we would require at least 3 copies of data, hence 3x the original data. In [5, 3] erasure coding scheme, we require,

1 + (n - k) /k = 1 + (5 - 3) / 3 = 1 + 2/3 = 5/3x the data

This is a huge saving considering the amount of data we store today. So, if we could save so much space requirement, then why don't we apply this scheme everywhere? Well, the problem lies in the fact that the reading of the data is expensive in erasure codes, as we need to merge the data from all shards. But most importantly, the reconstruction of the data is also expensive in case of any data loss. And that's why such a scheme is commonly used in cold storage, although, cloud platforms like Azure data platform use a hybrid approach, ie. replication + erasure codes.

One popular erasure coding algorithm is Reed Solomon. Reed Solomon is more than the erasure coding but we will restrict ourselves to its utility as erasure codes. Remember the finite field theory that we discussed earlier will be used to perform any mathematics in Reed Solomon.

Let's walk through a Reed Solomon example using a matrix multiplication scheme. Let's say we have a file that we want to encode to [6, 4] scheme. As such, we split our file into 4 chunks such that each chunk is placed on a disk. Our encoder will encode these 4 data split into 6 code splits such that we survive 2 disk failures and can reconstruct data from any 4 disks.

Consider that our file only contains numbers and has been split into 4 chunks like so,

Chunk	File content
0	1 2 3 4
1	5 6 7 8
2	9 10 11 12
3	13 14 15 16

Now, we need an encoder that will matrix multiply, our encoder matrix will be 6x4 dimensions as we need to create 6 rows and each row represents a disk. So, the multiplication might look like this.

Here, the first matrix to the left is the encoder matrix. The encoder's sub-matrix from rows 0 to 3 is an identity matrix. As such the resultant submatrix from rows 0 to 3 has the exact data as the file chunk data. Rows 4 and 5 of the resultant matrix have some different numbers.

So, what does this mean? If each row in the resultant matrix represents a disk and rows start from 0, then the disks numbered from 0 to 3 contain the original data and are hence called data disks. Disks 4 and 5 are the parity disks. In case of failures, any of these 4 disks can be used to retrieve the data. Notice that in a steady state, we don't need to perform any decoding on the data disk as the disk contains the original data. This scheme is good in this regard.

So far so good. What happens when we loose say, disks 2 and 3, how do we recover the data?

We will use the power of matrix inversion. So, if we can inverse the encoder matrix (called decoder matrix) and multiply it by the resultant matrix, we should get the original data. But for this, we need to first strip rows from our matrices that belong to rows 2 and 3, as shown above. And then,

Here, the first matrix is the inverse of the encoder matrix with rows 3 and 4 stripped. Similarly, we stripped rows 3 and 4 from the resultant matrix. And we can see that now we have the original 4x4 data matrix. And this is the fundamental of Erasure coding.

But, where did we use the finite field math? Well, the above matrix is just for illustration, we will actually replace the elements in our matrices with finite field elements and perform matrix multiplication and inversion using finite fields arithmetic.

Knowing the internals is always good, but the good thing is that we don't have to implement Reed Solomon from scratch as there are already a lot of libraries available. One such is in the Linux kernel link here.

The implementation and the math are a lot more involved but I have tried to keep it as simple as possible so that readers understand the concept.

I hope you find it useful :-)

Erasure codes for Distributed Storage

Recent Posts

Comments