Tag: Storage

Erasure Coding – a primer

Erasure Coding – a primer

A surefire [sic] way to get to look for another job in IT is to lose important data. Typically if a user in any organisation stores data he or she expects that data to be safe and always retrievable (and as we all know data loss in storage systems is unavoidable). Data also keeps growing, a corollary to Parkinson’s law is that data expands to fill the space available for storage, just like clutter around your house.

Because of the constant growth of data there is a greater need to both protect said data but also to simultaneously store it in a more space efficient way. If you look at large web-scale companies like Google, Facebook, and Amazon they need to store and protect incredible amounts of data, they do however not rely on traditional data protection schemes like RAID because it is simply not a good match with the hard disk capacity increases of late.

Sure sure, but I’m not Google…

Fair point, but take a look at the way modern data architectures are built and applied even in the enterprise space, looking at most hyper-converged infrastructure players for example they typically employ a storage replication scheme to protect data that resides on their platforms, for them they simply cannot not afford the long rebuild times associated with multi-Terabyte hard disks in a RAID based scheme. Same goes for most object storage vendors. As as example let’s take a 1TB disk, it’s typical sequential write sits around 115 MBps, so 1.000.000 MB / 115 MBps = approximately 8700 seconds which is nearly two and a half hours. If you are using 4TB disks then your rebuild time will be at least ten hours. In this case I am even ignoring the RAID calculation that needs to happen simultaneously and the other IO in the system that the storage controllers need to deal with.

RAID 5 protection example.

Let’s say we have 3 HDDs in a RAID 5 configuration, data is spread over 2 drives and the 3rd one is used to store the parity information. This is basically a exclusive or (XOR) function;

Let’s say I have 2 bits of data that I write to the system, disk 1 has the first bit, disk 2 the second bit, and disk 3 holds the parity bit (the XOR calculation). Now I can lose any of the 2 bits (disks) and the system is able to reconstruct the missing bit as demonstrated by the XOR truth table below;

Screen Shot 2016-07-14 at 14.06.39

Let’s say I write bit 1 and bit 0 to the system, 1 is stored on disk A and 0 is stored on disk B, if I lose disk A [1], I still have disk B [0] and the parity disk [1]. According to the table B [0] + parity [1] = 1 thus I can still reconstruct my data.

But as we have established that rebuilding these large disks is unfeasible, what the HCI builders do is replicate all data, typically 3 times, in their architecture as to protect against multiple component failures, this is of course great from an availability point of view but not so much from a usable capacity point of view.

Enter erasure coding.

So from a high level what happens with erasure coding is that when data is written to the system, instead of using RAID or simply replicating it multiple times to different parts of the environment, the system applies slightly more complex mathematical functions (including matrix, and Galois-Field arithmetic*) compared to the simple XOR we saw in RAID (strictly speaking RAID is also an implementation of erasure coding).

There are multiple ways to implement erasure coding of which Reed-Solomon seems to be the most widely adopted one right now, for example Microsoft Azure and Facebook’s cold-storage are said to have implemented it.

Since the calculation of the erasure code is more complex the often quoted drawback is that it is more CPU intensive than RAID. Luckily we have Intel who are not only churning out more capable and efficient CPUs but are also contributing tools, like the Intelligent Storage Acceleration Library (Intel ISA-L) to make implementations more feasible.

As the video above mentions you roughly get 50% more capacity with erasure coding compared to a triple mirrored system.

Erasure Coding 4,2 example.

Erasure codes are typically quite flexible in the way you can implement them, meaning that you can specify (typically as the implementor, not the end-user, but in some cases both) the number of data blocks to parity blocks. This then impacts the protection level and drive/node requirement. For example if you choose to implement a 4,2 scheme, meaning that each file will be split into 4 data chunks and for those 4 chunks 2 parity chunks are calculated, this means that in a 4,2 setup you require 6 drives/nodes.

The logic behind it can seem quite complex, I have linked to a nice video explanation by Backblaze below;

* http://web.eecs.utk.edu/~plank/plank/papers/CS-96-332.pdf

Backup is Boring!

Backup is Boring!

Yep, until it’s not.

When I was a consultant at a VAR a couple of years ago I implemented my fair share of backup and recovery solutions, products of different vendors which shall remain nameless, but one thing that always became clear was how excruciatingly painful the processes involved ended up being. Convoluted tape rotation schema’s, figuring out back-up windows in environments that were supposed to be operating in a 24/7 capacity, running out of capacity, missed pickups for offsite storage,… the experience consistently sucked.

I think it’s fair to say that there has not been a lot of innovation in this market for the last decade or so, sure vendors put out new versions of their solutions on a regular basis and some new players have entered the market, but the core concepts have largely remained unchanged. How many times do you sit around at lunch with your colleagues and discuss exciting new developments in the data protection space… exactly…

So when is the “until it’s not” moment then?

I’m obviously biased here but I think this market is ripe for disruption, if we take some (or most) of the pain out of the data protection process and make it a straightforward affair I believe we can bring real value to a lot of people.

Rubrik does this by providing a simple, converged data management platform that combines traditionally disparate backup software pieces (backup SW, backup agents, catalog management, backup proxies,…) and globally deduplicated storage in one easily deployable and scalable package.

No more jumping from interface to interface to configure and manage something that essentially should be a insurance policy for you business. (i.e. the focus should be on recovery, not backup). No more pricing and sizing individual pieces based on guesstimates, rather scale out (and in) if and when needed, all options included in the base package.

Because it is optimized for the modern datacenter (i.e. virtualization, scale-out architectures, hybrid cloud environments, flash based optimizations,…) it is possible to consume datamanagement as a service rather than through manual configuration. All interactions with the solution are available via REST APIs and several other consumption options are already making good use of this via community driven initiatives like the PowerShell Module and the VMware vRO plugin. (more info see please see: https://github.com/rubrikinc )


So essentially giving you the ability to say no to the “we have always done it this way” mantra, it is time to bring (drag?) backup and recovery into the modern age.


Intel and Micron 3D XPoint

Intel and Micron 3D XPoint


My day job is in networking but I do consider myself (on the journey to) a full stack engineer and like to dabble in lot’s of different technologies like, I’m assuming, most of us geeks do. Intel and Micron have been working on a seeming breakthrough that combines memory and storage in one non-volatile device that is cheaper than DRAM (typically computer memory) and faster than NAND (typically a SSD drive).

3D Xpoint

3D Xpoint, as the name implies, is a crosspoint structure, meaning 2 wires crossing each other, with “some material*” in between, it does not use transistors (like DRAM does) which makes it easier to stack (hence the 3D) —> for every 3 lines of metal you get 2 layers of this memory.

Screen Shot 2016-02-13 at 18.09.31.png

The columns contain a memory cell (the green section in the picture above) and a selector (the yellow section in the picture above), connected by perpendicular wires (the greyish sections in the picture above), allowing you to address each column individually by using one wire at the top and one wire at the bottom. These grids can be stacked 3 dimensionally to maximise density.
The memory can be accessed/modified by sending varied voltage to each selector, in contrast DRAM requires a transistor at each memory cell to access or modify it, this results in 3D XPoint being 10x more dense that DRAM and 1000x faster than NAND (at the array level, not at the individual device level).

3D XPoint can be connected via PCIe NVMe and has little wear effect over it’s lifetime compared to NAND. Intel will commercialise this in it’s Optane range both as an SSD disk and as DIMMS. (The difference between Optane and 3D XPoint is that 3D XPoint refers to the type of memory and Optane includes the memory and a controller package).

1000x faster, really?

In reality Intel is getting 7x performance compared to a NAND MLC SSD (on NVMe) today (at 4kB read), that is because of the inefficiencies of the storage stack we have today.

Screen Shot 2016-02-13 at 18.21.27.png

The I/O passes through the filesystem, storage stack, driver, bus/platform link (transfer and protocol i.e. PCIe/NVMe), controller firmware, controller hardware (ASIC), transfer from NAND to the buffers inside the SSD, etc. So 1000x is a theoretical number (and will show up on a lot of vendor marketing slides no doubt) but reality is a bit different.

So focus is and has been on reducing latency, for example work that has been done by moving to NVMe already reduced the controller latency by roughly 20 microseconds (no HBA latency and the command set is much simpler).

Screen Shot 2016-02-13 at 18.25.07

The picture above shows the impact of the bus technology, on the left side you see AHCI (SATA) and on the right NVMe, as you see there is a significant latency difference between the two. NVMe also provides a lot more bandwidth compared to SATA (about 6x more on PCIe NVMe Gen3 and more than 10x on Gen4).

Another thing that is hindering the speed improvements of 3D XPoint is replication latency across nodes (it’s storage so you typically want redundancy). To address this issue work is underway on things like “NVMe over Fabrics” to develop a standard for low overhead replication. Other improvements in the pipe are work on optimising the storage stack, mostly on the OS and driver level. For example, because the paging algorithms today were not designed with SSD in mind they try to optimise for seek time reduction etc, things that are irrelevant here so reducing paging overhead is a possibility.

They are also exploring “Partial synchronous completion”, 3D XPoint is so fast that doing an asynchronous return, i.e. setting up for an interrupt and then waiting for interrupt completion takes more time than polling for data. (we have to ignore queue depth i.e. assume that it will be 1 here).

Screen Shot 2016-02-13 at 19.03.27

Persistent memory

One way to overcome this “it’s only 7x faster problem”, altogether is to move to persistent memory. In other words you skip over they storage stack latency by using 3D XPoint as DIMMs, i.e. for your typical reads and writes there is no software involved, what little latency remains is now caused entirely by the memory and the controller itself.

Screen Shot 2016-02-13 at 19.15.21

To enable this “storage class memory” you need to change/enable some things, like a new programming model, new libraries, new instructions etc. So that’s a little further away but it’s being worked on. What will probably ship this year is the SSD model (the 7x improvement) which is already pretty cool I think.

* It’s not really clear right now what those materials entail exactly which is part of it’s allure I guess 😉