You snooze, you lose.

You snooze, you lose.

At the end of October 2016 I created a short blog post called “Data on tape is no longer useful” it’s premise was that if your historical data is being stored offline you can’t readily use it for other value-add services.

More recently Rubrik has delivered the ability to use archived backup copies of your data and spin those up on-demand as public cloud resources, called CloudOn. Initially taking VMware based virtual machines, archive them to Amazon S3 and when the time comes automatically migrate those workloads from VMware to Amazon and spin them up as AMI based EC2 instances.

Now with the release of version 4.1 we added the ability to support a similar scenario but spin up the workloads on Microsoft Azure instead of AWS. Additionally we added archiving support for Google Cloud Storage enabling more and more multi-cloud capabilities. (CloudOn for Google Cloud Platform is currently not available).

3477
Multi-Cloud er… Multi-Pass

Since I’ve just written a post about all the things Rubrik is (was) doing with Microsoft I now need to add Microsoft Azure CloudOn to the list (I snoozed).

The idea is you add Microsoft Blob Storage as an archive location to one or more of your SLAs and once the data has been archived off you can now use that archive copy to instantiate your workloads, which where originally running on VMware (VMware VMDK based), on Microsoft Azure, as Azure based VMs (VHD based).

Screen Shot 2017-10-01 at 16.15.45

You can opt to do this conversion completely on-demand or choose to auto-convert the latest backup copy to save time during instantiation. The inner-workings a bit different depending on which scenario you choose but generally speaking Rubrik takes the VMDK file and converts it to a VHD file, uploads the VHD file to Azure Blob storage as page blobs (vs block blobs which are typically used for discrete storage objects like jpg’s, log files, etc. that you’d typically view as a file in your local OS.)

Screen Shot 2017-10-01 at 16.22.29

We also added support for Azure Stack in 4.1, in this initial release we provide the same type of functionality as we do with Azure (public cloud), meaning we support Windows, Linux, and SQL (customer installed) based workloads via the Rubrik backup service.

For a grand discussion of Azure Stack (a.o.) I suggest listening to this wonderful episode of the Datanauts podcast with Jeffrey Snover, Technical Fellow at Microsoft and Chief Architect for Azure Infrastructure and the creator of Powershell.

If you want to get more of the details about the 4.1 release please check out one (or all) of these fine posts;

https://www.penguinpunk.net/blog/rubrik-4-1-released-more-than-you-might-expect/ 
https://blog.edmorgan.info/rubrik/2017/09/27/Rubrik-41.html
https://thevirtualhorizon.com/2017/09/28/announcing-rubrik-4-1-the-microsoft-release/

Rubrik and Microsoft, rest assured.

Rubrik and Microsoft, rest assured.

Rubrik has been working with Microsoft’s solutions in various ways since version 2.3 with the initial support of Microsoft Azure as an archive location for long-term retention data. You can even argue the relationship started earlier than that with support for application consistent backups for Microsoft applications through our own VSS provider and requester (for Windows OS’s on Virtual Machines) since the beginning.

But for this overview, I’ll show the Microsoft specifics from version 2.3 till version 4.0, which is currently GA.

Azure Blob storage as an archive location

At the heart of Rubrik sit our SLA policies that govern how we handle your data end-to-end, as part of that SLA policy you can define archival settings, these will determine when we need to start moving data of the local Rubrik cluster and onto Azure Blob storage.

Azure Blob storage is a massively-scalable object storage location for unstructured data, the nice thing about using it as an archive with Rubrik is that because we index all data we can search for specific files/folders in this unstructured data, and only pull back those specific objects instead of needing to re-hydrate the entire fileset. Furthermore the data we sent to Azure Blob storage is encrypted.

Screen Shot 2017-09-03 at 21.22.03.png

Support for physical Windows and Microsoft SQL Server

In Rubrik CDM version 3.0 we added support for physical Windows and MS SQL server via the use of the backup service. The backup service is a very lightweight agent that you install on your physical Windows server, it runs in user-space and doesn’t require a reboot, once installed it is managed throughout it’s lifecycle via the Rubrik cluster itself. In other words once the backup service is installed you don’t need to manage it manually, it is also valid for all applications, meaning that it provides filesystem backup but also SQL backup capabilities.

Screen Shot 2017-09-05 at 15.53.05

Rubrik also provides protection and data management of SQL Server databases that are installed across nodes in a Windows Server Failover Clustering, additionally we support Always-On availability groups as wel, Rubrik detects that a database is part of an availability group. In the event of a failover, protection will be transferred to another availability database in the same availability group. When protection is transferred, the Rubrik cluster transfers the existing metadata for history and data protection to the replacement database.

Rubrik Cloud Cluster in Microsoft Azure

Since Rubrik CDM v3.2 we support running a 4-node (minimum) cluster in Microsoft Azure. We use the DSv2-series VM sizes, which gives us 4 vCPU, 14GiB RAM, 400 GiB SSD, and a max of 8 HDDs per VM.

Screen Shot 2017-09-06 at 12.00.56

Through the Rubrik backup service we are able to support native Azure workloads running either Windows or Linux, and Native SQL. Since Cloud Cluster in Azure has the same capabilities we can also archive to another Azure Blob storage location within Azure for long term retention data (i.e. move backup data of the the DSv2 based instances and unto more cost-effective Blob storage), and even replicate to another Rubrik Cluster, either in Azure or another Public Cloud provider, or even to a physical Rubrik cluster on-premises.

Microsoft SQL Server Live Mount

One of the coolest, in my humble opinion, new features in version 4.0 of Rubrik CDM is the ability to Live Mount SQL databases. Rubrik will use the SSD tier to rapidly materialize any point-in-time copy of the SQL database and then expose this, fully writetable DB to the SQL server as a new DB. In other words you are not consuming any space on your production storage system to do this.

Screen Shot 2017-09-06 at 12.13.16

As a SQL DBA you can now easily restore individual objects between the original and copy. The speed of recovery is greatly reduced, and the original backup copy is still maintained as an immutable object on Rubrik safeguarding it against ransomware.

Microsoft Hyper-V

Last, but not least, is our support for Microsoft Hyper-V based workloads which we achieve by integrating directly with the Hyper-V 2016 WMI based APIs, so this works independent of the underlying storage layer for the broadest support.

Screen Shot 2017-09-06 at 12.43.17

We leverage Hyper-V’s Resilient Change Tracking (RCT) to perform incremental forever backups. Older versions of Hyper-V are also supported through the use of the Rubrik backup service.

Independent of the source, be it virtual or physical, we can leverage the same SLA policy based system to avoid having to set individual, and manual backup policies.

 

VMworld US – Sunday

VMworld US – Sunday

This is my 7th VMworld event, but the first time attending VMworld in the US, I’ve been to Vegas before but arriving this Saturday evening it felt different somehow, maybe it was the scorching heat or the onslaught of UFC/Boxing fans at the MGM Grand, both where very “in your face”.

IMG_0427

VMware signage welcoming you to Las Vegas

I decided to check out the OpeningActs event at the Beerhaus which was a pretty neat experience. The first panel session, moderated by John Troyer (aka John Trooper) was on VMware’s continued relevance in todays rapidly changing tech world which spurred a lot of Amazon AWS remarks. I think the overall consensus was that VMware might be in “sustainability” mode but that it is still an excellent foundation in a lot of customer environments. “Basic infrastructure” might be boring, but it is of great importance nevertheless. I think the other angle that might have been overlooked a bit here is that Amazon needs this relationship with VMware too, it needs a way into that “boring” enterprise market.

IMG_0435

Second panel session was on “How failing made me better”, moderated by Jody Tyrus.
My favourite quote from the session was, and I’m paraphrasing here; “I sometimes still feel like I’m a 12 year old in a body of a 42 year old that is about to be exposed”. Rebecca Fitzhugh wrote and excellent blog on her experience with dealing with failure here.

IMG_0436.png

 

Unfortunately I had to skip the 3rd session because I needed to get to the Expo for a Rubrik all-hands briefing before the Welcome Reception. After the reception I headed to the VMUG party at the House of Blues, which was pretty cool (because they served Stella Artois 😉 ), Michael Dell made an appearance and talked about his love for the VMUG and hinted at some cool Pivotal announcements coming this week.

IMG_0443

After the party I needed to get jet-lagged self to bed. Looking forward to a great VMworld week.

 

 

A year at Rubrik, moving to marketing; et tu, Brute?

A year at Rubrik, moving to marketing; et tu, Brute?

I wasn’t really planning on writing a post on this initially but since I’m moving to a (technical) marketing role it seemed only fitting on second thought.
Switching from an SE role to technical marketing seems to typically bring about the “I’m moving to the dark side and betraying my technical roots” blog posts so here goes 😉

Actually I think my previous role as an SE for the EMEA region already had quite some tech marketing aspects to it. I traveled to a bunch of countries in Europe, the Middle East, and Africa spreading the Rubrik gospel, was a speaker at countless events, presented numerous webinars, and very occasionally wrote about some Rubrik related stuff on my personal blog. So when my bosses asked me what I wanted to do next I immediately blurted out “tech evangelism!”.

My role as an EMEA SE mainly revolved around covering the countries where we did not have a local presence yet (did you know EMEA consists of 128 countries? me neither!, but hello frequent flyer status) and I had a blast doing that, but since the company is growing like crazy, i.e. hiring local sales teams everywhere, I naturally started thinking about what could be next as well. I feel extremely lucky, and very humbled, to be joining the amazing Rubrik Tech Marketing team (Chris Wahl, Andrew MillerRebecca Fitzhugh) and only hope I can live up to their standard.

…googles “imposter syndrome”…

Andrew Miller already did a great job describing what the tech marketing function is/can be all about here. Or as Matthew Broberg jokingly (but not really though) described it in episode 94 of The Geek Whisperers

It’s this weird unicorn position that seems to fit in the middle of a Venn diagram between sales, marketing, engineering, and some other stuff that we can’t quite put our finger on.

June also marked my 1 year anniversary at Rubrik and with the risk of sounding insincere, it has been an amazing ride so far. The company has more than doubled in size since I joined, seen massive customer adoption, went through multiple successful funding rounds and delivered very significant features release after release.

Screen Shot 2017-06-21 at 15.37.59

To infinity and beyond!

 

Having worked at large incumbent companies before I find it quite amazing what you can achieve if you’re not bogged down by big company inertia and office politics. Sticking with the rocket ship analogy imagine building one at a start-up vs. building one at a big established company.

 

On the left you have the potential result after 6 months of trying to build a rocket ship at a large established company, at the right the potential result at a start-up (oops, time to iterate baby!)

 

After 1 year… you get the idea…

All very tongue-in-cheek of course, both have their merits and drawbacks, in the end it’s about finding where you fit best I think, and I certainly feel I’ve done that.

So in light of the new function, expect me to show up at more events, deliver more content, and be a bit more vocal on social media (but always trying hard to maintain technical integrity).

Rubrik Alta feature spotlight: AWS.

Rubrik Alta feature spotlight: AWS.

With the announcement of Rubrik CDM version 4, codenamed Alta, we have added tons of new features to the platform, but since most of the release announcements are focussed on providing an overview of all the goodies, I wanted to focus more deeply on one specific topic namely our integration with AWS.

Rubrik has had a long history of successfully working with Amazon Web Services, we’ve had integration with Amazon S3 since our first version. In our initial release you could already use Amazon S3 as an archive location for on-premises backups, meaning take local backups of VMware Virtual Machines and then keep them on the local Rubrik cluster for a certain period of time (short term retention) with the option to put longer term retention data into Amazon S3. The idea was to leverage cloud storage economics and resiliency for backup data and at the same time have an active archive for longer term retention data instead of an offline copy on tape. Additionally the way our metadata system works allows us to only retrieve the specific bits of data you need to restore instead of having to pull down the entire VMDK file and incurring egress costs potentially killing the cloud storage economics benefit.

Screen Shot 2017-06-12 at 18.40.13

Also note there is no need to put a gateway device between the Rubrik cluster and Amazon S3, instead Rubrik can natively leverage the S3 APIs.

The ability to archive to Amazon S3 is still here in version 4 of course but now all the supported sources besides VMware ESXi, like Microsoft Hyper-V, Nutanix AHV, Physical Windows/Linux, Native SQL, and so on can also leverage this capability.

Then in Rubrik CDM version 3.2 we added the ability to protect native AWS workloads by having a Rubrik cluster run inside AWS using EC2 for compute and EBS for storage.

Screen Shot 2017-06-13 at 10.40.43We’ll run a 4 node (protecting your data using Erasure Coding) Rubrik cluster in you preferred AWS location (the Rubrik AMI image is uploaded as a private image).
We use the m4.xlarge instance type, using 64GB RAM, 256GB SSD (GP SSD (GP2)) and 24TB Raw Capacity (Throughput Optimised HDD (ST1)), resulting in 15TB usable capacity before deduplication and compression.

Once the Cloud Cluster is running you can protect your native AWS workloads using the connector based approach, i.e. you can protect Windows and Linux filesets, and Native SQL workloads in the Public Cloud.

Additionally since potentially you can now have a Rubrik cluster on-premises and a Rubrik Cloud Cluster you can replicate from your in-house datacenter to your public cloud environment and vice versa. Or replicate from one AWS region to another.Screen Shot 2017-06-13 at 10.46.06

Since the Cloud Cluster has the same capabilities as the on-premises one it can also backup your AWS EC2 workloads and then archive the data to S3, essentially going from EBS storage to S3. (Christopher Nolan really likes this feature).

Screen Shot 2017-06-13 at 10.53.11

Version 4 of Rubrik CDM extends our AWS capabilities by delivering on-demand app mobility, called CloudOn. The idea is that now you can take an application that is running on-premises and move it on the fly to AWS for DR, or Dev/Test, or analytics scenarios.

Screen Shot 2017-06-13 at 17.31.23

The way it will work is that just as since v1 you archive your data to AWS S3, once you decide to instantiate that workload in the public cloud you select “Launch On Cloud” from the Rubrik interface and a temporary Rubrik node (spun up on-demand in AWS in the VPC of your choice) converts those VMs into cloud instances (i.e. going from VMware ESXi to AWS AMI images). Once complete the temporary Rubrik node powers down and is purged.

launch on cloud

Rubrik scans the configuration file of a VM to understand its characteristics (compute, memory, storage, etc.) and recommends a compatible cloud instance type so you are not left guessing what resources you need to consume on AWS.

Alternatively we can also auto-convert the latest snapshot going to S3 so you don’t have to wait for the conversion action.

Data has gravity, meaning that once you accumulate more and more data it starts to make more sense to move the application closer to the data since the other way around starts to become less performant, and more and more cost prohibitive in terms of transport costs. So what if your data sits on-premises but your applications are running in the public cloud? Screen Shot 2017-06-13 at 18.06.53
For instance let’s say you want to perform business analytics using Amazon QuickSight but your primary data sources are in your private data center, now you simply archive data to S3 as part of your regular Rubrik SLA (archive data older than x days) and pick the point-in-time dataset you are interested in, use Rubrik to “Launch on Cloud”, and point Amazon QuickSight (or any other AWS service) to your particular data source.

Together these AWS integrations allow you to make use of the public cloud on your terms, in an extremely flexible way.

Marrying consumer convenience with enterprise prowess

If you look at the consumer applications we interface with on a daily basis, things like Facebook, Google, Twitter,… these all tend to be very easy to use and understand, typically little to no explanation is needed on how to use them, you simply sign-up and you get going.

dc8370d5-9876-458d-8f97-4bfcce30bfd4But behind the covers of these very straightforward interactions lies a pretty complex world of intricate components, a lot of moving parts that make the application tick. For a little insight into the back-end architecture of Facebook for example, check out these videos; https://developers.facebook.com/videos/f8-2016/inside-facebooks-infrastructure-part-1-the-system-that-serves-billions/

The typical target audience of Facebook I would guess is completely unaware of this, and rightfully so.

Now think about your typical enterprise applications, probably a lot less straight forward to interact with, a lot of nerd knobs to turn and a lot of certifications to be attained in mastering how to make best use of them.

“Well of course it is a lot more complex, because I need it to be!”  I hear you say, but does it really though?

What if most of the heavy lifting is taken care of by the system itself, internal algorithms that govern the state of the solution and make the application perform how you need it to, minimizing the interaction with the end-user, and making that interaction as enjoyable as possible.

That is what the Rubrik Cloud Data Management solution is trying to achieve, under the hood it is a very, very capable piece of equipment but most of the complexity that comes with these enterprise capabilities has been automated away, and the little interaction that is left to the end-user is very straight forward, and dare I say enjoyable?

laptop-desktop@1x

Matching enterprise data management capabilities with the simplicity of a consumer application is a lofty goal, but one worthy to pursue in my humble opinion. After all…

“Simplicity is the Ultimate Sophistication”

 

 

Atlas distributed filesystem, think outside the box.

Atlas distributed filesystem, think outside the box.

Rubrik recently presented at Tech Field Day 12 and one of the sessions focused on our distributed filesystem called Atlas. As one of the SE’s at Rubrik I’m in the field every day (proudly) representing my company but also competing with other, more traditional backup and recovery vendors. What is apparent more and more is that these traditional vendors are also going down the appliance route to sell their solution into the market, and as such I sometimes get the pushback from potential customers saying they can also get an appliance based offer from their current supplier, or not really immediately grasping why this model can be beneficial to them.
A couple of things I wanted to clarify first, when I say “also down the appliance route” I need make clear that this is purely a way to offer the solution to market for us, there is nothing special about the appliance as such, all of the intelligence in Rubrik’s case lies in the software, we even started to offer a software only version in the form of a virtual appliance for ROBO use cases recently.
Secondly, some traditional vendors can indeed deliver their solution in an appliance based model, be it their own branded one, or pre-packaged via a partnership with a traditional hardware vendor. I’m not saying there is something inherently bad about this, simplifying the acquisition of a backup solution via an appliance based model is great, but there the comparison stops, it will still be a legacy based architecture with disparate software components, typically these software components, think media server,  database server, search server, storage node, etc. need individual love and care, daily babysitting if you will, to keep them going.
Lastly from a component point of view our appliance consists of multiple independent (masterless) nodes that are each capable of running all tasks of the data management solution, in other words there is no need to protect, or indeed worry about, individual software and hardware components as everything is running distributed and able to sustain multiple failures while remaining operational.

nospoon There is no spoon (box)

So the difference lies in the software architecture, not the packaging, as such we need to look beyond the box itself and dive into why starting from a clustered distributed system as a base makes much more sense in todays information era.

The session at TFD12 was presented by Adam Gee and Roland Miller, Adam is the lead of Rubrik’s distributed filesystem called Atlas, it shares some architectural principles with a previous filesystem Adam worked on while he was at Google called Colossus, most items you store while using Google services end up on Colossus, it itself is the successor to the Google File System (GFS) bringing the concept of a masterless cluster to it and making it much more scalable. Not a lot is available on the web in terms of technical details around Colossus, but you can read a high level article on Wired about it here.

Atlas, which sits at the core of the Rubrik architecture, is a distributed filesystem, built from scratch with the Rubrik data management application in mind. It uses all distributed local storage (DAS) resources available on all nodes in the cluster and pools them together into a global namespace. As nodes are added to the cluster the global namespace grows automatically, increasing capacity in the cluster. The local storage resources on each node consist of both SSD and HDD’s, the metadata of Atlas (and the metadata of the data management application) is stored in the metadata store (Callisto) which is also running distributed on all nodes in the SSD layer. The nodes communicate internally using RPC which are presented to Atlas by the cluster management component (Forge) in a topology aware manner thus giving Atlas the capability to provide data locality. This is needed to ensure that data is spread correctly throughout the cluster for redundancy reasons. For example, assuming we are using triple mirror, we need to store data in 3 different nodes in the appliance, let’s now assume the cluster grows beyond 1 appliance, then it would make more sense from a failure domain point of view to move 1 copy of the data from one of the local nodes to the other appliance.

screen-shot-2016-11-19-at-18-58-09

The system is self healing in the way that Forge publishes the disk and node health status and Atlas can react to that, again assuming triple mirror, if a node or entire brik (appliance) fails Atlas will create a new copy of the data on another node to make sure the requested failure tolerance is met. Additionally Atlas also runs a background task to check the CRC of each chunk of data to ensure what you have written to Rubrik is available in time of recovery. See the article How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder on why that could be important.

screen-shot-2016-11-19-at-19-11-46

The Atlas filesystem was designed with the data management application in mind, essentially the application takes backups and places them on Atlas, building snapshots chains (Rubrik performs an initial full backup and incremental forever after that). The benefit is that we can instantly materialize any point in time snapshot without the need to re-hydrate data.

screen-shot-2016-11-19-at-19-44-00

In the example above you have the first full backup at t0, and then 4 incremental backups after that. Let’s assume you want to instantly recover data at point t3, this will simply be a metadata operation, pointers to the last time blocks making up t3 where mutated, there is no data movement involved.

Taking it a step further let’s now assume you want to use t3 as the basis and start writing new data to it from that point on. Any new data that you write to it (redirect-on-write) now goes to a log file, no content from the original snapshot is changed as this needs to be an immutable copy (compliancy). The use case here could be copy data management where you want to present a copy of a dataset to internal dev/test teams.

screen-shot-2016-11-19-at-19-55-54

Atlas also dynamically provides performance for certain operations in the cluster, for example a backup ingest job will get a higher priority than a background maintenance task. Because each node also has a local SSD drive Atlas can use this to place critical files on a higher performance tier and is also capable of tiering all data and placing hot blocks on SSDs. It also understands that each node has 3 HDDs and will these to stripe the data of a file across all 3 on a single node to take advantage of the aggregate disk bandwidth resulting in a performance improvement on large sequential reads and writes, by utilizing read-ahead and write buffering respectively.

For more information on Atlas you can find the blogpost Adam Gee wrote on it here, or watch the TFD12 recording here.

Data on tape is no longer useful

Data on tape is no longer useful

Data is valuable, I think most people would agree with that, but not all data is treated equally, the key is to enable all your data to remain active, but in the most economical way possible.

Analyst firm IDC is predicting that the amount of data will more than double every two years til 2020 to an astounding 44 trillion gigabytes. Meanwhile data scientists are finding new and interesting ways to activate this enormous amount of information. IDC furthermore states that 86% of business believe all data has value, but at the same time 48% of businesses are not (capable of) storing all available data.

Screen Shot 2016-10-27 at 13.15.46.png

Organisations have a real need to store all this data, including historical information, especially as that data can now be activated through the means of things like big data analytics. One of the sources of data can be your backups, which is typically not-active, especially when it is stored on tape, even more so when stored on tape and archived off-site.

What we are focusing on at Rubrik is to manage the data through it’s entire lifecycle, independent of it’s location. We do this in a couple of ways, first we backup your primary data and store it on our appliances (we are both the backup software, and the backup target), next, instead of archiving that data of to tape, and essentially rendering it useless (except for slow restores), we can archive/tier it to an object storage system so you can still store it in an economically feasible way.

screen-shot-2016-10-27-at-13-30-30

 

When the data is sitting in the archive (on object storage, in the public cloud, or on NFS, depending on your needs) we can still “enable” it. Right now we can instantly search your data independent of the location (and obviously restore it), but also use backup copies of your data to provide them (using what we call Live Mount) to your in-house developers/testers/QA-people, or provide copies to your data scientist that want to run analytics’ jobs against them, all without the need to store these copies on your primary storage system. Compare that to backup copies sitting on tape, in the dark, providing no additional value whatsoever, this way you can still extract value out of your data no matter when you created it.

Erasure Coding – a primer

Erasure Coding – a primer

A surefire [sic] way to get to look for another job in IT is to lose important data. Typically if a user in any organisation stores data he or she expects that data to be safe and always retrievable (and as we all know data loss in storage systems is unavoidable). Data also keeps growing, a corollary to Parkinson’s law is that data expands to fill the space available for storage, just like clutter around your house.

Because of the constant growth of data there is a greater need to both protect said data but also to simultaneously store it in a more space efficient way. If you look at large web-scale companies like Google, Facebook, and Amazon they need to store and protect incredible amounts of data, they do however not rely on traditional data protection schemes like RAID because it is simply not a good match with the hard disk capacity increases of late.

Sure sure, but I’m not Google…

Fair point, but take a look at the way modern data architectures are built and applied even in the enterprise space, looking at most hyper-converged infrastructure players for example they typically employ a storage replication scheme to protect data that resides on their platforms, for them they simply cannot not afford the long rebuild times associated with multi-Terabyte hard disks in a RAID based scheme. Same goes for most object storage vendors. As as example let’s take a 1TB disk, it’s typical sequential write sits around 115 MBps, so 1.000.000 MB / 115 MBps = approximately 8700 seconds which is nearly two and a half hours. If you are using 4TB disks then your rebuild time will be at least ten hours. In this case I am even ignoring the RAID calculation that needs to happen simultaneously and the other IO in the system that the storage controllers need to deal with.

RAID 5 protection example.

Let’s say we have 3 HDDs in a RAID 5 configuration, data is spread over 2 drives and the 3rd one is used to store the parity information. This is basically a exclusive or (XOR) function;

Let’s say I have 2 bits of data that I write to the system, disk 1 has the first bit, disk 2 the second bit, and disk 3 holds the parity bit (the XOR calculation). Now I can lose any of the 2 bits (disks) and the system is able to reconstruct the missing bit as demonstrated by the XOR truth table below;

Screen Shot 2016-07-14 at 14.06.39

Let’s say I write bit 1 and bit 0 to the system, 1 is stored on disk A and 0 is stored on disk B, if I lose disk A [1], I still have disk B [0] and the parity disk [1]. According to the table B [0] + parity [1] = 1 thus I can still reconstruct my data.

But as we have established that rebuilding these large disks is unfeasible, what the HCI builders do is replicate all data, typically 3 times, in their architecture as to protect against multiple component failures, this is of course great from an availability point of view but not so much from a usable capacity point of view.

Enter erasure coding.

So from a high level what happens with erasure coding is that when data is written to the system, instead of using RAID or simply replicating it multiple times to different parts of the environment, the system applies slightly more complex mathematical functions (including matrix, and Galois-Field arithmetic*) compared to the simple XOR we saw in RAID (strictly speaking RAID is also an implementation of erasure coding).

There are multiple ways to implement erasure coding of which Reed-Solomon seems to be the most widely adopted one right now, for example Microsoft Azure and Facebook’s cold-storage are said to have implemented it.

Since the calculation of the erasure code is more complex the often quoted drawback is that it is more CPU intensive than RAID. Luckily we have Intel who are not only churning out more capable and efficient CPUs but are also contributing tools, like the Intelligent Storage Acceleration Library (Intel ISA-L) to make implementations more feasible.

As the video above mentions you roughly get 50% more capacity with erasure coding compared to a triple mirrored system.

Erasure Coding 4,2 example.

Erasure codes are typically quite flexible in the way you can implement them, meaning that you can specify (typically as the implementor, not the end-user, but in some cases both) the number of data blocks to parity blocks. This then impacts the protection level and drive/node requirement. For example if you choose to implement a 4,2 scheme, meaning that each file will be split into 4 data chunks and for those 4 chunks 2 parity chunks are calculated, this means that in a 4,2 setup you require 6 drives/nodes.

The logic behind it can seem quite complex, I have linked to a nice video explanation by Backblaze below;

* http://web.eecs.utk.edu/~plank/plank/papers/CS-96-332.pdf

Backup is Boring!

Backup is Boring!

Yep, until it’s not.

When I was a consultant at a VAR a couple of years ago I implemented my fair share of backup and recovery solutions, products of different vendors which shall remain nameless, but one thing that always became clear was how excruciatingly painful the processes involved ended up being. Convoluted tape rotation schema’s, figuring out back-up windows in environments that were supposed to be operating in a 24/7 capacity, running out of capacity, missed pickups for offsite storage,… the experience consistently sucked.

I think it’s fair to say that there has not been a lot of innovation in this market for the last decade or so, sure vendors put out new versions of their solutions on a regular basis and some new players have entered the market, but the core concepts have largely remained unchanged. How many times do you sit around at lunch with your colleagues and discuss exciting new developments in the data protection space… exactly…

So when is the “until it’s not” moment then?

I’m obviously biased here but I think this market is ripe for disruption, if we take some (or most) of the pain out of the data protection process and make it a straightforward affair I believe we can bring real value to a lot of people.

Rubrik does this by providing a simple, converged data management platform that combines traditionally disparate backup software pieces (backup SW, backup agents, catalog management, backup proxies,…) and globally deduplicated storage in one easily deployable and scalable package.

No more jumping from interface to interface to configure and manage something that essentially should be a insurance policy for you business. (i.e. the focus should be on recovery, not backup). No more pricing and sizing individual pieces based on guesstimates, rather scale out (and in) if and when needed, all options included in the base package.

Because it is optimized for the modern datacenter (i.e. virtualization, scale-out architectures, hybrid cloud environments, flash based optimizations,…) it is possible to consume datamanagement as a service rather than through manual configuration. All interactions with the solution are available via REST APIs and several other consumption options are already making good use of this via community driven initiatives like the PowerShell Module and the VMware vRO plugin. (more info see please see: https://github.com/rubrikinc )

peter-gibbons2

So essentially giving you the ability to say no to the “we have always done it this way” mantra, it is time to bring (drag?) backup and recovery into the modern age.