Category: Storage

Atlas distributed filesystem, think outside the box.

Atlas distributed filesystem, think outside the box.

Rubrik recently presented at Tech Field Day 12 and one of the sessions focused on our distributed filesystem called Atlas. As one of the SE’s at Rubrik I’m in the field every day (proudly) representing my company but also competing with other, more traditional backup and recovery vendors. What is apparent more and more is that these traditional vendors are also going down the appliance route to sell their solution into the market, and as such I sometimes get the pushback from potential customers saying they can also get an appliance based offer from their current supplier, or not really immediately grasping why this model can be beneficial to them.
A couple of things I wanted to clarify first, when I say “also down the appliance route” I need make clear that this is purely a way to offer the solution to market for us, there is nothing special about the appliance as such, all of the intelligence in Rubrik’s case lies in the software, we even started to offer a software only version in the form of a virtual appliance for ROBO use cases recently.
Secondly, some traditional vendors can indeed deliver their solution in an appliance based model, be it their own branded one, or pre-packaged via a partnership with a traditional hardware vendor. I’m not saying there is something inherently bad about this, simplifying the acquisition of a backup solution via an appliance based model is great, but there the comparison stops, it will still be a legacy based architecture with disparate software components, typically these software components, think media server,  database server, search server, storage node, etc. need individual love and care, daily babysitting if you will, to keep them going.
Lastly from a component point of view our appliance consists of multiple independent (masterless) nodes that are each capable of running all tasks of the data management solution, in other words there is no need to protect, or indeed worry about, individual software and hardware components as everything is running distributed and able to sustain multiple failures while remaining operational.

nospoon There is no spoon (box)

So the difference lies in the software architecture, not the packaging, as such we need to look beyond the box itself and dive into why starting from a clustered distributed system as a base makes much more sense in todays information era.

The session at TFD12 was presented by Adam Gee and Roland Miller, Adam is the lead of Rubrik’s distributed filesystem called Atlas, it shares some architectural principles with a previous filesystem Adam worked on while he was at Google called Colossus, most items you store while using Google services end up on Colossus, it itself is the successor to the Google File System (GFS) bringing the concept of a masterless cluster to it and making it much more scalable. Not a lot is available on the web in terms of technical details around Colossus, but you can read a high level article on Wired about it here.

Atlas, which sits at the core of the Rubrik architecture, is a distributed filesystem, built from scratch with the Rubrik data management application in mind. It uses all distributed local storage (DAS) resources available on all nodes in the cluster and pools them together into a global namespace. As nodes are added to the cluster the global namespace grows automatically, increasing capacity in the cluster. The local storage resources on each node consist of both SSD and HDD’s, the metadata of Atlas (and the metadata of the data management application) is stored in the metadata store (Callisto) which is also running distributed on all nodes in the SSD layer. The nodes communicate internally using RPC which are presented to Atlas by the cluster management component (Forge) in a topology aware manner thus giving Atlas the capability to provide data locality. This is needed to ensure that data is spread correctly throughout the cluster for redundancy reasons. For example, assuming we are using triple mirror, we need to store data in 3 different nodes in the appliance, let’s now assume the cluster grows beyond 1 appliance, then it would make more sense from a failure domain point of view to move 1 copy of the data from one of the local nodes to the other appliance.

screen-shot-2016-11-19-at-18-58-09

The system is self healing in the way that Forge publishes the disk and node health status and Atlas can react to that, again assuming triple mirror, if a node or entire brik (appliance) fails Atlas will create a new copy of the data on another node to make sure the requested failure tolerance is met. Additionally Atlas also runs a background task to check the CRC of each chunk of data to ensure what you have written to Rubrik is available in time of recovery. See the article How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder on why that could be important.

screen-shot-2016-11-19-at-19-11-46

The Atlas filesystem was designed with the data management application in mind, essentially the application takes backups and places them on Atlas, building snapshots chains (Rubrik performs an initial full backup and incremental forever after that). The benefit is that we can instantly materialize any point in time snapshot without the need to re-hydrate data.

screen-shot-2016-11-19-at-19-44-00

In the example above you have the first full backup at t0, and then 4 incremental backups after that. Let’s assume you want to instantly recover data at point t3, this will simply be a metadata operation, pointers to the last time blocks making up t3 where mutated, there is no data movement involved.

Taking it a step further let’s now assume you want to use t3 as the basis and start writing new data to it from that point on. Any new data that you write to it (redirect-on-write) now goes to a log file, no content from the original snapshot is changed as this needs to be an immutable copy (compliancy). The use case here could be copy data management where you want to present a copy of a dataset to internal dev/test teams.

screen-shot-2016-11-19-at-19-55-54

Atlas also dynamically provides performance for certain operations in the cluster, for example a backup ingest job will get a higher priority than a background maintenance task. Because each node also has a local SSD drive Atlas can use this to place critical files on a higher performance tier and is also capable of tiering all data and placing hot blocks on SSDs. It also understands that each node has 3 HDDs and will these to stripe the data of a file across all 3 on a single node to take advantage of the aggregate disk bandwidth resulting in a performance improvement on large sequential reads and writes, by utilizing read-ahead and write buffering respectively.

For more information on Atlas you can find the blogpost Adam Gee wrote on it here, or watch the TFD12 recording here.

Erasure Coding – a primer

Erasure Coding – a primer

A surefire [sic] way to get to look for another job in IT is to lose important data. Typically if a user in any organisation stores data he or she expects that data to be safe and always retrievable (and as we all know data loss in storage systems is unavoidable). Data also keeps growing, a corollary to Parkinson’s law is that data expands to fill the space available for storage, just like clutter around your house.

Because of the constant growth of data there is a greater need to both protect said data but also to simultaneously store it in a more space efficient way. If you look at large web-scale companies like Google, Facebook, and Amazon they need to store and protect incredible amounts of data, they do however not rely on traditional data protection schemes like RAID because it is simply not a good match with the hard disk capacity increases of late.

Sure sure, but I’m not Google…

Fair point, but take a look at the way modern data architectures are built and applied even in the enterprise space, looking at most hyper-converged infrastructure players for example they typically employ a storage replication scheme to protect data that resides on their platforms, for them they simply cannot not afford the long rebuild times associated with multi-Terabyte hard disks in a RAID based scheme. Same goes for most object storage vendors. As as example let’s take a 1TB disk, it’s typical sequential write sits around 115 MBps, so 1.000.000 MB / 115 MBps = approximately 8700 seconds which is nearly two and a half hours. If you are using 4TB disks then your rebuild time will be at least ten hours. In this case I am even ignoring the RAID calculation that needs to happen simultaneously and the other IO in the system that the storage controllers need to deal with.

RAID 5 protection example.

Let’s say we have 3 HDDs in a RAID 5 configuration, data is spread over 2 drives and the 3rd one is used to store the parity information. This is basically a exclusive or (XOR) function;

Let’s say I have 2 bits of data that I write to the system, disk 1 has the first bit, disk 2 the second bit, and disk 3 holds the parity bit (the XOR calculation). Now I can lose any of the 2 bits (disks) and the system is able to reconstruct the missing bit as demonstrated by the XOR truth table below;

Screen Shot 2016-07-14 at 14.06.39

Let’s say I write bit 1 and bit 0 to the system, 1 is stored on disk A and 0 is stored on disk B, if I lose disk A [1], I still have disk B [0] and the parity disk [1]. According to the table B [0] + parity [1] = 1 thus I can still reconstruct my data.

But as we have established that rebuilding these large disks is unfeasible, what the HCI builders do is replicate all data, typically 3 times, in their architecture as to protect against multiple component failures, this is of course great from an availability point of view but not so much from a usable capacity point of view.

Enter erasure coding.

So from a high level what happens with erasure coding is that when data is written to the system, instead of using RAID or simply replicating it multiple times to different parts of the environment, the system applies slightly more complex mathematical functions (including matrix, and Galois-Field arithmetic*) compared to the simple XOR we saw in RAID (strictly speaking RAID is also an implementation of erasure coding).

There are multiple ways to implement erasure coding of which Reed-Solomon seems to be the most widely adopted one right now, for example Microsoft Azure and Facebook’s cold-storage are said to have implemented it.

Since the calculation of the erasure code is more complex the often quoted drawback is that it is more CPU intensive than RAID. Luckily we have Intel who are not only churning out more capable and efficient CPUs but are also contributing tools, like the Intelligent Storage Acceleration Library (Intel ISA-L) to make implementations more feasible.

As the video above mentions you roughly get 50% more capacity with erasure coding compared to a triple mirrored system.

Erasure Coding 4,2 example.

Erasure codes are typically quite flexible in the way you can implement them, meaning that you can specify (typically as the implementor, not the end-user, but in some cases both) the number of data blocks to parity blocks. This then impacts the protection level and drive/node requirement. For example if you choose to implement a 4,2 scheme, meaning that each file will be split into 4 data chunks and for those 4 chunks 2 parity chunks are calculated, this means that in a 4,2 setup you require 6 drives/nodes.

The logic behind it can seem quite complex, I have linked to a nice video explanation by Backblaze below;

* http://web.eecs.utk.edu/~plank/plank/papers/CS-96-332.pdf

Backup is Boring!

Backup is Boring!

Yep, until it’s not.

When I was a consultant at a VAR a couple of years ago I implemented my fair share of backup and recovery solutions, products of different vendors which shall remain nameless, but one thing that always became clear was how excruciatingly painful the processes involved ended up being. Convoluted tape rotation schema’s, figuring out back-up windows in environments that were supposed to be operating in a 24/7 capacity, running out of capacity, missed pickups for offsite storage,… the experience consistently sucked.

I think it’s fair to say that there has not been a lot of innovation in this market for the last decade or so, sure vendors put out new versions of their solutions on a regular basis and some new players have entered the market, but the core concepts have largely remained unchanged. How many times do you sit around at lunch with your colleagues and discuss exciting new developments in the data protection space… exactly…

So when is the “until it’s not” moment then?

I’m obviously biased here but I think this market is ripe for disruption, if we take some (or most) of the pain out of the data protection process and make it a straightforward affair I believe we can bring real value to a lot of people.

Rubrik does this by providing a simple, converged data management platform that combines traditionally disparate backup software pieces (backup SW, backup agents, catalog management, backup proxies,…) and globally deduplicated storage in one easily deployable and scalable package.

No more jumping from interface to interface to configure and manage something that essentially should be a insurance policy for you business. (i.e. the focus should be on recovery, not backup). No more pricing and sizing individual pieces based on guesstimates, rather scale out (and in) if and when needed, all options included in the base package.

Because it is optimized for the modern datacenter (i.e. virtualization, scale-out architectures, hybrid cloud environments, flash based optimizations,…) it is possible to consume datamanagement as a service rather than through manual configuration. All interactions with the solution are available via REST APIs and several other consumption options are already making good use of this via community driven initiatives like the PowerShell Module and the VMware vRO plugin. (more info see please see: https://github.com/rubrikinc )

peter-gibbons2

So essentially giving you the ability to say no to the “we have always done it this way” mantra, it is time to bring (drag?) backup and recovery into the modern age.

 

Intel and Micron 3D XPoint

Intel and Micron 3D XPoint

Introduction

My day job is in networking but I do consider myself (on the journey to) a full stack engineer and like to dabble in lot’s of different technologies like, I’m assuming, most of us geeks do. Intel and Micron have been working on a seeming breakthrough that combines memory and storage in one non-volatile device that is cheaper than DRAM (typically computer memory) and faster than NAND (typically a SSD drive).

3D Xpoint

3D Xpoint, as the name implies, is a crosspoint structure, meaning 2 wires crossing each other, with “some material*” in between, it does not use transistors (like DRAM does) which makes it easier to stack (hence the 3D) —> for every 3 lines of metal you get 2 layers of this memory.

Screen Shot 2016-02-13 at 18.09.31.png

The columns contain a memory cell (the green section in the picture above) and a selector (the yellow section in the picture above), connected by perpendicular wires (the greyish sections in the picture above), allowing you to address each column individually by using one wire at the top and one wire at the bottom. These grids can be stacked 3 dimensionally to maximise density.
The memory can be accessed/modified by sending varied voltage to each selector, in contrast DRAM requires a transistor at each memory cell to access or modify it, this results in 3D XPoint being 10x more dense that DRAM and 1000x faster than NAND (at the array level, not at the individual device level).

3D XPoint can be connected via PCIe NVMe and has little wear effect over it’s lifetime compared to NAND. Intel will commercialise this in it’s Optane range both as an SSD disk and as DIMMS. (The difference between Optane and 3D XPoint is that 3D XPoint refers to the type of memory and Optane includes the memory and a controller package).

1000x faster, really?

In reality Intel is getting 7x performance compared to a NAND MLC SSD (on NVMe) today (at 4kB read), that is because of the inefficiencies of the storage stack we have today.

Screen Shot 2016-02-13 at 18.21.27.png

The I/O passes through the filesystem, storage stack, driver, bus/platform link (transfer and protocol i.e. PCIe/NVMe), controller firmware, controller hardware (ASIC), transfer from NAND to the buffers inside the SSD, etc. So 1000x is a theoretical number (and will show up on a lot of vendor marketing slides no doubt) but reality is a bit different.

So focus is and has been on reducing latency, for example work that has been done by moving to NVMe already reduced the controller latency by roughly 20 microseconds (no HBA latency and the command set is much simpler).

Screen Shot 2016-02-13 at 18.25.07

The picture above shows the impact of the bus technology, on the left side you see AHCI (SATA) and on the right NVMe, as you see there is a significant latency difference between the two. NVMe also provides a lot more bandwidth compared to SATA (about 6x more on PCIe NVMe Gen3 and more than 10x on Gen4).

Another thing that is hindering the speed improvements of 3D XPoint is replication latency across nodes (it’s storage so you typically want redundancy). To address this issue work is underway on things like “NVMe over Fabrics” to develop a standard for low overhead replication. Other improvements in the pipe are work on optimising the storage stack, mostly on the OS and driver level. For example, because the paging algorithms today were not designed with SSD in mind they try to optimise for seek time reduction etc, things that are irrelevant here so reducing paging overhead is a possibility.

They are also exploring “Partial synchronous completion”, 3D XPoint is so fast that doing an asynchronous return, i.e. setting up for an interrupt and then waiting for interrupt completion takes more time than polling for data. (we have to ignore queue depth i.e. assume that it will be 1 here).

Screen Shot 2016-02-13 at 19.03.27

Persistent memory

One way to overcome this “it’s only 7x faster problem”, altogether is to move to persistent memory. In other words you skip over they storage stack latency by using 3D XPoint as DIMMs, i.e. for your typical reads and writes there is no software involved, what little latency remains is now caused entirely by the memory and the controller itself.

Screen Shot 2016-02-13 at 19.15.21

To enable this “storage class memory” you need to change/enable some things, like a new programming model, new libraries, new instructions etc. So that’s a little further away but it’s being worked on. What will probably ship this year is the SSD model (the 7x improvement) which is already pretty cool I think.

* It’s not really clear right now what those materials entail exactly which is part of it’s allure I guess 😉

 

vSphere hypervisor-based replication

vSphere Replication is VMware’s hypervisor (as opposed to storage array) asynchronous (minimum 15 minutes) based replication solution that works at the virtual machine (VMDK) level whereas storage array replication usually works at the datastore (VMFS) level. (One of the reasons why vVols will be interesting going forward).

Since vSphere replication happens at the host level it is storage agnostic meaning that you can replicate between between different storage arrays, for instance from your enterprise level array in production to a cheaper array in the DR site, or even from a storage array to local disk. Obviously VSAN could also be a good replication partner use case.

Within the RPO you as an admin specify (again 15 minutes is the minimum here, and you can set it per VMDK) the vSphere replication engine looks at the changed blocks that need to be sent over the network to the other site.

Now sending (potentially) lot’s of data over a network can be tricky depending on bandwidth and latency. vSphere replication uses CUBIC TCP to better cope with high bandwidth links where you potentially have lot’s of bandwidth but the round trip time is high. It is mainly about optimizing the congestion control algorithm in order to avoid low link utilization because of the way TCP generally works (see my post on TCP throughput for more info).

In terms of bandwidth requirements you need to take into account the way vSphere replication works on a per VM model, it will not consume all the bandwidth that is available but depending on the data set, change rate and your RPO will intelligently decide what to replicate and when. There is a KB article that goes into detail about calculating the bandwidth requirements.

Screen Shot 2013-10-31 at 4.53.48 PM

It is included with vSphere Essentials Plus or higher. You need to download a single virtual appliance that performs a dual role. (the vSphere Replication Management Server (VRMS) and the vSphere Replication Server(VRS)).

You can only have one VRMS per site but you can have multiple VRS’s (up to 10). The VRMS performs configuration management and the VRS manages replica instances.

Initially you select the source and target location and VR will scan the entire disk on both sites (you could already have a copy of the VM sitting in the target site if you want), and generate a checksum for each block, it then compares both and figures out what needs to be initially replicated to get both sites in sync. This initial sync happens over TCP port 3103 and is called a full sync.

After the initial sync the VR agent (VRA – inside the ESXi kernel) will (together with a with a passive virtual SCSI filter) track all I/O and keep an in-memory bitmap of all changed blocks, this bitmap is backed up with a persistent state file (psf) in the home directory of the VM. The psf file contains pointers to the changed blocks. Now the replication engine figures out based on your RPO when the time is right (you cannot yourself schedule this) to send the changed blocks to the replication site. This “ongoing” replication happens over TCP port 44046 initiated from the vmkernel management NICs.

At the recovery site you have deployed your VRS to which the changed blocks will be sent. After the VRS has received all changed blocks for that replication cycle it will pass those off to the ESXi’s network file copy (NFC) service to write the blocks to its target storage. This means that the vSphere replication process is completely abstracted from the underlying storage and as such gives you flexibility in terms of underlying hardware being different at both ends.

VMware’s Storage Portfolio

I recently joined VMware as an SE, one of the reasons that motivated and influenced my decision to join is a thirst for technology, at VMware you get to work on such a broad and interesting technology stack that anyone is bound to find one or more things that are deeply interesting.

One (out of many) of those things that I find compelling is storage and as such my first blog post as an employee on storage seemed fitting.

In this post I’ll briefly cover VSAN, vFRC, Virsto, and vVols.

I won’t cover vSphere data services like VMFS, VAAI, VASA, Storage vMotion, Storage DRS, VADP, and vSphere Replication here.vmware sds

VSAN

VSAN is a software based distributed strorage solution that is integrated directly into the hypervisor (in contrast to running as a virtual appliance, like the vSphere Storage Appliance (VSA)). It uses directly connected storage (a combination of SSD for performance and spinning disks for capacity) and creates a distributed storage layer across multiple (up to 8 hosts in the beta, rumored 16 hosts at GA in 2014) ESXi hosts.

One of the most obvious uses cases for VSAN will be virtual desktops, in fact VSAN will be a part of the new Horizon View 5.3

For more information on VSAN I suggest heading over to Cormac Hogan’s blog for an entire series on VSAN, or (AND!) Duncan Epping’s blog

vFRC

vSphere Flash Read Cache takes your direct attached flash resources (SSD and/or PCIe flash devices) on your vSphere host and allows you to utilize them as a dedicated (per VM) read cache for your virtual workloads. Like VSAN this solution is integrated into the hypervisor itself.

For more information on vFRC I recommend heading over to the Punching Clouds blog

VVols

VVols or virtual volumes is in technology preview at the moment (i.e. not available), the idea behind VVols is to make storage VM (VMDK) aware in contrast to VMFS aware (usually LUN or Volume based) for things like clones, snapshots, and replication.

For more information on Virtual Volumes I recommend reading Duncan Epping’s blog post.

You can also see a (tech preview) demo of vVols by NetApp here.

Virsto

Vmware acquired Virsto earlier this year in an effort to optimize external block storage (i.e. external SAN only) performance and space utilization. It is no secret that block based storage performs best with sequential I/O, but when you start using virtualization you end up with mostly random I/O (if you have more than 1 VM that is). Virsto tries to solve the random I/O issue by intercepting the random I/O, writing it to a serialized log file and later de-staging it to your SAN storage resulting in mostly sequantial data. Virsto is installed as a (set of) virtual appliance. Virsto presented at Storage Field day 2 before they were acquired by VMware.

Whitewater 3 – waves of innovation washing onto the shore

Riverbed recently released the latest edition of it’s cloud storage gateway, both upgrading the software and providing new hardware options.

What is Riverbed Whitewater?

Whitewater is an on-premise appliance that connects your internal network with a cloud storage provider, it easily integrates your existing back-up/archive infrastructure with cloud storage, leveraging the cloud as a low cost tier for long term storage.

wwa3 overview

Whitewater brings cloud scale cost and protection (cloud data durability is extremely high (11x9s) due to advanced cloud architectures) benefits into your existing infrastructure. At the same time Whitewater provides fast restore since the local cache will hold the most recent backup data.

In contrast to Riverbed Steelhead, the WAN optimization solution, whitewater is single-ended (you only need 1 appliance in your datacenter), whereas Steelhead requires an appliance (or softclient) at both ends of the WAN connection.

On the front-end it presents itself as a CIFS/NFS share, providing easy integration with existing back-up applications, and on the back-end it connects to a cloud storage system using REST APIs.

wwa3 providers

Data that is written to Whitewater is deduplicated inline, and securely (encrypted at rest and in-flight) transferred to your cloud storage provider/system.

I’ve written about Whitewater a couple of times before;

http://filipv.net/2013/05/19/amazon-glacier-and-backup-economics/

http://filipv.net/2012/12/26/riverbed-whitewater-and-caringo-castor/

What’s new? – Bigger, Better, Faster

  • Whitewater now supports up to 2.88PB of source data locally cached
  • Up to 14.4PB of source data in the cloud
  • Scalability by optionally allowing you to connect disk shelve extensions
  • Faster performance, now ingesting up to 2.5TB/Hr
  • 10Gb connectivity
  • Ability to locally pin a data set
  • Ability to perform replication to a remote whitewater (peer replication)
  • Symantec Enterprise Vault support

Storage shelve extensions

The current version allows you to connect 2 additional storage shelves, greatly expanding the local cache. This combined with local data pinning and peer replication makes it feasible to use the system as a backup to disk system without the cloud tier. But the main purpose of the solution remains leveraging cloud storage economics for long term retention.

wwa3 shelves

Locally pin a data set

If you have a particular data set for which your SLA to the business requires a shorter RTO you can optionally lock this data set on the local cache (changes will still be replicated to the peer and/or cloud storage). This way you can ensure that this data set will always be recovered from the local cache at LAN speed.

pinned data set

Peer replication

Another standard feature (at no additional license cost) in version 3 of Whitewater is the ability to replicate data to a peer Whitewater at a DR site.

Since Whitewater uses inline deduplication this means that the primary appliance will sent only deduplicated (and encrypted) traffic towards the DR site, thus greatly reducing network transmissions. The secondary whitewater first needs to acknowledge the data before it is replicated to the cloud as a 3rd tier.

wwa3 replication

Although we are only transferring deduplicated data we still allow you to control the bandwidth used for replication both to the peer whitewater and the cloud.

wwa3 repl

Symantec Enterprise Vault support

Whitewater allows you to integrate the cloud as a storage vault for Symantec Enterprise Vault. Click here for more information on Enterprise Vault.

What if my datacenter is lost and I need to restore from the cloud?

First of all we would recommend replicating to a peer whitewater in a DR site so you don’t incur cloud restore charges or transmission delays. But we do allow you to download a virtual whitewater for FREE (read-only) which will allow you to quickly (or at least quicker since we are pulling out deduplicated data) restore your data and get back online.

free vwwa

A word about deduplication

In order to make cloud storage economically feasible Whitewater first deduplicates data before sending it to the cloud. Withe deduplication only unique data is stored on the disk thus guaranteeing much more efficient utilization of any storage.

In the process of deduplication the incoming data stream is split into blocks. A fingerprint (digital signature) is created for each block to uniquely identify it, as well as a signature index. The index provides the list of references in order to determine if a block already exists on disk. When the deduplication algorithm finds an incoming data block that has been processed before (a duplicate), it does not store it again but it creates a reference to it. References are generated every time a duplicate is found. If a block is unique, the deduplication system writes it to disk.

Some deduplication techniques split each file into fixed length blocks, others, like Whitewater use variable length blocks. Fixed Block deduplication involves determining a block size (size varies based on the system but is fixed) and segmenting files/data into those block sizes.

Variable Block deduplication involves using algorithms to determine a variable block size. The data is split based on the algorithm’s determination. When something changes, i.e. data is added so the blocks shift then the algorithm will determine the shift so the blocks that follow are not “lost” by the algorithm, fixed block length cannot do this.

In the example below we have a fixed block length of 3, so the incoming data is “sliced” into block of 3 characters. The arrow indicates a change to the data, i.e. we add a new character (A) upstream, the result of which is, since the boundaries with fixed length do not change, that all blocks now contain different data and there are zero block matches meaning all blocks are unique and will be written to disk.

Fixed block

Notice how the variable block deduplication has seemingly random block sizes. While this does not look too efficient compared to fixed block, notice what happens when we add the same upstream element to the data. 

VBL

Since the variable block length algorithm has determined the boundary for this particular data to lie between C and BB only the first block (AABC) has changed and needs to be written to disk, the other blocks remain unchanged and can be referenced by the deduplication algorithm.

Since Whitewater uses variable segment length inline deduplication this allows for higher dedupe ratios than fixed block length deduplication (see above), once we have deduplicated the data we use LZ compression to further compact the data. We see an average data set reduction of 10 to 30x depending on the source data.

dedupe ratio

If you are an existing Riverbed Whitewater customer you can download Whitewater 3.0.x here

VMware Branch Office Desktop with Granite and Atlantis ILIO

When using VMware View, or any other VDI based solution for that matter, across a Wide Area Network you need to think about certain limitations inherit in this setup that can potentially limit the user experience for your remote users.

Running your virtual desktops in the data center and connecting over the WAN.

If you decide to keep the virtual desktops in the data center and let your users connect remotely, the user experience will be impacted by the amount of bandwidth, the latency, the type of application, and the remoting protocol. In the case of VMware View we are using PCoIP* across the WAN. With Riverbed Steelhead you can use WAN optimization technology to optimize PCoIP, for example Riverbed Steelhead can optimize printer mappings, drive mappings and USB redirection between the branch office and the data center.

riverbed pciop

Riverbed Steelhead also enables QoS for PCoIP giving fine bandwidth control and latency prioritization for virtual channels within a PCoIP stream, enabling fine-tuning of traffic including voice, video and display rendering.

Running your virtual desktops in the branch office.

To get round the bandwidth and latency issue you could also decide to host the virtual desktop vm’s in the branch office, Riverbed Granite allows you to host the VM’s remotely while the central management components still remain in the data center.

Branch

The net result is that you only need bandwidth for PCoIP in the branch office, where it is readily available and is not impacted by latency, and that the SAN hosting the virtual desktops vm’s is less impacted by the IOPS requirement when booting and running the vm’s since they are now running on local blockstore of the Granite appliance in the branch office. All while maintaining central management from the data center.

Now depending on the amount of virtual desktops you need to run in the branch office you could be impacted by the amount of IOPS required, the Steelhead EX appliance in the branch which runs the Granite Edge component has a certain amount of internal disks (HDD for Granite Blockstore, SSD for Steelhead Datastore) which translates to a maximum amount of IOPS available for you virtual desktops. The total amount of IOPS we can serve from Steelhead EX depends on the model.

Let’s assume you are running a big branch office and you require a large amount of IOPS to keep user experience optimal, again you have several options, you could run multiple Steelhead EX + Granite appliances (Riverbed supports up to 80 branch offices connected to a single Granite Core in the data center), or you could use a solution like Atlantis Computing ILIO to leverage your server’s RAM to satisfy your IOPS requirements. Steelhead EX has a certain amount of memory depending on the model or you can use an external VMware host chock full of RAM and connect that to the Granite component in the branch office.

So how much IOPS do you need to run your virtual desktops?

A lot has been written about the IOPS requirements for VDI, there are numerous whitepapers and VDI storage calculators out there that will give you some idea of the amount of IOPS you should expect, just be careful with steady state numbers vs booting the vm vs starting an application (application virtualization also helps reduce IOPS requirements here), the idea is that you want to provide a user experience that is at least as good compared to working locally so your users wont revolt. In general the Windows OS will consume as much disk IO or throughput to the hard drive as is available, additionally Windows desktop workloads are write heavy (70-80% writes, 20-30% reads).

VMware itself has also provided a way to alleviate IOPS requirements with its View Storage Accelerator introduced in VMware View 5.1, this is a great addition to limit read IOPS (20-30% in Windows virtual desktops) but as such still leaves us with the write IOPS requirement.

Atlantis ILIO is a virtual machine that is hosted on the same host running the virtual desktops (in our case the Steelhead EX or an external VMware host), it essentially presents the RAM of the host as a datastore where the virtual desktops are run from, providing IOPS from RAM (nanosecond latency as compared to microsecond latency when using flash based arrays or PCI based flash cards).

disklessVDI_diag_large

By using inline deduplication you further limit the amount of IOPS needed from the backend storage (in our case the Granite Edge) since less blocks are being transferred and also limits the amount of RAM required to run the virtual desktops.

A closer look at Atlantis ILIO.

First we need to differentiate bewteen persistent and non-persistent VDI desktops. For non-persistent desktops ILIO has had a solution for some time to just run these desktops from RAM without needing persistent storage, when the server dies you just reboot the virtual desktop on another host and start working from there.

With persistent desktops there is a need to write data to persistent storage so your users adjustments aren’t lost after a reboot, with the release of ILO 4.0 this is now possible. I’ll further explore the persistent desktop use case since this is the most interesting one and has the bigger IOPS requirements.

The VM you install on the host is called the session host, this session host hosts the virtual desktops, it exposes the RAM of the host as a NFS or iSCSI mountpoint to which you attach the VMware datastore.

Once data needs to be written to the datastore ILIO (In Line Image Optimization) performs inline deduplication and compression, and Windows I/O optimization. (potentially fixing the I/O blender issue by optimizing random 4K blocks into sequential 64K blocks).

Atlantis-ILIO-Persistent-VDI-4.0

The Replication Host stores the persistent data of multiple Session Hosts and can further deduplicate data (across multiple hosts this time) and as such further reduces the storage requirements of the SAN/NAS. The replication host is responsible for making sure that any changes made to the desktop are saved to a persistent storage device, either SAN or NAS. In order to get the data from the RAM to the Replication host Atlantis uses Fast Replication. (you can run session and replication hosts on the same server if you want).  So once you need to restart the virtual desktop on another host, the persistent state of the desktop is retrieved by the replication host from persistent storage. With all these features in the I/O path, Atlantis estimates that they only need around 3GB of persistent storage space per desktop.

See below for a demo of the Atlantis ILIO Persistent VDI 4 user experience

Of course this is not the only way to deal with user experience requirements for branch office VDI and a lot of options exist out there but as far as I am concerned this is definitely one of the coolest.

*PCoIP is not the only option you have as a remoting protocol, you can also use RDP or the HTML5 Blast “client”.

Disclaimer: This is my personal blog and this post is in no way endorsed or approved by Riverbed, I have not built this solution in a production network and cannot comment on real life feasibility.

Amazon Glacier and Backup economics

In the summer of last year Amazon announced Amazon Glacier, an extremely low cost storage service designed for data archiving and backup.

This makes it a very compelling solution for offloading your backup data to the cloud at low cost, but the point of a backup solution is not backing up your data, it is enabling restore of said data. The time it takes for the restore to complete must fit in your RTO (how long can the business wait before the data is back and useable), and this is where Amazon Glacier potentially falls down because the SLA it adheres to for getting your data back is between 3 and 5 hours, this is the reason why it is primarily marketed as an archive solution whereby the time constraints are less stringent and the cost of storing the archive takes precedent over the RTO. (If you need faster access to your data look at Amazon S3, but of course take into account the cost differential there).

glacier low cost

But have no fear, you can have your cake and eat it too, with Riverbed Whitewater you can leverage the low cost Amazon Glacier storage and still get fast restores. Whitewater is a tiered backup solution that ingests data from your existing, unmodified backup server, using inline deduplication to minimize the local storage required to maintain a full backup of your data locally, and sending the rest up to Amazon Glacier. Because most restores your users request are for relatively new data, chances are this data is stored on the local disks of the Whitewater appliance and the restore will be at LAN speed. The pricing of Amazon Glacier (see picture above) also assumes that storage retrieval will be infrequent (this is calculated in the pricing model), like say for archiving purposes, and with Whitewater it can be for backup purposes as well.

wwa glacier

ateamSo a serious reduction in data protection costs, eliminating tape, tape vaulting and disaster recovery storage sites. Improving DR readiness with secure anywhere accessible (think DR for example) Amazon Glacier storage services providing 11 9’s of durability. No need to change your existing backup application and processes, using less storage in Glacier because of our inline deduplication, and with local LAN speed restores. End-to-end security with secure data in flight and at rest with SSL v3 and AES 256 bit encryption.

Software Defined Shenanigans

Software defined anything (SDx) is the new black.

In July of last year VMware acquired Software Defined Networking (SDN) vendor Nicira and suddenly every network vendor had a SDN strategy, they must have reckoned the Google hits alone from people searching for SDN justified a change in vision.

Now VMware (a.o.) is further leading the charge by talking about the Software Defined Data Center (SDDC), wherein anything, the entire data center is now pooled, aggregated, and delivered as software, and managed by intelligent, policy-driven software.

Cloud and XaaS are so last year, SDx is where it’s at, it is the halo effect gone haywire.

A lot of networking and storage companies, both “legacy” and “start-up”, are scurrying around trying to figure out how to squeeze “Software-Defined” into their messaging.

So what defines software defined?

Software defined first appeared in the context of networking, traditionally network devices were delivered as a monolithic appliance, but logically you can think of them as consisting out of 3 parts, the data plane, the management plane, and the control plane.

SDN Logical (1)The Data plane is relatively straight forward (no pun intended), it is where your data packets travel from point A to point B. When packets and frames arrive on the ingress ports of the network device the forwarding table is what all routers and switches use to dispatch frames and packets to their egress ports.

The Management plane, besides providing management functions such as device access, os updates, etc. also delivers the Forwarding table data from the Control plane towards the Data plane.

The Control plane is more involved, as networks become more sophisticated, (routing) algorithms here can be pretty complex (and complexity often leads to bugs). Algorithms here are not uniform and dynamic because they are expected to support a wide range of use cases and deployment scenarios.

The idea of SDN is to separate these planes, separate the control/management function from the data function to increase flexibility. Now imagine you have moved to the control function to a system that controls other functions in your data center, like creating virtual machines and storage, as well. No longer are you limited by silos of control, you can potentially manage everything that is needed to deploy new applications (vm, network, storage, security, …) from a single point of control (single pane of glass?).

Is exposing API’s enough? 

It has always been possible to control functions in the network device programmatically, a lot of vendors are merely allowing you to control the existing control plane using API’s, I would argue this is not SDN, at least not in a purist sense, because it lacks scalability. (This point is very debatable I admit).

The aim is not having the control plane in each monolithic device, but rather have the intelligence outside, using OpenFlow for example, or the Big Network Controller from Big Switch Networks, allowing more flexibility and greater uniformity (one can dream).

Northbound API

The northbound API on a software-defined networking (SDN) controller enables applications and orchestration systems to program the network and request services from it. This is what the non-network vendors will use to integrate with your SDN, the problem today is that the service is not standardised (yet?), meaning that it is less open than we want it to be.

Is SDN the same as network virtualisation?

Network virtualisation adds a layer of abstraction (like all virtualisation) to the network often using tunnelling or an overlay network across the existing physical network. Nicira uses STT, VMware already had VXlan, Microsoft uses NVGRE, … I would argue that network virtualisation often is an underlying part of SDN.

Software Defined Storage (SDS)

In the world of storage we are also exposed to software defined, a lot of storage start-ups are using the SDS messaging to combat existing (or legacy as the start-ups would prefer) storage vendors to claim they have something new and improved. This, in my humble opinion, is not always warranted.

If you define SDS as SDN whereby the control plane is separated from the data plane, this is enabled by the lower-level storage systems abstracting their physical resources into software. Same reasons prevail, dynamism  flexibility, more control,… These abstracted storage resources are then presented up to a control plane as “software-defined” services. The exposure and management of these services is done through an orchestration layer (like the northbound API in SDN world). The quality and quantity of these services dependents on the virtualisation and automation capabilities of the underlying hardware (is exposing API’s enough?).

Some would argue that because of the existing architectures of legacy storage systems this becomes more cumbersome and less flexible compared to the new start-up SDS players. Like you have new players in SDN (Arista (even though they don’t seem to like the SDN terminology very much), Plexxi,…) baking these technologies in from the ground up, you have the same with storage vendors, but I would argue that the rate of innovation seems to be much higher here. A lot of new storage vendors (ExaBlox, Tintri, PureStorage, Nimble,..) , a lot of new architectures (Fusion-IO, Pernixdata, SanDisk FlashSoft,…), a lot of acquisitions of flash based systems by legacy vendors, etc… make it that I don’t believe “legacy” storage vendors are going the way of the dinosaur just yet. I do however think it will lead to a lot of confusion, like software only storage suddenly being SDS etc.

Is SDS the same as storage virtualisation?

Like network virtualisation in SDN, storage virtualisation in SDS can play it’s part. Storage virtualisation is another abstraction between the server and the storage array, one such abstraction can be achieved by implementing a storage hypervisor. The storage hypervisor can aggregate multiple different arrays, of different vendors, of maybe even generic JBODs. The storage hypervisor tends to not, at least not always, use most of the capabilities of the array instead treating them as generic storage. Datacore for example sells a storage hypervisor, so does Virsto which was acquired by VMware.

In a more traditional sense the IBM SVC, NetApp V-series, EMC VPLEX can be considered storage virtualisation, or more accurately storage federation. And then you have logical volume managers, LUNs, RAID sets, all abstraction, all “virtualisation”… so a lot of FUD will be incoming.

Is it all hype?

Of course not, some of the messaging might be confusing, and some vendors like to claim they are part of the latest trend without much to show for it, but the industry is moving, fast, adding functionality to legacy systems, building new architectures to deliver (at least partially) on the promise of better. But as always, there is a lot of misinformation about certain capabilities in certain products, maybe a little too much talking and not enough delivering. I expect a great deal of consolidation in the next few years, both of companies and terminology, so look carefully at who is doing what and how this matches with your companies strategy going forward. Exiting times ahead though.