Category: Technology

Containers and persistent storage

Containers and persistent storage

Containers are a method of operating system virtualization that allow you to run an application and its dependencies in resource-isolated processes. Containers allow you to easily package an application’s code, configurations, and dependencies into easy to use building blocks that deliver environmental consistency, operational efficiency, developer productivity, and version control. Containers are immutable, meaning they help you deploy applications in a reliable and consistent way independent of the the deployment environment.

As containers continue to rise in popularity beyond the developer populous the way these constructs are being used becomes increasingly varied, especially (but not exclusively) in light of enterprise applications the questions of persistent storage comes up more and more. It is a fallacy to think only stateless application can or should be containerized. If you take a look at you’ll see that about half of the most popular applications on Docker Hub are stateful, like databases for example. (Post hoc ergo propter hoc?). If you think about monolithic applications versus micro-services, a monolithic application typically requires state, if you pull this application out into micro-services, some of these services can be stateless containers but others will require state.

I’ll mainly use Docker as the example for this post but many other container technologies exist like LXD, rkt, OpenVZ, even Microsoft offers containers with Windows Server Containers, Hyper-V isolation, and Azure Container Service.

Running a stateless container using Docker is quite straightforward;

$ docker run --name demo-mysql mysql

When you execute docker run, the container process that runs is isolated in that it has its own file system, its own networking, and its own isolated process tree separate from the (local or remote) host.

The docker container is created from a readonly template called docker image. The “mysql” part in the command relates to this image, i.e. containerized application, that you want to run by pulling it from the registry. The data you create inside a container is stored on a thin writable layer, called the container layer, that sits on top of the stack of read-only layers, called the image layers, present in the base docker image. When the container is deleted the writable layer is also deleted so your data does not persist , in docker the docker storage driver is responsible for enabling and managing both the read-only image layers and the writable layer, both read and write speeds are generally considered slow.

Image result for docker layers

Assuming you want persistent data for your containers there are several methods to go about this. You can add a storage directory to a container’s virtual filesystem and map that directory to a directory on the host server. The data you create inside that directory on the container will be saved on the host, allowing it to persist after the container shuts down. This directory can also be shared between containers. In docker this is made possible by using volumes, you can also use bind mounts but these are dependent on the directory structure of the host machine whereas volumes are completely managed by Docker itself. Keep in mind though that these volumes don’t move with container workloads as they are local to the host. Alternatively you can use volume drives (Docker Engine volume plugins) to store data on remote systems instead of the Docker host itself. If you are only interested in storing data in the container writeable layer (i.e. on the docker host itself) you can use Docker storage drivers which then determine which filesystem is supported.

Typically you would create a volume using the storage driver of your choice in the following manner;

$ docker volume create -—driver=pure -o size=32GB testvol1

And then start a container and attach the volume to it;

$ docker run -ti -v testvol1:/data mysql

Storage Vendors and Persistent Container Storage

Storage vendors have an incentive to make consuming their particular storage as easy as possible for these types of workloads so many of them are providing plug-ins to do just that.

One example is Pure Storage who provide a Docker Volume Plugin for their FlashArray and FlashBlade systems. Current they support Docker, Swarm, and Mesos. Most other big name storage vendors also have plugins available.

Then there are things like REX-Ray which is an open source, storage management solution, it was born out of the now defunct {code} by Dell EMC team. It allows you to use multiple different storage backends and serve those up as persistent storage for your container workloads.

On the virtualization front VMware has something called the vSphere Docker Volume Service which consists of two parts, the Docker Volume Plugin and a vSphere Installation Bundle (VIB) to install on the ESXi hosts. This allows you to serve up vSphere Datastores (be it Virtual SAN, VMFS, NFS based) as persistent storage to your container workloads.

Then there are newer companies that have been solely focusing on providing persistent storage for container workloads, one of them is Portworx. Portworx want to provide another abstraction layer between the storage pool and the container workload. The idea is that they provide a “storage ” container that can then be integrated with the “application” containers. You can do this manually or you can integrate with a container scheduler like Docker Swarm using Docker Compose for example (Portworx provides a volume driver).

Docker itself has built specific plugins as well, Cloudstor is such a volume plugin. It comes pre-installed and pre-configured in Docker swarms deployed through Docker for AWS. Data volumes can either be backed by EBS or EFS. Workloads running in a Docker service that require access to low latency/high IOPs persistent storage, such as a database engine, can use a “relocatable” Cloudstor volume backed by EBS. When multiple swarm service tasks need to share data in a persistent storage volume, you can use a “shared” Cloudstor volume backed by EFS. Such a volume and its contents can be mounted by multiple swarm service tasks since EFS makes the data available to all swarm nodes over NFS.

Container Orchestration Systems and Persistent Storage

As most enterprise production container deployments will utilize some container orchestration system we should also determine how external persistent storage is managed at this level. If we look at Kubernetes for example, that supports a volume plugin system (FlexVolumes) that makes it relatively straightforward to consume different types of block and file storage. Additionally Kubernetes recently started supporting a implementation of the Container Storage Interface (CSI) which helps accelerate vendor support for these storage plug-ins as volume plugins are currently part of the core Kubernetes code and shipped with the core Kubernetes binaries meaning that vendors wanting to add support for their storage system to Kubernetes (or even fix a bug in an existing volume plugin) must align themselves with the Kubernetes release process. With the adoption of the Container Storage Interface, the Kubernetes volume layer becomes extensible. Third party storage developers can now write and deploy volume plugins exposing new storage systems in Kubernetes without having to touch the core Kubernetes code.

When using CSI with Docker, it relies on shared mounts (not docker volumes) to provide access to external storage. When using a mount the external storage is mounted into the container, when using volumes a new directory is created within Docker’s storage directory on the host machine, and Docker manages that directory’s contents.

To use CSI, you will need to deploy a CSI driver, a bunch of storage vendors have these available in various stages of development. For example there is a  Container Storage Interface (CSI) Storage Plug-in for VMware vSphere.

Pre-packaged container platforms

Another angle how vendors are trying to make it easier for enterprises to adopt these new platforms, including solving for persistence, is by providing packaged solutions (i.e. making it turnkey), this is not new of course, not too long ago we saw the same thing happening with OpenStack through the likes of VIO (VMware Integrated OpenStack), Platform9, Blue Box (acquired by IBM), etc. Looking at the Public Cloud providers these are moving more towards providing container as a service (CaaS) models with Azure Container Service, Google Container Engine, etc.

One example of packaged container platforms though is the Cisco Container Platform. This is provided as an OVA for VMware (meaning it is provisioning containers inside virtual machines, not on bare metal at the moment), initially this is supported on their HyperFlex Platform which will provide the persistent storage layer via a Kubernetes FlexVolume driver. It then can communicate externally via Contiv, including talking to other components on the HX platform like VMs that are running non containerized workloads. For the load-balancing piece (between k8s masters for example) they are bundling NGINX and for logging and monitoring they are bundling Prometheus (monitoring) and an ELK stack (logging and analytics).

Another example would be VMware PKS which I wrote about in my previous post.


Containers are ready for enterprise use today, however there are some areas that could do with a bit more maturity, one of them being storage. I fully expect to see continued innovation and tighter integrations as we figure out the validity of these use-cases. A lot progress has been made in the toolkits themselves, leading to the demise of earlier attempts like ClusterHQ/Flocker. As adoption continues so will the maturity of these frameworks and plugins.

You snooze, you lose.

You snooze, you lose.

At the end of October 2016 I created a short blog post called “Data on tape is no longer useful” it’s premise was that if your historical data is being stored offline you can’t readily use it for other value-add services.

More recently Rubrik has delivered the ability to use archived backup copies of your data and spin those up on-demand as public cloud resources, called CloudOn. Initially taking VMware based virtual machines, archive them to Amazon S3 and when the time comes automatically migrate those workloads from VMware to Amazon and spin them up as AMI based EC2 instances.

Now with the release of version 4.1 we added the ability to support a similar scenario but spin up the workloads on Microsoft Azure instead of AWS. Additionally we added archiving support for Google Cloud Storage enabling more and more multi-cloud capabilities. (CloudOn for Google Cloud Platform is currently not available).

Multi-Cloud er… Multi-Pass

Since I’ve just written a post about all the things Rubrik is (was) doing with Microsoft I now need to add Microsoft Azure CloudOn to the list (I snoozed).

The idea is you add Microsoft Blob Storage as an archive location to one or more of your SLAs and once the data has been archived off you can now use that archive copy to instantiate your workloads, which where originally running on VMware (VMware VMDK based), on Microsoft Azure, as Azure based VMs (VHD based).

Screen Shot 2017-10-01 at 16.15.45

You can opt to do this conversion completely on-demand or choose to auto-convert the latest backup copy to save time during instantiation. The inner-workings a bit different depending on which scenario you choose but generally speaking Rubrik takes the VMDK file and converts it to a VHD file, uploads the VHD file to Azure Blob storage as page blobs (vs block blobs which are typically used for discrete storage objects like jpg’s, log files, etc. that you’d typically view as a file in your local OS.)

Screen Shot 2017-10-01 at 16.22.29

We also added support for Azure Stack in 4.1, in this initial release we provide the same type of functionality as we do with Azure (public cloud), meaning we support Windows, Linux, and SQL (customer installed) based workloads via the Rubrik backup service.

For a grand discussion of Azure Stack (a.o.) I suggest listening to this wonderful episode of the Datanauts podcast with Jeffrey Snover, Technical Fellow at Microsoft and Chief Architect for Azure Infrastructure and the creator of Powershell.

If you want to get more of the details about the 4.1 release please check out one (or all) of these fine posts;

Erasure Coding – a primer

Erasure Coding – a primer

A surefire [sic] way to get to look for another job in IT is to lose important data. Typically if a user in any organisation stores data he or she expects that data to be safe and always retrievable (and as we all know data loss in storage systems is unavoidable). Data also keeps growing, a corollary to Parkinson’s law is that data expands to fill the space available for storage, just like clutter around your house.

Because of the constant growth of data there is a greater need to both protect said data but also to simultaneously store it in a more space efficient way. If you look at large web-scale companies like Google, Facebook, and Amazon they need to store and protect incredible amounts of data, they do however not rely on traditional data protection schemes like RAID because it is simply not a good match with the hard disk capacity increases of late.

Sure sure, but I’m not Google…

Fair point, but take a look at the way modern data architectures are built and applied even in the enterprise space, looking at most hyper-converged infrastructure players for example they typically employ a storage replication scheme to protect data that resides on their platforms, for them they simply cannot not afford the long rebuild times associated with multi-Terabyte hard disks in a RAID based scheme. Same goes for most object storage vendors. As as example let’s take a 1TB disk, it’s typical sequential write sits around 115 MBps, so 1.000.000 MB / 115 MBps = approximately 8700 seconds which is nearly two and a half hours. If you are using 4TB disks then your rebuild time will be at least ten hours. In this case I am even ignoring the RAID calculation that needs to happen simultaneously and the other IO in the system that the storage controllers need to deal with.

RAID 5 protection example.

Let’s say we have 3 HDDs in a RAID 5 configuration, data is spread over 2 drives and the 3rd one is used to store the parity information. This is basically a exclusive or (XOR) function;

Let’s say I have 2 bits of data that I write to the system, disk 1 has the first bit, disk 2 the second bit, and disk 3 holds the parity bit (the XOR calculation). Now I can lose any of the 2 bits (disks) and the system is able to reconstruct the missing bit as demonstrated by the XOR truth table below;

Screen Shot 2016-07-14 at 14.06.39

Let’s say I write bit 1 and bit 0 to the system, 1 is stored on disk A and 0 is stored on disk B, if I lose disk A [1], I still have disk B [0] and the parity disk [1]. According to the table B [0] + parity [1] = 1 thus I can still reconstruct my data.

But as we have established that rebuilding these large disks is unfeasible, what the HCI builders do is replicate all data, typically 3 times, in their architecture as to protect against multiple component failures, this is of course great from an availability point of view but not so much from a usable capacity point of view.

Enter erasure coding.

So from a high level what happens with erasure coding is that when data is written to the system, instead of using RAID or simply replicating it multiple times to different parts of the environment, the system applies slightly more complex mathematical functions (including matrix, and Galois-Field arithmetic*) compared to the simple XOR we saw in RAID (strictly speaking RAID is also an implementation of erasure coding).

There are multiple ways to implement erasure coding of which Reed-Solomon seems to be the most widely adopted one right now, for example Microsoft Azure and Facebook’s cold-storage are said to have implemented it.

Since the calculation of the erasure code is more complex the often quoted drawback is that it is more CPU intensive than RAID. Luckily we have Intel who are not only churning out more capable and efficient CPUs but are also contributing tools, like the Intelligent Storage Acceleration Library (Intel ISA-L) to make implementations more feasible.

As the video above mentions you roughly get 50% more capacity with erasure coding compared to a triple mirrored system.

Erasure Coding 4,2 example.

Erasure codes are typically quite flexible in the way you can implement them, meaning that you can specify (typically as the implementor, not the end-user, but in some cases both) the number of data blocks to parity blocks. This then impacts the protection level and drive/node requirement. For example if you choose to implement a 4,2 scheme, meaning that each file will be split into 4 data chunks and for those 4 chunks 2 parity chunks are calculated, this means that in a 4,2 setup you require 6 drives/nodes.

The logic behind it can seem quite complex, I have linked to a nice video explanation by Backblaze below;


Backup is Boring!

Backup is Boring!

Yep, until it’s not.

When I was a consultant at a VAR a couple of years ago I implemented my fair share of backup and recovery solutions, products of different vendors which shall remain nameless, but one thing that always became clear was how excruciatingly painful the processes involved ended up being. Convoluted tape rotation schema’s, figuring out back-up windows in environments that were supposed to be operating in a 24/7 capacity, running out of capacity, missed pickups for offsite storage,… the experience consistently sucked.

I think it’s fair to say that there has not been a lot of innovation in this market for the last decade or so, sure vendors put out new versions of their solutions on a regular basis and some new players have entered the market, but the core concepts have largely remained unchanged. How many times do you sit around at lunch with your colleagues and discuss exciting new developments in the data protection space… exactly…

So when is the “until it’s not” moment then?

I’m obviously biased here but I think this market is ripe for disruption, if we take some (or most) of the pain out of the data protection process and make it a straightforward affair I believe we can bring real value to a lot of people.

Rubrik does this by providing a simple, converged data management platform that combines traditionally disparate backup software pieces (backup SW, backup agents, catalog management, backup proxies,…) and globally deduplicated storage in one easily deployable and scalable package.

No more jumping from interface to interface to configure and manage something that essentially should be a insurance policy for you business. (i.e. the focus should be on recovery, not backup). No more pricing and sizing individual pieces based on guesstimates, rather scale out (and in) if and when needed, all options included in the base package.

Because it is optimized for the modern datacenter (i.e. virtualization, scale-out architectures, hybrid cloud environments, flash based optimizations,…) it is possible to consume datamanagement as a service rather than through manual configuration. All interactions with the solution are available via REST APIs and several other consumption options are already making good use of this via community driven initiatives like the PowerShell Module and the VMware vRO plugin. (more info see please see: )


So essentially giving you the ability to say no to the “we have always done it this way” mantra, it is time to bring (drag?) backup and recovery into the modern age.


Intel and Micron 3D XPoint

Intel and Micron 3D XPoint


My day job is in networking but I do consider myself (on the journey to) a full stack engineer and like to dabble in lot’s of different technologies like, I’m assuming, most of us geeks do. Intel and Micron have been working on a seeming breakthrough that combines memory and storage in one non-volatile device that is cheaper than DRAM (typically computer memory) and faster than NAND (typically a SSD drive).

3D Xpoint

3D Xpoint, as the name implies, is a crosspoint structure, meaning 2 wires crossing each other, with “some material*” in between, it does not use transistors (like DRAM does) which makes it easier to stack (hence the 3D) —> for every 3 lines of metal you get 2 layers of this memory.

Screen Shot 2016-02-13 at 18.09.31.png

The columns contain a memory cell (the green section in the picture above) and a selector (the yellow section in the picture above), connected by perpendicular wires (the greyish sections in the picture above), allowing you to address each column individually by using one wire at the top and one wire at the bottom. These grids can be stacked 3 dimensionally to maximise density.
The memory can be accessed/modified by sending varied voltage to each selector, in contrast DRAM requires a transistor at each memory cell to access or modify it, this results in 3D XPoint being 10x more dense that DRAM and 1000x faster than NAND (at the array level, not at the individual device level).

3D XPoint can be connected via PCIe NVMe and has little wear effect over it’s lifetime compared to NAND. Intel will commercialise this in it’s Optane range both as an SSD disk and as DIMMS. (The difference between Optane and 3D XPoint is that 3D XPoint refers to the type of memory and Optane includes the memory and a controller package).

1000x faster, really?

In reality Intel is getting 7x performance compared to a NAND MLC SSD (on NVMe) today (at 4kB read), that is because of the inefficiencies of the storage stack we have today.

Screen Shot 2016-02-13 at 18.21.27.png

The I/O passes through the filesystem, storage stack, driver, bus/platform link (transfer and protocol i.e. PCIe/NVMe), controller firmware, controller hardware (ASIC), transfer from NAND to the buffers inside the SSD, etc. So 1000x is a theoretical number (and will show up on a lot of vendor marketing slides no doubt) but reality is a bit different.

So focus is and has been on reducing latency, for example work that has been done by moving to NVMe already reduced the controller latency by roughly 20 microseconds (no HBA latency and the command set is much simpler).

Screen Shot 2016-02-13 at 18.25.07

The picture above shows the impact of the bus technology, on the left side you see AHCI (SATA) and on the right NVMe, as you see there is a significant latency difference between the two. NVMe also provides a lot more bandwidth compared to SATA (about 6x more on PCIe NVMe Gen3 and more than 10x on Gen4).

Another thing that is hindering the speed improvements of 3D XPoint is replication latency across nodes (it’s storage so you typically want redundancy). To address this issue work is underway on things like “NVMe over Fabrics” to develop a standard for low overhead replication. Other improvements in the pipe are work on optimising the storage stack, mostly on the OS and driver level. For example, because the paging algorithms today were not designed with SSD in mind they try to optimise for seek time reduction etc, things that are irrelevant here so reducing paging overhead is a possibility.

They are also exploring “Partial synchronous completion”, 3D XPoint is so fast that doing an asynchronous return, i.e. setting up for an interrupt and then waiting for interrupt completion takes more time than polling for data. (we have to ignore queue depth i.e. assume that it will be 1 here).

Screen Shot 2016-02-13 at 19.03.27

Persistent memory

One way to overcome this “it’s only 7x faster problem”, altogether is to move to persistent memory. In other words you skip over they storage stack latency by using 3D XPoint as DIMMs, i.e. for your typical reads and writes there is no software involved, what little latency remains is now caused entirely by the memory and the controller itself.

Screen Shot 2016-02-13 at 19.15.21

To enable this “storage class memory” you need to change/enable some things, like a new programming model, new libraries, new instructions etc. So that’s a little further away but it’s being worked on. What will probably ship this year is the SSD model (the 7x improvement) which is already pretty cool I think.

* It’s not really clear right now what those materials entail exactly which is part of it’s allure I guess 😉


Software Defined Shenanigans

Software defined anything (SDx) is the new black.

In July of last year VMware acquired Software Defined Networking (SDN) vendor Nicira and suddenly every network vendor had a SDN strategy, they must have reckoned the Google hits alone from people searching for SDN justified a change in vision.

Now VMware (a.o.) is further leading the charge by talking about the Software Defined Data Center (SDDC), wherein anything, the entire data center is now pooled, aggregated, and delivered as software, and managed by intelligent, policy-driven software.

Cloud and XaaS are so last year, SDx is where it’s at, it is the halo effect gone haywire.

A lot of networking and storage companies, both “legacy” and “start-up”, are scurrying around trying to figure out how to squeeze “Software-Defined” into their messaging.

So what defines software defined?

Software defined first appeared in the context of networking, traditionally network devices were delivered as a monolithic appliance, but logically you can think of them as consisting out of 3 parts, the data plane, the management plane, and the control plane.

SDN Logical (1)The Data plane is relatively straight forward (no pun intended), it is where your data packets travel from point A to point B. When packets and frames arrive on the ingress ports of the network device the forwarding table is what all routers and switches use to dispatch frames and packets to their egress ports.

The Management plane, besides providing management functions such as device access, os updates, etc. also delivers the Forwarding table data from the Control plane towards the Data plane.

The Control plane is more involved, as networks become more sophisticated, (routing) algorithms here can be pretty complex (and complexity often leads to bugs). Algorithms here are not uniform and dynamic because they are expected to support a wide range of use cases and deployment scenarios.

The idea of SDN is to separate these planes, separate the control/management function from the data function to increase flexibility. Now imagine you have moved to the control function to a system that controls other functions in your data center, like creating virtual machines and storage, as well. No longer are you limited by silos of control, you can potentially manage everything that is needed to deploy new applications (vm, network, storage, security, …) from a single point of control (single pane of glass?).

Is exposing API’s enough? 

It has always been possible to control functions in the network device programmatically, a lot of vendors are merely allowing you to control the existing control plane using API’s, I would argue this is not SDN, at least not in a purist sense, because it lacks scalability. (This point is very debatable I admit).

The aim is not having the control plane in each monolithic device, but rather have the intelligence outside, using OpenFlow for example, or the Big Network Controller from Big Switch Networks, allowing more flexibility and greater uniformity (one can dream).

Northbound API

The northbound API on a software-defined networking (SDN) controller enables applications and orchestration systems to program the network and request services from it. This is what the non-network vendors will use to integrate with your SDN, the problem today is that the service is not standardised (yet?), meaning that it is less open than we want it to be.

Is SDN the same as network virtualisation?

Network virtualisation adds a layer of abstraction (like all virtualisation) to the network often using tunnelling or an overlay network across the existing physical network. Nicira uses STT, VMware already had VXlan, Microsoft uses NVGRE, … I would argue that network virtualisation often is an underlying part of SDN.

Software Defined Storage (SDS)

In the world of storage we are also exposed to software defined, a lot of storage start-ups are using the SDS messaging to combat existing (or legacy as the start-ups would prefer) storage vendors to claim they have something new and improved. This, in my humble opinion, is not always warranted.

If you define SDS as SDN whereby the control plane is separated from the data plane, this is enabled by the lower-level storage systems abstracting their physical resources into software. Same reasons prevail, dynamism  flexibility, more control,… These abstracted storage resources are then presented up to a control plane as “software-defined” services. The exposure and management of these services is done through an orchestration layer (like the northbound API in SDN world). The quality and quantity of these services dependents on the virtualisation and automation capabilities of the underlying hardware (is exposing API’s enough?).

Some would argue that because of the existing architectures of legacy storage systems this becomes more cumbersome and less flexible compared to the new start-up SDS players. Like you have new players in SDN (Arista (even though they don’t seem to like the SDN terminology very much), Plexxi,…) baking these technologies in from the ground up, you have the same with storage vendors, but I would argue that the rate of innovation seems to be much higher here. A lot of new storage vendors (ExaBlox, Tintri, PureStorage, Nimble,..) , a lot of new architectures (Fusion-IO, Pernixdata, SanDisk FlashSoft,…), a lot of acquisitions of flash based systems by legacy vendors, etc… make it that I don’t believe “legacy” storage vendors are going the way of the dinosaur just yet. I do however think it will lead to a lot of confusion, like software only storage suddenly being SDS etc.

Is SDS the same as storage virtualisation?

Like network virtualisation in SDN, storage virtualisation in SDS can play it’s part. Storage virtualisation is another abstraction between the server and the storage array, one such abstraction can be achieved by implementing a storage hypervisor. The storage hypervisor can aggregate multiple different arrays, of different vendors, of maybe even generic JBODs. The storage hypervisor tends to not, at least not always, use most of the capabilities of the array instead treating them as generic storage. Datacore for example sells a storage hypervisor, so does Virsto which was acquired by VMware.

In a more traditional sense the IBM SVC, NetApp V-series, EMC VPLEX can be considered storage virtualisation, or more accurately storage federation. And then you have logical volume managers, LUNs, RAID sets, all abstraction, all “virtualisation”… so a lot of FUD will be incoming.

Is it all hype?

Of course not, some of the messaging might be confusing, and some vendors like to claim they are part of the latest trend without much to show for it, but the industry is moving, fast, adding functionality to legacy systems, building new architectures to deliver (at least partially) on the promise of better. But as always, there is a lot of misinformation about certain capabilities in certain products, maybe a little too much talking and not enough delivering. I expect a great deal of consolidation in the next few years, both of companies and terminology, so look carefully at who is doing what and how this matches with your companies strategy going forward. Exiting times ahead though.

The RFP process is broken

What is the RFP process?

The purpose of The Request For Proposal is to smooth out any vendor bias and get a real point by point comparison between the solution/proposal of multiple parties in order to reach the best proposal to fill a specific need.

Why is it broken?

This often leads to a strangely worded, which lacks room for interpretations in some sections, leaves broad room in others, document that will not lead to an innovative, cost-effective, long-term (quick, what’s another buzz word?) solution that is in the customers best interest.

The customer often has some idea of what he (thinks he) needs, sometimes based on past experiences and this shines through in the RFP document in such a way that any creativity on the vendor’s end is pointless.

I need X amount of IOPS, I need 64kB length block dedupe, I need SSL offloading in hardware, I need X amount of throughput, it needs to have a mermaid logo on it,…

“The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.”

Stephen W. Hawking

I am not saying the customer is stupid, far from it, but he is limiting what we can offer because of his narrow lens. When you force us to answer your questions, and only those, it dilutes our differentiation and you end up with the same old stuff you hate using today.

The vendors are more then happy to influence the RFP, most RFP I’ve read (and had to answer) contained at least some wording directly taken from competitors documents. If you can help write the rules, it suddenly becomes a lot easier to win the game. I’m not saying all of them are rigged, but some are. When reading through some RFPs it’s sometimes clear who will win it beforehand. I believe you will find you will be getting fewer and fewer RFP responses going forward if you play this game*

More often than not, we are not allowed to have a conversation with the customer once the RFP has been received, this leads to us trying to interpret a document which context we don’t always fully grasp. Furthermore you often have to provide a response in a limited space/format that makes a lot of assumptions of what the possible answer could be.

We often have no clue about the budget. Does the customer want an Aston Martin or a Volkswagen? (they all want the Aston, at the Volkswagen price ofc)

I also want to end world hunger, but secretly I don’t have the budget for it.

“You know we’re sitting on four million pounds of fuel, one nuclear weapon and a thing that has 270,000 moving parts built by the lowest bidder. Makes you feel good, doesn’t it?”

Steve Buscemi’s character in Armageddon

Then at the end of it, the requesting party scores the responses and picks the best one based on the weight of certain sections (is price most important?, or compliancy with the often arbitrary features?). This scoring is based on the interpretation of the document, which often made the vendor word things a certain way, and in a certain space to force it to fit in. The customer takes these answers and compares them based on the same biases when drafting the RFP.

What good does it do?

In a bigger organization it can be used as a tool to get buy in from all the parties that have a stake in the solution. It gets the noses pointed in the same direction and avoids territorial battles, and finger pointing after the solution has been purchased.

This is fine of course, but don’t use the RFP process for this, don’t externalize your own lack of coordination and communication by forcing it onto an RFP.

Not all RFP’s are created equal

I’ve answered many, many RFP’s and some are a lot better (yeah for you vendors! :rolleyes, I see you think) than others.

When your RFP is non-prescriptive, provides a more open format for crafting responses, includes the ability to ask questions, gives us an insight into the business issue instead of the technical issue you are trying to solve (so we can put forward our best thinking to help the business, and are not forced to jump through hoops to follow someone else’s), and some sense of budget, it goes a long way into getting a better suited response.

Also don’t send out your RFP just before the holidays, that’s just mean 🙂

*Death by RFP: 7 Reasons Not To Respond

SSL Acceleration

One of the prerequisites for WAN optimization is that the traffic we are attempting to de-duplicate across the WAN is not encrypted, we need “clear-text” data in order to find data patterns so de-duplication is most optimum.

But Steelhead can optimize SSL encrypted data by applying the same optimization methods, while still maintaining end-to-end security and keeping the trust model intact.

To better understand how we perform SSL optimization let’s first look at a simple example of requesting a secured webpage from a webserver.

SSL example

The encryption using a private key/public key pair ensures that the data can be encrypted by one key but can only be decrypted by the other key pair. The trick in a key pair is to keep one key secret (the private key) and to distribute the other key (the public key) to everybody.

All of the same optimizations that are applied to normal non-encrypted TCP traffic, you can also apply to encrypted SSL traffic. Steelhead appliances accomplish this without compromising end-to-end security and the established trust model. Your private keys remain in the data center and are not exposed in the remote branch office location where they might be compromised.

The Riverbed SSL solution starts with Steelhead appliances that have a configured trust relationship, enabling the them to exchange information securely over their own dedicated SSL connection. Each client uses unchanged server addresses and each server uses unchanged client addresses; no application changes or explicit proxy configuration is required. Riverbed uses a unique technique to split the SSL handshake.

The handshake (the sequence depicted above) is the sequence of message exchanges at the start of an SSL connection. In an ordinary SSL handshake, the client and server first establish identity using public-key cryptography, and then negotiate a symmetric session key to use for data transfer. When you use Riverbed’s SSL acceleration, the initial SSL message exchanges take place between the Web browser and the server- side Steelhead appliance. At a high level, Steelhead appliances terminate an SSL connection by making the client think it is talking to the server and making the server think it is talking to the client. In fact, the client is talking securely to the Steelhead appliances. You do this by configuring the server-side Steelhead appliance to include proxy certificates and private keys for the purpose of emulating the server.

SSL SH example

When the Steelhead appliance poses as the server, there does not need to be any change to either the client or the server. The security model is not compromised—the optimized SSL connection continues to guarantee server-side authentication, and prevents eavesdropping and tampering. The connection between the two Steelheads is secured by the use of secure peering (a separate SSL tunnel running between the two appliances).

Citrix HDX and WAN optimization (part 1)


In this first of a  two part blog post about Citrix HDX I want to explore the impact of HDX on the Wide Area Network, part one will serve as the introduction, and in part two I will testrun some of the scenarios described in part one.

HDX came to be because Citrix was finally getting competitive pressure on its Independent Computing Architecture (ICA) protocol from Microsoft with RDP version 7 and beyond and Teradici/VMware with PCoIP. (And arguably other protocols like Quest EOP Xstream, HP RGS, RedHat SPICE, etc.)

Citrix’s reaction to these competitive pressures has been to elevate the conversation above the protocol, stating that a great user experience is more than just a protocol, thus Citrix created the HDX brand to discuss all the elements in addition to ICA that Citrix claims allow it to deliver the best user experience.

HDX Brands

HDX is not a feature or a technology — it is a brand.

Short for “High Definition user eXperience,” HDX is the umbrella term that encapsulates several different Citrix technologies. Citrix has created HDX sub-brands, these include the list below and each brand represents a variety of technologies:

  • HDX Broadcast (ICA)
    • Capabilities for providing virtual desktops and applications over any network. This is the underlying transport for many of the other HDX technologies; it includes instant mouse click feedback, keystroke latency reduction, multi-level compression, session reliability, queuing and tossing.
  • HDX MediaStream
    • Capabilities for multimedia such as sound and video, using HDX Broadcast as it’s base, including client side rendering (streaming the content to the local client device for playing via local codecs with seamless embedding into the remote session).
    • Flash redirection (Flash v2), Windows Media redirection.
  • HDX Realtime
    • Capabilities for real time communications such as voice and web cameras, using HDX Broadcast as it’s base, it includes EasyCall (VoIP integration), and bi-directional audio functionality.
  • HDX SmartAccess
    • Refers mainly to the Citrix Access Gateway (SSL VPN) and cloud gateway components for single sign-on.
  • HDX RichGraphics  (incl 3D, 3D PRO, and GDI+ remoting)
    • Capabilities in remoting high end graphics using HDX Broadcast as it’s base, uses image acceleration and progressive display for graphically intense images. (formerly known as project appollo)
  • HDX Plug-n-Play
    • Capabilities to provide connectivity for local devices and applications in a virtualized environment, including USB, multi-monitor support, smart card support, special folder redirection, universal printing, and file-type associations.
  • HDX WAN Optimization
    • Capabilities to locally cache bandwidth intensive data and graphics, locally stage streamed applications (formally known as Intellicache, relying mostly on their Branch Repeater product line).
  • HDX Adaptive Orchestration
    • Capabilities that enable seamless interaction between the HDX technology categories. The central concept is that all these components work adaptively to tune the unified HDX offering for the best possible user experience.


The goal of this post is to provide an overview of these HDX sub-brands and technologies that directly relate to the network, and WAN optimization, in order to have a clearer understanding of marketing vs. technology impact.

Not every HDX feature is available on both XenApp and XenDesktop, (and now also VDI in-a-box after the acquisition of Kaviza) the table below shows the feature matrix for both:

hdx tableHDX and the network

As stated before most of the HDX technologies are either existing ICA components or rely on ICA (HDX Broadcast) as a remoting protocol. As such we should be able to (WAN) optimize most of the content within HDX one way or another.

HDX MediaStream

HDX MediaStream is used to optimize the delivery of multimedia content, it interacts with the Citrix Receiver (ICA Client) to determine the optimal rendering location (see overview picture below) for Windows Media and Flash content.

Within HDX MediaStream the process of obtaining the multimedia content and displaying the multimedia content are referenced by the terms fetching and rendering respectively.

Within HDX MediaStream, fetching the content is the process of obtaining or downloading the multimedia content from a location external (Internet, Intranet, fileserver (for WMV only)) to the virtual desktop. Rendering utilizes resources on the machine to decompress and display the content within the virtual desktop. In a Citrix virtual desktop that is being accessed via Citrix Receiver, rendering of content can executed by either the client or the hypervisor depending on the policies and environmental resources available.


Adaptive display (server side rendering) provides the ability to fetch and render multimedia content on the virtual machine running in the datacenter and send the rendered content over ICA to the client device. This translates to more bandwidth needed on the network than client side rendering. Howerver in certain scenarios client side rendering can use more bandwidth than server side rendering, it is after all, adaptive.

HDX MediaStream Windows Media Redirection (client side rendering) provides the ability to fetch Windows Media content (inclusive of WMV, DivX, MPEG, etc.) on the server and render the content within the virtual desktop by utilizing the resources on the client hosting Citrix Receiver (Windows or Linux). When Windows Media Redirection is enabled via Citrix policy, Windows video content is sent to the client through an ICA Virtual Channel in its native, compressed format for optimal performance. The processing capability of the client is then utilized to deliver smooth video playback while offloading the server to maximize server scalability. Since the data is sent in its native compressed format this should result in less bandwidth needed on the network than server side rendering.

HDX MediaStream Flash Redirection  (client side rendering) provides the ability to harness the bandwidth and processing capability of the client to fetch and render Flash content. By utilizing Internet Explorer API hooks, Citrix Receiver is able to securely capture the content request within the virtual desktop and render the Flash data stream directly on the client machine. Added benefits include increased server hypervisor scalability as the servers are no longer responsible for processing and delivering Flash multimedia to the client.

This usually decreases the wan bandwidth requirements by 2 to 4 times compared to Adaptive Display (server side rendering).

HDX MediaStream network considerations

In some cases, Window Media Redirection (client-side rendering of the video) can used significantly more bandwidth than Adaptive Display (server-side rendering of the video).

In the case of low bit rate videos, Adaptive Display may utilize more bandwidth than the native bitrate of the Windows Media content. This extra usage of bandwidth actually occurs since full screen updates are being sent across the connection rather than the actual raw video content.

Packet loss over the WAN connection is the most restricting aspect of an enhanced end-user experience for HDX MediaStream.

Citrix Consulting Solutions recommends Windows Media Redirection (client-side rendering) for WAN connections with a packet loss less than 0.5%.

Windows Media Redirection requires enough available bandwidth to accommodate the video bit rate. This can be controlled using SmartRendering thresholds. SmartRendering controls when the video reverts back to server side rendering because the bandwidth is not available, Citrix recommends setting the threshold to 8Mbps.

WAN optimization should provide the most benefits when the video is rendered on the client since the data stream for the compressed Windows Media content is similar between client devices, once the video has been viewed by one person in the branch, very little bandwidth is consumed when other workers view the same video.

HDX RichGraphics 3D Pro

HDX 3D Pro can be used to deliver any application that is compatible with the supported host operating systems, but is particularly suitable for use with DirectX and OpenGL-driven applications, and with rich media such as video.

The computer hosting the application can be either a physical machine or a XenServer VM with Multi-GPU Passthrough. The Multi-GPU Passthrough feature is available with Citrix XenServer 6.0

For CPU-based compression, including lossless compression, HDX 3D Pro supports any display adapter on the host computer that is compatible with the application that you are delivering. To use GPU-based deep compression, HDX 3D Pro requires that the computer hosting the application is equipped with a NVIDIA CUDA-enabled GPU and NVIDIA CUDA 2.1 or later display drivers installed. For optimum performance, Citrix recommends using a GPU with at least 128 parallel CUDA cores for single-monitor access.

To access desktops or applications delivered with XenDesktop and HDX 3D Pro, users must install Citrix Receiver. GPU-based deep compression is only available with the latest versions of Citrix Receiver for Windows and Citrix Receiver for Linux.

HDX 3D Pro supports all monitor resolutions that are supported by the GPU on the host computer. However, for optimum performance with the minimum recommended user device and GPU specifications, Citrix recommends maximum monitor resolutions for users’ devices of 1920 x 1200 pixels for LAN connections and 1280 x 1024 pixels for WAN connections.

Users’ devices do not need a dedicated GPU to access desktops or applications delivered with HDX 3D Pro.

HDX 3D Pro includes an image quality configuration tool that enables users to adjust in real time the balance between image quality and responsiveness to optimize their use of the available bandwidth.

HDX RichGraphics 3D Pro network considerations

HDX 3D PRO has significant bandwidth requirements depending on the encoding used (NVIDA CUDA encoding, CPU encoding, and Lossless.)


When supported NVIDIA chipsets are utilized, HDX 3D Pro offers the ability to compress the ICA session in a video stream. This significantly reduces bandwidth and CPU usage on both ends by utilizing the NVIDA CUDA-based deep compression. If a NVIDIA GPU is not present to provide compression, the server CPU can be utilized to compress the ICA stream. This method, however, does introduce a significant impact on CPU utilization. The highest quality method for delivering a 3D capable desktop is by using the Lossless option. As the Lossless title states, no compression of the ICA stream occurs allowing for pixel perfect images to be delivered to the end point. This option is available for delivering medical imaging software that cannot have degraded image quality. This level of high quality imaging does come with the price of very high bandwidth requirements.

HDX RichGraphics GDI and GDI+ remoting

GDI (Graphics Device Interface) and GDI+ remoting allows Microsoft office specifically (although other apps, like wordpad, use GDI also) to be remoted to the client using native graphics commands instead of bitmaps. By using native graphics commands, it saves on server side CPU, saves network bandwidth and eliminates visual artifacts as it doesn’t need to be compressed using image compression.

General network factors for Remoting protocols (including RDP/RemoteFX, ICA, PCoIP, Quest EoP,…)

  • Bandwidth – the protocols take all they can get, 2 Mbps is required for a decent user experience. (see planning bandwidth requirements below)
  • Latency – at 50ms things start getting tough (sometimes even at 20ms)
  • Packet loss – should stay under 1%

Planning bandwidth requirements for HDX (XenDesktop example)

Citrix publishes the numbers below in a medium (user load) user environment, this gives some indication as to what to expect in terms of network sizing.

  •  MS Office-based                                     43Kbps
  • Internet                                                     85 Kbps
  • Printing (5MB Word doc)                          555-593 Kbps
  • Flash video (server rendered)                    174 Kbps
  • Standard WMV video (client rendered)      464 Kbps
  • HD WMV video (client rendered)               1812 Kbps

These are estimates. If a user watches a WMV HD video with a bit rate of 6.5 Mbps, that user will require a network link with at least that much bandwidth. In addition to the WMV video, the link must also be able to support the other user activities happening at the same time.

Also, if multiple users are expected to be accessing the same type of content (videos, web pages, documents, etc.), integrating WAN Optimization into the architecture can drastically reduce the amount of bandwidth consumed. However, the amount of benefit is based on the level of repetition between users.

Note: Riverbed Steelhead can optimize ICA/HDX traffic extremely well, we even support the newer multi-stream ica protocol. In part 2 of this blog I will demonstrate the effectiveness of Steelhead on HDX traffic and talk about our Citrix specific optimizations like our very effective Citrix QoS, Riverbed Steelheads also have the ability to decode the ICA Priority Packet Tagging that identifies the virtual channel from which each Citrix ICA packet originated.  As part of this capability, Riverbed specifically developed a packet-order queuing discipline that respects the ordering of ICA packets within a flow, even when different packets from a given flow are classified by Citrix into different ICA virtual channels.  This allows the Steelhead to deliver very granular Quality of Service (QoS) enforcement based on the virtual channel in which the ICA data is transmitted.  Most importantly, this feature prevents any possibility of out-of-order packet delivery as a result of Riverbed’s QoS enforcement; out-of-order packet delivery would cause significant degradation in performance and responsiveness for the Citrix ICA user.  Riverbed’s packet-order queuing capability is patent-pending, and not available from any other WAN optimization vendor.

Real world impact can be seen in the picture below of a customer saving 14GB of ICA traffic over a transatlantic link every month.citrixtraff

Name a famous Belgian

As a Belgian working for an American multinational company I often, mockingly, get asked to name 10 famous Belgians, and I must admit us Belgians are not that good at self promotion it seems (to busy making beer, chocolate, waffles and “french fries”).

When thinking about the Internet, and more specifically the World Wide Web, the Internet’s first killer application if you will, no Belgians spring to mind either.

I give you Robert Cailliau, who together with the more well known Tim Berners-Lee made WWW a reality.

Another key application that relies on the Internet to make it real is Software as a Service (SaaS), now for SaaS applications to work the end users communication needs to travel the path between your location and the SaaS provider, this path is the Internet.
So how does the Internet make sure you get fast access to the server where your SaaS application is running?

The Internet relies on a routing protocol (BGP) to get your request to the SaaS provider, BGP is used between ISP’s as an interconnect to somewhat reliably connect all these separate (the routing protocol used in within the ISP is autonomous, i.e. this does not need to be BGP) networks together so your packet gets where it needs to be.

Another well kept secret, like Robert Caillau, is that the Border Gateway Protocol (BGP) was not really designed to give you the fastest route between all these autonomous systems, how could it since the Internet belongs to no one and everyone (‘s request) should be treated equally (feel free to mentally picture a Guy Fawkes mask here).

This is why, in order to provide a more predictable performance across the Internet, you need a solution like Steelhead Cloud Accelerator to:

  • A: Give you the fastest path across the Internet (using Akamai SureRoute)
  • B: Minimize the application and data overhead (using Steelhead transport, application, and data streamlining)