Category: Technology

Erasure Coding – a primer

Erasure Coding – a primer

A surefire [sic] way to get to look for another job in IT is to lose important data. Typically if a user in any organisation stores data he or she expects that data to be safe and always retrievable (and as we all know data loss in storage systems is unavoidable). Data also keeps growing, a corollary to Parkinson’s law is that data expands to fill the space available for storage, just like clutter around your house.

Because of the constant growth of data there is a greater need to both protect said data but also to simultaneously store it in a more space efficient way. If you look at large web-scale companies like Google, Facebook, and Amazon they need to store and protect incredible amounts of data, they do however not rely on traditional data protection schemes like RAID because it is simply not a good match with the hard disk capacity increases of late.

Sure sure, but I’m not Google…

Fair point, but take a look at the way modern data architectures are built and applied even in the enterprise space, looking at most hyper-converged infrastructure players for example they typically employ a storage replication scheme to protect data that resides on their platforms, for them they simply cannot not afford the long rebuild times associated with multi-Terabyte hard disks in a RAID based scheme. Same goes for most object storage vendors. As as example let’s take a 1TB disk, it’s typical sequential write sits around 115 MBps, so 1.000.000 MB / 115 MBps = approximately 8700 seconds which is nearly two and a half hours. If you are using 4TB disks then your rebuild time will be at least ten hours. In this case I am even ignoring the RAID calculation that needs to happen simultaneously and the other IO in the system that the storage controllers need to deal with.

RAID 5 protection example.

Let’s say we have 3 HDDs in a RAID 5 configuration, data is spread over 2 drives and the 3rd one is used to store the parity information. This is basically a exclusive or (XOR) function;

Let’s say I have 2 bits of data that I write to the system, disk 1 has the first bit, disk 2 the second bit, and disk 3 holds the parity bit (the XOR calculation). Now I can lose any of the 2 bits (disks) and the system is able to reconstruct the missing bit as demonstrated by the XOR truth table below;

Screen Shot 2016-07-14 at 14.06.39

Let’s say I write bit 1 and bit 0 to the system, 1 is stored on disk A and 0 is stored on disk B, if I lose disk A [1], I still have disk B [0] and the parity disk [1]. According to the table B [0] + parity [1] = 1 thus I can still reconstruct my data.

But as we have established that rebuilding these large disks is unfeasible, what the HCI builders do is replicate all data, typically 3 times, in their architecture as to protect against multiple component failures, this is of course great from an availability point of view but not so much from a usable capacity point of view.

Enter erasure coding.

So from a high level what happens with erasure coding is that when data is written to the system, instead of using RAID or simply replicating it multiple times to different parts of the environment, the system applies slightly more complex mathematical functions (including matrix, and Galois-Field arithmetic*) compared to the simple XOR we saw in RAID (strictly speaking RAID is also an implementation of erasure coding).

There are multiple ways to implement erasure coding of which Reed-Solomon seems to be the most widely adopted one right now, for example Microsoft Azure and Facebook’s cold-storage are said to have implemented it.

Since the calculation of the erasure code is more complex the often quoted drawback is that it is more CPU intensive than RAID. Luckily we have Intel who are not only churning out more capable and efficient CPUs but are also contributing tools, like the Intelligent Storage Acceleration Library (Intel ISA-L) to make implementations more feasible.

As the video above mentions you roughly get 50% more capacity with erasure coding compared to a triple mirrored system.

Erasure Coding 4,2 example.

Erasure codes are typically quite flexible in the way you can implement them, meaning that you can specify (typically as the implementor, not the end-user, but in some cases both) the number of data blocks to parity blocks. This then impacts the protection level and drive/node requirement. For example if you choose to implement a 4,2 scheme, meaning that each file will be split into 4 data chunks and for those 4 chunks 2 parity chunks are calculated, this means that in a 4,2 setup you require 6 drives/nodes.

The logic behind it can seem quite complex, I have linked to a nice video explanation by Backblaze below;

* http://web.eecs.utk.edu/~plank/plank/papers/CS-96-332.pdf

Backup is Boring!

Backup is Boring!

Yep, until it’s not.

When I was a consultant at a VAR a couple of years ago I implemented my fair share of backup and recovery solutions, products of different vendors which shall remain nameless, but one thing that always became clear was how excruciatingly painful the processes involved ended up being. Convoluted tape rotation schema’s, figuring out back-up windows in environments that were supposed to be operating in a 24/7 capacity, running out of capacity, missed pickups for offsite storage,… the experience consistently sucked.

I think it’s fair to say that there has not been a lot of innovation in this market for the last decade or so, sure vendors put out new versions of their solutions on a regular basis and some new players have entered the market, but the core concepts have largely remained unchanged. How many times do you sit around at lunch with your colleagues and discuss exciting new developments in the data protection space… exactly…

So when is the “until it’s not” moment then?

I’m obviously biased here but I think this market is ripe for disruption, if we take some (or most) of the pain out of the data protection process and make it a straightforward affair I believe we can bring real value to a lot of people.

Rubrik does this by providing a simple, converged data management platform that combines traditionally disparate backup software pieces (backup SW, backup agents, catalog management, backup proxies,…) and globally deduplicated storage in one easily deployable and scalable package.

No more jumping from interface to interface to configure and manage something that essentially should be a insurance policy for you business. (i.e. the focus should be on recovery, not backup). No more pricing and sizing individual pieces based on guesstimates, rather scale out (and in) if and when needed, all options included in the base package.

Because it is optimized for the modern datacenter (i.e. virtualization, scale-out architectures, hybrid cloud environments, flash based optimizations,…) it is possible to consume datamanagement as a service rather than through manual configuration. All interactions with the solution are available via REST APIs and several other consumption options are already making good use of this via community driven initiatives like the PowerShell Module and the VMware vRO plugin. (more info see please see: https://github.com/rubrikinc )

peter-gibbons2

So essentially giving you the ability to say no to the “we have always done it this way” mantra, it is time to bring (drag?) backup and recovery into the modern age.

 

Intel and Micron 3D XPoint

Intel and Micron 3D XPoint

Introduction

My day job is in networking but I do consider myself (on the journey to) a full stack engineer and like to dabble in lot’s of different technologies like, I’m assuming, most of us geeks do. Intel and Micron have been working on a seeming breakthrough that combines memory and storage in one non-volatile device that is cheaper than DRAM (typically computer memory) and faster than NAND (typically a SSD drive).

3D Xpoint

3D Xpoint, as the name implies, is a crosspoint structure, meaning 2 wires crossing each other, with “some material*” in between, it does not use transistors (like DRAM does) which makes it easier to stack (hence the 3D) —> for every 3 lines of metal you get 2 layers of this memory.

Screen Shot 2016-02-13 at 18.09.31.png

The columns contain a memory cell (the green section in the picture above) and a selector (the yellow section in the picture above), connected by perpendicular wires (the greyish sections in the picture above), allowing you to address each column individually by using one wire at the top and one wire at the bottom. These grids can be stacked 3 dimensionally to maximise density.
The memory can be accessed/modified by sending varied voltage to each selector, in contrast DRAM requires a transistor at each memory cell to access or modify it, this results in 3D XPoint being 10x more dense that DRAM and 1000x faster than NAND (at the array level, not at the individual device level).

3D XPoint can be connected via PCIe NVMe and has little wear effect over it’s lifetime compared to NAND. Intel will commercialise this in it’s Optane range both as an SSD disk and as DIMMS. (The difference between Optane and 3D XPoint is that 3D XPoint refers to the type of memory and Optane includes the memory and a controller package).

1000x faster, really?

In reality Intel is getting 7x performance compared to a NAND MLC SSD (on NVMe) today (at 4kB read), that is because of the inefficiencies of the storage stack we have today.

Screen Shot 2016-02-13 at 18.21.27.png

The I/O passes through the filesystem, storage stack, driver, bus/platform link (transfer and protocol i.e. PCIe/NVMe), controller firmware, controller hardware (ASIC), transfer from NAND to the buffers inside the SSD, etc. So 1000x is a theoretical number (and will show up on a lot of vendor marketing slides no doubt) but reality is a bit different.

So focus is and has been on reducing latency, for example work that has been done by moving to NVMe already reduced the controller latency by roughly 20 microseconds (no HBA latency and the command set is much simpler).

Screen Shot 2016-02-13 at 18.25.07

The picture above shows the impact of the bus technology, on the left side you see AHCI (SATA) and on the right NVMe, as you see there is a significant latency difference between the two. NVMe also provides a lot more bandwidth compared to SATA (about 6x more on PCIe NVMe Gen3 and more than 10x on Gen4).

Another thing that is hindering the speed improvements of 3D XPoint is replication latency across nodes (it’s storage so you typically want redundancy). To address this issue work is underway on things like “NVMe over Fabrics” to develop a standard for low overhead replication. Other improvements in the pipe are work on optimising the storage stack, mostly on the OS and driver level. For example, because the paging algorithms today were not designed with SSD in mind they try to optimise for seek time reduction etc, things that are irrelevant here so reducing paging overhead is a possibility.

They are also exploring “Partial synchronous completion”, 3D XPoint is so fast that doing an asynchronous return, i.e. setting up for an interrupt and then waiting for interrupt completion takes more time than polling for data. (we have to ignore queue depth i.e. assume that it will be 1 here).

Screen Shot 2016-02-13 at 19.03.27

Persistent memory

One way to overcome this “it’s only 7x faster problem”, altogether is to move to persistent memory. In other words you skip over they storage stack latency by using 3D XPoint as DIMMs, i.e. for your typical reads and writes there is no software involved, what little latency remains is now caused entirely by the memory and the controller itself.

Screen Shot 2016-02-13 at 19.15.21

To enable this “storage class memory” you need to change/enable some things, like a new programming model, new libraries, new instructions etc. So that’s a little further away but it’s being worked on. What will probably ship this year is the SSD model (the 7x improvement) which is already pretty cool I think.

* It’s not really clear right now what those materials entail exactly which is part of it’s allure I guess 😉

 

Software Defined Shenanigans

Software defined anything (SDx) is the new black.

In July of last year VMware acquired Software Defined Networking (SDN) vendor Nicira and suddenly every network vendor had a SDN strategy, they must have reckoned the Google hits alone from people searching for SDN justified a change in vision.

Now VMware (a.o.) is further leading the charge by talking about the Software Defined Data Center (SDDC), wherein anything, the entire data center is now pooled, aggregated, and delivered as software, and managed by intelligent, policy-driven software.

Cloud and XaaS are so last year, SDx is where it’s at, it is the halo effect gone haywire.

A lot of networking and storage companies, both “legacy” and “start-up”, are scurrying around trying to figure out how to squeeze “Software-Defined” into their messaging.

So what defines software defined?

Software defined first appeared in the context of networking, traditionally network devices were delivered as a monolithic appliance, but logically you can think of them as consisting out of 3 parts, the data plane, the management plane, and the control plane.

SDN Logical (1)The Data plane is relatively straight forward (no pun intended), it is where your data packets travel from point A to point B. When packets and frames arrive on the ingress ports of the network device the forwarding table is what all routers and switches use to dispatch frames and packets to their egress ports.

The Management plane, besides providing management functions such as device access, os updates, etc. also delivers the Forwarding table data from the Control plane towards the Data plane.

The Control plane is more involved, as networks become more sophisticated, (routing) algorithms here can be pretty complex (and complexity often leads to bugs). Algorithms here are not uniform and dynamic because they are expected to support a wide range of use cases and deployment scenarios.

The idea of SDN is to separate these planes, separate the control/management function from the data function to increase flexibility. Now imagine you have moved to the control function to a system that controls other functions in your data center, like creating virtual machines and storage, as well. No longer are you limited by silos of control, you can potentially manage everything that is needed to deploy new applications (vm, network, storage, security, …) from a single point of control (single pane of glass?).

Is exposing API’s enough? 

It has always been possible to control functions in the network device programmatically, a lot of vendors are merely allowing you to control the existing control plane using API’s, I would argue this is not SDN, at least not in a purist sense, because it lacks scalability. (This point is very debatable I admit).

The aim is not having the control plane in each monolithic device, but rather have the intelligence outside, using OpenFlow for example, or the Big Network Controller from Big Switch Networks, allowing more flexibility and greater uniformity (one can dream).

Northbound API

The northbound API on a software-defined networking (SDN) controller enables applications and orchestration systems to program the network and request services from it. This is what the non-network vendors will use to integrate with your SDN, the problem today is that the service is not standardised (yet?), meaning that it is less open than we want it to be.

Is SDN the same as network virtualisation?

Network virtualisation adds a layer of abstraction (like all virtualisation) to the network often using tunnelling or an overlay network across the existing physical network. Nicira uses STT, VMware already had VXlan, Microsoft uses NVGRE, … I would argue that network virtualisation often is an underlying part of SDN.

Software Defined Storage (SDS)

In the world of storage we are also exposed to software defined, a lot of storage start-ups are using the SDS messaging to combat existing (or legacy as the start-ups would prefer) storage vendors to claim they have something new and improved. This, in my humble opinion, is not always warranted.

If you define SDS as SDN whereby the control plane is separated from the data plane, this is enabled by the lower-level storage systems abstracting their physical resources into software. Same reasons prevail, dynamism  flexibility, more control,… These abstracted storage resources are then presented up to a control plane as “software-defined” services. The exposure and management of these services is done through an orchestration layer (like the northbound API in SDN world). The quality and quantity of these services dependents on the virtualisation and automation capabilities of the underlying hardware (is exposing API’s enough?).

Some would argue that because of the existing architectures of legacy storage systems this becomes more cumbersome and less flexible compared to the new start-up SDS players. Like you have new players in SDN (Arista (even though they don’t seem to like the SDN terminology very much), Plexxi,…) baking these technologies in from the ground up, you have the same with storage vendors, but I would argue that the rate of innovation seems to be much higher here. A lot of new storage vendors (ExaBlox, Tintri, PureStorage, Nimble,..) , a lot of new architectures (Fusion-IO, Pernixdata, SanDisk FlashSoft,…), a lot of acquisitions of flash based systems by legacy vendors, etc… make it that I don’t believe “legacy” storage vendors are going the way of the dinosaur just yet. I do however think it will lead to a lot of confusion, like software only storage suddenly being SDS etc.

Is SDS the same as storage virtualisation?

Like network virtualisation in SDN, storage virtualisation in SDS can play it’s part. Storage virtualisation is another abstraction between the server and the storage array, one such abstraction can be achieved by implementing a storage hypervisor. The storage hypervisor can aggregate multiple different arrays, of different vendors, of maybe even generic JBODs. The storage hypervisor tends to not, at least not always, use most of the capabilities of the array instead treating them as generic storage. Datacore for example sells a storage hypervisor, so does Virsto which was acquired by VMware.

In a more traditional sense the IBM SVC, NetApp V-series, EMC VPLEX can be considered storage virtualisation, or more accurately storage federation. And then you have logical volume managers, LUNs, RAID sets, all abstraction, all “virtualisation”… so a lot of FUD will be incoming.

Is it all hype?

Of course not, some of the messaging might be confusing, and some vendors like to claim they are part of the latest trend without much to show for it, but the industry is moving, fast, adding functionality to legacy systems, building new architectures to deliver (at least partially) on the promise of better. But as always, there is a lot of misinformation about certain capabilities in certain products, maybe a little too much talking and not enough delivering. I expect a great deal of consolidation in the next few years, both of companies and terminology, so look carefully at who is doing what and how this matches with your companies strategy going forward. Exiting times ahead though.

The RFP process is broken

What is the RFP process?

The purpose of The Request For Proposal is to smooth out any vendor bias and get a real point by point comparison between the solution/proposal of multiple parties in order to reach the best proposal to fill a specific need.

Why is it broken?

This often leads to a strangely worded, which lacks room for interpretations in some sections, leaves broad room in others, document that will not lead to an innovative, cost-effective, long-term (quick, what’s another buzz word?) solution that is in the customers best interest.

The customer often has some idea of what he (thinks he) needs, sometimes based on past experiences and this shines through in the RFP document in such a way that any creativity on the vendor’s end is pointless.

I need X amount of IOPS, I need 64kB length block dedupe, I need SSL offloading in hardware, I need X amount of throughput, it needs to have a mermaid logo on it,…

“The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.”

Stephen W. Hawking

I am not saying the customer is stupid, far from it, but he is limiting what we can offer because of his narrow lens. When you force us to answer your questions, and only those, it dilutes our differentiation and you end up with the same old stuff you hate using today.

The vendors are more then happy to influence the RFP, most RFP I’ve read (and had to answer) contained at least some wording directly taken from competitors documents. If you can help write the rules, it suddenly becomes a lot easier to win the game. I’m not saying all of them are rigged, but some are. When reading through some RFPs it’s sometimes clear who will win it beforehand. I believe you will find you will be getting fewer and fewer RFP responses going forward if you play this game*

More often than not, we are not allowed to have a conversation with the customer once the RFP has been received, this leads to us trying to interpret a document which context we don’t always fully grasp. Furthermore you often have to provide a response in a limited space/format that makes a lot of assumptions of what the possible answer could be.

We often have no clue about the budget. Does the customer want an Aston Martin or a Volkswagen? (they all want the Aston, at the Volkswagen price ofc)

I also want to end world hunger, but secretly I don’t have the budget for it.

“You know we’re sitting on four million pounds of fuel, one nuclear weapon and a thing that has 270,000 moving parts built by the lowest bidder. Makes you feel good, doesn’t it?”

Steve Buscemi’s character in Armageddon

Then at the end of it, the requesting party scores the responses and picks the best one based on the weight of certain sections (is price most important?, or compliancy with the often arbitrary features?). This scoring is based on the interpretation of the document, which often made the vendor word things a certain way, and in a certain space to force it to fit in. The customer takes these answers and compares them based on the same biases when drafting the RFP.

What good does it do?

In a bigger organization it can be used as a tool to get buy in from all the parties that have a stake in the solution. It gets the noses pointed in the same direction and avoids territorial battles, and finger pointing after the solution has been purchased.

This is fine of course, but don’t use the RFP process for this, don’t externalize your own lack of coordination and communication by forcing it onto an RFP.

Not all RFP’s are created equal

I’ve answered many, many RFP’s and some are a lot better (yeah for you vendors! :rolleyes, I see you think) than others.

When your RFP is non-prescriptive, provides a more open format for crafting responses, includes the ability to ask questions, gives us an insight into the business issue instead of the technical issue you are trying to solve (so we can put forward our best thinking to help the business, and are not forced to jump through hoops to follow someone else’s), and some sense of budget, it goes a long way into getting a better suited response.

Also don’t send out your RFP just before the holidays, that’s just mean 🙂

*Death by RFP: 7 Reasons Not To Respond

SSL Acceleration

One of the prerequisites for WAN optimization is that the traffic we are attempting to de-duplicate across the WAN is not encrypted, we need “clear-text” data in order to find data patterns so de-duplication is most optimum.

But Steelhead can optimize SSL encrypted data by applying the same optimization methods, while still maintaining end-to-end security and keeping the trust model intact.

To better understand how we perform SSL optimization let’s first look at a simple example of requesting a secured webpage from a webserver.

SSL example

The encryption using a private key/public key pair ensures that the data can be encrypted by one key but can only be decrypted by the other key pair. The trick in a key pair is to keep one key secret (the private key) and to distribute the other key (the public key) to everybody.

All of the same optimizations that are applied to normal non-encrypted TCP traffic, you can also apply to encrypted SSL traffic. Steelhead appliances accomplish this without compromising end-to-end security and the established trust model. Your private keys remain in the data center and are not exposed in the remote branch office location where they might be compromised.

The Riverbed SSL solution starts with Steelhead appliances that have a configured trust relationship, enabling the them to exchange information securely over their own dedicated SSL connection. Each client uses unchanged server addresses and each server uses unchanged client addresses; no application changes or explicit proxy configuration is required. Riverbed uses a unique technique to split the SSL handshake.

The handshake (the sequence depicted above) is the sequence of message exchanges at the start of an SSL connection. In an ordinary SSL handshake, the client and server first establish identity using public-key cryptography, and then negotiate a symmetric session key to use for data transfer. When you use Riverbed’s SSL acceleration, the initial SSL message exchanges take place between the Web browser and the server- side Steelhead appliance. At a high level, Steelhead appliances terminate an SSL connection by making the client think it is talking to the server and making the server think it is talking to the client. In fact, the client is talking securely to the Steelhead appliances. You do this by configuring the server-side Steelhead appliance to include proxy certificates and private keys for the purpose of emulating the server.

SSL SH example

When the Steelhead appliance poses as the server, there does not need to be any change to either the client or the server. The security model is not compromised—the optimized SSL connection continues to guarantee server-side authentication, and prevents eavesdropping and tampering. The connection between the two Steelheads is secured by the use of secure peering (a separate SSL tunnel running between the two appliances).

Citrix HDX and WAN optimization (part 1)

Introduction

In this first of a  two part blog post about Citrix HDX I want to explore the impact of HDX on the Wide Area Network, part one will serve as the introduction, and in part two I will testrun some of the scenarios described in part one.

HDX came to be because Citrix was finally getting competitive pressure on its Independent Computing Architecture (ICA) protocol from Microsoft with RDP version 7 and beyond and Teradici/VMware with PCoIP. (And arguably other protocols like Quest EOP Xstream, HP RGS, RedHat SPICE, etc.)

Citrix’s reaction to these competitive pressures has been to elevate the conversation above the protocol, stating that a great user experience is more than just a protocol, thus Citrix created the HDX brand to discuss all the elements in addition to ICA that Citrix claims allow it to deliver the best user experience.

HDX Brands

HDX is not a feature or a technology — it is a brand.

Short for “High Definition user eXperience,” HDX is the umbrella term that encapsulates several different Citrix technologies. Citrix has created HDX sub-brands, these include the list below and each brand represents a variety of technologies:

  • HDX Broadcast (ICA)
    • Capabilities for providing virtual desktops and applications over any network. This is the underlying transport for many of the other HDX technologies; it includes instant mouse click feedback, keystroke latency reduction, multi-level compression, session reliability, queuing and tossing.
  • HDX MediaStream
    • Capabilities for multimedia such as sound and video, using HDX Broadcast as it’s base, including client side rendering (streaming the content to the local client device for playing via local codecs with seamless embedding into the remote session).
    • Flash redirection (Flash v2), Windows Media redirection.
  • HDX Realtime
    • Capabilities for real time communications such as voice and web cameras, using HDX Broadcast as it’s base, it includes EasyCall (VoIP integration), and bi-directional audio functionality.
  • HDX SmartAccess
    • Refers mainly to the Citrix Access Gateway (SSL VPN) and cloud gateway components for single sign-on.
  • HDX RichGraphics  (incl 3D, 3D PRO, and GDI+ remoting)
    • Capabilities in remoting high end graphics using HDX Broadcast as it’s base, uses image acceleration and progressive display for graphically intense images. (formerly known as project appollo)
  • HDX Plug-n-Play
    • Capabilities to provide connectivity for local devices and applications in a virtualized environment, including USB, multi-monitor support, smart card support, special folder redirection, universal printing, and file-type associations.
  • HDX WAN Optimization
    • Capabilities to locally cache bandwidth intensive data and graphics, locally stage streamed applications (formally known as Intellicache, relying mostly on their Branch Repeater product line).
  • HDX Adaptive Orchestration
    • Capabilities that enable seamless interaction between the HDX technology categories. The central concept is that all these components work adaptively to tune the unified HDX offering for the best possible user experience.

hdxbrands

The goal of this post is to provide an overview of these HDX sub-brands and technologies that directly relate to the network, and WAN optimization, in order to have a clearer understanding of marketing vs. technology impact.

Not every HDX feature is available on both XenApp and XenDesktop, (and now also VDI in-a-box after the acquisition of Kaviza) the table below shows the feature matrix for both:

hdx tableHDX and the network

As stated before most of the HDX technologies are either existing ICA components or rely on ICA (HDX Broadcast) as a remoting protocol. As such we should be able to (WAN) optimize most of the content within HDX one way or another.

HDX MediaStream

HDX MediaStream is used to optimize the delivery of multimedia content, it interacts with the Citrix Receiver (ICA Client) to determine the optimal rendering location (see overview picture below) for Windows Media and Flash content.

Within HDX MediaStream the process of obtaining the multimedia content and displaying the multimedia content are referenced by the terms fetching and rendering respectively.

Within HDX MediaStream, fetching the content is the process of obtaining or downloading the multimedia content from a location external (Internet, Intranet, fileserver (for WMV only)) to the virtual desktop. Rendering utilizes resources on the machine to decompress and display the content within the virtual desktop. In a Citrix virtual desktop that is being accessed via Citrix Receiver, rendering of content can executed by either the client or the hypervisor depending on the policies and environmental resources available.

rendering

Adaptive display (server side rendering) provides the ability to fetch and render multimedia content on the virtual machine running in the datacenter and send the rendered content over ICA to the client device. This translates to more bandwidth needed on the network than client side rendering. Howerver in certain scenarios client side rendering can use more bandwidth than server side rendering, it is after all, adaptive.

HDX MediaStream Windows Media Redirection (client side rendering) provides the ability to fetch Windows Media content (inclusive of WMV, DivX, MPEG, etc.) on the server and render the content within the virtual desktop by utilizing the resources on the client hosting Citrix Receiver (Windows or Linux). When Windows Media Redirection is enabled via Citrix policy, Windows video content is sent to the client through an ICA Virtual Channel in its native, compressed format for optimal performance. The processing capability of the client is then utilized to deliver smooth video playback while offloading the server to maximize server scalability. Since the data is sent in its native compressed format this should result in less bandwidth needed on the network than server side rendering.

HDX MediaStream Flash Redirection  (client side rendering) provides the ability to harness the bandwidth and processing capability of the client to fetch and render Flash content. By utilizing Internet Explorer API hooks, Citrix Receiver is able to securely capture the content request within the virtual desktop and render the Flash data stream directly on the client machine. Added benefits include increased server hypervisor scalability as the servers are no longer responsible for processing and delivering Flash multimedia to the client.

This usually decreases the wan bandwidth requirements by 2 to 4 times compared to Adaptive Display (server side rendering).

HDX MediaStream network considerations

In some cases, Window Media Redirection (client-side rendering of the video) can used significantly more bandwidth than Adaptive Display (server-side rendering of the video).

In the case of low bit rate videos, Adaptive Display may utilize more bandwidth than the native bitrate of the Windows Media content. This extra usage of bandwidth actually occurs since full screen updates are being sent across the connection rather than the actual raw video content.

Packet loss over the WAN connection is the most restricting aspect of an enhanced end-user experience for HDX MediaStream.

Citrix Consulting Solutions recommends Windows Media Redirection (client-side rendering) for WAN connections with a packet loss less than 0.5%.

Windows Media Redirection requires enough available bandwidth to accommodate the video bit rate. This can be controlled using SmartRendering thresholds. SmartRendering controls when the video reverts back to server side rendering because the bandwidth is not available, Citrix recommends setting the threshold to 8Mbps.

WAN optimization should provide the most benefits when the video is rendered on the client since the data stream for the compressed Windows Media content is similar between client devices, once the video has been viewed by one person in the branch, very little bandwidth is consumed when other workers view the same video.

HDX RichGraphics 3D Pro

HDX 3D Pro can be used to deliver any application that is compatible with the supported host operating systems, but is particularly suitable for use with DirectX and OpenGL-driven applications, and with rich media such as video.

The computer hosting the application can be either a physical machine or a XenServer VM with Multi-GPU Passthrough. The Multi-GPU Passthrough feature is available with Citrix XenServer 6.0

For CPU-based compression, including lossless compression, HDX 3D Pro supports any display adapter on the host computer that is compatible with the application that you are delivering. To use GPU-based deep compression, HDX 3D Pro requires that the computer hosting the application is equipped with a NVIDIA CUDA-enabled GPU and NVIDIA CUDA 2.1 or later display drivers installed. For optimum performance, Citrix recommends using a GPU with at least 128 parallel CUDA cores for single-monitor access.

To access desktops or applications delivered with XenDesktop and HDX 3D Pro, users must install Citrix Receiver. GPU-based deep compression is only available with the latest versions of Citrix Receiver for Windows and Citrix Receiver for Linux.

HDX 3D Pro supports all monitor resolutions that are supported by the GPU on the host computer. However, for optimum performance with the minimum recommended user device and GPU specifications, Citrix recommends maximum monitor resolutions for users’ devices of 1920 x 1200 pixels for LAN connections and 1280 x 1024 pixels for WAN connections.

Users’ devices do not need a dedicated GPU to access desktops or applications delivered with HDX 3D Pro.

HDX 3D Pro includes an image quality configuration tool that enables users to adjust in real time the balance between image quality and responsiveness to optimize their use of the available bandwidth.

HDX RichGraphics 3D Pro network considerations

HDX 3D PRO has significant bandwidth requirements depending on the encoding used (NVIDA CUDA encoding, CPU encoding, and Lossless.)

hdx3dpro

When supported NVIDIA chipsets are utilized, HDX 3D Pro offers the ability to compress the ICA session in a video stream. This significantly reduces bandwidth and CPU usage on both ends by utilizing the NVIDA CUDA-based deep compression. If a NVIDIA GPU is not present to provide compression, the server CPU can be utilized to compress the ICA stream. This method, however, does introduce a significant impact on CPU utilization. The highest quality method for delivering a 3D capable desktop is by using the Lossless option. As the Lossless title states, no compression of the ICA stream occurs allowing for pixel perfect images to be delivered to the end point. This option is available for delivering medical imaging software that cannot have degraded image quality. This level of high quality imaging does come with the price of very high bandwidth requirements.

HDX RichGraphics GDI and GDI+ remoting

GDI (Graphics Device Interface) and GDI+ remoting allows Microsoft office specifically (although other apps, like wordpad, use GDI also) to be remoted to the client using native graphics commands instead of bitmaps. By using native graphics commands, it saves on server side CPU, saves network bandwidth and eliminates visual artifacts as it doesn’t need to be compressed using image compression.

General network factors for Remoting protocols (including RDP/RemoteFX, ICA, PCoIP, Quest EoP,…)

  • Bandwidth – the protocols take all they can get, 2 Mbps is required for a decent user experience. (see planning bandwidth requirements below)
  • Latency – at 50ms things start getting tough (sometimes even at 20ms)
  • Packet loss – should stay under 1%

Planning bandwidth requirements for HDX (XenDesktop example)

Citrix publishes the numbers below in a medium (user load) user environment, this gives some indication as to what to expect in terms of network sizing.

  •  MS Office-based                                     43Kbps
  • Internet                                                     85 Kbps
  • Printing (5MB Word doc)                          555-593 Kbps
  • Flash video (server rendered)                    174 Kbps
  • Standard WMV video (client rendered)      464 Kbps
  • HD WMV video (client rendered)               1812 Kbps

These are estimates. If a user watches a WMV HD video with a bit rate of 6.5 Mbps, that user will require a network link with at least that much bandwidth. In addition to the WMV video, the link must also be able to support the other user activities happening at the same time.

Also, if multiple users are expected to be accessing the same type of content (videos, web pages, documents, etc.), integrating WAN Optimization into the architecture can drastically reduce the amount of bandwidth consumed. However, the amount of benefit is based on the level of repetition between users.

Note: Riverbed Steelhead can optimize ICA/HDX traffic extremely well, we even support the newer multi-stream ica protocol. In part 2 of this blog I will demonstrate the effectiveness of Steelhead on HDX traffic and talk about our Citrix specific optimizations like our very effective Citrix QoS, Riverbed Steelheads also have the ability to decode the ICA Priority Packet Tagging that identifies the virtual channel from which each Citrix ICA packet originated.  As part of this capability, Riverbed specifically developed a packet-order queuing discipline that respects the ordering of ICA packets within a flow, even when different packets from a given flow are classified by Citrix into different ICA virtual channels.  This allows the Steelhead to deliver very granular Quality of Service (QoS) enforcement based on the virtual channel in which the ICA data is transmitted.  Most importantly, this feature prevents any possibility of out-of-order packet delivery as a result of Riverbed’s QoS enforcement; out-of-order packet delivery would cause significant degradation in performance and responsiveness for the Citrix ICA user.  Riverbed’s packet-order queuing capability is patent-pending, and not available from any other WAN optimization vendor.

Real world impact can be seen in the picture below of a customer saving 14GB of ICA traffic over a transatlantic link every month.citrixtraff

Name a famous Belgian

As a Belgian working for an American multinational company I often, mockingly, get asked to name 10 famous Belgians, and I must admit us Belgians are not that good at self promotion it seems (to busy making beer, chocolate, waffles and “french fries”).

When thinking about the Internet, and more specifically the World Wide Web, the Internet’s first killer application if you will, no Belgians spring to mind either.

I give you Robert Cailliau, who together with the more well known Tim Berners-Lee made WWW a reality.

Another key application that relies on the Internet to make it real is Software as a Service (SaaS), now for SaaS applications to work the end users communication needs to travel the path between your location and the SaaS provider, this path is the Internet.
So how does the Internet make sure you get fast access to the server where your SaaS application is running?

The Internet relies on a routing protocol (BGP) to get your request to the SaaS provider, BGP is used between ISP’s as an interconnect to somewhat reliably connect all these separate (the routing protocol used in within the ISP is autonomous, i.e. this does not need to be BGP) networks together so your packet gets where it needs to be.

Another well kept secret, like Robert Caillau, is that the Border Gateway Protocol (BGP) was not really designed to give you the fastest route between all these autonomous systems, how could it since the Internet belongs to no one and everyone (‘s request) should be treated equally (feel free to mentally picture a Guy Fawkes mask here).

This is why, in order to provide a more predictable performance across the Internet, you need a solution like Steelhead Cloud Accelerator to:

  • A: Give you the fastest path across the Internet (using Akamai SureRoute)
  • B: Minimize the application and data overhead (using Steelhead transport, application, and data streamlining)

Your Salesforce.com, only faster.

As mentioned in my previous post Riverbed has a joined SaaS optimization solution with Akamai called Steelhead Cloud Accelerator. In this blog post I will show you how to use this technology to accelerate your salesforce (people and the application).

The picture below is a diagram of the lab environment I’ll be using for this setup.

The lab uses a WAN Simulator so we can simulate a cross-atlantic link towards Salesforce.com. For this simulation I have set the link to 200ms latency and 512Kbps.

For the Steelhead Cloud Functionality you need a specific firmware image, freely available to our customers on http://support.riverbed.com,  you can recognize this by the -sca at the end of the version number (right hand corner in the screenshot below).

Once you are using the firmware you get an additional option under Configure –> Optimization, called Cloud Accelerator. (see screenshot above).

Here you can register the Steelhead in our cloud portal (which is running as a public cloud service itself, running on Amazon Web Services). You can also enable one or more of our currently supported SaaS applications (Google Apps, Salesforce.com, and Office 365).

When you register the appliance on the Riverbed Cloud Portal you need to grant the appliance cloud service to enable it.

Once the appliance is granted service, the status on the Steelhead itself will change to “service ready”

So let’s first look at the unoptimized version of our SaaS application. As you can see in the screenshot below I have disabled the Steelhead optimization service so all connections towards Salesforce.com will be pass-through. You can also see the latency is 214ms on average and the bandwidth is 512Kbps.

I logged into Salesforce.com and am attempting to download a 24MB PowerPoint presentation, as you can see in the screenshot below this is estimated to take about 7 minutes to complete. Time for another nice unproductive cup of coffee…

If we now enable the optimization service on the Steelhead it will automatically detect that we are connecting to Salesforce.com and in conjunction with Akamai spin up a cloud Steelhead on the closest Akamai Edge Server next to the Salesforce.com datacenter I am currently using.

Looking at the current connections on the Steelhead you can see that my connections to Salesforce.com are now being symmetrically optimized by the Steelhead in the Lab and the Cloud Steelhead on the Akamai-ES.

Note the little lightning bolt in the notes section signifying that Cloud Acceleration is on.

Let’s attempt to download the presentation again.

Yeah, I think you could call that faster…

But that is not all, because we are using the same proven Steelhead technology including byte-level deduplication I can edit the PowerPoint file and upload it back to salesforce.com with a minimum of data transfer across the cloud.

I edited the first slide by changing the title and subtitle and will upload the changed file to my SaaS application, notice that the filename itself is also changed.

Looking at the current connections on the Steelhead you can see I am uploading the file at the same breakneck speed since I only need to transfer the changed bytes.

So there you have it, Salesforce.com at lightning speeds!

NOTE: I have not mentioned the SSL based configuration needed to allow us to optimize https based SaaS applications (as all of them are), I will cover this in a later post.

Voyager 1 is heading for the stars, talk about latency!

It now been 35 years since Voyager 1’s launch to Jupiter and Saturn, sooner or later, the workhorse spacecraft will bid adieu to the solar system and enter a new realm of space—the first time a manmade object will have escaped to the other side.

Voyager 1 is currently more than 11 billion miles from the sun. it’s twin Voyager 2, which celebrated its launch anniversary 3 weeks ago, trails behind at 9 billion miles from the sun.

There are no full-time scientists left on the mission, but 20 part-timers analyze the data streamed back. Since the spacecraft are so far out, it takes 17 hours for a radio signal from Voyager 1 to travel to Earth. For Voyager 2, it takes about 13 hours. This means Voyager 1 has a round-trip latency of about 34 hours, meaning that if you send a command it takes 34 hours for the acknowledgement to get back, that’s a long time to find out if you didn’t press the wrong button.

We at Riverbed deal with latency issues all the time, in an Enterprise setting we don’t tend to see round trip times of 34 hours but we do see the impact of even acceptable levels of latency on business applications every day.

Since there is a physical distance between two points, let’s say your datacenter is in Belgium and your branch office is in New York, we need to cross a distance of about 4000 miles, assuming this gives us a RTT latency of 100ms*, it takes 0.2 seconds for your keystroke to be acknowledged between your branch and your datacenter at the speed of light.

* speed of light over 4000 miles plus the added delay impact of network topology and constraints of the actual cable (light does not actually travel at Lightspeed over a fiber cable because of light refactoring etc. only in a Vacuum does it reach 299 792 458 m / s and electric signals only approximately reach the speed of light even in theory)

The bad news is that a lot of applications compound this problem by needing a lot of round trips across the network to work, even before sending actual data. If you let these applications create all these network messages your users will start to feel like they are waiting 34 hours for your applications to get going (open your vpn to your remote company fileshare and start browsing directories, utter madness).

So what we do is avoid a lot of these round trips coming from the application so that the impact is greatly reduced, let’s say that instead of needing 300 messages to get going, you can do the same with 20? That’s a 4 second wait instead of a minute. Utter bliss.

See how I’ve cleverly avoiding mentioning bandwidth? That’s because with application chattiness issues it generally doesn’t play a role, the application overhead messages are small enough to fit even the tiniest link, so latency is the secret application killer here.