Having just completed a survey of vendor functionality in the hybrid infrastructure management market, it was apparent to me that the ability for IT operations to provide genuine end-to-end performance based SLAs in a multi-cloud and hybrid-cloud architecture is still seriously impaired by the lack of visibility into the network performance of the major cloud and internet service providers.
For the second year Thousand Eyes have just published their latest Cloud Performance Benchmark. This is a fascinating and must-read report for those of you using Amazon Web Services (AWS), Microsoft Azure, Google Compute Platform (GCP), IBM Cloud or Alibaba Cloud to run important operational applications. It covers end-user to and from the Cloud measurements, inter-availability zone (AZ) and inter-region cloud vendor specific performance as well as network performance between the cloud vendors. Captured over the course of a month, sampling network data at 10-minute intervals, Thousand Eyes ended up with 320 million data points for analysis. It is the depth and scope of the data captured that makes this report unique and important.
While overall performance figures for latency, jitter and packet loss are generally good, there were a number of anomalies where individual cloud providers in specific regions seemed to under-perform compared to their peers. The strength of the Thousand Eyes platform is in being able to dig deep into network traffic, show the paths that are being used and identify specific causes. Thus, greater latency experienced on GCP traffic from Europe to Mumbai in India turned out to be a lack of direct links on its core backbone network, resulting in traffic being routed west across the Atlantic, then across the Pacific and south east Asia before reaching Mumbai.
Intriguingly, AWS relies much more heavily on the public internet than its main competitors. As an AWS customer your traffic will be forced to travel further on the public internet before it accesses the core backbone network. This doesn’t necessarily add hugely to latency, but it does add levels of unpredictability into the network equation. Surprisingly, the chargeable AWS Global Accelerator service, designed to force traffic on to the AWS backbone sooner than the standard service, did not always result in latency improvements.
The way in which ISPs peer and route internet traffic also impacts overall performance and this report highlights how anomalous routing decisions by individual ISPs, like passing internet traffic from California all the way to the US East Coast and back again to Cloud provider regions on the West Coast.
For me, there are two key take-aways from reading this report. The first, is that you cannot afford to make overall assumptions about cloud provider performance when designing your global, or even regional IT architecture. At the very least you should be using Thousand Eyes capabilities as a benchmarking tool. Secondly, I believe this report demonstrates the breadth, depth and ability of the underlying monitoring platform to provide much greater visibility into cloud networks and deliver greater assurance about performance than has been possible before
We haven’t yet arrived at the point where a single monitoring platform can provide true, real-time, end-to-end application performance assurance? For a start, those cloud networks are multi-tenant and no cloud provider is going to provide that level of assured performance SLA. But the need for large numbers of siloed performance tool sets is much reduced. I would stick my neck out and say that you probably just need three. An Application Performance Management (APM) tool, a Hybrid Infrastructure Management/AIOps tool, and a tool like Thousand Eyes that can lift the lid and provide visibility into the cloud networking black-box.