I was fortunate to be a delegate during Cloud Field Day 15. Many discussions were had about the definition of cloud, multi-cloud, and even supercloud. We heard from six different vendor who provide solutions to almost every facet of cloud. I really enjoyed learning how Kentik addresses network observability in the cloud.
Observability refresher
Observability is an important cloud computing concept. TechTarget defines observability as “an element in control theory, which says that the internal states of IT systems can be deduced from the relationship between their inputs and outputs.”.
Observability is not the same as monitoring. Monitoring depends on reviewing log information for anomalies, usually to verify if a thing happened (or not). To find the root cause of an event, you’ll probably pour through lots of events from different monitoring streams. Observability looks at outputs of a workflow (not just a system) to analyze the current state of that system.
There are three pillars of observability:
- Logs: Records of events, provided in human-readable formats.
- Metrics: Real-time operating data, of data measured over time. These are usually accessed via API.
- Traces: Representation of causally related distributed events (via O’Reilly). Traces provide an understanding of the entire lifecycle of a workflow.
From a cloud operations perspective, observability is critical to quickly find issues in a distributed application that could impact customers or company revenue.
Kentik pulls it all together for public cloud network observability
Kentik focuses on network observability.
They have a SaaS solution that analyzes telemetry and flow data from infrastructure network components (cloud, internet, and on premises datacenter). This relieves the companies from having to continually add new appliances to collect and analyze that data.
This is a great thing for customers, Kentik is storing and processing a lot of flow information.
They observe the state of your network, but they don’t do the remediation. However, they are able to connect into other applications like New Relic and Splunk to act when a threat or issue is inferred by their observability.
Cloud economics is a cloud operations skill
As on-premises operators start moving into cloud operations roles, one thing that is important to learn to understand is cloud economics. You have probably heard of #finops (or financial operations). This is the practice of keeping an organization within its cloud operations budget.
Observability is key for evaluating cloud economics. Kentik’s network observability can help answer questions about your network. For example:
- When will we be at capacity?
- What’s driving latency in a particular cloud region?
- What does distribution look like across regions?
- What should we be performance testing?
- Is the network the problem?
Kentik gathers cloud provider metrics to see how that impacts performance. They also use synthetic agents to test your infrastructure across the public infrastructure. They also have an open-source synthetic agent that you can use to test your private infrastructure. If you want a nice explanation of how this all works, check out this presentation:
They take 8M records and combine them into records with much more context that can get you beyond the “it’s slow” or “it’s not working” issues to understanding the underlying cause of performance issues. This network observability can even help you uncover hidden cloud costs.
Finding hidden cloud costs
How can operations help with hidden cloud costs? Mike Kygeris explained this. He told us about a customer who got a surprise AWS bill. They had migrated an application to the cloud, and it used S3 to store data. They decided to just send the data to S3 but didn’t consider that the data would have to go thru a NAT gateway. The charge for that would have been about $150K a year.
But, the traffic didn’t need to go through that gateway! During the migration, they should have had the data go thru a VPC endpoint. Using observability to see the traffic patterns, they cut the transit costs and the NAT gateway cost.
Understanding the best architecture can save money, and network observability can pinpoint when the architecture is not cost-optimized. Understanding this is an operations skill.
This presentation dove deep into these types of use cases.
Real Talk
Observability is critical to cloud operations. Network observability is vital when running any sort of distributed application. It helps answer questions, even the age-old question of “is it the network’s fault?”.
But there is no way to evaluate telemetry from modern applications that is required for observability – there is too much! Kentik is an interesting platform for network observability. They have a free trial to check out.
If you’re old-school ops like me, learning how to use observability and be more cloud economics fluent is a good skill to learn.