Metrics in Distributed Systems

16 July 2019 devTalks

Would you like to find out more about “Metrics in Distributed Systems”? Idan Tovi, Head of SRE at PayU,  shares with us more about it!

“Collecting metrics in a distributed system can be a real challenge, especially if the system wasn’t designed in advance. The number of different repositories and technologies, the way to collect the metrics, the number of different metrics, dashboards and alert creation are just some of the challenges.

Still, we managed to overcome those challenges in less than 3 months! In this article I will explain how.

How it All Started

The system we had in place served our company’s customers well, but had started to become overloaded. So we decided to build a new system, starting with a minimum set of principles:

  • API first
  • SMACK (Spark, Mesos, Akka, Cassandra and Kafka)
  • Full automation
  • Horizontal scale
  • Small teams

The advantage of this approach that we now had a team of 20 talented engineers searching/investigating/uncovering the microservices landscape in a high pace. The downside, however, was that we didn’t have any standardization regarding observability. So we knew we needed monitoring tools. We already had logs, but quickly we understood that we needed another level of observability except for that. So we decided to focus on metrics.

The Initial Approach

The technology stack we decided to use was Sysdig, which is a very nice commercial metrics solution, together with an open source package written by me, and oh what a bad package it is… 🙂 The package collected too many metrics and also caused a nice memory leak.

The fact that everyone on the engineering team wanted custom metrics for their own service, led to a huge number of metrics. This caused massive load on the Sysdig system which was no longer responsive and eventually not useful. Besides the system was still in its early stages with no real customers so who needs metrics anyway right?!? Logs were enough for now, so we accepted the downside of using Sysdig.

Taking it a Step Further

A year later the system started to take shape. The traffic started to shift from the legacy system and we kept adding more and more microservices. Just to give you a sense of what our stack included back then: DC/OS, LinkerD, Kafka, Cassandra, Druid, Spark, Elastic Search, Consul, Vault, 4 different programming languages and everything was dockerised and based on AWS. At this stage, our Infra team felt like they must have some better way to monitor this growing stack and they decided to give metrics a second chance.

This time we decided to go with InfluxDB. We started to collect the infrastructure metrics and then asked the developers to add some services metrics  However, Influx didn’t take it well. When we started to add the service metrics, Influx doesn’t handle large numbers of time series well.

Still, we weren’t yet using the full potential of the system and we had a limited number of services, so by making a couple of improvements to the application logs we didn’t feel the lack of metrics and Influx gave us some mid term solution for the infrastructure. We knew this could not scale but had more urgent things to handle.

And indeed, as another year passed, we started to onboard larger customers. The load on the system was growing fast and logs were no longer enough.

Giving Metrics Another Try

As always, first we chose the technologies.

We knew Prometheus become part of the CNCF (Cloud Native Computing Foundation) and because our system is a cloud-native one we thought we should give it a try. As part of this choice we had to re-write our open source package, so we decided to write it from scratch and learn from our mistakes. We decided to expose by default much fewer metrics and only valuable ones. Then we thought about what else we should do different, in order to succeed this time?!? And then we came up with the most important piece of the puzzle – we set up the SRE team, which took responsibility for the challenge. So for the first time, we had a team that was accountable for the observability of the system.

But how should we handle more than 200 different services and a growing tech stack???

The approach we took, is to divide the services into groups. So we started to think about how to group our services,  by summarizing the 3 most common practices of what you should monitor:

  • RED (Rate, Erros, Duration) – which is more focused on the application
  • USE – Utilization, Saturation, Error – which is a better fit for infrastructure
  • 4 Goldan signals (Latency, Traffic, Errors, and Saturation) 

We figured out that our services can be categorized by their functionality. So we defined our groups accordingly. Each group contains different metrics that we should monitor. In addition, each service can be part of more than one group; for example, a service with an API that also produces events to Kafka and is written in Scala, will be part of the 3 groups.

Of course, this approach is flexible and we can always add more and more groups as we grow and as our tech stack evolves.

The groups we choose to start with are:

  • General metrics like memory and CPU usage
  • Programming language specific like Node.js event look, JVM metrics and Golang goroutines.
  • Message broker Consumers
  • Message broker Producers


Taking an ‘Everything as Code’ Approach

We love think of everything as code. We automate everything, even our documentation site is automated (not just the deployment but even some of the content is automatically generated).

We decided to take the same approach with our dashboards and alerts. This is crucial because it helps developers embed this step in their pipeline as part of the provisioning. It also makes things deterministic, since we will always create the same alerts/dashboards for every service within the group.

In addition, it makes our dashboards and alerts reproducible in case of a major incident to our monitoring system. And last but not least, it makes collaboration easy: one of our biggest lessons in how to not block your engineering team from scaling up either in number of people or number of technologies is to make it easy to collaborate – so every engineer is more than welcome to add more or improve an existing group.

The Netflix Paved Road Culture

So far, we have technology, a team, a good approach  for grouping our services, a clear understanding of what we should monitor for each group and an easy way to start monitoring and contributing. But how should we bring this all together for our engineers.

Netflix Paved Road culture was the answer to this question. In a nutshell, it describes the relationship between Netflix teams (consumers and producers). The idea is to give freedom to the engineers, but also help them to focus on their main business logic.

We thus decided to build three metrics packages to collect metrics very easily from our application and we also created default panels for each group mentioned earlier. Those panels can be used via a simple CLI tool which is also packed in a docker container and makes it very easy to add a step in every service pipeline to add dashboards in Grafana and Alerts in the AlertManager. It actually can take less than 10 minutes to add and provision metrics, dashboards and alerts for a service.

In addition, from now on every new technology we will use that fits into an existing group has minimum requirements regarding the metrics it should expose. If there’s a technology which has no suitable group, everyone knows how to add it to our metrics ecosystem.

So far so good for application metrics, but what should we do with an infrastructure that is more unique and harder to divide into groups The answer is the same as we did with our services – EaC is the guiding principle. We added an automated way to upload full dashboards as part of the infrastructure provisioning/configuration pipeline (the same CLI tool for the apps). The main difference in this case, is that we upload a full dashboard.

Using those two main approaches and the right technologies, we managed to resolve a lack of observability of a pretty large technology stack and a lot of different services in less than 3 months, but obviously we weren’t able to complete that without learning our lessons from our previous failures.

In Summary

So to recap, those are the main takeaways I think everyone should take from our experience:

  1. As everything in a distributed system, it is always harder than with a monolith.
  2. If you can, design your metrics from the beginning. It is important to have metrics for observability and production debugging, but also in order to know you meet your SLOs.
  3. Either way, if you are closing the gaps like us or designing from the beginning, EaC and the Paved road are amazing principles and in my opinion, the only way to scale either your system and your engineering.
  4. Choose the right technologies for your type of system. Prometheus is amazing for cloud native distributed system, but there are a lot of great solutions out there that might fit you system better.
  5. Make someone responsible and accountable for your system production observability, otherwise it will pushed back in priority until it becomes critical and critical can be too late.
  6. Use the community, there is a huge amount of knowledge already built in the community – SRE is not something specific to Google anymore.
  7. And never give up – as we believe “fail fast, learn fast”.”