Partial trace sampling: a new approach to distributed trace sampling

Otmar Ertl

October 28, 2021

Dynatrace Research is developing a new way to achieve more accurate trace sampling.

Distributed traces are a key part of the accurate observability of distributed systems and microservices environments, however, traces need to be sampled to save on storage costs. Head-based and tail-based sampling methods have been used for many years, but they have limitations. This is why Dynatrace is researching a new approach, partial trace sampling, to fill the gaps left by those canonical sampling methods.

What is distributed tracing?

The complexity we have reached with distributed systems and microservice architectures makes observability a necessity to maintain healthy software. And next to metrics, events, and logs, distributed traces are an essential type of telemetry data that gives you a complete picture of your software environment and performance for end-to-end observability.

Distributed traces track and observe service requests as they flow through distributed systems, as they go from one service to another. Thanks to trace data, you can understand microservices environments in a way that is not possible manually. You can understand where failures happen and why.

Every “step” in a trace is referred to as a span. The “root span” is the first span in a trace and a “child span” a subsequent span.

In general, the more information you can collect and store about your system, the more accurate the results of its analysis are.

So, why do we need trace sampling?

Since storage and processing resources are not unlimited, sampling enables us to collect as much useful information as possible without storing too much.

But how do you choose which traces to sample?

There are two canonical ways to do this: head-based and tail-based.

Head-Based Trace Sampling

In head-based sampling, the sampling decision is made randomly while the root span is being processed. It’s fast and simple to get up and running and has little impact on application performance. For some systems, random sampling can give sufficient visibility, however, for more complex systems it could lead to less coverage. For example, if the decision is made at the start not to sample this trace and it ends up being a rare path.

Tail-Based Trace Sampling

In tail-based sampling, the sampling decision is made when the request has been completed and all information about the trace has been collected. This method has the benefit of more intelligent sampling since rarer traces can be collected just as often as more common ones. However, incomplete traces have to be buffered on a collector service as the decision can only be made upon completion, causing significant communication and memory overhead.

What is the issue with canonical trace sampling methods?

These canonical ways have been working in observability for a long time, however, they are far from perfect ways of doing the job.

In head-based sampling, you often do not know at the root span whether a trace is a rare one or a common one. Since the sampling decision is random, you cannot ensure that you get high coverage. Frequently called requests are sampled very often. Rarer requests have a lower chance of being sampled. This makes it challenging to choose an appropriate sampling rate since you also need to take into account what the constraints are for data collection at the backend.

As already mentioned, tail-based sampling comes with significant additional network and memory costs, because of the required preprocessing. In addition, if the collector service needs to be scalable, additional complexity is introduced as the spans of the same trace have to be routed to the same collector instance. Adding or removing instances may even lead to unintended information loss.

What is the alternative?

To ensure that rarer parts of a trace are sampled as often as more common ones, Dynatrace research proposes a flexible sampling method called “partial trace sampling”.

In partial trace sampling, the sampling rate may vary depending on the frequency in which a part of a branch is being called.

For example, the head of the trace is sampled less frequently because it is called very often. Whilst the deeper parts of the trace that are called less often will be sampled more frequently. This ensures a more balanced sampling of spans across the board, which is less likely to happen with other approaches.

Are partially sampled traces useful?

In contrast to head- and tail-based sampling, varying sampling rates across the spans of the same trace often result in fragments being collected which makes the analysis more challenging.

While consistent span sampling maximizes the probability of capturing all spans of a trace by sharing the same random number for all sampling decisions within a trace, there will be many traces that are only partially sampled due to the differing span sampling rates. However, the information available in incomplete traces is also valuable because many queries do not need the full trace and only consider specific branches. For example, to estimate how often one backend service calls another backend service does not require any information about the frontend.

What is the current state of research?

These ideas have been formulated in our research paper and are being used to define the new OpenTelemetry sampling specification — but it’s still a work in progress. Nonetheless, you can have a look at the proof of concept published on GitHub.

Partial trace sampling: a new approach to distributed trace sampling was originally published in Dynatrace Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Written by