Profiling Individual Queries in a Concurrent System

March 15, 2024

Nathan Bronson

Principal Architect at Rockset

A good CPU profiler is worth its weight in gold. Measuring performance in-situ usually means using a sampling profile. They provide a lot of information while having very low overhead. In a concurrent system, however, it is hard to use the resulting data to extract high-level insights. Samples don’t include context like query IDs and application-level statistics; they show you what code was run, but not why.

This blog introduces trampoline histories, a technique Rockset has developed to efficiently attach application-level information (query IDs) to the samples of a CPU profile. This lets us use profiles to understand the performance of individual queries, even when multiple queries are executing simultaneously across the same set of worker threads.

Primer on Rockset

Rockset is a cloud-native search and analytics database. SQL queries from a customer are executed in a distributed fashion across a set of servers in the cloud. We use inverted indexes, approximate vector indexes, and columnar layouts to efficiently execute queries, while also processing streaming updates. The majority of Rockset’s performance-critical code is C++.

Most Rockset customers have their own dedicated compute resources referred to as virtual instances. Within that dedicated set of compute resources, however, multiple queries can execute at the same time. Queries are executed in a distributed fashion across all of the nodes, so this means that multiple queries are active at the same time in the same process. This concurrent query execution poses a challenge when trying to measure performance.

Concurrent query processing improves utilization by allowing computation, I/O, and communication to be overlapped. This overlapping is especially important for high QPS workloads and fast queries, which have more coordination relative to their fundamental work. Concurrent execution is also important for reducing head-of-line blocking and latency outliers; it prevents an occasional heavy query from blocking completion of the queries that follow it.

We manage concurrency by breaking work into micro-tasks that are run by a fixed set of thread pools. This substantially reduces the need for locks, because we can manage synchronization via task dependencies, and it also minimizes context switching overheads. Unfortunately, this micro-task architecture makes it difficult to profile individual queries. Callchain samples (stack backtraces) might have come from any active query, so the resulting profile shows only the sum of the CPU work.

Profiles that blend all of the active queries are better than nothing, but a lot of manual expertise is required to interpret the noisy results. Trampoline histories let us assign most of the CPU work in our execution engine to individual query IDs, both for continuous profiles and on-demand profiles. This is a very powerful tool when tuning queries or debugging anomalies.

DynamicLabel

The API we’ve constructed for adding application-level metadata to the CPU samples is called DynamicLabel. Its public interface is very simple:

class DynamicLabel {
  public:
    DynamicLabel(std::string key, std::string value);
    ~DynamicLabel();

    template <typename Func>
    std::invoke_result_t<Func> apply(Func&& func) const;
};

DynamicLabel::apply invokes func. Profile samples taken during that invocation will have the label attached.

Each query needs only one DynamicLabel. Whenever a micro-task from the query is run it is invoked via DynamicLabel::apply.

One of the most important properties of sampling profilers is that their overhead is proportional to their sampling rate; this is what lets their overhead be made arbitrarily small. In contrast, DynamicLabel::apply must do some work for every task regardless of the sampling rate. In some cases our micro-tasks can be pretty micro, so it is important that apply has very low overhead.

apply's performance is the primary design constraint. DynamicLabel's other operations (construction, destruction, and label lookup during sampling) happen orders of magnitude less frequently.

Let’s work through some ways we might try to implement the DynamicLabel functionality. We’ll evaluate and refine them with the goal of making apply as fast as possible. If you want to skip the journey and jump straight to the destination, go to the "Trampoline Histories" section.

Implementation Ideas

Idea #1: Resolve dynamic labels at sample collection time

The most obvious way to associate application metadata with a sample is to put it there from the beginning. The profiler would look for dynamic labels at the same time that it is capturing the stack backtrace, bundling a copy of them with the callchain.

Rockset's profiling uses Linux's perf_event, the subsystem that powers the perf command line tool. perf_event has many advantages over signal-based profilers (such as gperftools). It has lower bias, lower skew, lower overhead, access to hardware performance counters, visibility into both userspace and kernel callchains, and the ability to measure interference from other processes. These advantages come from its architecture, in which system-wide profile samples are taken by the kernel and asynchronously passed to userspace through a lock-free ring buffer.

Although perf_event has a lot of advantages, we can’t use it for idea #1 because it can’t read arbitrary userspace data at sampling time. eBPF profilers have a similar limitation.

Idea #2: Record a perf sample when the metadata changes

If it’s not possible to pull dynamic labels from userspace to the kernel at sampling time, then what about push? We could add an event to the profile every time that the thread→label mapping changes, then post-process the profiles to match up the labels.

One way to do this would be to use perf uprobes. Userspace probes can record function invocations, including function arguments. Unfortunately, uprobes are too slow to use in this fashion for us. Thread pool overhead for us is about 110 nanoseconds per task. Even a single crossing from the userspace into the kernel (uprobe or syscall) would multiply this overhead.

Avoiding syscalls during DynamicLabel::apply also prevents an eBPF solution, where we update an eBPF map in apply and then modify an eBPF profiler like BCC to fetch the labels when sampling.

edit: eBPF can be used to pull from userspace when collecting a sample, reading fsbase and then using bpfprobereaduser() to walk a userspace data structure that is attached to a threadlocal. If you have BPF permissions enabled in your production environment and are using a BPF-based profiler then this alternative can be a good one. The engineering and deployment issues are more complex but the result does not require in-process profile processing. Thanks to Jason Rahman for pointing this out.

Idea #3: Merge profiles with a userspace label history

If it's too expensive to record changes to the thread→label mapping in the kernel, what if we do it in the userspace? We could record a history of calls to DynamicLabel::apply, then join it to the profile samples during post-processing. perf_event samples can include timestamps and Linux's CLOCK_MONOTONIC clock has enough precision to appear strictly monotonic (at least on the x86_64 or arm64 instances we might use), so the join would be exact. A call to clock_gettime using the VDSO mechanism is a lot faster than a kernel transition, so the overhead would be much lower than that for idea #2.

The challenge with this approach is the data footprint. DynamicLabel histories would be several orders of magnitude larger than the profiles themselves, even after applying some simple compression. Profiling is enabled continuously on all of our servers at a low sampling rate, so trying to persist a history of every micro-task invocation would quickly overload our monitoring infrastructure.

Idea #4: In-memory history merging

The earlier we join samples and label histories, the less history we need to store. If we could join the samples and the history in near-realtime (perhaps every second) then we wouldn’t need to write the histories to disk at all.

The most common way to use Linux's perf_event subsystem is via the perf command line tool, but all of the deep kernel magic is available to any process via the perf_event_open syscall. There are a lot of configuration options (perf_event_open(2) is the longest manpage of any system call), but once you get it set up you can read profile samples from a lock-free ring buffer as soon as they are gathered by the kernel.

To avoid contention, we could maintain the history as a set of thread-local queues that record the timestamp of every DynamicLabel::apply entry and exit. For each sample we would search the corresponding history using the sample’s timestamp.

This approach has feasible performance, but can we do better?

Idea #5: Use the callchains to optimize the history of calls to `apply`

We can use the fact that apply shows up in the recorded callchains to reduce the history size. If we block inlining so that we can find DynamicLabel::apply in the call stacks, then we can use the backtrace to detect exit. This means that apply only needs to write the entry records, which record the time that an association was created. Halving the number of records halves the CPU and data footprint (of the part of the work that is not sampled).

This strategy is the best one yet, but we can do even better! The history entry records a range of time for which apply was bound to a particular label, so we only need to make a record when the binding changes, rather than per-invocation. This optimization can be very effective if we have multiple versions of apply to look for in the call stack. This leads us to trampoline histories, the design that we have implemented and deployed.

Trampoline Histories

If the stack has enough information to find the right DynamicLabel , then the only thing that apply needs to do is leave a frame on the stack. Since there are multiple active labels, we’ll need multiple addresses.

A function that immediately invokes another function is a trampoline. In C++ it might look like this:

__attribute__((__noinline__))
void trampoline(std::move_only_function<void()> func) {
    func();
    asm volatile (""); // prevent tailcall optimization
}

Note that we need to prevent compiler optimizations that would cause the function to not be present in the stack, namely inlining and tailcall elimination.

The trampoline compiles to only 5 instructions, 2 to set up the frame pointer, 1 to invoke func(), and 2 to clean up and return. Including padding this is 32 bytes of code.

C++ templates let us easily generate a whole family of trampolines, each of which has a unique address.

using Trampoline = __attribute__((__noinline__)) void (*)(
        std::move_only_function<void()>);

constexpr size_t kNumTrampolines = ...;

template <size_t N>
__attribute__((__noinline__))
void trampoline(std::move_only_function<void()> func) {
    func();
    asm volatile (""); // prevent tailcall optimization
}

template <size_t... Is>
constexpr std::array<Trampoline, sizeof...(Is)> makeTrampolines(
        std::index_sequence<Is...>) {
    return {&trampoline<Is>...};
}

Trampoline getTrampoline(unsigned idx) {
    static constexpr auto kTrampolines =
            makeTrampolines(std::make_index_sequence<kNumTrampolines>{});
    return kTrampolines.at(idx);
}

We’ve now got all of the low-level pieces we need to implement DynamicLabel:

DynamicLabel construction → find a trampoline that is not currently in use, append the label and current timestamp to that trampoline’s history
DynamicLabel::apply → invoke the code using the trampoline
DynamicLabel destruction → return the trampoline to a pool of unused trampolines
Stack frame symbolization → if the trampoline’s address is found in a callchain, look up the label in the trampoline’s history

Performance Impact

Our goal is to make DynamicLabel::apply fast, so that we can use it to wrap even small pieces of work. We measured it by extending our existing dynamic thread pool microbenchmark, adding a layer of indirection via apply.

{
    DynamicThreadPool executor({.maxThreads = 1});
    for (size_t i = 0; i < kNumTasks; ++i) {
        executor.add([&]() {
            label.apply([&] { ++count; }); });
    }
    // ~DynamicThreadPool waits for all tasks
}
EXPECT_EQ(kNumTasks, count);

Perhaps surprisingly, this benchmark shows zero performance impact from the extra level of indirection, when measured using either wall clock time or cycle counts. How can this be?

It turns out we’re benefiting from many years of research into branch prediction for indirect jumps. The inside of our trampoline looks like a virtual method call to the CPU. This is extremely common, so processor vendors have put a lot of effort into optimizing it.

If we use perf to measure the number of instructions in the benchmark we observe that adding label.apply causes about three dozen extra instructions to be executed per loop. This would slow things down if the CPU was front-end bound or if the destination was unpredictable, but in this case we are memory bound. There are plenty of execution resources for the extra instructions, so they don’t actually increase the program’s latency. Rockset is generally memory bound when executing queries; the zero-latency result holds in our production environment as well.

A Few Implementation Details

There are a few things we've done to improve the ergonomics of our profile ecosystem:

The perf.data format emitted by perf is optimized for CPU-efficient writing, not for simplicity or ease of use. Even though Rockset’s perf_event_open-based profiler pulls data from perf_event_open, we have chosen to emit the same protobuf-based pprof format used by gperftools. Importantly, the pprof format supports arbitrary labels on samples and the pprof visualizer already has the ability to filter on those tags, so it was easy to add and use the information from DynamicLabel.
We subtract one from most callchain addresses before symbolizing, because the return address is actually the first instruction that will be run after returning. This is especially important when using inline frames, since neighboring instructions are often not from the same source function.
We rewrite trampoline<i> to trampoline<0> so that we have the option of ignoring the tags and rendering a regular flame graph.
When simplifying demangled constructor names, we use something like Foo::copy_construct and Foo::move_construct rather than simplifying both to Foo::Foo. Differentiating constructor types makes it much easier to search for unnecessary copies. (If you implement this make sure you can handle demangled names with unbalanced < and >, such as std::enable_if<sizeof(Foo) > 4, void>::type.)
We compile with -fno-omit-frame-pointer and use frame pointers to build our callchains, but some important glibc functions like memcpy are written in assembly and don’t touch the stack at all. For these functions, the backtrace captured by perf_event_open's PERF_SAMPLE_CALLCHAIN mode omits the function that calls the assembly function. We find it by using PERF_SAMPLE_STACK_USER to record the top 8 bytes of the stack, splicing it into the callchain when the leaf is in one of those functions. This is much less overhead than trying to capture the entire backtrace with PERF_SAMPLE_STACK_USER.

Conclusion

Dynamic labels let Rockset tag CPU profile samples with the query whose work was active at that moment. This ability lets us use profiles to get insights about individual queries, even though Rockset uses concurrent query execution to improve CPU utilization.

Trampoline histories are a way of encoding the active work in the callchain, where the existing profiling infrastructure can easily capture it. By making the DynamicLabel ↔ trampoline binding relatively long-lived (milliseconds, rather than microseconds), the overhead of adding the labels is kept extremely low. The technique applies to any system that wants to augment sampled callchains with application state.

Rockset is hiring engineers in its Boston, San Mateo, London and Madrid offices. Apply to open engineering positions today.