DevConf.CZ 2025

Auto-instrumentation for GPU performance using eBPF
2025-06-14 , D0206 (capacity 154)

Modern AI workloads rely on large GPU fleets whose efficient utilisation is crucial due to high costs. However, gathering telemetry from these workloads to optimise performance is challenging because it requires manual instrumentation and adds performance overheads. Further, it does not produce telemetry in a standardised format for commonly used visualisation tools like Prometheus.

This talk explores the potential of leveraging eBPF to capture CUDA calls made to GPUs, including kernel launches and memory allocations. Data from these probes can be used to export Prometheus metrics, facilitating detailed analysis of kernel launch patterns and associated memory usage. This approach offers significant benefits as eBPF imposes minimal overhead and requires no intrusive instrumentation. Our implementation is also open-source and available on GitHub.


Experience level

Intermediate - attendees should be familiar with the subject

See also:

Marc Tuduri is Prometheus contributor, OpenTelemetry member and Software Engineer at Grafana.

This speaker also appears in:

Dominik started his journey in technology as an SRE, working on projects ranging from warehouse logistics and photobook designers to analyzing satellite imagery. During this time, he discovered his passion for developer tooling and making sure developers can focus on what they do best - build great software!

Now he is working as a Developer Experience Engineer at Grafana Labs, building tools to see clearly in the ever-changing world of software.

This speaker also appears in: