Upgrading Pinterest operational metrics – Pinterest Engineering


Three distributions from three hosts, each reporting one p90
The true p90 across all three hosts cannot be recovered from the sampled p90s
Metrics flow diagram

API decisions

Optimization 1: Caching synchronized hashmap lookups

Optimization 2: Thread-Local Stats

Optimization 3: Gauge API design

Before and After MABS pipeline
Only the true p95 is stored, rather than one p95 for every fleet machine.
MABS pipeline does not have false spikes



Source link