"How NOT to Measure Latency" by Gil Tene

Sdílet
Vložit
  • čas přidán 27. 09. 2015
  • Time is Money. Understanding application responsiveness and latency is critical but good characterization of bad data is useless. Gil Tene discusses some common pitfalls encountered in measuring latency and response time behavior. He introduces how simple, open sourced tools can be used to improve and gain higher confidence in both latency measurement and reporting.
    Gil Tene
    AZUL SYSTEMS
    @giltene
    Gil Tene is CTO and co-founder of Azul Systems. He has been involved with virtual machine and runtime technologies for the past 25 years. His pet focus areas include system responsiveness and latency behavior. Gil is a frequent speaker at technology conferences worldwide, and an official JavaOne Rock Star. He pioneered the Continuously Concurrent Compacting Collector (C4) that powers Azul's continuously reactive Java platforms. In past lives, he also designed and built operating systems, network switches, firewalls, and laser based mosquito interception systems.
  • Věda a technologie

Komentáře • 16

  • @pranytt3485
    @pranytt3485 Před rokem +17

    Key takeaways for me :
    1. Most of the tools that capture the response times, report 99 percentile latency of every 30 sec duration. For example prometheus metrics are scraped every one minute. But the real thing to look at is the Max response time.
    2. Gatling fixed the co-ordinated omission problem. Most of the other tools like Jmeter, etc still have this problem. So use Gatling for your load generation and reporting purposes.
    3. Didn't understand co-ordinated omission fully. But I'm now informed that it is bad and needs to be looked out for.
    4. When a graph shows sudden spike, it is an indication of a 'possible' coordinated omission. If a graph is smoothly growing it is an indication that there is no bad data. Exceptions maybe there to this rule.
    5. There is no point in looking at percentile graphs if you don't have performance goals set for your service. If you are comparing two systems and your target is 20ms, then you could plot graphs and see what is the maximum throughput each system supports while maintaining latency at 20 ms.

  • @TheSuckerOfTheWorld
    @TheSuckerOfTheWorld Před 8 lety +13

    10 Minutes in and I already see the very obvious flaw that +Gil Tene pointed out in my day-to-day monitoring.
    Great talk!

  • @whitegelfling
    @whitegelfling Před 8 lety +8

    Coordinate emission: One issue here is one that is often encountered in metrics in business, and that is that the bosses want simple, easy, and reliable numbers to look at. To the guy behind the project it is seen as a system that ions out a rare case, without understanding the maths behind it.

  • @timothydsears
    @timothydsears Před 8 lety +8

    Terrific talk about load testing and lazy thinking. The early part probably applies to anyone thinking about metrics for a complex system.

  • @TestAutomationTV
    @TestAutomationTV Před rokem

    Nice talk, I've read good things about it. Now starting to listen, looking forward to finding some good stuff about performance testing.

  • @WilsonMar1
    @WilsonMar1 Před 8 lety +1

    [6:52] I don't have the data. A common problem we have is we plot only what is convenient. We only plot what gives us nice colorful charts. We choose the noise to display.

  • @minimaddu
    @minimaddu Před 8 lety +5

    Great talk! I'm curious, we get most of our production response time stats from AWS load balancer logs. Is that an accurate measure of response time?

  • @Turalcar
    @Turalcar Před rokem

    I'm more used to graphs being split for request kinds. To me the first thing that jumped out was the large difference between 50th and 75th percentile.

  • @ericj1380
    @ericj1380 Před 2 lety +2

    @12:04, is this because of 5 page loads/40 resources per page increasing the chance of hitting above p99?
    If that’s the case couldn’t you just adjust each graph to be on a per-resource or per-page basis? Which seems like it would directly reflect the percentile.

  • @ruimeireles1695
    @ruimeireles1695 Před 3 lety +1

    Anyone can write all the tool names mentioned in the presentation? I can't find some of them, probably because I'm not writing the name correctly.

  • @whitegelfling
    @whitegelfling Před 8 lety +8

    Ok, i'm only a few mins in and my brain hurts.. I can't belive that people seriously ignore the max in things like this.. scary.

    • @MikkoRantalainen
      @MikkoRantalainen Před 4 lety +1

      I agree. Only maximum (worst case latency) and median latency are worth wathing. Everything else is just noise.

    • @MikkoRantalainen
      @MikkoRantalainen Před 4 lety

      Note that "median" is not the target, the diffence between the worst case latency and median latency is the part of the picture that could get better if you fix the bad stuff. Getting median latency downwards often requires LOTS of changes to the system.

    • @MikkoRantalainen
      @MikkoRantalainen Před 4 lety +1

      All well made latency graphs should have number of the requests per second on the horizontal axis and maximum response time on vertical axis. The number of requests per second that gets the maximum response time too high is the limit.

    • @GeorgeTsiros
      @GeorgeTsiros Před rokem +1

      that, is why "how to measure", by itself, is an entire class in physics (at least) courses.

  • @tirumaraiselvan1
    @tirumaraiselvan1 Před 6 měsíci

    19:34 should be 100 measurements of 100s each , no? 100 requests will be sent that second and each will be stalled for 100s.