Monitoring, managing and troubleshooting large scale networks

Sdílet
Vložit
  • čas přidán 1. 06. 2015
  • Speaker:
    Peter Hoose, Facebook, Inc.
    Monitoring, managing and troubleshooting large scale networks. Almost four years ago I came to NANOG and mostly complained about the state of monitoring networks, par for the course for me. A lot has changed since then, we've solved many of the problems I addressed. Perhaps more importantly, we've fundamentally changed how we manage, monitor and troubleshoot our network. We plan to share what we learned, what went well, and best of all, what went oh so terribly wrong. Our driving philosophy behind this effort is that by taking an engineering approach to operations, you can greatly reduce the time to discover, mitigate and resolve issues on your network. We analyzed our faults, our pain points and the work that consumed most of our time. This allowed us to prioritize what we tackled first, we were surprised by what we learned caused the most outages, and how much impact minor network issues can have when they fall in the right place. From this, today, the majority of the faults that occur in our network are automatically detected, and mitigated all without human intervention. We'll dive into some of the most interesting issues we've experienced in our network, how we narrowed them down before, and after our new tooling and monitoring was deployed. We'll walk through specific examples of remediations and how the systems function. I'm lazy, I don't want to spend my time fixing known issues, I want to work on new problems, I want a challenge. This was the driving force behind our approach, if this sounds like you, them this talk is for you. ---- One of the keys to this effort was a system called FBAR, which interacts with our devices to perform the tasks needed to resolve issues. We'll explain in detail how this works, as well some of our earlier remediations. As a companion to this talk, David Swafford will be preparing a separate tutorial session to show you how to build your own system much like FBAR to help detect, isolate and remedy issues automatically. - See more at: www.nanog.org/meetings/nanog6...
  • Věda a technologie

Komentáře •