Video není dostupné.
Omlouváme se.
What is MTTR ? - Key Incident Recovery Metrics to Reduce Downtime
Vložit
- čas přidán 14. 08. 2024
- What is MTTR?
In this video, we cover the key incident recovery metrics you need to reduce downtime.
Learn more at pagertree.com/...
This video was brought to you by PagerTree. On-Call. Simplified.
Start your free trial today at pagertree.com
***************************************************************
Transcript:
Uptime - MTTR - MTTA - and MTBF
So many acronyms! But it all starts with downtime.
Downtime is time your solutions are out of action and unavailable for use.
It’s costly to fix, and ultimately breaks customer trust. Simply put, downtime is expensive!
So let’s take a look at how these key metrics help us manage downtime.
The first metric is Uptime.
Uptime is the % of time which a company’s solutions are in action and available for use.
It’s calculated as “Total time less downtime divided by total time”
It’s a measure of availability, and in today’s connected world, you want to target at least 99.99% uptime.
Next, we have MTTR.
Mean time to resolution is the average time it takes to resolve an outage and restore services to end-users.
It’s calculated as “Total downtime divided by # of incidents”
MTTR is a measure of resilience. Incidents are bound to happen, but healthy systems bounce back quickly.
Our next metric is MTTA.
Mean Time to Acknowledge is the average time it takes for a new incident to be acknowledged.
It’s calculated as “Total time to acknowledge divided by # of incidents”
MTTA is a measure of responsiveness. The faster you respond, the sooner you can start working towards a solution.
Lastly, we have MTBF.
Mean Time Between Failures is the average time from one incident to the next.
It’s calculated as “Total Time less downtime divided by the # of incidents”
MTBF is a measure of reliability, and answers the question: How often do our systems go down?
Now that we’ve covered our key metrics, let’s test our knowledge with some real numbers.
Over a 30 day period, you have 5 incidents, 10 hours of total downtime, and 180 total minutes to acknowledge.
Can you calculate your uptime, MTTR, MTTA, and MTBF?
Let’s start with uptime.
Remember, uptime is calculated as “total time less downtime divided by total time”.
In this case, we have 720 hours less 10 hours of downtime divided by 720 hours.
This gives us an uptime of 98.61%
Ouch! That translates to 121.8 hours of downtime each year! Not good at all.
We need to drill in on the other metrics to see where we can improve.
Our mean time to resolution is 10 hours of downtime divided by 5 incidents, which gives us an MTTR of 2 hours per incident.
That’s not horrible, but it’s not great either. We’ll want to break this down by incident severity in order to better understand what’s going on. Are we consistently hovering at 2 hours, or do we have some wild variations in time to resolution?
Next, let’s take a look at mean time to acknowledge. Our MTTA is 180 minutes divided by 5 incidents for an average of 36 minutes per incident.
That’s way too slow, and there’s some low-hanging fruit here. Speeding up response times can be the quickest way to improve MTTR and reduce downtime.
Lastly, our mean time between failures is calculated in the following manner: 720 total hours less 10 hours of downtime divided by 5 incidents.
This gives us an MTBF of 142 hours. That translates to more than 1 outage a week! We’ll definitely want to look at our logged incident trends and see if we can’t identify some root causes.
So there you have it. Uptime, MTTR, MTTA, and MTBF. These four key metrics keep a pulse on your IT operations, and help you ask the right questions to take action.
Start monitoring these key metrics today and get ahead of downtime!
First metric equation: total time - downtime % total time, voiceover: total time plus downtime % total time, good start
very nice video to understand the parameter easy
How did we get 121.8 hours of downtime each year at the 2:10 timestamp?
Very informative video. Thank you.
This is great! I love it! but what about MTTD (Mean time to Detection)? It affects MTTR, the sooner you detected the sooner you can start working on it and have a better MTTR
Good information, thanks!
Very informative but the problem the problem they fix isnt easy and the personnel is shortage , it affects their evaluation
Very crisp
What are the basis of this formula for the uptime and etc?
How did you translate 142 hours to 1 outage per week
Thanks for that quick information.
Thank you
where you get 720?
24 hours × 30 days = 720 Hours
Thank you.