MTTD

MTTD: Mean Time To Detect

Definition: The average time it takes to identify a problem or incident from the moment it occurs

How to Calculate: Sum of (Time to Detect for each incident) / Total number of incidents

Scenario: An E-commerce Website

Imagine an e-commerce website that sells electronics. One day, a critical bug is introduced during a new feature deployment. This bug prevents customers from adding items to their shopping cart and proceeding to checkout.

Here’s how MTTD would apply:

Incident Start Time:

The bug is deployed and starts affecting users at 10:00 AM.

At this point, customers begin experiencing issues, but the development or operations team is not yet aware of the problem.

Detection Time for Incident 1 (Automated Monitoring):

The website has automated monitoring tools in place (e.g., error logging, performance monitoring).

At 10:05 AM, the monitoring system detects an unusually high number of “add to cart” failures and triggers an alert to the operations team.

Detection Time for Incident 1 = 5 minutes (10:05 AM – 10:00 AM)

Detection Time for Incident 2 (Customer Report):

Another customer, who didn’t trigger the automated alert threshold, tries to make a purchase at 10:15 AM and encounters the same issue.

They immediately contact customer support, and the support agent logs the issue as a critical bug.

Detection Time for Incident 2 = 15 minutes (10:15 AM – 10:00 AM)

Detection Time for Incident 3 (Manual Check):

A product manager, performing a routine check of key user flows, tries to add an item to the cart at 10:20 AMand notices the problem. They report it internally.

Detection Time for Incident 3 = 20 minutes (10:20 AM – 10:00 AM)

Calculating MTTD for this example:

Let’s assume these are the only three “detections” that occurred for this specific bug before a resolution process truly began.

Total Time to Detect Incidents: 5 minutes+15 minutes+20 minutes=40 minutes

Number of Incidents Detected: 3

MTTD = Total Time to Detect Incidents / Number of Incidents Detected MTTD = 40 minutes / 3 incidents MTTD ≈13.33 minutes

Interpretation:

In this example, the Mean Time To Detect (MTTD) for this critical shopping cart bug was approximately 13.33 minutes. This means, on average, it took about 13 minutes from the moment the bug started impacting users until it was identified by the team (either through automation or manual reports).

Why is a low MTTD desirable?

A low MTTD is crucial because:

Minimizes Customer Impact: The faster you detect an issue, the less time it affects your users, leading to less frustration and potential loss of sales.

Reduces Financial Loss: For an e-commerce site, every minute of downtime or broken functionality can mean lost revenue.

Faster Resolution: Detection is the first step. A low MTTD allows your team to move quickly to Mean Time To Recover (MTTR), bringing the system back to normal faster.

Indicates Effective Monitoring: A consistently low MTTD suggests that your monitoring tools, alerting systems, and incident response processes are working effectively

Leave a Reply

Your email address will not be published. Required fields are marked *