I function as our company's ITIL Problem Manager, both managing the process and leading Root Cause Analysis efforts.
The Problem Managment Process I manage, along with the flow which a typical problem traverses, an example Problem Record from the tool (Redmine) we use to track Problems, what we do during review meetings, and how we report monthly and quarterly (more examples) to management.
Templates for reporting Problems upstream
I view Problem Management as an IT-specific instance of Risk Management and view its theroetical underpinnings in ways consonant with the following:
My favorite methodology, with backing checklist, for managing an RCA comes from Advance7 and is described in detail in their Rapid Problem Resolution book and in various white papers.
I facilitate a hands-on workshop in which participants split into small groups and practice a simplified version of the RPR Methodology along with analysis skills, working through real-world RCAs.
During RCAs, I often set long-running packet captures going and later extract key frames from directories full of the capture files.
What Takes Us Down?, published in the October 2012 ;login. My analysis of this data set suggests that timely Patching and proactive Testing can convert Unplanned incidents into Planned events, although I admit that the argument isn't compelling.
Summary of the data set:
I've charted statistics extracted from the database in several ways, none of which tell a persuasive story to me. Note that the database starts October 2010 and ends June 2012.
- Count of Drama Events by Year: OK, looks like we experience fewer painful events these days than we did early in the century; perhaps we've become smarter in how we manage our environment or perhaps software has become more reliable.
- Duration of Drama Events by Year: this chart sums the time spent during really painful events -- illustrates the same trend which the count chart above shows.
- Outages Activity by Month, Sum: this chart sums the count of Planned and Unplanned Outages and buckets them by month. The result suggests that January - April and September are bad months to be an IT service, and that techs perform less Planned work in August and September. But then, perhaps the distinctions aren't statistically signficant and we're just seeing noise here.
- Outages Activity by Month, Average: same as above but averaged per month.
- Outages Activity by Year: we had a spike of Unplanned events in 2001 and then again in 2011 -- significant or not?
- Severity Count by Year: that spike in 2011 was driven by a surge in Minor events.
- Severity Duration by Year: a few long-running Minor and Major events in 2001 wash out the utility of this chart.
- Window Count of Planned Events by Year: we perform most of our Planned work during Shoulder time, meaning outside regular business hours (Prime) but not within our internal SLA-defined maintenance windows (SLA).
- Window Count of Unplanned Events by Year: most of our Unplanned events arrive during business hours (Prime). I suppose that makes sense: our users are exercising the environment and uncovering Software Bugs during these hours.