Service Level Agreements – Best Practices and Crucial Elements

Service Level Agreements (SLAs) have been used for many years. Most people working in IT and IT Service Management (ITSM) know of the term, and some even think they understand service levels. Despite the number of years SLAs have been part of the IT environment, they have acquired a bad reputation at plenty of organizations. IT staffs think they are pressured to deliver against SLAs and customers, in most cases, are disgusted with IT telling them the SLAs are being fulfilled, even though they’re not happy with the IT services they receive.

How can this be? Surely, using the service-level ideas from best-practice frameworks, such as ITIL (formerly known as Information Technology Infrastructure Library) and Control Objectives for Information and Related Technologies (CoBiT), guarantees having happy customers and providing services that satisfy their needs and expectations?

To misquote William Shakespeare – there lies the rub. Much too often, the service levels, including their definition, aims, and specific targets, are designed in isolation of the customer. They are based on historical ideas when IT thought they were supremely important and knew what the customer wanted. Sure, the frameworks, such as ITIL, recommended ITSM should first obtain the requirements from the customer and then talk with IT to determine if it could deliver those requirements, which resulted in a set of service levels to which IT and the customer formally agreed in a service-level agreement.

In the real world that almost never happened. IT presented some service levels to the customer, who didn’t understand what they actually meant, so, after a short discussion, the customer agreed to the service levels and signed the SLA.

 

An example drawn from life

Availability, a very commonly used service level, will help to illustrate this situation. Availability is usually expressed as a target percentage: “The system will be available 99.95% of the time.” The customer looks at that and thinks it’s good. He or she asks for 100%, but IT tells him or her that although 100% availability is possible (just), it will cost a considerable amount of money. The customer, therefore, agrees to 99.95% and signs an SLA to that effect. All is good for several months. Then, during a weekend, IT implements a new software release. Unfortunately, some issues occur, and the system is unavailable until mid-morning on a busy trading day. At the end of the month, IT publishes the service-level report for the period. The users are angry, because, despite them and their businesses suffering a major disruption for an hour, the service report shows all SLAs were fulfilled with a system availability of 99.99% after rounding to 2 decimal places. The IT director can’t understand why the users are unhappy, because his or her IT team has exceeded the service levels.

The following month the scenario is repeated, with no service for an hour during a Monday morning. Service-level achievement for availability is reported as 99.97%. Users are still unhappy, but so is IT. Its director accuses IT of slacking because the service level has slipped.

This phenomenon is known as the “watermelon effect” after the color status, Red/Amber/Green, which is often used to illustrate service-level achievement. In the example above, the service level is reported as green when viewed from the outside, but inside, from a customer perspective, it is red. Just like a watermelon!

How did this happen? Easy. It’s mathematics. No one told IT or the users the availability-service-level calculation was based on the total downtime during the preceding 12 months, with a 24-hour day, 7 days a week. Each year that isn’t a leap year has 525,600 minutes. If the availability target is measured during 12 months and includes 24 hours a day, then, for a target of 99.95%, the system can be unavailable for a total of 262 minutes during the year without breaching the SLA. In the example above, the system was unavailable for a total of 120 minutes during the year to date, still well within the service level.

 

Best practice #1 – Understanding is key

It is crucial all of the people involved in setting, agreeing, achieving, managing and using service levels completely understand how the service level is defined, and how its achievement is calculated. If the customer in this example had been told the calculation for availability would be based on 7 days a week, 24 hours a day, totaled during the last year, then he or she would probably have rejected it. A much better service level would have used the hours the customer worked and/or his or her business is open, and measured during just one period.

 

Best practice #2 – Outside in thinking

Service levels should always be defined from the “outside in,” or defining service levels according to users’ understanding, and match what they need to do their jobs. Always think from the viewpoint of the customer, outside of IT.

Imagine a different approach to the example. The users explained they only used the systems weekdays 9 am to 5 pm, with a lunch break from 1 to 2 pm. Any loss of service during those working hours severely affects their work and how their companies do business. During the other hours of the day, they don’t care if the systems are available. The Human Resources (HR) function can manage without its particular system for as many as 4 hours. All of the other functions in the business can manage without the systems they use for 30 minutes, but there can’t be more than one issue per week that causes an outage. 

The new head of ITSM defined new service levels that were much more aligned with the user requirements. The new availability-service-level targets were for a maximum outage period of 30 minutes for all systems apart from the HR system, which had a maximum of 4 hours. A new service level stated there could only be one outage every week. Achievement against these targets was calculated and reported every Monday, using the previous week’s data, plus a rolling summary of the previous 4, 12 and 53 weeks. Percentages were no longer calculated, as they meant nothing to the users.

Because the service levels now matched what the users required and understood, they now saw a true picture of how well IT fulfilled their requirements. IT now had a much better understanding of what the users needed, so it could better plan to satisfy their requirements.

 

Best practice #3 – Drive desired behaviors

Service levels drive behaviors. Best practice is to first understand what behavior is optimal, then design an appropriate service level that will influence people to achieve the desired behavior. In addition, think hard about any harmful behaviors that might occur. Human beings are very good at thinking of ways to fulfill service levels, which might not provide the expected results. Lastly, and most importantly, once the service levels are in use, exact behaviors must be noticed and scrutinized.

Another illustration of the point is a service desk of a few years ago to which users complained the agents took ages to answer their phone calls. The director in charge of the service desk introduced a new service level he thought would fix the problem:

“100% of telephone calls must be answered during less than four ring tones.”

Instead of improving the situation, the user complaints increased, as now most of the time no one at the service desk answered their calls. The director visited the service desk to determine what was occurring. Half of the phones were ringing and no one was answering them. The director demanded to know why. He soon received the answer. If anyone answered a phone that had rung more than four times, then the service level was immediately breached. This was an unintended behavior!

Hence, the behavioral implications of every service level must be completely considered before introduction.

 

Best practice #4 – Set realistic service levels

In this last example, it was unrealistic to expect 100% compliance with the 4-ring service level. There will always be situations that prevent 100% of service levels being fulfilled. Setting an unachievable service level can have an unintended consequence: IT will completely disregard it. It is human nature to ignore any target that knowingly can’t be fulfilled, so any service level shouldn’t be set unless there is a very high probability it can be achieved. It doesn’t have to be very easy to achieve, since setting service-level targets that stretch achievement can drive improvements, but the IT staff must buy into the concept. No one likes to fail.

 

Best practice #5 – Review service-level targets

Too many organizations set service levels, and then never change them. Unachievable service levels are ignored and those easily achieved cause IT to become complacent. It is highly likely that the first service level targets you set won’t be perfect. Your service levels themselves, not just achievement against them, should be regularly reviewed and if necessary changed. If a target is always being exceeded, then you might be spending too much money by over-engineering the technology for the service. This refers, again, to understand actually what users need. Gold-plated service levels can look good, but the costs can exceed the true value. Tightening service levels that are consistently achieved will establish a challenge for IT, forcing the staff to think about how to improve the delivery of service.

 

Best practice #6 – Use precise definitions

As in the first example above about availability, using a precise definition of each service level is crucial, otherwise, simple errors can occur. Some service levels for fixing issues are sometimes stated as,”‘95% of all incidents will be fixed within 5 minutes.” If an incident was fixed within 4 minutes, then does it fulfill the service level? What about a fix of exactly 5 minutes? What does “fixed” mean? Who says it’s fixed, IT or the user? When does the clock start: from when the user has the issue, from when he or she reports it or from when the service desk has passed the incident to IT?

It’s a good idea to draft the precise wording carefully of every service level, and if necessary, then defining specific terms, and be very clear when using absolute numbers if those numbers are included or not. The example could have been better worded as, “95% of all incidents will be resolved to the satisfaction of the user within 5 minutes or less from when the service desk records the incident.”

 

Best practice #7 – Manage disputes

You should design a process to manage any service-level disputes. It is particularly important when using external suppliers to provide the services. The types of disputes include disagreements about the meaning of service levels, the calculations used when determining if they have been achieved and when the supplier claims it wasn’t responsible for failing to fulfill the service level. This can be a regular occurrence in multi-supplier or SIAM (service integration and management) ecosystems. A good approach is to encourage suppliers to continue to maintain a focus on providing good service to users, while any service-level disputes are managed in parallel.

 

Best practice #8 – Excusing causes

There can be very good reasons why a supplier fails to achieve a service level. It might have received incorrect or insufficient information from the user or the service desk, or another supplier could have been the actual cause of the failure. A mechanism known as “Excusing cause” can be very useful. If a supplier wants to claim it wasn’t responsible for the service-level breach, then, after the incident has been fixed, the supplier submits a formal claim, detailing why it thinks it is not the cause. The ITSM team reviews the claim from the customer review, gathering evidence from the suppliers and users involved. If ITSM agrees with the claim, then the service-level achievement can be re-reported, discounting this particular event.

 

Best practice #9 – Critical service-level triggers

As well as setting service-level targets to be achieved, it is good practice to set a critical level too, which is worse than the other targets. If this critical level is reached, then it triggers specific actions, such as calling the boss of the supplier at fault, withholding a proportion of service payments or, at worse, issuing a formal notice to terminate the contract. Repeated failures can also trigger critical service levels, where the same service level is breached every period.

 

Summary

Best practice in service levels is all about using your brain. Think through every service level from the perspective of both IT and users. Be specific about how your service levels are defined, and understand the behaviors and consequences of each, both intended and unintended.

Cover image by Sharmila