How to deliver bad news to customers - Part 2

How to respond to customers when your systems break!

Jun 14, 2023

In the previous newsletter we talked about how to tell your customer that you are not going to deliver on a feature commitment. This post gets into how to communicate with your customers when your systems suffer an outage.

Again, the most important thing your customers are looking for is transparency. Yes, they care about their software starting to work again, but they care more about the details of the outage and also about how timely those details are made available to them.

Broadly, they care about three things, 1) What happened? 2) How does the outage affect them? 3) What are you doing to prevent issues like this from happening again?

You need a robust multi-channel approach to communicating with customers in the event of software outages. At the minimum, every software company needs the following channels in place to communicate effectively with their customers

A web based status page where your customers can get the overall status of the software they are paying for, for e.g, https://health.aws.amazon.com/health/status
An internal incident management system that keeps all relevant internal stakeholders (product teams, customer support, customer success etc.) informed, so that they are ready to tackle questions from customers
An online community where your customer can interact with their service provider (in this case, you and your company) and other customers. Some examples of online communities, https://support.zendesk.com/hc/en-us/community/topics, https://aws.amazon.com/developer/community/
A mass emailing system, for e.g., Mailgun, Twilio etc.
If you sell to enterprises, you probably also have dedicated account managers for your customers

OK, so your system goes down. The first thing to do is to update the status page with the details of the outage. You don’t have to put in details if you don’t know them, but you have to put something on the status page to make sure your customers are informed. As your team works through the issue, make sure the status page is up to date. My general guideline to my teams is to update the status page every fifteen minutes. Even if you don’t have anything to report besides just saying your team is still working on it, make sure the status page has updates every fifteen minutes. As your team uncovers more details, make sure you update the status page and specifically get into the details of what matters to your end customer. Here is an example from my past

‘Due to a vendor disruption, media encoding in the application may be delayed but functional. If you encounter any issues with a video/file stuck encoding, you may save the page and try again later. We will update this page as soon as we have more information to share.’

Here is an example of all the other status messages associated with the incident above . I have removed some of the updates for the sake of brevity.

Identified - Due to a vendor disruption, media encoding in the application may be delayed but functional. If you encounter any issues with a video/file stuck encoding, you may save the page and try again later. We will update this page as soon as we have more information to share.

Update - The vendor has implemented a fix, and anticipates that encoding queues will be cleared by 7pm PST. We are continuing to monitor for any performance improvements or changes with media encoding services and other queues. If you encounter any issues uploading media, we recommend saving and returning to the page later. Thank you for your patience, and we will post updates to this page as soon as we have more information to share.

Resolved - The incident has been resolved. The vendor has finished processing the backlog of uploaded media.

The common theme across these updates are-

It is precise about what is not working and what is
It is precise about when the issue started happening
It is precise about how the incident affects the end customer and any actions they need to take

For partial outages (as in the case above), you don’t need to do much beyond regular updates to the status page and make sure your customers know the URL for the status page.

In the case of full application outages, where everything is down for everyone, or in the event of a security incident (large scale phishing, accounts compromised etc), you as the leader need to directly address your customers. This is where the pitchforks will be thrown at you.

Again, the best way to showcase how to write a production incident follow-up is with an example. This is what I wrote up when our managed redis instance (used for caching), failed, bringing down the entire application for an hour-

“Hey, all! My name is Mahesh and I run Engineering here at <company_name_redacted>. First off, I want to deeply apologize for the disruption this incident has caused your business.
I want to share what we know so far about what happened, and what steps we are taking to mitigate future occurrences of this issue.
We have isolated the problem to a vendor product we use for caching application requests. We noticed this morning that caching servers were taking longer than usual to respond to requests. At that point, we engaged with the vendor to help us mitigate the issue. We’ve taken additional steps (spun up an additional caching server) to reduce response times down to normal levels. We have a few more changes scheduled for later tonight that we expect to protect against this issue. However, we are still investigating what exactly caused the cache cluster to start behaving abnormally in the first place. My team is actively working with the vendor to get to the bottom of this.
Additionally, we are accelerating our efforts (which were already underway) to continue to invest in our infrastructure, including caching (we are strongly considering moving to a new provider), such that we are better supporting and serving your businesses.
I really apologize for the disruption and impact this had on your businesses and for that I am truly sorry, and appreciate your patience as we work to get this resolved as quickly as possible.”

This message shares a lot of similarities with the status page updates mentioned a few paragraphs above, but is different in one big meaningful way. It is personal. It shows that I, the leader in charge, personally care about the incident.

In the event of a large-scale system outage that caused tremendous disruption for your customers, the single most important thing they are looking for is transparency. Not just transparency from the company, but transparency from an actual person. Specifically, they want to see accountability from the person directly responsible for the upkeep of the system they depend on daily. Just by owning up to the outage and directly responding to customers will go a long way in earning back trust from your customers. You score even more points if you allow two-way communication with your customers. For e.g, engaging with them on social media and answering questions. Yes, you will get some (and in some cases, a lot) hate from your customers in online interactions. I typically don’t engage with any sniping that comes my way, but I do engage with customers who are genuinely worried about the future of the software. For e.g, I won’t respond to posts like

‘Clearly you are incompetent, and the company needs a new engineering leader

And yes, that is a real quote that got thrown at me by a customer.

But I do respond to questions like

‘There have been several incidents in the past few months, what is going on?’

With something like

‘Hey X, I apologize for the disruption we caused you. There have been 3 partial incidents (fixed in under 15 minutes) and 1 full application outage (fixed in 60 minutes) in the last six months. There wasn’t any common theme across the incidents except human error. To reduce the chance of human errors happening, we have increased our investment in test automation. So far, those errors haven’t happened again. I am sure there will be more human caused production incidents as we ship new features, but we will continue to keep working on creating a good balance between velocity of shipping and quality.’

The key is to respond with empathy, data, honesty and accountability.

If you have VIP customers, just sending an email or engaging with them on social media won’t be enough. You will have to get the phone with them. The script is pretty much the same as above, but you have to personalize this conversation. When you talk to them, you have to showcase that you understand their business and the impact your system outage had on their bottom line. Lastly, walk them through the details of what caused the incident, why mitigation steps took whatever time they took, and the future roadmap items which will prevent issues like this from happening again.

Personal Anecdote

In a galaxy far, far, away, I was a senior leader at an e-commerce company. I had inherited a set of systems that had extensive stability issues. Thankfully the engineers I inherited were actually pretty good, so as soon as I joined, I refocused most of them to fix the stability issues.

One of the operational issues I had to deal with was the amount of spam flowing through our systems. Every software company has their share of bad customers who want to subvert the software in malicious ways. One of the core features of our product was the ability to send mass emails, which was probably more loved by the bad actors than our legitimate customers, and they loved to send spam. Spam is annoying to the people who receive it, but to the systems that send that spam they can be a death knell. If your servers send too much spam, they will start getting blocked by various inbox providers like Gmail, Microsoft, Yahoo, etc. This means that ALL of the email sent by you will be blocked. Which is what happened to us.

We got blocked by Spamhaus, which is a very influential authority when it comes to email reputation. If Spamhaus thinks you are a bad actor, lots of inbox providers will think so as well. Once Spamhaus blocked us, 80% of all emails our system was sending were not getting delivered to customers. The email system was a critical component of our customer’s workflow and when it went down, the pitchforks went up.

They were PISSED. They yelled at us through all available channels. Social media, customer support, emails and one customer actually yelled at me on LinkedIn. In the beginning, I was just just frustrated. In my mind, I was thinking, why don’t customers understand that yelling at me or my team is just distracting us from solving the issue?

In a moment of impossible serendipity and a bit of madness, I decided to read all the hate mail that I have gotten. I painfully went through a lot of it. Amidst the pitchforks, I noticed that there was actually a real ask. I realized that the biggest reason they were angry was because they didn’t know what the hell was going on and nobody was able to answer their basic questions like, what happened to their emails??

When I looked at the status page, I noticed that it only had one update on it since the incident started. Immediately, I put one person in charge of keeping that page up-to-date with information. I also brought the support team up to speed with what was going on, so that they can answer customer questions they call the support line. Once regular updates started flowing through to customers, some of the anger died down. However, until the emails started flowing again, the fire won’t be put out.

The only way you can get Spamhaus to unblock you is to convince them you are not a wholesale supplier of spam and the easiest way to do that is to show them that we are actually blocking spam from going out. Internally, we aggressively started blocking the spam accounts. After three days of aggressively removing bad actors and duct taping some of the critical holes in the system, Spamhaus finally removed us from their block list. We were finally able to breathe again. I thought that the fixing of the issue will herald the lowering of the pitchforks, but I was mistaken. Multiple customers started social media threads calling into question our system’s overall reliability and through proxy, the company’s credibility. I decided to address the customers directly via a social media post.

In the post, I first introduced myself to the group as I was relatively new to the company. It is only fair that the customers knew the name of the person they were yelling at. Then I went into all the gory details about what happened and how we ended up getting blocked by Spamhaus. The truth was, we weren’t paying attention to spam for a very long time, with new sexy, customer facing features consistently getting prioritized over unsexy spam blocking initiatives. The sins of the past is what put us in email jail, and I was pretty self-critical about it in my message to the customers. I didn’t blame a single person or a team for it, I blamed myself. It doesn’t matter who made the wrong decision in your team, in the end the failures are your own as the leader of the team. Then I went into all the projects we were going to do that will prevent issues like this from happening again. Some of those features will actually make it harder for customers to send email. We were going to doubly make sure customers are getting opt-in’s from their customers before sending them emails, which will introduce some additional work for our customers. I laid all of them down as well in my post, making it clear to them that I need their help to keep the email flowing without disruption.

When I made it clear to customers that I want to make them a part of figuring out the solution, the temperature in the community went down considerably. I am not entirely sure if all companies have similar customers, but most customers just want to be included in your thought process. I wonder if most people are like that.

So to recap, the things that helped me calm my customers down, being self-critical, giving them the details and including them in designing the fix. Oh, and also some refunds.

A quick recap of takeaways from this chapter

Understand your customer and personalize your responses to them
Take ownership and show accountability
Lean into transparency. Don’t be afraid to get into the details with your customers.
Develop a thick skin

Caroline Sheldon

I like the insights in this article. And I agree that it is all about transparency. One thing missing in the "system breaks" scenario is how you get your users to your status page in the first place. Example - Yesterday, AWS-east had an outage which affected a product I was using. The product just didn't load, but I have no feedback of a user as to what was happening. I believe that when possible, the product experience should also provide a message to the user and direct them to that web status page that you referenced.

Expand full comment

1 reply by Mahesh Guruswamy

1 more comment...

The Sensible Manager

Discussion about this post