Strategies for creating concise, efficient communication between teams during incidents and operational suprises
Being on-call is stressful. On-call responders must be ready to spring into action whenever an alert comes in. Since on-call is also demanding, communications must be precise and descriptive to minimize confusion and accelerate a responder’s ability to assess and remedy the situation.
With a diverse ecosystem of incident response, site reliability engineering, and on-call tools that seem to expand daily, there is no shortage of technologies to keep incident responders and stakeholders engaged and informed. However, these tools can’t automate the actual person-to-person communication between team members. While writing notifications and preparing runbooks in advance is possible, each incident requires team members to communicate outside pre-defined templates and procedures.
Handling the dynamic nature of an emerging incident is already tricky; why complicate it with confusing communication? Instead, organizations must streamline on-call responder efficiency by improving communication skills used during an incident.
Below we’ll share a high-level look at the importance of using unambiguous language during incident response. We’ll also outline strategies for creating concise, efficient communication between teams during stressful situations like high-severity incidents.
Everyone communicates differently. There are slight vocabulary differences across regions, organizations, and individual teams; idiosyncratic phrases emerge when discussing or solving a shared challenge.
These variations rarely introduce confusion or disruption in everyday communication. However, with tensions running high during an incident, any misunderstanding can lead to wasted time and potentially aggravate the situation by adding unnecessary frustration and friction.
Using familiar and unambiguous language, paired with practical and concise communication practices, will enable your team’s success. Here are some strategies to consider:
Although every incident varies slightly, nearly all are time-sensitive. Therefore, when there is a delay in incident response, lost time is lost money, and the business bears the costs.
Therefore, nothing is more vital in incident communication than clarity. Every extraneous word — written and read — detracts from incident resolution.
When writing a message to an on-call responder, consider the following:
I can assure you that on-call responders do not want to look at a wall of text as they are woken up at 3 am to respond to an incident. So keep it simple, using just the facts.
There is no shortage of jargon in DevOps. Unfortunately, while some terms help describe concepts and technologies, others can be more confusing than illuminating. This especially applies to newer terms not yet commonly used or those that a single team might commonly use.
Every team’s on-call vocabulary will contain some amount of jargon and field-specific phrases. These terms are often the most effective way to communicate specific concepts. However, it is vital not to introduce new jargon during an incident. Any terminology used during the incident should be commonly used and understood across all teams. Strive for a mutually intelligible, shared vocabulary across the workplace.
Generic messages might help quickly report and initiate an incident, but they do not necessarily incite action. In addition, poor communication creates unnecessary stress for on-call responders, who may overreact (or underreact) if they cannot assess the scope of an incident and understand their role in an incident.
It is essential to tailor messages to the on-call responder’s needs to accelerate response time. Provide the most relevant and actionable information based on their role, skills, and responsibility in addressing the incident.
Consider what information will provide the fastest and most effective response. The most relevant details will pertain to the team member’s role in remediating the incident. For example, a security breach might require quick access to logs and audit trails, while a back-end infrastructure incident requires looking at in-depth metrics and network information.
Regardless of the situation, the language and the information should set expectations of an on-call responder’s role in an incident.
Many teams are eager to provide every detail to support incident resolution when it comes to incident response. However, this may inadvertently delay recovery or response time by not presenting only the most critical information.
Failure to provide easy access to the essential incident data is equally harmful, as critical metrics and incident reports should be immediately accessible. If they are not in the initial message, the incident response interface should provide an easy way to access the data.
Incidents can be pretty stressful, but it’s critical to maintain a cool head and reflect this calm in your communication. Remain positive, non-emotional, composed, and precise. Approach the incident knowing the on-call responders can resolve it and strive to provide them with the most pertinent data they need.
It is vital to cultivate a blame-free culture in all incident response strategies. Responders must be unafraid to ask questions and engage collaboratively to resolve incidents. Equally important is to recognize failures as opportunities to learn and drive improvements across the ecosystem. Avoid using emotionally charged and negative words or phrases that could cause friction or derail recovery actions.
Additionally, during the resolution process and incident retrospective, do not allow individual blame to manifest. Avoid communicating details that might implicate an individual or team responsible for the incident. Instead, keep the focus firmly on remediating the incident and an opportunity to strengthen your system and processes.
Giving thought to your on-call communication and vocabulary can go a long way. While it is unnecessary to create an exhaustive list of language for your team to follow, it is essential to instill in the entire team the idea of using commonly understood terms and practices to ensure effective on-call communication. By following the simple principles described above, teams can save time and minimize an incident duration.
These principles of effective communication are helpful to practice regardless of an organization’s approach to incident response. However, when synergized with a powerful connected workflow tool such as Transposit, teams can communicate more effectively than ever before. Connect with us if you are interested in learning more about how Transposit can facilitate team communications, generate dynamic, real-time documentation, send automated notifications, and trigger actions with just about any API.