Network issues are one of the main reasons for failures in IT services. So what else?

HOSTKEY
7 min readApr 15, 2024

--

Explore the analysis report and survey results from Uptime Institute

Main Causes of Failures

According to the annual survey conducted by the Uptime Institute Global Survey of IT and Data Center managers (2024) and the analysis of outages in 2024 performed by the Uptime Institute (Uptime Institute Annual Outage Analysis 2024), network and connectivity issues were cited as the most common cause of failures in IT services by 31% of the 442 surveyed respondents. Following closely behind are software problems of IT systems, which account for 22% of the reasons for downtime and outages reported by the respondents. Other common causes of IT service failures include power supply (18%), cooling (7%), and issues related to third-party IT services (10%).

The Uptime Institute also analyzed reports of the largest publicly known data center outages (DCs). According to this analysis, the main causes of publicly reported IT service failures include:

  • IT software (configuration): 23%
  • Network (software/configuration): 22%
  • Power supply: 11%
  • Cyber attacks/ransomware: 11%
  • Fiber optic cable line damage: 10%
  • Fire: 9%
  • Cooling issues: 6%
  • Network (cable infrastructure): 4%
  • Vendor/partner problems: 2%
  • Capacity/demand: 1%
  • Other reasons: 1%

We discovered that IT software is the main cause of failures. However, if we also include network software and their configurations in addition to issues related to fiber optic trunk lines and cable infrastructure, the network will become the primary cause of outages in data centers (DCs) and services.

Andy Lawrence, Executive Director of the Uptime Institute Research Center, remarked on this during a webinar where the analysis results were presented:

If we add network software and their configurations to the list of reasons for failures alongside fiber optic trunk lines and cable infrastructure, it becomes evident that the network is indeed the primary cause of outages in data centers and related services.

As the frequency and severity of failures continue to decrease, cyber attacks are becoming increasingly prevalent and contribute to many serious outages. These attacks lead to widespread and severe disruptions, as mentioned in the report.

Ransomware and other malicious actors can hold systems hostage for days or even weeks. In some extremely rare cases, the targeted company may even cease operations permanently and never fully recover.

The main issue in most cases is that data center management systems now connect to IP networks, making them more vulnerable to attacks. If in the past, these control systems used their own private networks, separate from the corporate network, today, IP-based systems become critical for network security. If attackers gain access to them, they can cause parts of the infrastructure to malfunction.

While primary IP systems receive regular patches to address security vulnerabilities, many of these OT (operational technology) devices, such as HVAC systems, backup power generators, and fire safety management systems, do not receive such updates. As a result, their vulnerabilities remain exposed and unpatched.

To mitigate the risks associated with cyber attacks and maintain data center operational reliability, organizations should:

  1. Regularly update and patch IP-based systems and OT devices to address security vulnerabilities.
  2. Implement robust network security measures, including firewalls, intrusion detection and prevention systems (IDS/IPS), and multi-factor authentication.
  3. Perfom regular security audits, penetration tests, and simulations to identify and remediate potential weaknesses.
  4. Develop and practice incident response plans to quickly contain and recover from cyber attacks

Is everything really as dire as it seems?

In the context of modern business operations, maintaining a high level of service reliability is crucial to preserving customer trust and minimizing negative impacts on reputation. A recent survey highlights the significance of this issue, revealing insights into the experiences of data center operators (DCOs) and IT service providers.

Image from Uptime Institute analysis report

The findings indicate that while some organizations continue to face challenges such as downtime incidents and potential service interruptions, many have managed to mitigate these risks effectively. Among respondents, 41% reported minor downtime issues classified as “registered outages with minimal impact or negligible effect on services.” Uptime also categorized additional incidents, which comprised 32% of the cases, as minimal in nature, causing only slight disruptions for users and clients.

However, it is essential not to overlook the serious implications of more significant disruptions. In this same survey, six percent of participants mentioned severe issues, including service or operation failures, financial losses, violations of claimed service quality standards, security concerns, and reputational damage potentially leading to client loss.

Despite these challenges, it is encouraging to note that only 4% of the respondents encountered substantial or catastrophic downtime incidents causing significant disruptions in services or operations. Nevertheless, it underscores the importance of vigilance and ongoing efforts to improve service quality and ensure robust contingency plans for potential issues.

In conclusion, organizations must remain committed to fostering a culture of continuous improvement, proactive monitoring, and adaptive strategies to respond effectively to evolving market demands and customer expectations. By doing so, they can mitigate the risks associated with downtime incidents and minimize any negative impacts on their reputation and overall business performance.

The Uptime Institute cites several examples of high-profile outages that significantly impacted organizations, highlighting the potential consequences of downtime incidents. One such example involves the US Federal Aviation Administration (FAA), which faced a failure due to software configuration errors. This error inadvertently caused remote file deletions in the system that warned pilots, resulting in delays or cancellations for more than 30,000 flights. This incident led to a decline in the stock prices of major airline companies.

Another case study features Australian telecommunications provider Optus, which experienced a costly outage due to network issues. These problems caused data transmission delays and subsequent difficulties for banking services. Additionally, 12 hours of phone service disruptions affected over 10 million users and 400,000 businesses across the country.

And one more example is the cyberattack on Dish Network, during which attackers encrypted critical data, causing service disruptions for nearly 300,000 customers and resulting in a significant drop in the company’s stock value of more than 6%. These examples underscore the importance of robust contingency plans, vigilant monitoring, and continuous improvement in preventing, mitigating, or rapidly recovering from downtime events.

In all these cases, organizations faced severe reputational damage, financial losses, and operational disruptions that reverberated throughout their respective industries. These incidents emphasize the critical nature of maintaining high availability and service continuity to preserve customer trust, avoid significant penalties, and ultimately remain competitive in an increasingly interconnected world.

Power issues persist

The power supply issues continue to be a major concern despite improvements in design quality and redundancy measures implemented in data centers. Power outages remain a critical factor in data center downtime, with one in ten reported incidents in 2023 being attributed to electrical problems.

Uptime Institute surveys indicated that 30% of respondents experienced an outage directly resulting from power issues. Among these, 42% cited uninterruptible power supply (UPS) failures as the primary cause. The process of switching to backup generators was identified as the second most significant contributing factor by 30% of survey participants.

Generator failures accounted for 28% of all electrical-related incidents and contributed to almost 18% of reported downtime instances due to difficulties with automatic reserve system initiation processes.

The most commonly overlooked aspect in data center management is testing. Even though many data centers have redundant power systems in place, they often neglect regular tests, leading to real-world issues.

However, there are some positive developments. 39% of survey respondents from industrial companies reported an increase in electrical redundancy, and 37% observed similar enhancements in cooling and heating systems for data centers.

Colocation service providers and data centers themselves have also seen improvements in power redundancy (35%) and cooling capacity (33%). Additionally, 37% of cloud/hosting/SaaS providers experienced increased electrical redundancy, while 33% observed better performance in their cooling systems.

The foundation of an emergency is the human element

The human element remains a significant factor in data center (DC) failures, despite advances in technology and redundant systems. A survey revealed that 39% of respondents directly linked the failure to human error.

Image from Uptime Institute analysis report

Examples include:

  • 48% attributed the failure to employees violating standard operating procedures at the data center.
  • 45% cited incorrect processes or policies for staff members.
  • 23% blamed installation errors in equipment and software.

Other factors related to human error involve:

  • Service delivery issues: 19%.
  • Insufficient personnel: 15%.
  • Inadequate preventive maintenance frequency: 14%.
  • Data center design flaws or construction oversights: 10%.

If a system was designed, installed, or configured by a human, it may unintentionally contain an error that could lead to an outage due to this involvement.

Regarding recent IT service or data center failure incidents and their consequences, several trends have emerged:

  • Increased investment in redundant backup power and cooling systems.
  • Enhanced monitoring solutions for early detection of potential issues.
  • Emphasis on employee training and better adherence to procedures and policies.
  • Greater collaboration among IT stakeholders to mitigate risks and improve overall resilience.

Dedicated servers with 4th generation AMD EPYC 9354 / 9124 / 9554 / 9754 and Intel Xeon Silver 4416+ processors

Rent a high-performance dedicated server based on the latest generation AMD EPYC and Intel Xeon processors, with DDR5 RAM and NVME storage. Servers are available for order at our state-of-the-art data centers in Moscow and Amsterdam.

🔶 Installation fee: Free of charge
🔶 Discount of up to 12% depending on the rental period
🔶 Delivery time: next working day

Order

--

--