IP Address: 44.201.99.222 Implementing AIOps to Predict Runtime Failures | AIOps Blogs | Torry Harris Integration Solutions Blogs

Digitalization continues to accelerate across many sectors. International Data Corporation (IDC) forecasts that $2.8trillion will be spent on digital transformations worldwide in 2025 – more than double the amount invested in 2020. Within those transformation efforts, an increasing number of companies are turning to artificial intelligence in their operations (AIOps): According to Gartner, 30% of large enterprises will be using AIOps platforms and digital experience monitoring technology to monitor their IT by 2023, up from 2% in 2018.

This is because the acceleration in digital transformation is giving rise to a rocketing number of digital services, which massively increases the pressure on IT operations teams who are drowning in a flood of events, logs, and metrics.

AIOps is a category of tools designed to bring the benefits of AI, most typically machine learning, to data generated by telemetry. The aim is to enable teams to evaluate and act on data faster, reduce manual tasks and human errors, and predict and so prevent problems.

This includes:

  • DevOps teams who solve development pipeline issues;
  • Site reliability engineers (SRE) who address operational, scale, and reliability problems; and
  • Operations and management (O&M) teams who are responsible for creating more value from infrastructure and assets through greater reliability and efficiency, and optimizing operational performance using technical analyses, overcoming technical glitches and carrying out corrective or preventive measures.

AIOps achieves this, Gartner says, “by combining big data and machine learning to automate IT operations processes, including event correlation, anomaly detection, and causality determination”. It enables the automated discovery of performance and event patterns, and detection of the root cause of performance anomalies, typically using machine learning, data mining, analytics, and visualization to observe operational status. This minimizes the impact of failures during day-to-day operations and proactively allocates computer resources.

Regarding predictive analytics in AIOps, solutions can analyze and correlate data to gain insights and automated actions, enabling IT teams to control complex IT environments and assure applications’ performance. The facility to correlate and isolate issues is a leap forward for IT operations as it cuts the time needed to detect issues and throws light on problems that might not have been found.

Automatic anomaly detection, alerts, and solution recommendations reduce overall downtime and the number of incidents and tickets. Dynamic resource optimization can also be automated by predictive analytics to assure applications’ performance while cutting down resource costs, even during peak periods.

A Survey of AIOps Methods for Failure Management published in November 2021, states that compared to traditional approaches, AIOps is:

  • Fast because it reacts independently and automatically to real-time problems, without requiring long manual debugging and analysis sessions;
  • Efficient because it draws on the holistic monitoring of infrastructure, removing data silos and improving issue visibility. By forecasting workload requirements and modelling request patterns, AIOps improves resource utilization, identifies performance bottlenecks, and reduces wastages. Freed from investigation and repair burdens, AIOps enable staff to concentrate on other tasks; and
  • Effective because it allocates computer resources proactively and can offer a large set of actionable insights for root-cause diagnosis, failure prevention, fault localization, recovery, and other O&M activities.

AIOps everywhere

All of this explains why the AIOps market is expected to grow rapidly. Allied Market Research values the global AIOps market at $26.33 billion in 2020 and projects it will reach $644.96 billion by 2030 at a compound annual growth rate of 37.9% from 2021 to 2030.

That growth – and the anticipated benefits – will not be achieved without overcoming some obstacles; chief among them is data. Regardless of how smart the AI algorithms are, their usefulness is limited unless they can be applied to good-quality data. It is usual for an enterprise's data to be fragmented, resident in diverse systems – some of which were probably designed to be discrete – in many different formats.

This makes it difficult to gain that end-to-end visibility of infrastructure and applications and hinders AIOp's efforts to identify emerging patterns in which anomalies can be detected across operations. The key to addressing both these challenges – visibility and identification of emerging patterns – is to choose API-driven AI tools* because they can work with any source of data, from logs to databases and streams, to deliver intelligent, actionable insights.

The richest source of data of all though is through data modelling; that is, bringing data together from different fields and APIs are key to this.

The critical importance of APIs

Open, standardized APIs are a highly pragmatic way of enabling secure, controlled access to and between systems, and are foundational to the success of AIOps. The maximum benefit is derived from APIs by reusing them wherever possible, and deploying them in a consistent fashion so that the process of integration can be repeated as required, instead of each integration being a time-consuming, expensive customized task that introduces greater risk and complexity.

The MuleSoft’s 2022 Connectivity Benchmark Report (in conjunction with Vanson Bourne and Deloitte Digital) estimates that 38% of IT teams’ time is spent on designing, building, and testing new custom integrations between systems and data, while technical debt for integration is increasing with organizations spending more than $3.6 million, on average, for the labor involved in custom integration.

In addition to the quality of the raw data, the analytics that are applied to it have a tremendous bearing on how the AI algorithms perform. The most helpful are normal and binomial distribution. Normal distribution describes continuous data which is symmetrical, with a 'bell' shape. Binomial describes the distribution of success or failure from a finite sample of data to predict the likelihood of problems occurring.

Outlier detection is key to safeguarding data quality and needs to be part of each stage of the machine learning process. It works by removing and analyzing anomalous data and errors to avoid the outliers – anomalies – skewing results. If anomalies in training data are not fixed, the accuracy and usefulness of the model are compromised.

A z-test uses statistics to determine whether two population means are different when the variances are known within a large sample, while a t-test determines if there is a significant difference between the means of two groups, which may be related to certain features. It is deployed to test hypotheses before they are applied.

Another critical success factor is that tools can be embedded in existing applications, for example, by using widgets, without causing disruption, delay, and expense so that data can be visualized, and analytics and predictive capabilities can be introduced seamlessly via metrics, scripts, sort and more. A search facility, via an interactive web interface, means teams can search an organization’s data based on criteria, including time, phrase, expression, and user events.

Creating predictive models

The best AIOps solutions offer a user interface-based approach to creating predictive models that do not involve writing custom code but can be configured to receive forecasted values based on a set of input parameters. Alerts can be configured to notify if the value in a dataset has changed beyond a set threshold. The solution should also allow the uploading of training data sets for the machine learning to ‘learn’ from.

The machine learning models themselves also need to be adaptable so that they can be applied to multiple tasks. For example, a solution should have a configurable AI and machine learning module with support for algorithms such as:

Random Forest Regressor

which combines ensemble learning methods –using multiple models trained on the same data – with a decision tree framework to create multiple, randomly drawn decision trees from data, then averaging out the results to provide predictions and classifications.

Content-based recommender

works with data provided by users – either explicitly, such as by rating something, or implicitly, such as by clicking on a link –to create a profile used to make recommendations to that user.

Association rule learning

is derived by machine learning models analyzing data for patterns, or co-occurrences, in a database to identify frequent if-then associations. These are the association rules, and each has two parts, which are the antecedent (if) and the consequent (then).

Apriori algorithm

mines data for frequent item sets and association rule learning across relational databases. It identifies frequent individual items in the database and extends them to ever larger item sets if those item sets appear sufficiently often in the database.

It's crucial to grasp that deploying AIOps is not about ‘fixing’ technology so much as a means of delivering improved, strategic business outcomes. The predictive element of AIOps could, for example, improve customer experience by solving problems before they impact customers, or prevent a hold-up at a manufacturing plant, or conversely stop a production line in anticipation of a fault, thereby avoiding or reducing wastage while maximizing output and use of assets.

Clearly, these and other potential benefits are not mutually exclusive; rather, each could impact some or, in some circumstances, all the others.

*Check out why Torry Harris Integration Solutions (THIS) won the Computing DevOps Excellence Award 2022 for the Best AI/ML Ops Tool, 4Sight