Your IP Address is 4Sight Case Study: AIOps detect service anomalies invisible to the human eye | Torry Harris Integration Solutions

Who

A UK-based mobile operator wanted to implement an AI solution to monitor web services and detect and flag deviations in their performance.

The operator is part of a larger group with operations in multiple countries. The operator’s IT environment includes several legacy integration frameworks, independent web services, secure access gateway to connect billing, customer relationship management (CRM), order management, prepaid systems, and the service delivery platform (DSP), among others.

What

The operator used an Elastic Search, Logstash, and Kibana (ELK) stack to monitor different middleware services based on their response times. This was proving tedious, time-consuming, and was prone to human error. The main challenge was in identifying services that had such a slight gradual increase in response time that problems only became obvious to the human eye after three to four days and cumulatively, could break the terms of service level agreements (SLAs) over a 24-day period.

This hard-to-detect decay of service affected other services and took a lot of human resources and time to fix. It underlined the limitations of humans’ ability to monitor the services effectively in an increasingly complex environment.

How

Torry Harris Integration Solutions (THIS) introduced its 4Sight machine learning solution to explore three scenarios, as shown below.

Scenario 1: Gradual increase in response time

4Sight monitors the response time of different services, looking for gradual increases within various operating parameters, such as specific periods of time or numbers of transactions. Where a gradual increase is identified, the solution interprets the trends and can predict when the service will breach its terms of SLA and generate alerts.

Scenario 2: Sudden spike in the response time

If there is a sudden increase or decrease in the response time (referred to as the configurable value) of any of the services, 4Sight sends an alert about the deviation in the service’s normal behavior.

Scenario 3: Sudden breach of SLA

If any of the services breach the SLA regarding response time, the monitoring solution flags the breach. For example, if the SLA stipulates 3 seconds for a “Offers” service as the fixed base value and the actual response time is 2 seconds, then the solution dynamically adapts and makes that shorter time the new base for processing and predictions. Other examples could be that an alarm is sent if there is a 10% increase or decrease in the dynamically derived base or when a response time breaches an SLA term.

Results:
  • The machine learning capabilities were introduced into the existing ELK environment without any disruption and now rapidly detect and send alerts about anomalies based on the response time of services.
  • Automatic monitoring of the different services runs 24x7, requiring minimal human interaction, thereby freeing up the workforce.
  • The solution is efficient and cost-effective: the Kafka cluster runs on a commonplace server with a 32 Gigabyte random access memory (RAM)that can process about 300 transactions per second.
  • Further, early detection of problems in any service allows a fast fix by developers, avoiding pressure on the server, which in turn prevents downtime.

Creating a solutions architecture

THIS used 4Sight to create an artificial neural network (ANN) to monitor end-to-end business services running on a Kafka system. An ANN comprises a collection of linked units or nodes, known as an artificial neuron because they are loosely modelled on how neurons work in a biological brain in that they can send a signal to other neurons. Once an artificial neuron receives and processes the signal, it can relay it to others to which it is connected.

The signal is a real number, which is an expression of a continuous quantity, such as to represent a distance along a line. The value of each neuron is computed non-linearly as the sum of its outputs.

The mobile operator worked by forwarding the Logstash output from the existing ELK set-up to the Kafka system which provides the response time and other details about each service to the neural network. Another option is to deploy a Kafka cluster to process different services in parallel to save time.