wiki:public/20131011

Public Wiki

Project: Towards real-time IT service management systems: In-situ analysis of events and incidents using SAP HANA

Team: Prof. Dr. Rüdiger Zarnekow, Dipl.-Phys. Thorsten Pröhl

Research institution: Technische Universität Berlin

Project Description: Information technology service management (ITSM) is of increasing importance to IT organizations around the world, while information systems (IS) play a crucial function in private and public sector organizations. ITSM leads to a paradigm shift from a traditional IT organization to a customer-oriented IT service provider. The need to align the IT strategy with the business strategy, increasing the transparency of IT processes and quality while at the same time reducing the cost of IT services are some of the reasons why the information technology infrastructure library (ITIL), as the de-facto standard of IT service management, is introduced.

Key topics of ITIL are event and incident management, which describe how to manage the whole event and incident lifecycle. According to ITIL an event is defined as “a change of state that has significance for the management of an IT service […]”. The term is also used to mean an alert or notification created by any IT service, configuration item or monitoring tool. In addition, IT organizations are dealing with a huge number of incidents in their daily business. ITIL describes incidents as “an unplanned interruption to an IT service or reduction in the quality of an IT service”. Furthermore, major incidents are described as “the highest category of impact. A major incident results in significant disruption to the business.” Thus, the identification and resolution of these major incidents is a matter of special importance.

In the domain of ITSM, there are various fields of application for big data analyses and operation of in-memory technology. Real-time analysis of alerts, events, and incidents leads to proactive actions and processes. Incidents could be predicted, rapidly identified, and resolved in a timely manner. Moreover, serious issues, major incidents, could be recognized among a crowd of “normal” incidents. In order to recognize these kinds of incidents, text analysis techniques and dependency resolution of IT services can be applied. In contrast to the traditional resolution and visualization of configuration items (CIs), now the resolution of service dependencies is possible. The demand for this kind of analyses is increasing due to the complexity of services and data center structures. Cloud services and far-reaching value chains lead up to service dependency analyses. In addition, the ITIL IT security process (“information security management”) is encouraged by user behavior and data analyses. Near-term determination of capacity and availability issues support capacity and availability management. On top of this, in-memory technologies offer new possibilities regarding the observance of service level agreements (SLAs). Usually, SLA reports are generated once a month, now, real-time SLAs are conceivable, whereas predictive SLAs will support service managers as they are able to start corrective actions in a timely manner. Dashboards visualize both kinds of SLAs in an appealing way, therefore service managers don't have to wait for recurring reports.

This research project will focus on a subset of envisaged new ITSM topics. This project investigates the real-time identification possibilities of events and (major) incidents. Therefore, the authors will build up a “live” system based on a big database and develop respectively use pattern matching techniques in order to treat the event and (major) incident issues. Finally, the in-situ monitoring of SLAs will be performed.

http://www.ikm.tu-berlin.de/fileadmin/fg16/Forschungsprojekte/SAP_HANA_Project_800.png
IT service management system: high level view of underlying architecture

The above figure presents a high level view of the underlying architecture. Within the project scope, we implement monitoring nodes based on Nagios or Icinga in order to monitor and control IT resources like databases, web and ftp servers. Furthermore, these nodes observe virtual machine (VM) hosts and single VMs. It is necessary to distinguish between agent-based and agentless monitoring; our nodes support both modes of operation. These monitoring systems collect event and sensor data from different sources in order to transfer these via VPN to our HANA instance, where a continual load into corresponding database tables happens. On the opposite side, a dashboard shows events, created incidents, and compliance of contracted SLAs. This proof of concept has different users, which have different needs for information and therefore various dashboard views.

Last modified 6 years ago Last modified on Aug 27, 2013 11:19:22 AM