wiki:public/20141014

Project: Performance Optimization of Data Mining Ensemble Algorithms on SAP HANA

Team: Prof. Dr. Christoph Engels, Prof. Dr. Christoph M. Friedrich, B.Sc. David Müller

Research institution: Fachhochschule Dortmund

Abstract: In the upcoming Future SOC Lab period the project team of the University of Applied Sciences and Arts Dortmund wants to enhance the usage of ensemble methods on SAP HANA by using findings and results of the recent project period. Therefore, more powerful and efficient ways to work with SAP HANA are considered, as well as a comprehensive source code optimization. Main objectives are performance improvement and increased functionality of the existing ensemble method.

Predictive statistical data mining has evolved further over the recent years and remains a steady field of active research. The latest research results provide new data mining methods which lead to better results in model identification and behave more robustly especially in the domain of Predictive Analytics. Most analytic business applications lead to improved financial outcomes directly, for instance demand prediction, fraud detection and churn prediction. Even small improvements in prediction quality lead to enhanced financial effects. Therefore the application of new sophisticated predictive data mining techniques enable business processes to leverage hidden potentials and should be considered seriously.

Especially for classification tasks Ensemble Methods (like Random Forests) show powerful behavior which includes that:

  • they exhibt an excellent accuracy
  • they scale up and are parallel by design
  • they are able to handle thousands of variables, many valued categorials, extensive missing values and badly unbalanced data sets
  • they give an internal unbiased estimate of test set error as primitives are added to ensemble
  • they can hardly overfit
  • they provide a variable importance
  • they enable an easy approach for outlier detection

This project transfers these techniques to the HANA environment where two implementations in R or HANA PAL are investigated (see figure where the R-Server is shown in green and the AFL-PAL Server is depicted in orange).

Because the AFL PAL option makes use of the computing power of the HANA hardware and minimizes data transfer from the database to the analytic function we are in favor of this approach (following the paradigm: “bring the analytic function to the data”!). The project implements the PAL option and R. option and compares both results by application to example classification problems in terms of performance and accuracy. At a later stage all methods should be invoked via the HANA studio workflow builder on the Client.

Link to the previous project: public/20132007

Last modified 5 years ago Last modified on May 20, 2014 1:53:15 PM

Attachments (1)

Download all attachments as: .zip