Project: Raising the power of ensemble methods using SAP HANA

Team: Prof. Dr. Christoph Engels, Prof. Dr. Christoph M. Friedrich, B.Sc. David Müller

Research institution: University of Applied Sciences and Arts Dortmund, Forschungsschwerpunkt iBIS

Abstract: Ensemble methods (like Random Forests, Quantile Forests, Gradient Boosting Machines and variants) have demonstrated their outstanding behavior in the domain of data mining techniques. This project aims to raise these potentials in the powerful HANA environment.

Predictive statistical data mining has evolved further over the recent years and remains a steady field of active research. The latest research results provide new data mining methods which lead to better results in model identification and behave more robustly especially in the domain of Predictive Analytics. Most analytic business applications lead to improved financial outcomes directly, for instance demand prediction, fraud detection and churn prediction. Even small improvements in prediction quality lead to enhanced financial effects. Therefore the application of new sophisticated predictive data mining techniques enable business processes to leverage hidden potentials and should be considered seriously.

Especially for classification tasks Ensemble Methods (like Random Forests) show powerful behavior which includes that:

  • they exhibt an excellent accuracy
  • they scale up and are parallel by design
  • they are able to handle thousands of variables, many valued categorials, extensive missing values and badly unbalanced data sets
  • they give an internal unbiased estimate of test set error as primitives are added to ensemble
  • they can hardly overfit
  • they provide a variable importance
  • they enable an easy approach for outlier detection

This project transfers these techniques to the HANA environment where two implementations in R or HANA PAL are investigated (see figure where the R-Server is shown in green and the AFL-PAL Server is depicted in orange).

Because the AFL PAL option makes use of the computing power of the HANA hardware and minimizes data transfer from the database to the analytic function we are in favor of this approach (following the paradigm: “bring the analytic function to the data”!). The project implements the PAL option and R. option and compares both results by application to example classification problems in terms of performance and accuracy. At a later stage all methods should be invoked via the HANA studio workflow builder on the Client.

Link to the previous project: public/20131005

Last modified 8 years ago Last modified on Oct 2, 2013 9:14:20 AM

Attachments (1)

Download all attachments as: .zip