Yohei Fukuhara

September 21, 2022 5 minute read

Python hana_ml: Classification Training with APL(GradientBoostingBinaryClassifier)

I am writing this blog to show training with APL using python package hana_ml. With APL, you can automate preprocessing to some extent.

Environment

Environment is as below.

Python: 3.7.14(Google Colaboratory)
HANA: Cloud Edition 2022.16
APL: 2209

Python packages and their versions.

hana_ml: 2.14.22091801
pandas: 1.3.5
scikit-learn: 1.0.2

As for HANA Cloud, I activated scriptserver and created my users. Though I don’t recognize other special configurations, I may miss something since our HANA Cloud was created long time before.

I didn’t use HDI here to make environment simple.

Python Script

1. Install Python packages

Install python package hana_ml, which is not pre-installed on Google Colaboratory.

As for pandas and scikit-learn, I used pre-installed ones.

!pip install hana_ml

2. Import modules

Import python package modules.

import pprint

from hana_ml.algorithms.apl.apl_base import get_apl_version
from hana_ml.algorithms.apl.gradient_boosting_classification \
    import GradientBoostingBinaryClassifier
from hana_ml.algorithms.pal.partition import train_test_val_split
from hana_ml.dataframe import ConnectionContext, create_dataframe_from_pandas
from hana_ml.model_storage import ModelStorage
from hana_ml.visualizers.unified_report import UnifiedReport
import pandas as pd
from sklearn.datasets import make_classification

3. Connect to HANA Cloud

Connect to HANA Cloud and check its version.

ConnectionContext class is for connection to HANA. You can check the APL version with get_apl_version function.

HOST = '<HANA HOST NAME>'
SCHEMA = USER = '<USER NAME>'
PASS = '<PASSWORD>'
conn = ConnectionContext(address=HOST, port=443, user=USER,
                           password=PASS, schema=SCHEMA) 
print(conn.hana_version())

# APL.Version.ServicePack is APL
print(get_apl_version(conn))

4.00.000.00.1660640318 (fa/CE2022.16)
                                      name                                            value
0                        APL.Version.Major                                                4
1                        APL.Version.Minor                                              400
2                  APL.Version.ServicePack                                             2209
3                        APL.Version.Patch                                                1
4                                 APL.Info                     Automated Predictive Library
5                     AFLSDK.Version.Major                                                2
6                     AFLSDK.Version.Minor                                               16
7                     AFLSDK.Version.Patch                                                0
8                              AFLSDK.Info                                           2.16.0
9               AFLSDK.Build.Version.Major                                                2
10              AFLSDK.Build.Version.Minor                                               13
11              AFLSDK.Build.Version.Patch                                                0
12        AutomatedAnalytics.Version.Major                                               10
13        AutomatedAnalytics.Version.Minor                                             2209
14  AutomatedAnalytics.Version.ServicePack                                                1
15        AutomatedAnalytics.Version.Patch                                                0
16                 AutomatedAnalytics.Info                              Automated Analytics
17                             HDB.Version                           4.00.000.00.1660640318
18                     SQLAutoContent.Date                                       2022-04-19
19                  SQLAutoContent.Version                                     4.400.2209.1
20                  SQLAutoContent.Caption  Automated Predictive SQL Library for Hana Cloud

4. Create test data

Create test data using scikit-learn.

There are 3 features and 1 target variable.

def make_df():
    X, y = make_classification(n_samples=1000, 
                               n_features=3, n_redundant=0)
    df = pd.DataFrame(X, columns=['X1', 'X2', 'X3'])
    df['CLASS'] = y
    return df

df = make_df()
print(df)
df.info()

Here is dataframe overview.

           X1        X2        X3  CLASS
0    0.964229  1.995667  0.244143      1
1   -1.358062 -0.254956  0.502890      0
2    1.732057  0.261251 -2.214177      1
3   -1.519878  1.023710 -0.262691      0
4    4.020262  1.381454 -1.582143      1
..        ...       ...       ...    ...
995 -0.247950  0.500666 -0.219276      1
996 -1.918810  0.183850 -1.448264      0
997 -0.605083 -0.491902  1.889303      0
998 -0.742692  0.265878 -0.792163      0
999  2.189423  0.742682 -2.075825      1

[1000 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      1000 non-null   float64
 1   X2      1000 non-null   float64
 2   X3      1000 non-null   float64
 3   CLASS   1000 non-null   int64  
dtypes: float64(3), int64(1)
memory usage: 31.4 KB

5. define table and upload data

Define HANA Table and upload data using function “create_dataframe_from_pandas”.

The function is very useful, since it automatically define table and upload at the same time. Please check options for further detail.

TRAIN_TABLE = 'PAL_TRAIN'
dfh = create_dataframe_from_pandas(conn, df, TRAIN_TABLE,
                             schema=SCHEMA, 
                             force=True, # True: truncate and insert
                             replace=True) # True: Null is replaced by 0

6. Split data into train and test dataset

Split dataset using function “train_test_val_split”. The function needs key columns, so I added key column using function “add_id”.

train, test, _ = train_test_val_split(dfh.add_id(), 
                                      testing_percentage=0.2,
                                      validation_percentage=0)
print(f'Train shape: {train.shape}, Test Shape: {test.shape}')

Train shape: [8000, 5], Test Shape: [2000, 5]

7. Training

Train with random forest by using class “GradientBoostingClassifier”. Please make sure class AutoClassifier is deprecated.

model = GradientBoostingBinaryClassifier()
model.fit(train, label='CLASS', key='ID', build_report=True)

8. Training result

8.1. Unified Report

Model report shows with the below code. Please see another article “Python hana_ml: PAL Classification Training(UnifiedClassification)” for the report content, which is basically same.

model.generate_notebook_iframe_report()
model.generate_html_report('apl')

8.2. Score

Score function returns mean average accuracy.

# score: mean average accuracy.  cannot output other metrics
score = model.score(test)
print(score)

8.3. Summary

get_summary function returns model summary.

model.get_summary().deselect('OID').collect()

8.4. Metrics

get_performance_metrics function returns metrics information.

>> pprint.pprint(model.get_performance_metrics())

{'AUC': 0.991,
 'BalancedClassificationRate': 0.964590677634156,
 'BalancedErrorRate': 0.03540932236584404,
 'BestIteration': 69,
 'ClassificationRate': 0.9646017699115044,
 'CohenKappa': 0.9291813552683117,
 'GINI': 0.4823,
 'KS': 0.9195,
 'LogLoss': 0.12414480396790141,
 'PredictionConfidence': 0.991,
 'PredictivePower': 0.982,
 'perf_per_iteration': {'LogLoss': [0.617163,
                                    0.554102,
                                    0.499026,
<omit>
                                    0.125448,
                                    0.125588]}}

8.5. Statistical Report

get_debrief_report function returns several type of statistical reports. Please See Statistical Reports in the SAP HANA APL Reference Guide.

reports = ['Statistics_Partition',
           'Statistics_Variables',
           'Statistics_CategoryFrequencies',
           'Statistics_GroupFrequencies',
           'Statistics_ContinuousVariables',
           'ClassificationRegression_VariablesCorrelation',
           'ClassificationRegression_VariablesContribution',
           'ClassificationRegression_VariablesExclusion',
           'Classification_BinaryClass_ConfusionMatrix']

for report in reports:
    print('\n'+report)
    display(model.get_debrief_report(report).deselect('Oid').head(3).collect())

8.6. Indicators

get_indicators function returns all indicators with unified format.

model.get_indicators().collect()

8.7. Model info

get_model_info function returns several type of reports.

for model_info in model.get_model_info():
    print('\n', model_info.source_table['TABLE_NAME'])
    display(model_info.deselect('OID').head(3).collect())

9. Predict

You can predict with function predict.

>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>> apply_out = model.predict(test)
>> print(apply_out.head(3).collect())

   ID  TRUE_LABEL  PREDICTED  gb_score_CLASS  gb_contrib_X1  gb_contrib_X2  gb_contrib_X3  gb_contrib_constant_bias
0  12           0          0        2.592326      -0.222146       3.193908      -0.383197                  0.003759
1  13           1          1       -4.876161       0.141867      -4.717393      -0.304394                  0.003759
2  19           1          1       -4.074210       0.433828      -4.438335      -0.073464                  0.003759

10. Save model

Just save model with class “ModelStorage” and function “save_model”.

ms = ModelStorage(conn)
# ms.clean_up()
model.name = 'My classification model name'
ms.save_model(model, if_exists='replace')

You can see the saved model.

# display(ms.list_models())
pprint.pprint(ms.list_models().to_dict())

{'CLASS': {0: 'hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingBinaryClassifier'},
 'JSON': {0: '{"model_attributes": {"name": "My classification model name", '
             '"version": 1, "log_level": 8, "model_format": "bin", "language": '
             '"en", "label": "CLASS", "auto_metric_sampling": false}, '
             '"fit_params": {}, "artifacts": {"schema": "I348221", '
             '"model_tables": ["HANAML_APL_MODELS_DEFAULT"], "library": '
             '"APL"}, "pal_meta": {}}'},
 'LIBRARY': {0: 'APL'},
 'MODEL_REPORT': {0: None},
 'MODEL_STORAGE_VER': {0: 1},
 'NAME': {0: 'My classification model name'},
 'SCHEDULE': {0: '{"schedule": {"status": "inactive", "schedule_time": "every '
                 '1 hours", "pid": null, "client": null, "connection": '
                 '{"userkey": "your_userkey", "encrypt": "false", '
                 '"sslValidateCertificate": "true"}, "hana_ml_obj": '
                 '"hana_ml.algorithms.pal.xx", "init_params": {}, '
                 '"fit_params": {}, "training_dataset_select_statement": '
                 '"SELECT * FROM YOUR_TABLE"}}'},
 'STORAGE_TYPE': {0: 'default'},
 'TIMESTAMP': {0: Timestamp('2022-09-21 08:57:33')},
 'VERSION': {0: 1}}

11. Close connection

Last but not least, close the connection.

conn.close()

Python hana_ml: Classification Training with APL(GradientBoostingBinaryClassifie...

Python hana_ml: Classification Training with APL(GradientBoostingBinaryClassifier)

Environment

Python Script

1. Install Python packages

2. Import modules

3. Connect to HANA Cloud

4. Create test data

5. define table and upload data

6. Split data into train and test dataset

7. Training

8. Training result

8.1. Unified Report

8.2. Score

8.3. Summary

8.4. Metrics

8.5. Statistical Report

8.6. Indicators

8.7. Model info

9. Predict

10. Save model

11. Close connection

Recommend

iOS 16.1 Beta 2 已修复频繁粘贴许可弹窗

加密支付平台 MoonPay 通过其铸造服务 Hypermint 为环球影城创建 NFT 寻宝游戏

大家分享一下公司或个人现在用的服务器是什么 Linux 发行版吧

Space roar: The mystery of the loudest sound in the universe

This is the work-from-home impact on women employees

Google: 5 Ways To Prepare For Site Closure

【AMZ123跨境早报】亚马逊卖家大学位置发生改变；亚马逊日本站对产品上架时出现5665/5...

“MetaQuest”VS“字节Pico”：一文读懂VR生态两大阵营

在Pinterest投放广告引流到亚马逊、独立站的好处

种种迹象表明 iPad也将“灵动岛”化

About Joyk