SAP HANA Cloud Machine Learning Challenge “I quit!” – understanding metrics

2 12 421

I participated in the SAP HANA ML Challenge – Employee Churn and I came in second place.

I would like to thank the entire SAP expert team for the initiative and organization!

Table of content

1. Getting started
2. Understanding the challenge
3. Exploratory data analysis and pre-processing
4. Baseline model with initial data
5. Understanding statistics
6. Training model with additional data
7. Reducing the number of features
8. Balancing training data with SMOTE
9. Balancing test data and increasing training data
10. Finding top SMOTE
11. Building top models
12. Analyzing top models
13. Visualizing statistics
14. Further steps

1. Getting started

The challenge was an opportunity for me to get hands-on experience with SAP Python Machine Learning Client for SAP HANA. The participants had continuous support from SAP team experts during open office hours, discussions on SAP Community Groups, and tutorials on GitHub. I gained my previous experience from Kaggle and SAP open courses. Participants without SAP HANA Cloud instance, myself included, received credentials to execute Jupyter Notebooks in the cloud. The first step was to set up the environment. I installed Anaconda, created the environment, and installed SciPy.

conda create --name ml_hana python=3.8
conda install -c anaconda scipy
pip install hana-ml

After successfully testing the connection, I was ready to dig deeper into the problem we needed to solve.

2. Understanding the challenge

The task is to build a model that predicts if an employee will stay or quit. The collected data were provided in a CSV file. To look into the data we can use python libraries locally for small tasks and hana_ml for resource-consuming tasks. I used it mainly for data processing Pandas DataFrame and HANA ML DataFrame, libraries are aligned in terms of methods, but the syntax is specific for each. I received all the required support from the expert team and I moved on with my exploratory data analysis. I focused on building a model with the best results and alternatives to it and applying A/B testing on deployed models. The results of the final model I presented with visualization libraries. The machine learning problem of the challenge is a classification problem with two classes. One class is Stay labeled with No and another one is Quit labeled with Yes. I call them class No and class Yes. The class we have to predict is Yes. Counting the rows we can find that class No : Class Yes = 17015:2100 = 8:1. Looking at proportions I can see that data are highly imbalanced and I have a majority class No and a minority class Yes. Minority class Yes is the one I have to predict.

The final code in Jupyter Notebook contains 270 cells. In the blog,I will only go through the main parts. The entire code is on GitHub, published with my last run. As with any code, it can be improved further if I had to maintain it in the long run.

3. Exploratory data analysis and pre-processing

Analyzing feature distributions.

# In [26]
# EDA features distributions
col_exc = ['ID']
for col in df_remote_add.columns:
    if col not in col_exc:
        print(df_remote_add.agg([('count', col, 'Count')], group_by=col).collect())

Data contains Null values and I filled them with ‘-999’.

# In [31]:
df_remote = df_remote.fillna(-999)
df_remote = df_remote.fillna("-999")

Feature engineering.

# In [34]
# TIMEINPREVPOSITIONMONTH - feature engineering group in intervals
# I will check by importance if this mattters
    
df_remote = df_remote.select \
    ('*',("""CASE WHEN TIMEINPREVPOSITIONMONTH <=12 THEN 'PREV_POS (,12]'
                  WHEN TIMEINPREVPOSITIONMONTH > 12 AND TIMEINPREVPOSITIONMONTH <=24 THEN 'PREV_POS (12,24]'
                  WHEN TIMEINPREVPOSITIONMONTH > 24 AND TIMEINPREVPOSITIONMONTH <=36 THEN 'PREV_POS (24,36]'
                  WHEN TIMEINPREVPOSITIONMONTH > 36 AND TIMEINPREVPOSITIONMONTH <=48 THEN 'PREV_POS (36,48]'
                  WHEN TIMEINPREVPOSITIONMONTH > 48 AND TIMEINPREVPOSITIONMONTH <=60 THEN 'PREV_POS (48,60]'
                  WHEN TIMEINPREVPOSITIONMONTH > 60 THEN 'TIMEPREV_POS (60,]'
              END""",'TIMEINPREVPOS_INT'))
df_remote = df_remote.to_tail('FLIGHT_RISK')
df_remote.head(3).collect()

4. Baseline model with initial data

I split data into train data and test data and fit train data to build the first baseline model.

# In [42]:
from hana_ml.algorithms.pal.partition import train_test_val_split as split
d_train, d_test, d_val = split( data=df_remote, partition_method='stratified', stratified_column=target, validation_percentage = 0)

To build the model I used HybridGradientBoostingTree wrapped in a function.

# In [47]:
# Function to use repeatedly
def model_fit(d_train):
    HGBT_MODEL = UnifiedClassification('HybridGradientBoostingTree')

# In [56]:
HGBT_MODEL.statistics_.collect() # MODEL statistics

Out [56]:

1	RECALL	0.9959588537839824	No
2	PRECISION	0.9249402934152167	No
3	F1_SCORE	0.9591367415531578	No
4	SUPPORT	2722	No
5	RECALL	0.34523809523809523	Yes
6	PRECISION	0.9133858267716536	Yes
7	F1_SCORE	0.5010799136069115	Yes
8	SUPPORT	336	Yes
9	ACCURACY	0.9244604316546763	None

Looking into model statistics at first glance I can consider myself very lucky with ACCURACY = 0.92, but looking at SUPPORT No = 2722 and SUPPORT Yes = 336 I recognize that validation data used for models metrics calculations is imbalanced and this high ACCURACY is because of the majority class No with RECALL No = 0.99, while RECALL Yes = 0.34. Since I am interested in the class Yes I can see that the results are very poor. What is a RECALL? I spent some time breaking down the confusion matrix into basic elements. Because if I don’t feel confident with metrics I would not be able to improve the prediction model. Computers are fast and accurate at calculations, but the reasoning is a human task.

5. Understanding statistics

To calculate the metrics of the confusion matrix all we need is algebra, first, we count TP (true positive), FP (false positive), FN (false negative), and TN (true negative), then with these four elements we can calculate all the metrics. If classes are imbalanced I can rely only on RECALL because calculations are performed with the prediction from each class RECALL Positive = TP / TP + FP and RECALL Negative = TN / TN + FN. To calculate PRECISION I need true prediction (TP) from one class and false predictions (FP) from another class. To calculate ACCURACY I need all four elements TP, FP, FN, and TN, and with imbalanced classes, results are misleading because the extremely good results of the majority class and poor results of the minority class will make overall results look good. I read also this blog Failure of Classification Accuracy for Imbalanced Class Distributions to check my thinking.

Let’s deep into confusion matrix principles, best explained with an example.

	Predicted No	Predicted Yes
Actual No	True No	False Yes
Actual Yes	False No	True Yes

Assume one built a model with extremely good metrics for majority class No.

Majority class No, Minority class Yes = 900, 100

Assume confusion matrix results.

	Predicted No	Predicted Yes
Actual No	900	9
Actual Yes	82	9

Model statistics.

RECALL No	0.99	extremely good
PRECISION No	0.92	very good
F1_SCORE No	0.95	very good
SUPPORT No	909
RECALL Yes	0.10	very poor
PRECISION Yes	0.50	poor
F1_SCORE Yes	0.17	very poor
SUPPORT Yes	91
ACCURACY	0.91	very good

If I wanted to rely on all metrics I would force balancing classes in the confusion matrix dividing Actual No by 9.

	Predicted No	Predicted Yes
Actual No	90	1
Actual Yes	82	9

Model statistics adjusted.

RECALL No	0.99	very good (same)
PRECISION No	0.52	poor
F1_SCORE No	0.68	good
SUPPORT No	91
RECALL Yes	0.10	very poor (same)
PRECISION Yes	0.90	very good
F1_SCORE Yes	0.18	very poor
SUPPORT Yes	91
ACCURACY	0.54	poor

Now I can see that with balanced metrics RECALL No and RECALL Yes remain the same, PRECISON No drops significantly, PRECISION Yes increases significantly, but F1_SCORE Yes drops to very poor, and ACCURACY drops from 0.91 (very good) to 0.54 (poor).

Model statistics calculates metrics with validations data split from train data. Validation data is unseen data however used in fitting for iterative optimization.

To see how the model performs we have to calculate metrics for absolutely unseen data the test data.

Prediction statistics wrapped in a local function.

# In [61]:
f_score_res(HGBT_MODEL,d_test)[1].collect()

Out [61]:

	STAT_NAME	STAT_VALUE	CLASS_NAME
1	RECALL	0.9964747356051704	No
2	PRECISION	0.9247546346782988	No
3	F1_SCORE	0.9592760180995475	No
4	SUPPORT	1702	No
5	RECALL	0.34285714285714286	Yes
6	PRECISION	0.9230769230769231	Yes
7	F1_SCORE	0.5000000000000001	Yes
8	SUPPORT	210	Yes
9	ACCURACY	0.9246861924686193	None

RECALL Yes = 0.34 for predicted data is very poor and ACCURACY = 0.92 doesn’t tell the true story.

In order to analyze the performance of subsequent approaches I will collect results in a local pandas DataFrame wrapped in a local function to add subsequent models results.

# In [64]:
def f_stat_all(p_model, p_d_test, p_col_name):
df_model_stat_last # model metrics
df_pred_stat_last  # prediction metrics

I will store the initial baseline results in column INIT.

# In [65]:
df_model_stat_last, df_pred_stat_last = f_stat_all(HGBT_MODEL, d_test, 'INIT')

I will take a look at the initial feature importance.

# In [68]:
# Feature importance
HGBT_MODEL.importance_.sort('IMPORTANCE', desc=True).collect()

Out [68]:

The top feature by importance is FUNCTIONALAREACHANGETYPE.

6. Training model with additional data

Meanwhile, additional data has been provided; the shape of the data is consistent.

# In [19]:
df_data_add.shape, df_remote.shape # check rows and columns

Out [19]:

((19115, 4), [19115, 43])

Joining of initial data with additional data.

# In [70]:
df_remote_new = df_remote.set_index(tab_id).join(df_remote_add.set_index(tab_id))

After fitting the model statistics are the following:

Out [75]:

1	RECALL	0.979059515062454	No
4	SUPPORT	2722	No
5	RECALL	0.6101190476190477	Yes
8	SUPPORT	336	Yes

I observe that the RECALL Yes of model statistics increased to a good one 0.61.

Analyzing feature importance I observe that the top feature HRTRAINING comes from additional data.

# Out [76]:
df_imp = HGBT_MODEL.importance_.sort('IMPORTANCE', desc=True).collect()

Prediction statistics.

Out [78]:

1	RECALL	0.9800235017626322	No
4	SUPPORT	1702	No
5	RECALL	0.5714285714285714	Yes
8	SUPPORT	210	Yes

RECALL Yes of prediction statistics is 0.57 is acceptable and less than model statistics 0.61.

I conclude that predictions improve.

7. Reducing the number of features

Looking further into the data I notice dependent features by description and content.

# In [87]:
col_age_dep = ['AGE', 'AGE_GROUP5', 'AGE_GROUP10', 'GENERATION'] # age group

# In [90]:
col_imp_prev_loc = ['PREVIOUS_COUNTRY', 'PREVCOUNTRYLAT', 'PREVCOUNTRYLON', 'PREVIOUS_REGION'] # previous country group

# In [92]:
col_imp_curr_loc = ['CURRENT_REGION', 'CURRENT_COUNTRY', 'CURCOUNTRYLAT', 'CURCOUNTRYLON'] # current country group

# In [94:]
col_imp_tenure = ['TENURE_MONTHS','TENURE_INTERVAL_YEARS','TENURE_INTERVALL_DESC'] # tenure group

From each group, I select the first top feature and fit again the model.

RECALL Yes from prediction statistics improves to 0.61 from 0.57 which means that model performs better with only top features from groups.

# In [104]:
f_score_res(HGBT_MODEL, d_test[cols_imp])[1].collect()

Out [104]:

1	RECALL	0.9782608695652174	No
4	SUPPORT	1702	No
5	RECALL	0.6190476190476191	Yes
8	SUPPORT	210	Yes

For countries, I prefer adjusted selection with top PREVIOUS_COUNTRY, and CURRENT_COUNTRY.

# In [110]:
# Reducing features with meaning preference
COL_IMP_PREVIOUS_COUNTRY_MEANING = ['PREVIOUS_COUNTRY','PREVCOUNTRYLAT', 'PREVCOUNTRYLON', 'PREVIOUS_REGION']
COL_IMP_CURRENT_REGION_MEANING = ['CURRENT_COUNTRY','CURRENT_REGION', 'CURCOUNTRYLAT', 'CURCOUNTRYLON']

Fitting with meaning preference.

# In [114]:
f_score_res(HGBT_MODEL, d_test[cols_imp_mn])[1].collect() # meaning preference

Prediction statistics.

Out [114]:

1	RECALL	0.981198589894242	No
4	SUPPORT	1702	No
5	RECALL	0.6238095238095238	Yes
8	SUPPORT	210	Yes

I observe that RECALL Yes is almost the same 0.62.

8. Balancing training data with SMOTE

A quick introduction to imbalanced data you could find in the blog Data Yoga-It is all about finding the right balance. For oversampling of minority class NO I will use SMOTE technique.

Checking class proportions from train data.

# In [124]:
f_class_count(d_train).collect()

Out [124]:

	FLIGHT_RISK	Count
0	No	13612
1	Yes	1680

I know that classes are highly imbalanced and I want to find a way to fix this with SMOTE.

I wrapped the calculation of max SMOTE into a local function.

# In [126]:
# return max smote amount
def f_smote_nmax(dfh):

The max SMOTE is 710.

Generating train data with max SMOTE to get balanced classes.

# In [129]:
d_train_sm = f_smote(d_train,f_smote_nmax(d_train)) # generate smote with max amount

Checking the proportions of the classes with max SMOTE.

# In [130]:
f_class_count(d_train_sm).collect() #check classes count for Majority No:Minority Yes = 1:1

Out [130]:

	FLIGHT_RISK	Count
0	No	13612
1	Yes	13608

# In [134]:
# Fit with smote balanced No:Yes with all features
HGBT_MODEL=model_fit(d_train_sm)

Prediction statistics.

Out [137]:

1	RECALL	0.945358401880141	No
3	F1_SCORE	0.9600238663484487	No
4	SUPPORT	1702	No
5	RECALL	0.8047619047619048	Yes
8	SUPPORT	210	Yes

RECALL Yes jumps from 0.62 to 0.80, which means SMOTE improves the model.

The fitting of the model with max SMOTE and important features.

# In [144]:
# Fit with smote balanced No:Yes with important features
HGBT_MODEL=model_fit(d_train_sm[cols_imp])

Prediction statistics

Out [146]:

1	RECALL	0.9195064629847238	No
2	PRECISION	0.9726538222498446	No
4	SUPPORT	1702	No
5	RECALL	0.7904761904761904	Yes
8	SUPPORT	210	Yes

RECALL Yes with important features and max SMOTE almost same 0.79

9. Balancing test data and increasing training data

I notice that SUPPORT Yes < SUPPORT No. To fix it I will keep for test rows SUPPORT No = SUPPORT Yes. The rest of the SUPPORT No rows I will add to train data because I want to use precious data.

Balanced test data.

# In [164]:
f_class_count(d_test_bl).collect() # check test classes are balanced

Out [164]:

	FLIGHT_RISK	Count
0	Yes	210
1	No	210

Increased rows of train data.

# In [166]:
d_train_target_count = f_class_count(d_train_bl) # added No from d_test that are not used (unseen)
d_train_target_count.collect()

Out [166]:

	FLIGHT_RISK	Count
0	No	15104
1	Yes	1680

Applying SMOTE to train data.

# In [168]:
f_class_count(d_train_bl_sm).collect() # #check classes count for Majority No:Minority Yes = 1:1

Out [168]:

	FLIGHT_RISK	Count
0	No	15104
1	Yes	15103

# In [171]:
# Fit with balanced d_train_bl_sm important features
HGBT_MODEL=model_fit(d_train_bl_sm[cols_imp])

Prediction statistics.

# In [173]:
f_score_res(HGBT_MODEL, d_test_bl[cols_imp])[1].collect()

Out [173]:

	STAT_NAME	STAT_VALUE	CLASS_NAME
1	RECALL	0.919047619047619	No
2	PRECISION	0.8427947598253275	No
3	F1_SCORE	0.8792710706150342	No
4	SUPPORT	210	No
5	RECALL	0.8285714285714286	Yes
6	PRECISION	0.9109947643979057	Yes
7	F1_SCORE	0.8678304239401496	Yes
8	SUPPORT	210	Yes
9	ACCURACY	0.8738095238095238	None

RECALL Yes has a small improvement to 0.82.

SUPPORT Yes = SUPPORT No that means I can rely on all statistics. I notice that ACCURACY is 0.87 and that is a very good one.

10. Finding top SMOTE

I want to execute SMOTE with the step of 100 and analyze ACCURACY from prediction statistics.

# In [188]:
# SMOTE with steps
# fire on all cylinders
dict_models_smote, df_model_statistics_sm, df_predicted_statistics_sm = \
f_smote_models(d_train_bl, d_test_bl, cols_imp, 100)
ls_smote: [0, 100, 200, 300, 400, 500, 600, 700, 799]

This is intensive processing in the HANA ML cloud with Wall time: 8min 34s.

Model statistics are collected in local pandas DataFrame.

# In [190]:
df_model_statistics_sm

Prediction statistics are collected in the local pandas DataFrame.

# In [191]:
df_predicted_statistics_sm

Selecting SMOTE for best prediction ACCURACY with local function.

# In [194]:
# SUPPORT Yes = SUPPORT No - ACCURACY reliable
f_accuracy_descending(df_predicted_statistics_sm)

Out [194]:

STAT_NAME ACCURACY

STAT_VALUE_700 0.8738095238095238

STAT_VALUE_200 0.8666666666666667

STAT_VALUE_500 0.8666666666666667

STAT_VALUE_400 0.8642857142857143

STAT_VALUE_600 0.8642857142857143

STAT_VALUE_799 0.8642857142857143

STAT_VALUE_300 0.8547619047619047

STAT_VALUE_100 0.8523809523809524

STAT_VALUE_0 0.7928571428571428

11. Building top models

I fit models with smote_best_amount adding one feature by importance.

# In [207]:
# FEATURES IMPORTANCE TOP n
# fire on all cylinders
def f_smote_model_top_f(p_d_train, p_cols_imp, p_d_test, p_smote_best_amount, p_cols_imp_start, p_cols_imp_step, p_verbosity):
# ...
   return lc_dict_models_top_f, lc_df_model_statistics, lc_df_predicted_statistics

# In [212]:
%%time
# FEATURES IMPORTANCE TOP n
# fire on all cylinders
dict_models_top_f, df_model_statistics_top_f, df_predicted_statistics_top_f = \
f_smote_model_top_f(d_train_bl, sm_best_cols_imp, d_test_bl, smote_best_amount, 1, 1, True)

This is very intensive processing in the HANA ML cloud with Wall time: Wall time: 21min 6s.

Selecting top features with top ACCURACY with a local function.

# In [217]:
# SUPPORT Yes = SUPPORT No ACCURACY is reliable
f_accuracy_descending(df_predicted_statistics_top_f)[:10]

With the best SMOTE and best top features ACCURACY increases to 0.89.

I stored all models and other dependencies in the dictionary dict_models_top_f to access them easily.

# In [218]:
def f_sel_top_pred_acc(p_df,p_n):
    s =  f_accuracy_descending(p_df) # sorted series
    ls_s = s[:p_n+1].index.to_list() # sorted list with last n+1
    key = ls_s[-1] # last element
    model = dict_models_top_f[key][0]
    cols = dict_models_top_f[key][1]
    cols_len = dict_models_top_f[key][2]
    cols_name = dict_models_top_f[key][3]

    return [key, model, cols, cols_len, cols_name]

# In [219]:
# key, model, cols, cols_len, cols_name Top 1
top_1_imp_ls = f_sel_top_pred_acc(df_predicted_statistics_top_f,1)
top_1_imp_ls[0], top_1_imp_ls[1], top_1_imp_ls[2], top_1_imp_ls[3], top_1_imp_ls[4]

Out [219]:

(‘STAT_VALUE_top15’,
<hana_ml.algorithms.pal.unified_classification.UnifiedClassification at 0x25c02d19430>,
[‘EMPLOYEE_ID’,
‘HRTRAINING’,
‘FUNCTIONALAREACHANGETYPE’,
‘SICKDAYS’,
‘PROMOTION_WITHIN_LAST_3_YEARS’,
‘TIMEINPREVPOSITIONMONTH’,
‘EMPLOYMENT_TYPE_2’,
‘PREVIOUS_CAREER_PATH’,
‘RISK_OF_LOSS’,
‘CURRENT_COUNTRY’,
‘SALARY’,
‘JOBLEVELCHANGETYPE’,
‘AGE’,
‘CHANGE_IN_PERFORMANCE_RATING’,
‘PREVIOUS_COUNTRY’,
‘LINKEDIN’,
‘FLIGHT_RISK’],
17,
‘top15’)

12. Analyzing top models

First I analyze the top 1 reason for the top 1 model prediction.

In [233]:

# In [233]:
# Top 1 contribution feauture to prediction
f_score_res(top_1_imp_ls[1], d_test_bl)[0].\
    select('EMPLOYEE_ID', 'SCORE', 'CONFIDENCE', 'REASON_CODE', 
                      ('json_query("REASON_CODE", \'$[0].attr\')', 'Top1'), 
                      ('json_query("REASON_CODE", \'$[0].pct\')', 'PCT_1') ).head(5).collect()

Out[233]:

	EMPLOYEE_ID	SCORE	CONFIDENCE	REASON_CODE	Top1	PCT_1
0	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“SICKDAYS”	28.49847521037374
1	10034	No	0.997572	[{“attr”:”SICKDAYS”,”val”:-2.1461261874013997,…	“SICKDAYS”	28.604720941656255
2	10036	Yes	0.884731	[{“attr”:”HRTRAINING”,”val”:1.2600622007181038…	“HRTRAINING”	17.883981590687534
3	10056	Yes	0.992384	[{“attr”:”SICKDAYS”,”val”:3.2852997983870959,”…	“SICKDAYS”	30.97615277828276
4	10064	No	0.985765	[{“attr”:”HRTRAINING”,”val”:-1.738132753015509…	“HRTRAINING”	27.416934166112573

I can compare test data FLIGHT_RISK (actual) with prediction SCORE.

# In [236]:
print(f"Top 1 voting with features {top_1_imp_ls[4]}") # Top 1 voting
f_actual_vs_score(top_1_imp_ls[1], d_test_bl).head(5).collect()

Top 1 model (top15 features ) prediction and top 1 reason code.

Out [236]:

	EMPLOYEE_ID	FLIGHT_RISK	SCORE	CONFIDENCE	REASON_CODE
0	10033	No	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…
1	10034	No	No	0.997572	[{“attr”:”SICKDAYS”,”val”:-2.1461261874013997,…
2	10036	Yes	Yes	0.884731	[{“attr”:”HRTRAINING”,”val”:1.2600622007181038…
3	10056	Yes	Yes	0.992384	[{“attr”:”SICKDAYS”,”val”:3.2852997983870959,”…
4	10064	No	No	0.985765	[{“attr”:”HRTRAINING”,”val”:-1.738132753015509…

I can analyze further top 5 REASON_CODE with code wrapped in a local function.

# In [240]:
# Top 5 contribution feauture to prediction for rows
def f_contrib_top5f_head(p_model, p_d_test, p_rows):

# In [241]:
f_contrib_top5f_head(top_1_imp_ls[1], d_test_bl, 3)

Out [241]:

	EMPLOYEE_ID	SCORE	CONFIDENCE	REASON_CODE	Top	PCT
0	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“SICKDAYS”	28.49847521037374
3	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“HRTRAINING”	20.004226264298525
6	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“PREVIOUS_CAREER_PATH”	13.468698519836725
9	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“LINKEDIN”	6.355941761599763
12	10033	No	0.998697	[{“attr”:”SICKDAYS”,”val”:-2.3005012586457966,…	“FUNCTIONALAREACHANGETYPE”	6.23681257187179

13. Visualizing statistics

Presenting model results with Visualizing Model Report as explained in the blog “ML in SAP Data Warehouse Cloud”.

# In [243]:
from hana_ml.visualizers.unified_report import UnifiedReport
UnifiedReport(top_1_imp_ls[1]).build().display()

Variable Importance

I can analyze the contributions of features with shapely_explainer as presented in the library example.

# In [257]:
shapley_explainer = TreeModelDebriefing.shapley_explainer(feature_data=d_test_bl.head(500).select(features_shapely),
reason_code_data=pred_res_shapely.head(500).select('REASON_CODE'))
shapley_explainer.summary_plot()

Shapley explainer

Confusion matrix for validation data from model training as explained in the tutorial on GitHub.

In [244]:
import matplotlib.pyplot as plt
from hana_ml.visualizers.metrics import MetricsVisualizer
f, ax1 = plt.subplots(1,1)
mv1 = MetricsVisualizer(ax1, title = 'Modedl confusion matrix')
ax1 = mv1.plot_confusion_matrix(top_1_imp_ls[1].confusion_matrix_, normalize=False)

Model confusion matrix

Model confusion matrix

# In [245]:
f, ax1 = plt.subplots(1,1)
mv1 = MetricsVisualizer(ax1, title = 'Model confusion matrix %')
ax1 = mv1.plot_confusion_matrix(top_1_imp_ls[1].confusion_matrix_, normalize=True)

Model confusion matrix

Confusion matrix for predictions wrapped in local functions.

# In [259]:
def f_cf_mx_values(p_model, p_d_test):

# In [267]:
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

Prediction confusion matrix

14. Further steps

Regarding to further models exploring I could use as starting point Automated Predictive Libraries GradientBoostingBinaryClassifier, tutorial Fraud Detection Auto ML Script, blog Hands-On Tutorial: Automated Predictive (APL) in SAP HANA Cloud and blog SAP HANA Cloud Machine Learning Challenge. I would increase the test data percentage from split and use ensemble techniques for voting.

Thinking in terms of transfer learning as in computer vision, what if we can collect data across different companies and industries, build specialized or generalized models, shake hands with developers, and integrate prediction in applications to use internally or externally as on demand service. Data required for monthly evaluation could be collected from internal sources and external sources with coded extractors or scraping tools.

Since I am an end-to-end developer of Business Intelligence reports, I am looking forward to getting hands-on experience regarding how to serve data from an SAP environment or other external sources to build models, consume machine learning models and integrate prediction into the business apps.

SAP HANA Cloud Machine Learning Challenge “I quit!” – understanding metrics

SAP HANA Cloud Machine Learning Challenge “I quit!” – understanding metrics

Table of content

1. Getting started

2. Understanding the challenge

3. Exploratory data analysis and pre-processing

4. Baseline model with initial data

5. Understanding statistics

6. Training model with additional data

7. Reducing the number of features

8. Balancing training data with SMOTE

9. Balancing test data and increasing training data

10. Finding top SMOTE

11. Building top models

12. Analyzing top models

13. Visualizing statistics

14. Further steps

Recommend

荣耀新年广告《小家伙争宠记》：猫咪和手机争宠

4 Things That Make Roblox Metaverse Different

真快乐改回旧名「国美」，连LOGO也换了回去

Get-CredentialFromWindowsCredentialManager.ps1

《贞伊》在线免费观看-高分电影-雷神影院

学完这个教程，小白也能构建Transformer，DeepMind科学家推荐

2023 年了，三大通讯运营商的手机卡默认分配 IPV6 地址了吧？

【行业深度】洞察2023：中国天然气发电行业竞争格局及市场份额(附市场集中度、企业竞...

Android 13 QPR2 Beta 2.1 rolling out with Bluetooth and 5G fixes for Pixel

Top Technologies Used In Creating Meta Metaverse

About Joyk