Democratising Machine learning with H2O

Overview of H2O: the open source, distributed in-memory machine learning platform

Photo by Pixabay from Pexels

It is important to make AI accessible to everyone for the sake of social and economic stability.

Kaggle days is a two-day event where data science enthusiasts can talk to each other face to face, exchange knowledge, and compete together. Kaggle days San Francisco just concluded and as is customary, Kaggle also organised a hackathon for the participants. I had been following Kaggle days on Twitter and the following tweet from Erin LeDell (Chief Machine Learning Scientist at H2O.ai) caught my eye.

Source: Twitter

I have been experimenting with H2O for quite some time and found it really seamless and intuitive for solving ML problems. Seeing it perform so well on Leaderboard, I thought it was time that I wrote an article on the same to make it easy for others to make a transition into the world of H2O.

H2O.ai: The company behind H2O

H2O.ai is based in Mountain View, California and offers a suite of Machine Learning platforms. H2O’s core strength is its high-performing ML components, which are tightly integrated. H2O.ai is a Visionary in the Gartner Magic Quadrant for Data Science Platforms in its report released in Jan’2019.

Source: Gartner (January 2019)

Let’s take a brief look at the offerings of H2O.ai:

H2O.ai Products and Solutions

H2O

H2O is an open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms and also has an AutoML functionality. H2O’s core code is written in Java and its REST API allows access to all the capabilities of H2O from an external program or script. The platform includes interfaces for R, Python, Scala, Java, JSON and CoffeeScript/JavaScript, along with a built-in web interface, Flow,

Since the main focus of this article is about H2O, we shall get to know more about it later in the article.

H2O Sparkling Water

Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities of Spark. Sparkling Water is ideal for H2O users who need to manage large clusters for their data processing needs and want to transfer data from Spark to H2O (or vice versa).

H2O4GPU

H2O4GPU is an open source, GPU-accelerated machine learning package with APIs in Python and R that allows anyone to take advantage of GPUs to build advanced machine learning models.

H2O Driverless AI

Driverless AI’s UI

H2O Driverless AI is H2O.ai’s flagship product for automatic machine learning. It fully automates some of the most challenging and productive tasks in applied data science such as feature engineering, model tuning, model ensembling and model deployment. With Driverless AI, data scientists of all proficiency levels can train and deploy modelling pipelines with just a few clicks from the GUI. Driverless AI is a commercially licensed product with a 21-day free trial version.

What is H2O

The latest version called H2O-3 is the third incarnation of H2O. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark. H2O can easily and quickly derive insights from the data through faster and better predictive modelling.

High-Level Architecture

H2O makes it possible to import data from multiple sources and has a fast, Scalable & Distributed Compute Engine Written in Java. Here is a high-level overview of the platform.

A High-Level architecture of h2o

Supported Algorithms

H2O supports a lot of commonly used algorithms of Machine Learning.

Algorithms supported by H2O

Installation

H2O offers an R package that can be installed from CRAN and a python package that can be installed from PyPI. In this article, I shall be working with only the Python implementation. Also, you may want to look at the documentation for complete details.

Pre-requisites

Python
Java 7 or later, which you can get at the Java download page . To build H2O or run H2O tests, the 64-bit JDK is required. To run the H2O binary using either the command line, R or Python packages, only 64-bit JRE is required.

Dependencies :

pip install requests
pip install tabulate
pip install "colorama>=0.3.8"
pip install future

pip install

pip install <strong>-</strong>f http:<strong>//</strong>h2o<strong>-</strong>release<strong>.</strong>s3<strong>.</strong>amazonaws<strong>.</strong>com<strong>/</strong>h2o<strong>/</strong>latest_stable_Py<strong>.</strong>html h2o

conda

conda install -c h2oai h2o=3.22.1.2

Note: When installing H2O from pip in OS X El Capitan, users must include the --user flag. For example -

pip install -f <a href="http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html" data-href="http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html" rel="nofollow noopener noopener nofollow noopener noopener" target="_blank">http://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html</a> h2o --user

For R installation please refer to the official documentation here .

Testing installation

Every new python session begins by initializing a connection between the python client and the H2O cluster. A cluster is a group of H2O nodes that work together; when a job is submitted to a cluster, all the nodes in the cluster work on a portion of the job.

To check if everything is in place, open your Jupyter Notebooks and type in the following:

import h2o
h2o.init()

This is a local H2O cluster. On executing the cell, some information will be printed on the screen in a tabular format displaying amongst other things, the number of nodes, total memory, Python version etc. In case you need to report a bug, make sure you include all this information. Also, the h2o.init() makes sure that no prior instance of H2O is running.

Running h2o.init() (in Python)

By default, H2O instance uses all the cores and about 25% of the system’s memory. However, in case you wish to allocate it a fixed chunk of memory, you can specify it in the init function. Let’s say we want to give the H2O instance 4GB of memory and it should only use 2 cores.

#Allocate resources

h2o.init(nthreads=2,max_mem_size=4)

Now our H2O instance is using only 2 cores and around 4GB of memory. However, we will go with the default method.

Importing Data with H2O in Python

After the installation is successful, it’s time to get our hands dirty by working on a real-world dataset. We will be working on a Regression problem using the famous wine dataset. The task here is to predict the quality of white wine on a scale of 0–10 given a set of features as inputs.

Here is a link to the Github Repository in case you want to follow along or you can view it on my binder by clicking the image below.

Data

The data belongs to the white variants of the Portuguese “Vinho Verde” wine.

Sourc e: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
CSV FIle : ( https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv )

Data Import

Importing data from a local CSV file. The command is very similar to pandas.read_csv and the data is stored in memory as a H2OFrame .

wine_data = h2o.import_file("winequality-white.csv")
wine_data.head(5)# The default head() command displays the first 10 rows.

Displaying the first 5 rows of the dataset

EDA

Let us explore the dataset to get some insights.

wine_data.describe()

Exploring some of the columns of the dataset

All the features here are numbers and there aren’t any categorical variables. Now let us also look at the correlation of the individual features.

import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10,10))
corr = wine_data.cor().as_data_frame()
corr.index = wine_data.columns
sns.heatmap(corr, annot = True, cmap='RdYlGn', vmin=-1, vmax=1)
plt.title("Correlation Heatmap", fontsize=16)
plt.show()

Modeling with H2O

We shall build a regression model to predict the Quality of the wine. There are a lot of a lgorithms available in the H2O module both for Classification as well as Regression problems.

Splitting data into Test and Training sets

Since we have only one dataset, let’s split it into training and Testing part, so that we can evaluate the model’s performance. We shall use the split_frame() function.

wine_split = wine_data.split_frame(ratios = [0.8], seed = 1234)

wine_train = wine_split[0] # using 80% for training
wine_test = wine_split[1] #rest 20% for testing

print(wine_train.shape, wine_test.shape)
(3932, 12) (966, 12)

Defining Predictor Variables

predictors = list(wine_data.columns) 
predictors.remove('quality')  # Since we need to predict quality
predictors

Generalized Linear Model

We shall build a Generalized Linear Model (GLM) with default settings. Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. You can read more about GLM in the documentation .

# Import the function for GLM
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

# Set up GLM for regression
glm = H2OGeneralizedLinearEstimator(family = 'gaussian', model_id = 'glm_default')

# Use .train() to build the model
glm.train(x = predictors, 
                  y = 'quality', 
                  training_frame = wine_train)

print(glm)

GLM model’s parameters on the Training set

Now, let’s check the model’s performance on the test dataset

glm.model_performance(wine_test)

Making Predictions

Using the GLM model to make predictions in the test dataset.

predictions = glm.predict(wine_test)
predictions.head(5)

Similarly, you could use other supervised algorithms like Distributed Random Fores t, Gradient Boosting Machines , and even Deep Learning .you could also tune in the hyperparameters.

H2OAutoML: Automatic Machine Learning

Automated machine learning ( AutoML ) is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML makes machine learning available in a true sense, even to people with no major expertise in this field. H2O’s AutoML tends to automate the training and the tuning part of the models.

H2O AutoML: Available Algos

In this section, we shall be using the AutoML capabilities of H2O to work on the same regression problem of predicting wine quality.

Importing the AutoML Module

from h2o.automl import H2OAutoML
aml = H2OAutoML(max_models = 20, max_runtime_secs=100, seed = 1)

Here AutoML will run for 10 base models for 100 seconds. The default runtime is 1 Hour.

Training

aml.train(x=predictors, y='quality', training_frame=wine_train, validation_frame=wine_test)

Leaderboard

Now let us look at the automl leaderboard.

print(aml.leaderboard)

AutoML Leaderboard

The leaderboard displays the top 10 models built by AutoML with their parameters. The best model is placed on the top is a Stacked Ensemble.

The leader model is stored as aml.leader

Contribution of Individual Models

Let us look at the contribution of the individual models for this meta-learner.

metalearner = h2o.get_model(aml.leader.metalearner()['name'])
metalearner.std_coef_plot()

XRT( Extremely Randomized Trees) has the maximum contribution followed by Distributed Random Forests.

Predictions

preds = aml.leader.predict(wine_test)

The code above is the quickest way to get started, however, to learn more about H2O AutoML it is worth taking a look at the in-depth AutoML tutorial (available in R and Python).

Shutting Down

h2o.shutdown()

Using Flow — H2O’s Web UI

In the final leg of this article, let us have a quick overview of H2O’s open source Web UI called Flow . FLow is a web-based interactive computational environment where you can combine code execution, text, mathematics, plots and rich media into a single document, much like Jupyter Notebooks.

Launching FLow

Once H2O is up and running all you need to do is point your browser to http://localhost:54321 and you’ll see our very nice user interface called Flow.

Launching H2O flow

Flow Interface

Here is a quick glance over the flow interface. You can read more about using and working with it here .

H2O’s flow interface

Flow is designed to help data scientists rapidly and easily create models, import files, split data frames and do all the things that would normally require quite a bit of typing in other environments.

Working

Let’s work through our same wine example but this time with Flow. The following video explains the model building and prediction using flow and it is kind of self-explanatory.

Demonstration of H2O Flow

Conclusion

H2O is a powerful tool and given its capabilities, it can really transform the Data Science process for good. The capabilities and advantages of AI should be made available to everybody and not a select few. This is the real essence of Democratisation and Democratising Data Science should is essential for resolving Real problems threatening our planet.

Democratising Machine learning with H2O