Kubeflow Components and Pipelines

I want to keep things simple therefore we cover components, pipelines and experiments. With pipelines and components, you get the basics which are required to build ML workflows.

There are many more tools integrated into Kubeflow and I will cover them in the upcoming posts.

Kubeflow is originated at Google.

Making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable.

Photo by Ryan Quintal on Unsplash

Goals

Demonstrate how to use build pipelines.
Demonstrate how to create components.
Demonstrate how to use components.
Demonstrate how to run pipelines and experiments inside of a Notebook.
Easy to understand and ready to use examples .

Pipelines

Component

Code that performs one step in the Pipeline. In other words a containerized implementation of an ML task.

A component is analog to a function it has a name, parameter, return values and a body. Each component in a pipeline executes independently and has to be packed as a docker image.

The components.

Graph

The representation between the components. It shows the steps your pipeline is executing.

The graph.

Pipeline

A pipeline describes the machine learning workflow, it includes the components and the graph.

The pipeline.

Run

A run is a single execution of a pipeline, for comparison all runs are kept.

A run is a single execution of a Pipeline.

Recurring runs

Recurringruns can be used to repeat a pipeline run, useful if we want to train an updated model version on new data in a scheduled manner.

Experiment

Similar to a workspace, it contains the different runs . Runs can be compared.

An overview of all runs in this specific experiment.

Component Types

Kubeflow contains two types of components, one for rapid development and one for re-usability.

Lightweight Component

Used for fast development in a notebook environment. Fast and easy cause there is no need to build container images.

Lightweight components cannot be reused .

Reusable Component

Can be reused by loading it into a Kubeflow pipeline. It is a c ontainerized component.

Require more implementation time .

Reusable Component

In this section, you will get the basics of a reusable component.

Component Structure

A component itself is simple and consists of just a few parts:

The component logic
A component specification as yaml.
A Dockerfile which is required to build the container.
A readme to explain the component and its inputs and outputs.
A helper script to build the component and push it to a Docker repository.

Parts of a reusable Kubeflow component.

Component Specification

This specification describes the container component data model for Kubeflow Pipelines.

Written in YAML format (component.yaml).

Metadata describe the component itself, like name and description
Interface defines the input and the output of the component.
Implementation specifies how the component should be executed.

Handling Input

A component requires usually some kind of input, like a path to our training data or the name of our model. It can consume multiple inputs.

Define inputs in the component.yaml
Define the input as arguments for our container.
Parse arguments in the component logic.

The training component might require training data as input and produces a model as output.

Handling Output

The output is required to pass data between components. It is important to know that each component in a pipeline executes independently .

Components run in different processes and cannot share data.
The process for passing small data differs from large data.

For small data

Values can be passed directly as output.

For large data

Large data has to be serialized to files to be passed between components.
Upload the data to our storage system.
And pass a reference to this file to the next component.
The next component in the pipeline will take this reference and download the serialized data.

Dockerize Component

Each component is a container image which requires a dockerfile in order to build an image.

After the image is built we push the component container image to the Google Container Registry .

Build and upload the component container image to the Google Container Registry is just a few lines of code:

# build_image.sh
image_name=gcr.io/ml-training/kubeflow/training/train
image_tag=latest

full_image_name=${image_name}:${image_tag}

docker build --build-arg -t "${full_image_name}" 
docker push "$full_image_name"

It happens that with the first docker push you might get the following error message:

You don’t have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication

In that case, just run the following gcloud command and push again:

$ gcloud auth configure-docker

Use the Pipeline

Load the Component

Everyone with access to the Docker repository and the component.yaml can use the component in Pipelines .

Load a component from a component.yaml URL.

The component can then be loaded based on the component.yaml.

operation = kfp.components.load_component_from_url(
'https://location-to-your-yaml/component.yaml')

help(operation)

Create the Pipeline

The dsl decorator is provided via the pipeline SDK and is used to define and interact with pipelines. dsl.pipeline defines a decorator for Python functions which returns a pipeline.

@dsl.pipeline(
  name='The name of the pipeline',
  description='The description of the pipeline'
)
def sample_pipeline(parameter):
    concat = operation(parameter=first)

Compile the Pipeline

To compile the pipeline we use the compiler.Compile() function which is again part of the pipeline SDK. The compiler generates a yaml definition which is used by Kubernetes to create the execution resources.

pipeline_func = sample_pipeline
pipeline_filename = pipeline_func.__name__ + '.pipeline.zip'

compiler.Compiler().compile(sample_pipeline, 
                            pipeline_filename)

Create an Experiment

Pipelines are always part of an experiment and can be created with the Kubeflow Pipeline Client kfp.client() . Experiments cannot be removed at the moment.

client = kfp.Client()

try:
    experiment = client.get_experiment(experiment_name=EXPERIMENT_NAME)
except:
    experiment = client.create_experiment(EXPERIMENT_NAME)
    
print(experiment)

Run the Pipeline

To run a pipeline we use the experiment id and the compiled pipeline created in the previous steps. client.run_pipeline runs the pipelines and provides a direct link to the Kubeflow experiment.

run_name = pipeline_func.__name__ + ' run'
run_result = client.run_pipeline(experiment.id, 
                                 run_name, 
                                 pipeline_filename)

Examples on GitHub

I created a basic pipeline which demonstrates everything presented in this post. To keep things simple the pipeline does not contain any ML specific implementation.

https://github.com/SaschaHeyer/Machine-Learning-Training/tree/master/kubeflow/reusable-component-training

Usage

Goals

Pipelines

Component

Graph

Pipeline

Run

Recurring runs

Experiment

Component Types

Lightweight Component

Reusable Component

Reusable Component

Component Structure

Component Specification

Handling Input

Handling Output

For small data

For large data

Dockerize Component

Use the Pipeline

Load the Component

Create the Pipeline

Compile the Pipeline

Create an Experiment

Run the Pipeline

Examples on GitHub

Recommend

MongooseIM 3.4: Designed with privacy in mind

git的一些撤销操作

The Simple Genius of Checklists, from B-17 to the Apollo Missions

Next.js vs. Create React App: Whose apps are more performant?

Reactive JavaScript Objects - Proof of Concept lib inspired by Vuejs reactivity...

Of Sixes and Fours – Analyzing the IPL using the tidyverse

MariaDB sys Schema

Mac中安全地使用rm命令

Use constexpr for faster, smaller, and safer code

“垃圾”创业，下一个风口？_创事记_新浪科技_新浪网

About Joyk