36

Kubeflow Components and Pipelines

 4 years ago
source link: https://www.tuicool.com/articles/QBVFNfa
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

I want to keep things simple therefore we cover components, pipelines and experiments. With pipelines and components, you get the basics which are required to build ML workflows.

There are many more tools integrated into Kubeflow and I will cover them in the upcoming posts.

Kubeflow is originated at Google.

Making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable.

qeEbEbV.jpg!webr6vqIz6.jpg!web
Photo by Ryan Quintal on  Unsplash

Goals

  • Demonstrate how to use build pipelines.
  • Demonstrate how to create components.
  • Demonstrate how to use components.
  • Demonstrate how to run pipelines and experiments inside of a Notebook.
  • Easy to understand and ready to use examples .

Pipelines

Component

Code that performs one step in the Pipeline. In other words a containerized implementation of an ML task.

A component is analog to a function it has a name, parameter, return values and a body. Each component in a pipeline executes independently and has to be packed as a docker image.

BJvyqeI.png!webb2aUriy.png!web
The components.

Graph

The representation between the components. It shows the steps your pipeline is executing.

aAjum2j.png!webvMvuMfq.png!web
The graph.

Pipeline

A pipeline describes the machine learning workflow, it includes the components and the graph.

u6b6vi2.png!webyyyMfiu.png!web
The pipeline.

Run

A run is a single execution of a pipeline, for comparison all runs are kept.

YBzmyqn.png!webRVfA7jZ.png!web
A run is a single execution of a Pipeline.

Recurring runs

Recurringruns can be used to repeat a pipeline run, useful if we want to train an updated model version on new data in a scheduled manner.

Experiment

Similar to a workspace, it contains the different runs . Runs can be compared.

rYV7je6.png!webaMBb6f3.png!web
An overview of all runs in this specific experiment.

Component Types

Kubeflow contains two types of components, one for rapid development and one for re-usability.

Lightweight Component

Used for fast development in a notebook environment. Fast and easy cause there is no need to build container images.

Lightweight components cannot be reused .

Reusable Component

Can be reused by loading it into a Kubeflow pipeline. It is a c ontainerized component.

Require more implementation time .

Reusable Component

In this section, you will get the basics of a reusable component.

Component Structure

A component itself is simple and consists of just a few parts:

  • The component logic
  • A component specification as yaml.
  • A Dockerfile which is required to build the container.
  • A readme to explain the component and its inputs and outputs.
  • A helper script to build the component and push it to a Docker repository.
6fqMj26.png!webeYJviyu.png!web
Parts of a reusable Kubeflow component.

Component Specification

This specification describes the container component data model for Kubeflow Pipelines.

Written in YAML format (component.yaml).

  • Metadata describe the component itself, like name and description
  • Interface defines the input and the output of the component.
  • Implementation specifies how the component should be executed.

Handling Input

A component requires usually some kind of input, like a path to our training data or the name of our model. It can consume multiple inputs.

  • Define inputs in the component.yaml
  • Define the input as arguments for our container.
  • Parse arguments in the component logic.
uaeYz2z.png!web
The training component might require training data as input and produces a model as output.

Handling Output

The output is required to pass data between components. It is important to know that each component in a pipeline executes independently .

  • Components run in different processes and cannot share data.
  • The process for passing small data differs from large data.

For small data

  • Values can be passed directly as output.

For large data

  • Large data has to be serialized to files to be passed between components.
  • Upload the data to our storage system.
  • And pass a reference to this file to the next component.
  • The next component in the pipeline will take this reference and download the serialized data.

Dockerize Component

Each component is a container image which requires a dockerfile in order to build an image.

After the image is built we push the component container image to the Google Container Registry .

Build and upload the component container image to the Google Container Registry is just a few lines of code:

# build_image.sh
image_name=gcr.io/ml-training/kubeflow/training/train
image_tag=latest
full_image_name=${image_name}:${image_tag}
docker build --build-arg -t "${full_image_name}" 
docker push "$full_image_name"

It happens that with the first docker push you might get the following error message:

You don’t have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication

In that case, just run the following gcloud command and push again:

$ gcloud auth configure-docker

Use the Pipeline

Load the Component

Everyone with access to the Docker repository and the component.yaml can use the component in Pipelines .

fIB7byZ.png!web7Zbqm2b.png!web
Load a component from a component.yaml URL.

The component can then be loaded based on the component.yaml.

operation = kfp.components.load_component_from_url(
'https://location-to-your-yaml/component.yaml')
help(operation)

Create the Pipeline

The dsl decorator is provided via the pipeline SDK and is used to define and interact with pipelines. dsl.pipeline defines a decorator for Python functions which returns a pipeline.

@dsl.pipeline(
name='The name of the pipeline',
description='The description of the pipeline'
)
def sample_pipeline(parameter):
concat = operation(parameter=first)

Compile the Pipeline

To compile the pipeline we use the compiler.Compile() function which is again part of the pipeline SDK. The compiler generates a yaml definition which is used by Kubernetes to create the execution resources.

pipeline_func = sample_pipeline
pipeline_filename = pipeline_func.__name__ + '.pipeline.zip'
compiler.Compiler().compile(sample_pipeline, 
                            pipeline_filename)

Create an Experiment

Pipelines are always part of an experiment and can be created with the Kubeflow Pipeline Client kfp.client() . Experiments cannot be removed at the moment.

client = kfp.Client()
try:
    experiment = client.get_experiment(experiment_name=EXPERIMENT_NAME)
except:
    experiment = client.create_experiment(EXPERIMENT_NAME)
    
print(experiment)

Run the Pipeline

To run a pipeline we use the experiment id and the compiled pipeline created in the previous steps. client.run_pipeline runs the pipelines and provides a direct link to the Kubeflow experiment.

run_name = pipeline_func.__name__ + ' run'
run_result = client.run_pipeline(experiment.id, 
                                 run_name, 
                                 pipeline_filename)

Examples on GitHub

I created a basic pipeline which demonstrates everything presented in this post. To keep things simple the pipeline does not contain any ML specific implementation.

https://github.com/SaschaHeyer/Machine-Learning-Training/tree/master/kubeflow/reusable-component-training

2eQNnqU.jpgVJjm6ri.gif
Usage

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK