Kubeflow Components and Pipelines
source link: https://www.tuicool.com/articles/QBVFNfa
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
I want to keep things simple therefore we cover components, pipelines and experiments. With pipelines and components, you get the basics which are required to build ML workflows.
There are many more tools integrated into Kubeflow and I will cover them in the upcoming posts.
Kubeflow is originated at Google.
Making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable.
Goals
- Demonstrate how to use build pipelines.
- Demonstrate how to create components.
- Demonstrate how to use components.
- Demonstrate how to run pipelines and experiments inside of a Notebook.
- Easy to understand and ready to use examples .
Pipelines
Component
Code that performs one step in the Pipeline. In other words a containerized implementation of an ML task.
A component is analog to a function it has a name, parameter, return values and a body. Each component in a pipeline executes independently and has to be packed as a docker image.
Graph
The representation between the components. It shows the steps your pipeline is executing.
Pipeline
A pipeline describes the machine learning workflow, it includes the components and the graph.
Run
A run is a single execution of a pipeline, for comparison all runs are kept.
Recurring runs
Recurringruns can be used to repeat a pipeline run, useful if we want to train an updated model version on new data in a scheduled manner.
Experiment
Similar to a workspace, it contains the different runs . Runs can be compared.
Component Types
Kubeflow contains two types of components, one for rapid development and one for re-usability.
Lightweight Component
Used for fast development in a notebook environment. Fast and easy cause there is no need to build container images.
Lightweight components cannot be reused .
Reusable Component
Can be reused by loading it into a Kubeflow pipeline. It is a c ontainerized component.
Require more implementation time .
Reusable Component
In this section, you will get the basics of a reusable component.
Component Structure
A component itself is simple and consists of just a few parts:
- The component logic
- A component specification as yaml.
- A Dockerfile which is required to build the container.
- A readme to explain the component and its inputs and outputs.
- A helper script to build the component and push it to a Docker repository.
Component Specification
This specification describes the container component data model for Kubeflow Pipelines.
Written in YAML format (component.yaml).
- Metadata describe the component itself, like name and description
- Interface defines the input and the output of the component.
- Implementation specifies how the component should be executed.
Handling Input
A component requires usually some kind of input, like a path to our training data or the name of our model. It can consume multiple inputs.
- Define inputs in the component.yaml
- Define the input as arguments for our container.
- Parse arguments in the component logic.
Handling Output
The output is required to pass data between components. It is important to know that each component in a pipeline executes independently .
- Components run in different processes and cannot share data.
- The process for passing small data differs from large data.
For small data
- Values can be passed directly as output.
For large data
- Large data has to be serialized to files to be passed between components.
- Upload the data to our storage system.
- And pass a reference to this file to the next component.
- The next component in the pipeline will take this reference and download the serialized data.
Dockerize Component
Each component is a container image which requires a dockerfile in order to build an image.
After the image is built we push the component container image to the Google Container Registry .
Build and upload the component container image to the Google Container Registry is just a few lines of code:
# build_image.sh image_name=gcr.io/ml-training/kubeflow/training/train image_tag=latest
full_image_name=${image_name}:${image_tag}
docker build --build-arg -t "${full_image_name}" docker push "$full_image_name"
It happens that with the first docker push you might get the following error message:
You don’t have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
In that case, just run the following gcloud command and push again:
$ gcloud auth configure-docker
Use the Pipeline
Load the Component
Everyone with access to the Docker repository and the component.yaml can use the component in Pipelines .
The component can then be loaded based on the component.yaml.
operation = kfp.components.load_component_from_url(
'https://location-to-your-yaml/component.yaml')
help(operation)
Create the Pipeline
The dsl decorator is provided via the pipeline SDK and is used to define and interact with pipelines. dsl.pipeline
defines a decorator for Python functions which returns a pipeline.
@dsl.pipeline(
name='The name of the pipeline',
description='The description of the pipeline'
)
def sample_pipeline(parameter):
concat = operation(parameter=first)
Compile the Pipeline
To compile the pipeline we use the compiler.Compile()
function which is again part of the pipeline SDK. The compiler generates a yaml definition which is used by Kubernetes to create the execution resources.
pipeline_func = sample_pipeline pipeline_filename = pipeline_func.__name__ + '.pipeline.zip'
compiler.Compiler().compile(sample_pipeline, pipeline_filename)
Create an Experiment
Pipelines are always part of an experiment and can be created with the Kubeflow Pipeline Client kfp.client()
. Experiments cannot be removed at the moment.
client = kfp.Client()
try: experiment = client.get_experiment(experiment_name=EXPERIMENT_NAME) except: experiment = client.create_experiment(EXPERIMENT_NAME) print(experiment)
Run the Pipeline
To run a pipeline we use the experiment id and the compiled pipeline created in the previous steps. client.run_pipeline
runs the pipelines and provides a direct link to the Kubeflow experiment.
run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename)
Examples on GitHub
I created a basic pipeline which demonstrates everything presented in this post. To keep things simple the pipeline does not contain any ML specific implementation.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK