How to train and deploy in Azure ML

Created: July 18, 2022

Topic: Azure ML

Introduction

My goal for this post is to show you the simplest way to train your model and deploy it in the cloud, using Azure ML. Why would you want to train and deploy in the cloud? Training in the cloud will allow you to handle larger ML models and datasets than you could train on your development machine. And deploying your model in the cloud will allow your system to scale to many more inference requests than a development machine could handle.

I recommend reading my introduction to Azure ML before reading this post, but you should still be able to follow along even if you don’t.

Any task you want to accomplish using Azure ML can be done in three ways: using the Azure ML CLI, the Python SDK, or the Studio UI. This post will cover all three, which I hope will enable you to choose the best approach for each scenario you encounter in the future.

I’ve created two GitHub repositories to accompany this post. The first GitHub repo shows how to train and deploy using the Azure ML CLI, and the second GitHub repo shows how to use the Python SDK.

Azure setup

The following steps describe how to set up your machine to start using Azure ML:

You need to have an Azure subscription. You can get a free subscription to try it out.
Create a resource group.
Create a new machine learning workspace by following the “Create the workspace” section of the documentation. Keep in mind that you’ll be creating a “machine learning workspace” Azure resource, not a “workspace” Azure resource, which is entirely different!
If you have access to GitHub Codespaces, click on the “Code” button in either GitHub repo (aml_command_cli or aml_command_sdk), select the “Codespaces” tab, and then click on “New codespace.”
Alternatively, if you plan to use your local machine:
- Install the Azure CLI by following the instructions in the documentation.
- Install the ML extension to the Azure CLI by following the “Installation” section of the documentation.
In a terminal window, login to Azure by executing az login --use-device-code.
Set your default subscription by executing az account set -s "<YOUR_SUBSCRIPTION_NAME_OR_ID>". You can verify your default subscription by executing az account show, or by looking at ~/.azure/azureProfile.json.
Set your default resource group and workspace by executing az configure --defaults group="<YOUR_RESOURCE_GROUP>" workspace="<YOUR_WORKSPACE>". You can verify your defaults by executing az configure --list-defaults or by looking at ~/.azure/config.
You can now open the Azure Machine Learning studio, where you’ll be able to see and manage all the machine learning resources we’ll be creating.
Although not essential to run the code in this post, I highly recommend installing the Azure Machine Learning extension for VS Code.

Project setup

If you have access to GitHub Codespaces, click on the “Code” button in each GitHub repo, select the “Codespaces” tab, and then click on “New codespace.”

Alternatively, you can set up your local machine with the right conda environment using the following steps.

Install and activate the conda environment for the CLI project:

cd aml_command_cli
conda env create -f environment.yml
conda activate aml_command_cli

In a different terminal tab, repeat the steps for the SDK project:

cd aml_command_sdk
conda env create -f environment.yml
conda activate aml_command_sdk

Training and inference on your development machine

Your development machine may be your local machine, a GitHub Codespace, or an Azure ML compute instance. I’ll first discuss how to train a machine learning model and use it for inference on your development machine. In later sections, I’ll cover how you can use Azure ML to train and deploy your model at scale in the cloud.

The data we’ll be using in our example is small, so we’ll run our training and inference code using the complete dataset during development. If your data is large, you may need to use just a subset during development. It’s always a good idea to test your code out on your dev machine first, since that involves less overhead than running in the cloud.

The GitHub projects associated with this post contain code that trains the Fashion MNIST dataset. If you’re not familiar with this dataset or with the machine learning code, I recommend reading my introduction to PyTorch article. I won’t describe the training code in detail here, but I do want to call your attention to the portion of the code that saves the model:

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/src/train.py

def save_model(model_dir, model: nn.Module) -> None:
    code_paths = ["neural_network.py", "utils_train_nn.py"]
    full_code_paths = [
        Path(Path(__file__).parent, code_path) for code_path in code_paths
    ]
    ...
    mlflow.pytorch.save_model(pytorch_model=model,
                              path=model_dir,
                              code_paths=full_code_paths)

As you can see, this code saves the model using the open-source MLflow framework. This brings us several benefits. First of all, it allows us to visualize any metrics we log during training without having to manually create any graphs. Here’s how I logged the loss and accuracy in the training code:

def train(data_dir: str, model_dir: str, device: str) -> None:
    ...
    for epoch in range(epochs):
        ...
        metrics = {
            "training_loss": training_loss,
            "training_accuracy": training_accuracy,
            "validation_loss": validation_loss,
            "validation_accuracy": validation_accuracy
        }
        mlflow.log_metrics(metrics, step=epoch)

    save_model(model_dir, model)

And here’s how I visualize these metrics locally:

mlflow ui

When I run this command, I get a link that I can click on to see the graphs.

Another benefit of MLflow is that I can invoke the trained model on my dev machine, which helps me to fully test it out before I deploy it in the cloud. MLflow supports test data in CSV and JSON forms, and I include both types of test data in the project, to give you options. Here are the commands I use:

cd aml_command_cli
mlflow models predict --model-uri "model" --input-path "test_data/images.csv" --content-type csv
mlflow models predict --model-uri "model" --input-path "test_data/images.json" --content-type json

You can see the code I wrote to generate the CSV and JSON data in the generate_images.py file.

The output of the JSON generation code looks like this:

[{"0": 3.5685975551605225, "1": -7.8351311683654785, "2": 12.533431053161621, "3": 1.6915751695632935, "4": 6.009798049926758, "5": -6.79791784286499, "6": 7.569240570068359, "7": -6.589715480804443, "8": -2.000182628631592, "9": -8.283203125}, {"0": -3.6867828369140625, "1": -5.797521591186523, "2": -3.2098610401153564, "3": -2.2174417972564697, "4": -2.5920114517211914, "5": 3.298574686050415, "6": -0.4601913094520569, "7": 4.433833599090576, "8": 1.1174960136413574, "9": 5.766951560974121}]

If you’re familiar with the Fashion MNIST dataset, you may have guessed that the keys in this prediction dictionary correspond to clothing items, and the values represent how likely each clothing item is to be correct. For this particular example, key 2 has the highest value, which corresponds to “Pullover”.

You might be wondering why the model doesn’t simply return the string “Pullover”. That’s a slightly more advanced scenario, which I’ll cover in a future post. But I think that this basic scenario has value too — for example, if you’re localizing your app to different languages, you might want to translate the predicition to a string in your client app rather than the server.

Once you’re able to get a good prediction for your model on your development machine, you’re ready to move training to the cloud.

Training and deploying using the Azure ML CLI

You’ll need to create a few Azure ML resources in order to train and deploy a model using Azure ML. In this section, I’ll show you how to create those resources using the Azure ML CLI. Check out my introductory article for an overview of the major resources supported by Azure ML. Here are the resources we’ll use for this simple scenario:

Compute — We’ll create a cluster of CPU machines to run training in the cloud.
Data — We’ll copy our MNIST data to the cloud so that it’s easily accessible to our training job.
Job — We’ll create a CommandJob (the simplest type of Job supported by Azure ML) to train the model.
Model — Once the training job produces a model, we’ll register it with Azure ML so that we can deploy it as an endpoint.
Managed Online Endpoint — We’ll use this particular type of endpoint to make predictions because it’s designed to process smaller requests and give near-immediate responses.
Managed Online Deployment — Our endpoint can accommodate one or more deployments; we’ll just use one.

Let’s create our compute resource. We’ll start by defining the details of the compute we want in a YAML file. As you can see below, we name our resource “cluster-cpu” and specify that we want between 0 and 4 machines. We also specify that we want a machine of size Standard_DS4_v2. How do you know which machine size to choose? You can learn more about that in my blog post about compute.

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/cluster-cpu.yml

$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: cluster-cpu
type: amlcompute
size: Standard_DS4_v2
location: westus2
min_instances: 0
max_instances: 4

Notice that we also specify a schema, which gives us intellisense and warns us of errors when editing this file. If you press “Ctrl + Space” with the cursor on a new line, you’ll see that VS Code tells you all the other properties that can go in this file. And if you type a non-supported property name, VS Code will alert you by underlining the property name with red squiggles.

How do you know the schema URI for each resource? You can find URIs for all resources in the documentation or you can use the Azure ML extension for VS Code. If you have this extension installed, you can go to the left menu in Visual Studio, click on the symbol for the extension, select your Azure subscription, and pick your desired ML Workspace. You can then browse your existing cloud resources by navigating the tree. Clicking on the ”+” icon to the right of a resource name generates a new YAML file for that resource type, populated with the appropriate schema and a few commonly used properties.

Screenshot showing the Azure ML extension to VS Code.

Now that we have the compute details specified, we need to instruct Azure ML to create the resource in the cloud. This can be done by executing the following command in the terminal:

az ml compute create -f cloud/cluster-cpu.yml

You can verify that your resource was created by visiting the Azure ML Studio. Click on “Compute” in the left menu, then “Compute clusters,” and you should see a cluster named “cluster-cpu” listed on that page.

Congratulations! You created your first Azure ML resource! :)

We can follow similar steps to create the data resource. Our YAML configuration file specifies that we want to upload the “data” local folder into the cloud, and register it under the name “data-fashion-mnist”:

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/data.yml

$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: data-fashion-mnist
description: Fashion MNIST Dataset.
path: ../data/
type: uri_folder

We can execute a similar CLI command in the terminal:

az ml data create -f cloud/data.yml

We can then go to the Azure ML Studio, click on “Data,” and verify that a data resource with name “data-fashion-mnist” was created. If you click on the resource name, and then on “Explore,” you’ll see all the Fashion MNIST data files listed there.

Next we’ll create the job resource. In order to train our model, we need to specify the following information:

The compute hardware used, which is the compute cluster we defined earlier.
The software environment we want installed on that hardware. For more information about environments, check out my blog post on the topic.
Where our training code is located, which in our case is within the “src” directory.
Inputs to the training code. In our scenario, that’s just the data resource we created earlier.
Outputs of the training code. In our scenario, we output the trained model.

You can see that all of this information is specified in the YAML definition file:

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/job.yml

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

type: command
description: Trains a simple neural network on the Fashion-MNIST dataset.
compute: azureml:cluster-cpu

inputs:
  fashion_mnist:
    path: azureml:data-fashion-mnist@latest
outputs:
  model:
    type: mlflow_model

code: ../src
environment:
  image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
  conda_file: conda.yml
command: python train.py --data_dir ${{inputs.fashion_mnist}} --model_dir ${{outputs.model}}

Creating and starting the job in the cloud should look familiar by now:

az ml job create -f cloud/job.yml

This command works, but it’s not quite enough for our scenario. That’s because once the job finishes training the model, we want to register the model, and that requires a reference to the run (the job instance). One way to get this reference is by running the following command (assuming you’re using bash or zsh):

run_id=$(az ml job create -f cloud/job.yml --query name -o tsv)

The --query parameter specifies that we want to query the JSON returned by the command using JMESPath language, as you can see in the documentation for the “az ml job create” command. The -o parameter specifies that we want the output to be formatted using tab-separated values — you can learn more about output formats in the documentation.

In the Azure ML Studio, you can go to “Jobs”, and look for the name “aml_command_cli.” Click on it, and you’ll see all the runs associated with this job definition. If you execute the CLI command again, it will add another entry to this page. You’ll need to wait a few minutes for the run to complete. That means that the training is done, and the ML model is ready for use.

When training has completed, you can create an Azure ML resource for the trained model. I could have created another YAML file with the model specification, but I want to show you a different way of creating a resource. Since I only have three properties to set in this case, I can provide the values on the command line:

az ml model create --name model-command-cli --path "azureml://jobs/$run_id/outputs/model" --type mlflow_model

Keep in mind that we need to specify (using --type mlflow_model) that the model was created using MLflow. Also, notice the syntax that I’m using to refer to the output of the run. I like the fact that I can create the model directly from the run output, without having to download it first to my local machine. But you could download the model to use locally if you wanted to, with the following command:

az ml job download --name $run_id --output-name "model"

You can verify that your model was created correctly by going to the Azure ML Studio, clicking on “Models,” and then looking for the “model-command-cli,” which is the name we specified in the CLI command.

Great! We now have a trained model, and want to create an endpoint that we can use to invoke it. In Azure ML, an endpoint can have several deployments, which specify the compute and model we want to use. This is useful, for example, if you want to direct some percentage of your traffic to one deployment and the rest to another. But we’ll keep it simple here, and use a single deployment that handles all traffic. You can see here the YAML definitions for the endpoint and deployment:

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/endpoint.yml

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: endpoint-command-cli
auth_mode: key

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/deployment.yml

$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: endpoint-command-cli
model: azureml:model-command-cli@latest
instance_type: Standard_DS4_v2
instance_count: 1

And here are the commands we can execute to create these resources:

az ml online-endpoint create -f cloud/endpoint.yml
az ml online-deployment create -f cloud/deployment.yml --all-traffic

As usual, you can go to the Azure ML Studio, click on “Endpoints,” and see your endpoint and deployment creation in progress. The deployment creation in particular may take several minutes.

Once the endpoint and deployment are created, we’re ready to invoke the endpoint, which we can do with the following command:

az ml online-endpoint invoke --name endpoint-command-cli --request-file test_data/images_azureml.json

You may have noticed that this is not the same JSON file I used to test the endpoint locally, using MLflow. The only difference is that this JSON wraps the JSON we used previously with a dictionary with key “input_data” — this is currently a requirement for Azure ML. You can look at the test image generation code to see how I generated the images_azureml.json file. Also, keep in mind that the CSV format is not supported by Azure ML at the moment.

Before we wrap up, you might want to delete the endpoint, to avoid getting charged:

az ml online-endpoint delete --name endpoint-command-cli

And that’s all there is to it.

Training and deploying using the Azure ML Python SDK

All the Azure ML CLI steps that I presented in the previous section can also be accomplished using the Azure ML Python SDK. You can look at this GitHub repo to see how, including instructions on how to run it. I won’t delve into the details, except I want to call your attention to how similar the SDK syntax is to the YAML syntax. This makes it very intuitive to port your implementation from one method to the other, and to mix the two methods in the same project.

For example, let’s compare the YAML and Python SDK code used to create the compute cluster:

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/cluster-cpu.yml

$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: cluster-cpu
type: amlcompute
size: Standard_DS4_v2
location: westus2
min_instances: 0
max_instances: 4

https://github.com/bstollnitz/aml_command_sdk/blob/master/aml_command_sdk/cloud/pipeline_job.py

    cluster_cpu = AmlCompute(
        name="cluster-cpu",
        type="amlcompute",
        size="Standard_DS4_v2",
        location="westus2",
        min_instances=0,
        max_instances=4,
    )
    ml_client.begin_create_or_update(cluster_cpu)

As you can see, the YAML and Python SDK code are very similar. Once you understand the CLI method well, it should be easy to learn the Python SDK.

Training and deploying using the Azure ML Studio

Using the Azure ML Studio should also be intuitive once you’re familiar with the CLI method. Instead of creating resources by writing a YAML file and executing a command, you create them directly using the UI. The decisions you need to make when creating each resource are basically the same as the ones we specified in the YAML files.

For example, let’s see how you would create the compute cluster. You would click on “Compute” in the left menu, then “Compute clusters” in the top menu, then ”+ New.”

Screenshot of Azure ML Studio UI for creating a compute cluster

A window opens that allows you to choose the compute location and VM size. One advantage of using the UI to create your compute cluster is that you get a lot of help when choosing the VM size, including information about what each machine is optimized for, how many cores you have available in your subscription, and the cost of using each machine per hour. You can select a recommended machine size, or search all machines.

After pressing Next, a new window opens that allows you to choose a compute name, the minimum and maximum number of nodes, and a few other settings.

You can create other types of resources in a similar way.

The Azure ML Studio is a great option to create resources that you’ll only need to create once, because it offers so much guidance. However, resource creation is not as easily repeatable in the Studio as it is with the CLI or SDK, so it’s not as efficient for resources that you plan to re-create often.

Conclusion

In this post, you learned how to train and deploy a model using three methods: the Azure ML CLI, the Azure ML Python SDK, and the Azure ML Studio. Now you’re well prepared to choose the appropriate method for each scenario you’ll encounter in the future.

How to train and deploy in Azure ML

How to train and deploy in Azure ML

Introduction

Azure setup

Project setup

Training and inference on your development machine

Training and deploying using the Azure ML CLI

Training and deploying using the Azure ML Python SDK

Training and deploying using the Azure ML Studio

Conclusion

Recommend

Germany Has Lost an Average of €6.6B Per Year Since 2000 Due to Climate Change

TikTok owner ByteDance explores self-designed chips

使用 lisp 简单描述加法运算

三星官宣：新品发布会8月10日举办新折叠手机来了！

What can you do when you have much time left in a contest but have nothing to do...

台湾最新手机品牌排名出炉 OPPO第三 Redmi第四

What happens to my funds if a crypto exchange goes bankrupt?

想要做好新闻发稿发布的工作，这三个步骤缺一不可！

TikTok 放弃在欧美扩张直播电商，水土不服的不只是 996

华为杨超斌：迈向5.5G持续创新，开启5G产业新征程

About Joyk