Explore data and train models

12/27/2024

Contents

Skill 2.1: Explore data by using data assets and datastores
Skill 2.2: Create models by using the Azure Machine Learning Designer
Skill 2.3: Use automated machine learning to explore optimal models

Skill 2.3: Use automated machine learning to explore optimal models

Azure Machine Learning service’s automated ML capability is based on a breakthrough from the Microsoft Research division. It is distinct from competing solutions in the market. The approach combines ideas from collaborative filtering and Bayesian optimization. This combination allows it to search an enormous space of possible machine learning pipelines intelligently and efficiently. Essentially, it acts as a recommender system for machine learning pipelines. Just as streaming services recommend movies for users, automated ML recommends machine learning pipelines for datasets.

Use automated machine learning for tabular data

Imagine you’re a data scientist working for a telecom company. Your task is to develop a machine learning model to predict customer churn based on various customer attributes. You decide to use Azure Machine Learning’s Automated Machine Learning (AutoML) feature to quickly build and deploy the model. We’ll break this down into setting up your environment, preparing your tabular data, and then using AutoML on your tabular data by looking at a real-world scenario.

Working with tabular data in Azure Machine Learning

This section covers how you can use objects like MLTable for data processing. The MLTable object can be used with your tabular data (for example, a CSV file containing customer churn data). MLTable is a feature of Azure Machine Learning that allows you to define and save a series of data loading steps for tabular data. This makes it easier to reproduce data loading in different environments and share it with team members. MLTable supports various data sources, including CSV and Parquet files. Figure 2-8 shows selecting AutoML in the Designer.

FIGURE 2.8 AutoML in the workspace

Here’s how you can use MLTable with AutoML for tabular data:

Define Data Loading Steps: Use the mltable Python SDK (to clarify, mltable must be used via Python and not the UI) to define the steps for loading and preprocessing your data. This includes specifying the data source, filtering rows, selecting columns, and creating new columns based on the data.
Save Data Loading Steps: Once you have defined the data loading steps, you can save them into an MLTable file. This file contains the serialized steps, making it easy to reproduce the data loading process.
Load Data into a Pandas DataFrame: You can load the data defined by an MLTable into a Pandas DataFrame. This is useful for exploring the data and performing additional preprocessing before training a model.
Use MLTable with AutoML: When setting up an AutoML experiment for tabular data, you can use an MLTable as the data input. AutoML will automatically apply the data loading steps defined in the MLTable and use the resulting DataFrame for model training.
Create a Data Asset: To share the MLTable with team members and ensure reproducibility, you can create a data asset in Azure Machine Learning. This stores the MLTable in cloud storage and makes it accessible through a friendly name and version number.
Use Data Asset in Jobs: You can reference the data asset in Azure Machine Learning jobs, such as training or inference jobs. This allows you to use the same data loading steps consistently across different experiments and pipelines.

Here’s an example of how to turn a CSV file into an MLTable using the SDK:

import mltable
# Define the data source (CSV file)
paths = [{'file': 'path/to/your/data.csv'}]
# Create an MLTable from the CSV file
tbl = mltable.from_delimited_files(paths)
# Apply any additional data loading steps (e.g., filtering, column selection)
tbl = tbl.filter("col('some_column') > 0")
tbl = tbl.select_columns(["column1", "column2"])
# Save the data loading steps into an MLTable file
tbl.save("./your_mltable_directory")

In this example, the CSV file is turned into an MLTable with some filtering and column selection steps. The resulting MLTable can then be used with AutoML for training machine learning models on tabular data.

Now that we understand how to work with tabular data in a pipeline, we can look at some specific scenarios for using AutoML on tabular data for a customer churn prediction pipeline. You can also use the Designer with AutoML and tabular data. Figure 2-9 shows creating a new AutoML run in the Designer.

FIGURE 2.9 Advanced Features of Automated Machine Learning for model development

Select and understand training options, including preprocessing and algorithms

Automated Machine Learning (AutoML) in Azure Machine Learning is a powerful tool that automates the process of selecting the best machine learning algorithms and hyperparameters for your data. This simplifies the machine learning workflow, making it accessible to data scientists, analysts, and developers, regardless of their expertise in machine learning. In the following section, we will look at automating machine learning concepts including training data, validation, featurization, preprocessing, distributed training, model selection, and ensemble learning in the context of Azure’s AutoML capabilities.

Automated Machine Learning AutoML in Azure provides various training options to cater to different requirements and preferences. These options are designed to optimize the model development process, ensuring efficiency and effectiveness in training machine learning models.
Training Data and Validation AutoML allows users to provide training data in different formats, including MLTable for tabular data. Users can specify separate datasets for training and validation or let AutoML automatically split the training data for validation purposes. This helps in evaluating the model’s performance and avoiding overfitting. For time-series forecasting, AutoML supports advanced configurations like rolling-origin cross-validation to ensure robust model evaluation.
Featurization and Preprocessing AutoML automates the featurization and preprocessing steps, which are crucial for preparing the data for model training. This includes handling missing values, encoding categorical variables, and scaling numerical features. Users can customize these steps by specifying featurization settings, such as blocking certain transformers or defining custom transformations. This flexibility allows users to tailor the data preprocessing to their specific needs, ensuring that the input data is in the optimal format for training.
Distributed Training For large datasets and complex models, AutoML supports distributed training. This allows the training process to be distributed across multiple compute nodes, significantly reducing the training time. Users can specify the number of nodes to use for training, enabling parallel execution of model training. Distributed training is particularly beneficial for tasks like deep learning and NLP, where the computational requirements are high.
Model Selection and Hyperparameter Tuning AutoML automates the selection of machine learning algorithms and the tuning of hyperparameters. It iterates through a predefined list of algorithms and tests different hyperparameter combinations to find the best-performing model. Users can control the number of iterations and set limits on the training time to manage computational resources effectively.
Ensemble Models AutoML supports ensemble models, which combine predictions from multiple models to improve accuracy. It uses techniques like voting and stacking to create ensembles, automatically selecting the best models to include in the ensemble based on their performance.

Table 2-1 outlines the algorithms that are supported by Automated Machine Learning (AutoML) in Azure Machine Learning for various learning tasks.

TABLE 2-1 Automated Machine Learning algorithms

Task Type	Algorithms
Classification	- Logistic Regression<br>- Light GBM<br>- Gradient Boosting<br>- Decision Tree<br>- K Nearest Neighbors<br>- Linear SVC<br>- Support Vector Classification (SVC)<br>- Random Forest<br>- Extremely Randomized Trees<br>- Xgboost<br>- Naive Bayes<br>- Stochastic Gradient Descent (SGD)
Regression	- Elastic Net<br>- Light GBM<br>- Gradient Boosting<br>- Decision Tree<br>- K Nearest Neighbors<br>- LARS Lasso<br>- Stochastic Gradient Descent (SGD)<br>- Random Forest<br>- Extremely Randomized Trees<br>- Xgboost*<br>- Xgboost
Time Series Forecasting	- AutoARIMA<br>- Prophet<br>- Elastic Net<br>- Light GBM<br>- K Nearest Neighbors<br>- Decision Tree<br>- LARS Lasso<br>- Extremely Randomized Trees*<br>- Random Forest<br>- TCNForecaster<br>- Gradient Boosting<br>- ExponentialSmoothing<br>- SeasonalNaive<br>- Average<br>- Naive<br>- SeasonalAverage
Image Classification	- MobileNet<br>- ResNet<br>- ResNeSt<br>- SE-ResNeXt50<br>- ViT
Image Classification Multi-label	Refer to ClassificationMultilabelPrimaryMetrics Enum
Image Object Detection	- YOLOv5<br>- Faster RCNN ResNet FPN<br>- RetinaNet ResNet FPN
NLP Text Classification Multi-label	Refer to supported algorithms for NLP tasks
NLP Text Named Entity Recognition (NER)	Refer to supported algorithms for NLP tasks

Algorithms marked with an asterisk (*) are default models.

For NLP tasks, AutoML supports a range of pretrained text DNN models, including but not limited to BERT, GPT-4, RoBERTa, T5, and LaMDA.

Before showing an example of how you can select and use various training options in Automated Machine Learning (AutoML) with the Azure Machine Learning Python SDK v2, we need to list some of the options that are available:

Primary Metric This is the metric that AutoML will optimize for model selection. Common metrics include accuracy for classification tasks and mean_squared_error for regression tasks.
Validation Strategy AutoML supports several validation strategies such as cross-validation and train-validation splits. This helps in evaluating the model’s performance on unseen data.
Max Trials This specifies the maximum number of different algorithm and parameter combinations that AutoML will try before selecting the best model.
Max Concurrent Trials This is the maximum number of trials that can run in parallel, which can speed up the training process.
Timeout You can set a maximum amount of time for the AutoML experiment. Once the time limit is reached, AutoML will stop trying new models.
Featurization AutoML can automatically preprocess and featurize the input data, which includes handling missing values, encoding categorical variables, and more.

The following code example shows how to configure these training options in AutoML using the Azure Machine Learning Python SDK:

from azure.ai.ml import MLClient
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import automl, Input
from azure.identity import DefaultAzureCredential

# Set up the MLClient
credential = DefaultAzureCredential()
subscription_id = "your-subscription-id"
resource_group = "your-resource-group"
workspace_name = "your-workspace-name"
ml_client = MLClient(credential, subscription_id, resource_group, workspace_name)

# Define the training data
training_data_input = Input(type=AssetTypes.MLTABLE, path="./data/training_data/")

# Configure the AutoML job
automl_job = automl.classification(
compute="your-compute-cluster",
experiment_name="automl_classification_example",
training_data=training_data_input,
target_column_name="target",
primary_metric="accuracy",
validation_data_split=0.2,
max_trials=100,
max_concurrent_trials=4,
timeout_minutes=60,
enable_model_explainability=True
)

# Submit the AutoML job
submitted_job = ml_client.jobs.create_or_update(automl_job)
print(f"Submitted job: {submitted_job}")

# Get the URL to monitor the job
print(f"Monitor your job at: {submitted_job.services['Studio'].endpoint}")

In this example, we’ve configured the primary metric as accuracy, set a validation data split of 20%, limited the maximum number of trials to 100, allowed up to 4 trials to run concurrently, and set a timeout of 60 minutes. We’ve also enabled model explainability to interpret the model’s predictions.

You can adjust these options based on your specific requirements and the nature of your dataset. Whether you’re a seasoned data scientist or a developer new to machine learning, AutoML provides the tools you need to develop and deploy machine learning models with ease. In the next section, we will look at the last piece of the above example: evaluating an Automated Machine Learning Run according to responsible AI guidelines.

Evaluate an automated machine learning run, including responsible AI guidelines

Depending on the type of machine learning task (classification, regression, etc.), different metrics are used to evaluate the model’s performance.

Classification metrics

Classification metrics include accuracy, precision, and recall having specific meaning as ratios of true and false positives to actual positive predictions as well as metrics like F1 Score and AUC-ROC, or area under the receiver-operating curve. Monitoring the performance of your classification models using accuracy, F1 Score, or AUC-ROC to detect model drift and to decide when to retrain the model are concepts we will explore in later chapters, so it is important to understand the definitions for the following classification metrics:

Accuracy Proportion of correct predictions
Precision Ratio of true positives to all positive predictions
Recall Ratio of true positives to all actual positives
F1 Score Harmonic mean of precision and recall
AUC-ROC Area under the Receiver Operating Characteristic curve

Regression Metrics

Not all supervised machine learning problems are classification problems. Regression problems could involve predicting a continuous response variable—for example, forecasting demand for a new product line requires its own set of performance metrics to measure the error between predicted and actual values. Here are a few important regression metrics that you could encounter frequently in the real world as well as on exam questions:

Mean Absolute Error (MAE) Average of absolute differences between predicted and actual values
Mean Squared Error (MSE) Average of squared differences between predicted and actual values
Root Mean Squared Error (RMSE) Square root of MSE
R-squared Proportion of variance in the dependent variable that is predictable from the independent variables

Using evaluation metrics in AutoML

When you run an AutoML experiment, it automatically calculates and logs these metrics for each model. You can access these metrics through the Azure Machine Learning Studio or programmatically using the SDK.

Visualizations for model evaluation

AutoML provides various visualizations to help you understand the model’s performance:

Confusion Matrix For classification tasks, this shows the number of correct and incorrect predictions for each class.
ROC Curve For binary classification, this plots the true positive rate against the false positive rate at various threshold levels.
Precision-Recall Curve For binary classification, this shows the trade-off between precision and recall for different threshold levels.
Residuals Plot For regression tasks, this shows the difference between actual and predicted values.

After the AutoML run is complete, you can retrieve the best model based on the primary metric you specified. You can then evaluate this model on a test dataset to get a sense of its real-world performance.

Here’s an example of how you can retrieve and evaluate the best model from an AutoML run:

from azure.ai.ml import MLClient
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import automl, Input
from azure.identity import DefaultAzureCredential

# Set up the MLClient
credential = DefaultAzureCredential()
subscription_id = "your-subscription-id"
resource_group = "your-resource-group"
workspace_name = "your-workspace-name"
ml_client = MLClient(credential, subscription_id, resource_group, workspace_name)

# Get the best model from the AutoML run
best_model = ml_client.jobs.get_best_model(
experiment_name="automl_classification_example",
job_name="automl_job_name"
)

# Evaluate the best model on a test dataset
test_data = Input(type=AssetTypes.MLTABLE, path="./data/test_data/")
evaluation_results = ml_client.jobs.evaluate(
model=best_model,
test_data=test_data
)

In this example, we retrieve the best model from a completed AutoML run and evaluate it on a separate test dataset. The evaluation results provide metrics that help us understand the model’s performance.

Predicting customer churn with Azure AutoML

Suppose you are a data scientist tasked with creating a machine learning model to predict customer churn for a telecom company. To accomplish this, you decide to leverage Azure’s Automated Machine Learning (AutoML) feature, which simplifies the process of building and deploying models. Here’s a step-by-step guide to help you prepare tabular data for use with Automated Machine Learning capabilities (see Figure 2-10 for an example using the Designer):

Set Up Your Environment: Create an Azure Machine Learning workspace. This is your centralized environment for managing and monitoring your machine learning models.
Install the Azure Machine Learning SDK v2 for Python: Run pip install azure-ai-ml in your terminal. This SDK enables you to interact with Azure Machine Learning services and resources programmatically.
Prepare Your Tabular Data: Gather your dataset. Ensure that your dataset includes various customer attributes and a churn label indicating whether the customer has churned.
Format Your Data: Structure your data in a tabular format with rows representing individual customers and columns representing attributes. The target column should be the churn label.
Upload Your Dataset to Azure: Convert your dataset to an MLTable and upload it to Azure. MLTable is a tabular data format supported by Azure AutoML.

FIGURE 2.10 Data connection and feature preparation in Azure Machine Learning

Specify the task type as classification since you’re predicting a binary outcome (churn or no churn). Choose accuracy as your primary metric to evaluate model performance. Also, decide on your data splitting strategy (e.g., cross-validation or train-validation split). Determine the maximum duration for the experiment (timeout minutes) and the maximum number of trials (max trials). This helps in managing computational resources and experiment time.

Run your AutoML experiment

The following code shows how to use the Azure Machine Learning SDK to submit your AutoML experiment for execution. The purpose of the code is to show in detail how to use AutoML, including configuring limits like time outs and max trials. Keep an eye on the experiment’s progress through the Azure Machine Learning Studio or SDK. You can review the performance of different models as they are generated.

from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient, automl, Input
from azure.ai.ml.constants import AssetTypes

# Set up workspace
credential = DefaultAzureCredential()
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<WORKSPACE_NAME>"
ml_client = MLClient(credential, subscription_id, resource_group, workspace)

# Prepare data
train_data_input = Input(type=AssetTypes.MLTABLE, path="./data/customer_churn_data")

# Configure AutoML experiment
classification_job = automl.classification(
compute="<COMPUTE_NAME>",
experiment_name="customer_churn_prediction",
training_data=train_data_input,
 target_column_name="Churn",
primary_metric="accuracy",
n_cross_validations=5
)

# Set limits (optional)
classification_job.set_limits(
 timeout_minutes=60,
max_trials=20
)

# Run the experiment
returned_job = ml_client.jobs.create_or_update(classification_job)
print(f"Created job: {returned_job}")

Use automated machine learning for computer vision

Imagine you are a data scientist tasked with developing a model to classify animal images. Your goal is to utilize Azure Automated Machine Learning (AutoML) for computer vision tasks to accomplish this.

Setting up the environment

To kickstart your machine learning journey, the first step is to establish an Azure Machine Learning workspace, acting as a centralized hub for overseeing and tracking your machine learning models’ progress. This workspace provides a unified platform for managing resources, conducting experiments, and deploying models seamlessly. Following this, installing the Azure Machine Learning CLI v2 and Python SDK v2 equips you with the necessary tools to interact with Azure services efficiently. These resources empower you to leverage Azure’s capabilities effectively, enabling streamlined development, deployment, and management of machine learning solutions within your workspace.

Selecting the task type

In this project, the task type selected is image classification, which serves as a cornerstone determining the approach and algorithms utilized by AutoML for model training. Image classification involves categorizing images into predefined classes or categories based on their visual features. This choice significantly influences the techniques employed during the training phase, as well as the algorithms leveraged to optimize model performance.

Image classification tasks typically require specialized algorithms capable of understanding and extracting meaningful features from images to accurately classify them. AutoML, being an automated machine learning platform, adapts its approach based on the specified task type. For image classification, it employs algorithms specifically designed to process image data efficiently, such as convolutional neural networks (CNNs). CNNs are particularly well-suited for image-related tasks due to their ability to automatically learn hierarchical representations of visual features from the input images.

Furthermore, the choice of image classification as the task type underscores the importance of selecting appropriate evaluation metrics and validation strategies tailored to this specific problem domain. Metrics such as accuracy, precision, recall, and F1-score are commonly used to assess the performance of image classification models. Additionally, techniques like cross-validation or stratified sampling may be employed to ensure robust evaluation and prevent overfitting. Therefore, the decision to focus on image classification guides the entire workflow of model training within the AutoML framework, shaping the selection of algorithms, evaluation metrics, and validation strategies to achieve optimal results.

Preparing the data

Your next step is to organize your labeled image data. Format this data into JSONL format, ensuring that each line contains an image URL and the corresponding label. If your data is in a different format, such as Pascal VOC or COCO, convert it to JSONL using available helper scripts. A minimum of 10 images is recommended to start the training process. Here is an example of JSONL format to help visualize what this looks like for an image URL and a label that can have values “cat”, “dog”, “bird”, “car”, and “tree”:

{"image_url": "http://example.com/image1.jpg", "label": "cat"}
{"image_url": "http://example.com/image2.jpg", "label": "dog"}
{"image_url": "http://example.com/image3.jpg", "label": "bird"}
{"image_url": "http://example.com/image4.jpg", "label": "car"}
{"image_url": "http://example.com/image5.jpg", "label": "tree"}

Create an MLTable for your training and validation data using Azure CLI or Python SDK. This involves specifying the path to your JSONL files and defining any necessary data transformations. MLTable serves as a structured representation of your data for AutoML.

Setting up compute for training

Choose a GPU-enabled compute target, such as the NC or ND series VMs, to train your computer vision models. The choice of compute target affects the speed and efficiency of model training.

Configure your AutoML experiment by setting parameters like the task type, primary metric, and job limits (e.g., timeout_minutes, max_trials, and max_concurrent_trials). This step involves defining the boundaries and objectives of the model training process. Figure 2-11 shows the menu for submitting an Automated ML Job including basic settings and Task settings like task type mentioned previously.

FIGURE 2.11 Submitting an AutoML job in Azure Machine Learning

Evaluating and deploying the model

After training, evaluate the best model based on the primary metric in accordance with the responsible AI guidelines covered earlier. Register this model in your Azure Machine Learning workspace and deploy it as a web service for making predictions. This final step makes your model accessible for real-world applications. Figure 2-12 shows selecting computer vision task-specific options in AutoML and the different options available as well as where to select data for training.

FIGURE 2.12 Select a computer vision task type using AutoML

Use automated machine learning for natural language processing (NLP)

Imagine again that you are a data scientist aiming to develop a natural language processing (NLP) model for classifying movie reviews into genres. You plan to use Azure Automated Machine Learning (AutoML) for NLP tasks. Figure 2-13 shows the high-level architecture for configuring AutoML to perform NLP tasks in Azure Machine Learning; however, in this chapter, we’ll concentrate specifically on Automated Machine Learning for NLP tasks.

FIGURE 2.13 NLP using AutoML

Setting up the environment

The first step is to create an Azure Machine Learning workspace, which acts as a centralized platform for managing and overseeing NLP models. Additionally, configuring a GPU training compute within the workspace enhances the efficiency of training large-scale NLP models by harnessing the parallel processing power of GPUs. Moreover, installing the Azure Machine Learning CLI v2 and Python SDK v2 equips you with essential tools to seamlessly interact with Azure services. This facilitates smooth integration of NLP pipelines, experimentation, and deployment processes within your workspace. Collectively, these resources empower you to leverage Azure’s capabilities effectively for developing, fine-tuning, and deploying NLP solutions with optimal performance and scalability.

Selecting the NLP task

For this project, choose text_classification as your NLP task. This task involves classifying each movie review into a specific genre. Organize your dataset in a CSV format with columns for the review text and the corresponding genre labels. Ensure that the data is labeled correctly for the classification task. Figure 2-14 shows how to configure an AutoML experiment for an NLP task.

FIGURE 2.14 NLP processing task

Configuring the AutoML experiment

Define your experiment settings, including the task type (text_classification), compute target, and data inputs. Set the label column name to the name of the genre label column in your dataset.

Submit your AutoML job for training using the Azure CLI or Python SDK. Monitor the progress of the job and review the generated models in Azure Machine Learning Studio.

After training, evaluate the best model based on its performance metrics. Register this model in your Azure Machine Learning workspace and deploy it as a web service for making predictions. The following code example shows how to submit an AutoML NLP job:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import AutoMLJob, TextClassificationJob, ComputeConfiguration

# Define the NLP task and settings
nlp_task = TextClassificationJob(
compute=ComputeConfiguration(target="gpu-cluster"),
training_data="path/to/training_data.csv",
validation_data="path/to/validation_data.csv",
 target_column_name="genre"
)

# Submit the AutoML job
automl_job = AutoMLJob(task=nlp_task)
ml_client.jobs.create_or_update(automl_job)

We can also use Azure ML for fine-tuning natural language processing tasks. Figure 2-15 shows an example of fine-tuning a large language model (LLM).