Skip to main content

The current state of MLOps for machine learning engineers

Future artificial intelligence robot and cyborg.
Image Credit: Blue Planet Studio // Getty Images

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


This article was contributed by Aymane Hachcham, data scientist and contributor to neptune.ai

MLOps refers to the operation of machine learning in production. It combines DevOps with lifecycle tracking, reusable infrastructure, and reproducible environments to operationalize machine learning at scale across an entire organization. The term MLOps was first coined by Google in their paper on Machine Learning Operations, although it does have roots in software operations. Google’s goal with this paper was to introduce a new approach to developing AI products that is more agile, collaborative, and customer-centric. MLOps is an advanced form of traditional DevOps and ML/AI that mostly focuses on automation to design, manage, and optimize ML pipelines.

Machine learning on top of DevOps

MLOps is based on DevOps, which is a modern practice for building, delivering, and operating corporate applications effectively. DevOps began a decade ago as a method for rival tribes of software developers (the Devs) and IT operations teams (the Ops) to interact.

MLOps help data scientists monitor and keep track of their solutions in real-life production environments. Furthermore, the real work that happens behind the scenes when pushing to production involves significant issues in terms of both raw performance and managerial discipline. Datasets are huge and constantly expanding, and they can change in real-time. AI models need regular monitoring via rounds of experimentation, adjusting, and retraining.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

Lifecycle tracking

Lifecycle tracking is a process that enables various team members to track and manage the life cycle of a product from inception to deployment. The system keeps track of all the changes made to the product throughout this process and allows each user to revert back to a previous version if necessary.

Lifecycle tracking focuses on the iterative model development phase where you attempt a variety of modifications to bring your model’s performance to the desired level. A modest modification in training information can sometimes have a significant influence on performance. Because there are multiple layers of experiment tracking involving model training metadata, model versions, training data, etc., you may decide to choose a platform that can automate all these processes for you and manage scalability and team collaboration.

Model training metadata

During the course of a project (especially if there are several individuals working on the project), your experiment data may be spread across multiple devices. In such instances, it can be difficult to control the experimental process, and some knowledge is likely to be lost. You may choose to work with a platform that offers solutions to this issue.

Hyperparameter logging

The best way to track the hyperparameters of your different version models is using a configuration file. These are simple text files with a preset structure and standard libraries to interpret them, such as JSON encoder and decoder or PyYAML.json, YAML, and cfg files are common standards. Below is an example of a YAML file for a credit scoring project:

project: ORGANIZATION/project-I-credit-scoring
name: cs-credit-default-risk

parameters:
# Data preparation
  n_cv_splits: 5
  validation_size: 0.2
  stratified_cv: True
  shuffle: 1
# Random forest
  rf__n_estimators: 2000
  rf__criterion: gini
  rf__max_features: 0.2
  rf__max_depth: 40
  rf__min_samples_split: 50
  rf__min_samples_leaf: 20
  rf__max_leaf_nodes: 60
  rf__class_weight: balanced
# Post Processing
  aggregation_method: rank_mean

One way to do this is with Hydra, a new Facebook AI project that streamlines the setup of more sophisticated machine learning experiments.

The key takeaways from Hydra are:

  • You may compose your hyperparameter configuration dynamically.
  • You can pass additional arguments not found in the configuration to the CLI.

Hydra is more versatile and allows you or your MLOps engineer to override complicated configurations (including config groups and hierarchies). The library is well-suited for deep-learning projects and is more reliable than a simple YAML file.

A minimalist example should look like the following:

# Use your previous yaml config file:
project: ORGANIZATION/project-I-credit-scoring
name: cs-credit-default-risk

parameters:
# Data preparation
  n_cv_splits: 5
  validation_size: 0.2
  stratified_cv: True
  shuffle: 1
# Random forest
  rf__n_estimators: 2000
  rf__criterion: gini
  rf__max_features: 0.2
  rf__max_depth: 40
  rf__min_samples_split: 50
  rf__min_samples_leaf: 20
  rf__max_leaf_nodes: 60
  rf__class_weight: balanced

Create your Hydra configuration file:

import hydra
from omegaconf import DictConfig

@hydra.main(config_path='hydra-config.yaml')
def paramter_config(cfg):
  print(cfg.pretty()) # this prints config in a reader friendly way
  print(cfg.parameters.rf__n_estimators) # Access values from your config file
if __name__ == "__main__":
  train()

When you start training your model, Hydra will log and print the configuration you’ve given:

name: cs-credit-default-risk
parameters:
  n_cv_splits: 5
  rf__class_weight: balanced
  rf__criterion: gini
  rf__max_depth: 40
  rf__n_estimators: 2000
  shuffle: 1
  stratified_cv: true
  validation_size: 0.2

project: ORGANIZATION/project-I-credit-scoring

Solid AI infrastructure

The AI infrastructure is the backbone of every AI project. In order for an AI company to be successful, it needs a solid network, servers, and storage solutions. This includes not only hardware but also the software tools that enable them to iterate quickly on machine learning algorithms. It’s extremely important that these solutions are scalable and can adapt as needs change over time.

Objectives and KPIs: key for MLOps engineers

Two principal categories fall under MLOps scope: predictive and prescriptive. Predictive MLOps is about predicting the outcome of a decision based on historic data while prescriptive MLOps is about providing recommendations for decisions before they are made.

And those two categories abide by four general principles:

  1. Don’t overthink which aim to directly optimize; instead, track various indicators at first
  2. For your initial aim, select a basic, observable, and accountable metric
  3. Establish governance objectives
  4. Fairness and privacy must be enforced

In terms of code, one could establish multiple requirements to have perfectly functional production code. However, the big deal comes when ML models run inference in post-production and are exposed to vulnerabilities never tested against. Therefore, testing is a tremendously important part of the process that actually needs a lot of attention.

Proper testing workflow should always account for the following rules:

  • Perform automated regression testing
  • Check code quality using static analysis.
  • And finally, employ continuous integration

Principal KPIs in MLOps

There is no one-size-fits-all solution when it comes to MLOps KPIs. The metrics you or your MLOps engineer want to monitor will depend on your specific goals and environment. You should start by considering what you need to optimize, how quickly you need to make changes, and what kind of data you can collect. Major KPIs to always keep an eye on when deploying ML software in production include:

Hybrid MLOps infrastructure

The advent of MLOps has seen new-age businesses moving their datacenters into the cloud. This trend has shown that companies that are looking for agility and cost efficiency can easily switch to a fully-managed platform for their infrastructure management needs.

Hybrid MLOps capabilities are defined as those that have some interaction with the cloud while also having some interaction with local computing resources. Local compute resources can include laptops running Jupyter notebooks and Python scripts, HDFS clusters storing terabytes of data, web apps serving millions of people globally, on-premises AWS Outposts, and a plethora of additional applications.

Many companies and MLOps engineers, in response to increased regulatory and data privacy concerns, are turning to hybrid solutions to handle data localization. Furthermore, an increasing number of smart edge devices are fueling creative new services across sectors. Because these devices create large amounts of complicated data that must frequently be processed and evaluated in real-time, IT directors must determine how and where to process that data.

How to implement a hybrid MLOps process for MLOps engineers

A robust AI infrastructure heavily relies on an active learning data pipeline. When used correctly, the data pipeline may dramatically accelerate the development of ML models. It can also lower the cost of developing ML models.

Team integration

Continuous integration and continuous delivery (CI/CD) are terms used to describe the processes of integrating and delivering software inside a CI/CD framework. Machine learning extends the integration step with data and model validation, whereas delivery handles the difficulties of machine learning installations.

Machine learning experts and MLOps engineers devote a significant amount of work to troubleshooting and enhancing model performance. CI/CD tools save time and automate as much manual work as feasible. Some tools used in business are:

  • Github actions
  • GitLab Ci/CD
  • Jenkins
  • Circle CI

Continuous training

CT (Continuous Training), a notion specific to MLOps, is all about automating model retraining. It covers the whole model lifetime, from data intake through measuring performance in production. CT guarantees that your algorithm is updated as soon as there is evidence of deterioration or a change in the environment.

Model training pipeline

A model training pipeline is an important part of the ongoing training process and the overall MLOps workflow. It trains and retrains models on a regular basis, freeing up data scientists to focus on building new models for other business challenges.

Each time the pipeline performs a new training the following sequence of operations is performed:

  • Data ingestion: Obtaining fresh data from external repositories or feature stores, where data is preserved as reusable “features” tailored to specific business scenarios.
  • Data preparation: A crucial step, where data anomalies are detected, the pipeline can be immediately paused until data engineers can resolve the issue.
  • Model training and validation: In the most basic case, the model is trained on newly imported and processed data or characteristics. However, you may conduct numerous training runs in parallel or in sequence to find the ideal parameters for a production model. Then the inference is run and tested on specific sets of data to assess the modelâ€TMs performance.
  • Data versioning: Data versioning is the technique of preserving data artifacts in the same way as code versions are saved in software development.

All those steps can be implemented by an MLOps engineer in complex solution software that provides full functionalities.

Model registry

Once the model is trained and ready for a production setup, it is pushed into a model registry, which serves as a centralized repository for all metadata for published models. For each, model-specific entries are determined to serve as the model’s metadata, for example:

project: ORGANIZATION/project-I-credit-scoring
model_version: model_v_10.0.02

Identifiers
- version
- name
- version_date
- remote_path_to_serialized_model
- model_stage_of_deployment
- datasets_used_for_training
- runtime_metrics

Model serving

The stage of model deployment. The most recent option, Model-as-a-Service, is now the most popular since it simplifies deployment by isolating the machine learning component from software code. This implies that you or your MLOps engineer can change a model version without having to re-deploy the application.

Generally speaking, there are three main ways to deploy an ML model:

  • On an IoT device.
  • On an embedded device, consumer application.
  • On a dedicated web service available via a REST API.

The best platforms that provide SDKs and APIs for model serving are:

You or your MLOps engineer can also launch multiple models for the same service to perform testing in production. For example, you can try testing for competing model versions. This strategy involves simultaneously deploying many models with comparable results to determine which model is superior. The process is similar to A/B testing, except that you can compare more than two models at the same time.

Model monitoring

Upon release, the model’s performance may be influenced by a variety of circumstances, ranging from an initial mismatch between research and real data to changes in customer behavior. Typically, machine learning models do not show errors right away, but their predictions do have an impact on the eventual results. Inadequate insights might result in poor company decisions and, as a result, financial losses. Software tools that MLOps engineers could consider to handle model monitoring are MLWatcherDbLue, and Qualdo.

Don’t forget that managing any form of company IT infrastructure is not easy. There are constant concerns about security, performance, availability, pricing, and other factors. Hybrid cloud solutions are no exception, introducing higher layers of complexity making IT management even more difficult. To avoid such problems, businesses and MLOps engineers should implement retroactive processes like anomaly detection and making early alerts, and also be ready to trigger ML retraining pipelines as soon as problems arise.

Aymane Hachcham is a data scientist and contributor to neptune.ai

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.