What is MLOps and why is it so critical for enterprise AI?

In our conversations, we have noticed that concerns about AI tend towards two extremes. On one end, corporate leaders and governments are worried about ethics and trust in "black-box" AI systems that harness vast amounts of data.  On the other end, we hear the challenges in deploying and scaling AI. AI algorithms written on a data scientist’s laptop don’t seem to make it past experimentation. Models are “thrown over the wall”, and never make it to real-time intelligent systems that can power business applications. 

We think that the problem is not with the model development. The challenge lies with the infrastructure between the model, and the rest of the software stack at the enterprise.  

The remedy to mistrust in “black-box” AI is not so much to open it up, but to build a robust and performant AI system that is also explainable, fair and easy-to-maintain. Machine learning operations (MLOps) is promising in this arena. 

What is machine learning operations (MLOps)?

 

MLOps (a compound of “machine learning” and “operations”), is the discipline that combines technology and practices to provide a scalable and governed means to deploy machine learning models in production environments. 

It is an emergent machine learning practice that draws from DevOps approaches to increase automation in machine learning systems. MLOps helps provide a technology “paved road” for data scientists and IT operations teams to work together throughout the lifecycle of the machine learning model. 

MLOps - Machine Learning Operations

Even though the MLOps market is relatively immature and nascent, with technology solutions emerging only in the last year or two for effective model management, we expect to see MLOps become a major component of the AI solution landscape. It’s been predicted to be a major trend even for 2020, and the ML Ops market will be over $4 Billion in just a few years. 

How does MLOps help productionise AI? 

 

Productionising AI is the process of turning a prototype model into a version that can be more easily mass-produced. This requires modern software engineering and big data capabilities. AI algorithms need to be converted into containerised micro-services that can continue to remain performant even when demands have increased. Production systems need to be able to handle massive training data, complex deep learning algorithms and high throughput demand for real-time predictions. 

Even the most experienced data teams capable of building solid machine learning models struggle to productionise AI in a timely manner. 

Through the abstraction of the development infrastructure, MLOps can help speed up this production process. MLOps provides critical capabilities such as:

  • Speeds up the process of experimenting and developing models 
  • Improves model tracking and version control
  • Reduces the time to deployment
  • Automates machine learning lifecycle management
  • Enables AI governance-by-design

MLOps tools are agnostic of the application they are being used for. As such they can be used to scale a range of machine learning applications at an enterprise.

MLOps speeds up the process of experimenting and developing models 

One key aspect of MLOps is building a process in which data science teams can test their models - a critical checkpoint prior to deployment. 

Most data scientists have advanced scientific programming skills. However they often struggle with testing their code at scale. An MLOps engineer can help by providing access to a powerful cloud computing hardware, allowing data teams to test their models with larger data sets. In doing so, MLOps can significantly speed up the complex process of machine learning research and development, allowing models to be built more quickly.

MLOps improves model tracking and version control 

Before model deployment, it is essential that the machine learning model complies with regulatory guidelines specific to the area of application. Fields like finance and healthcare typically require some degree of transparency in decision making. Many other verticals are also becoming increasingly aware of issues with fairness, safety, and liability when it comes to the output of machine learning models. 

MLOps enables permission control. It improves traceability by automatically tracking changes made to models. It also captures the data necessary for audit trails to minimize risks and promote compliance with regulations, such as model parameters, metadata on datasets used, source code that describes the algorithm, as well as a record of who trained the model and a time log.  

And reduces the time to deployment

Model deployment is the process of integrating the machine learning model into an existing production environment. This is a critical step in evaluating how the model performs in the wild. This process is challenging because it requires the coordination of different parties, including data scientists, engineers and project managers. The computational frameworks that power the machine learning model may also be different from the environment it is being deployed to. 

MLOps helps mitigate these issues through the abstraction of the code integration process, and allow data scientists to deploy their models in an automated fashion. 

By automating and consolidating disparate processes

MLOps can also support building a CI/CD pipeline. CI/CD stands for Continuous integration and continuous delivery. Building a CI/CD pipeline automates the steps in the software development process, such as updating building codes, running unit tests, integration testing and deploying to staging environments. An ML Ops engineer can build the bridges that connect these disparate processes, and speed up the development cycle. 

MLOps enables a ML lifecycle management

Even if the model was tested comprehensively before deployment, it is still difficult to predict how it will perform when encountering new data. Model performance diminishes when the distribution of incoming data changes. Moreover, the commonly used tools for monitoring software are not easily adaptable for use with a machine learning product. 

MLOps provides methods for detecting data drift. It can also implement key metrics to monitor model performance. If performance reduces beyond a certain threshold, the team will be alerted to build a new model.

Changing out an old model for a new one can cause problems when an application is being used by customers in real-time. This process needs to occur without interrupting services. 

One approach that MLOps engineers use is canary rollouts. In a canary rollout, only a small portion of the model inference request is served by the new model. This strategy allows for issues to be identified using only a small number of users. MLOps enables a seamless iteration of models. 

MLOps allows for governance principles to be designed into systems

Poorly governed AI solutions can lead to poor predictions that are biased or unfair, causing damage to brand reputation and consumer trust in the organisation. This can be caused by anti-patterns in the enterprise AI strategy, such as the use of black-box AI systems, and getting locked into non-scalable solutions. 

MLOps provides the engineering practice and processes to  build a well-governed AI system. These are:

  • Visibility of metrics. Enabling the ability to monitor and understand the way your systems are functioning at every level.
  • Version control. Ensuring you have the ability to collaborate, iterate and roll-back where necessary.
  • Automation. Automating various tasks of data scientists and engineers so that they can focus on the creative aspects.

Summary

We are excited to see enterprises beginning to understand the benefits of AI to automate, scale and improve their business operations. In order for AI/ML to have a significant impact, the machine learning lifecycle must seamlessly integrate into business operations. 

MLOps provides the ML infrastructure and tools to automate this lifecycle without interrupting normal operations. 

It also builds a common language and working structure between data scientists and data engineers. This way, they can each focus on what they do best, and collectively achieve the outcome of building scalable models to extract the greatest predictive power.

Importantly, MLOps enables the governance-by-design that AI governance demands by enabling the design and operation of systems that lead to visibility, maintainability and auditability.

MLOps is a must-have in any organisation’s toolbox to unleash the full potential of AI-generated business value.