AI Monitoring with Bedrock
Enterprises want to trust the AI systems they are using. But many assume that AI systems will work perfectly once they are in production, or hope that the system will be intelligent enough to troubleshoot on its own. In this article, one of BasisAI's senior software engineers explain why this is a faulty and costly assumption, and how you can monitor AI systems easily throughout the entire machine learning lifecycle.
All AI models degrade. Environmental changes affect input data. The concept drifts. To ensure AI systems function as intended, system owners need to ensure ongoing monitoring. This can be costly and difficult to justify. It is more appealing to allocate manpower and budget to developing new AI-driven products. Yet we have seen how machine learning mishaps can put a dampener on AI innovation, or worse, lead to reputational damage.
As AI adoption increases rapidly, enterprises need to ensure algorithms are working to achieve maximum outcomes through responsible AI practices. In this article, we show you how Bedrock can help you achieve optimum AI performance with ongoing monitoring of your AI system throughout the machine learning lifecycle.
Bedrock is a machine learning (ML) platform that enables data science teams to train ML models, conduct batch ML inferencing and deploy ML models as an API service for real-time inference. It is made up of two main components:
- A REST API service that provides you with the features of Bedrock such as orchestration and monitoring; and
- A fully managed environment, also known as a workload cluster, that runs in your private cloud e.g. Amazon Web Services (AWS), Google Cloud Platform (GCP) and Azure. Your data, training artefacts and workload never leave your cloud environment.
What Bedrock does for you:
- Enables end-to-end deployment by training ML models and serving them as REST endpoints.
- Promotes infrastructure automation by abstracting away the k8s cluster to help you orchestrate ML workloads on a supported cloud provider of your choice.
- Provides metrics for models and applications serving their inference results.
- Retains control and governance by viewers, editors, and administrators of entities on Bedrock via a single pane of glass.
Before we move on to discuss the technical features of Bedrock that enable end-to-end AI monitoring, two other resources we highly recommend are a quick introduction to Bedrock and Why we built Bedrock.
How Bedrock monitors model performance during training
The first step to monitoring ML models in production is to collect their baseline performance during training. To do so, Bedrock computes and logs various metrics to track model performance.
Bedrock plots several useful metrics to track model performance. Metrics built into the Bedrock user interface include Precision-Recall, the AUC (area under the curve) ROC (receiver operating characteristic) curve, and the Confusion Matrix. Users may also log custom metrics using the bdrk client library.
Bedrock also helps users track CPU and memory usage for training and batch scoring runs. This information helps users allocate the right amount of resources to reduce cost.
CPU usage is measured by taking the rate of increase in the number of consumed CPU seconds. A single counter is used to track the total amount of time your application spends on the CPU. That number is then divided by the rate interval (2 minutes by default) to compute the fraction of CPU used during that interval. In a multi-core setup, this fraction may be greater than 1.
Memory usage is measured by sampling the amount of resident memory in bytes. It is a gauge metric with value directly taken from the container's cgroup resource files. Due to a fixed sampling rate, the memory usage may not capture fluctuations in memory usage during a training run. All metrics are kept for 30 days unless specifically requested by our users.
Compare Model Versions
To compare the performance of different model versions trained using the same pipeline, Bedrock provides users with multiple plots. Users can easily export training metrics to Bedrock by adding logging code to train.py. The example below demonstrates logging charts and metrics for visualisation on the Bedrock user interface.
How Bedrock monitors model performance in production
Common challenges of productionising AI include ensuring new changes do not break existing systems, and timely tracking of improvements in model performance at serving time. As a workload orchestrator, Bedrock provides a holistic overview through:
- Regular updates and maintenance of the ML infrastructure.
- Insights into the performance of deployed models and the utilisation of the infrastructure through the platform.
- Alerts to users whenever there are issues and rectifies them within the service-level agreements.
Here are some examples of Bedrock functionalities that enable ongoing monitoring of model performance:
Model endpoint metrics
When deploying a new model server to production systems, users have to ensure new changes do not break existing systems. Bedrock automatically collects and exposes service level metrics to data scientists.
Throughput measures the rate of requests to a particular model endpoint. Bedrock measures both the aggregated metrics from the load balancer as well as the individual rate hitting each model server. Response time measures the latency of serving each request. Users may select various percentile from p50 to p99 for a holistic view of their endpoint's latency.
After a model is deployed in production, a question often asked by users is when to retrain a model. This requires looking at several model health indicators. One way is to compare the feature distribution between training and production.
If you would like to learn more about enabling feature distribution tracking for your production models, follow our tutorial here: https://docs.basis-ai.com/guides/tutorials/detect-feature-drift.
Assuming the trained model is accurate, looking at the prediction distribution can often yield actionable insights for the business. In the synthetic example above, the model is predicting customer churn rate based on call data. The data from the last 30 minutes shows that this churn rate is more than 60%. Upon understanding when the churn is occurring, the business can investigate further and take appropriate actions to reduce the rate of customer churn.
Model rot occurs if the model inference is inaccurate at serving time. Bedrock enables users to detect model rot by posting back the ground truth data. For example, the figure below shows the inference distribution of a binary classifier. You can easily compare the production distribution in the selected time range with that of training.
Service uptime alerts
Using signals from multiple indicators, Bedrock is also able to alert users of potential issues with their model deployment. For example, if 60% of the features have drifted from their training distribution, there could be a problem with the underlying data source. Users will be able to configure these thresholds within Bedrock to be alerted of such issues.
How Bedrock promotes extensibility
Bedrock is designed to play well with existing DevOps infrastructure. To support the open metrics standard, Bedrock does not adopt a proprietary metrics format for storage. Instead, metrics are exported in Prometheus exposition format so that other visualisation systems can be built around the collected metadata. In addition, Bedrock is designed to protect the privacy of user data. Statistics of the distribution used for training is always stored in the workload cluster. Find out how we implemented these at our engineering blog.
Conclusion: AI monitoring promotes responsible AI
Ongoing monitoring of AI systems allows for rapid and responsible AI model development. When AI systems are monitored throughout the machine learning cycle, total model failure and unintended biases can be averted by retraining the model with minimal interruptions to existing systems. By tracking the right features, enhancements for optimal model performance can take place in a timely manner. Bedrock automates these ongoing monitoring processes, reducing significant maintenance costs for you, while allowing you to retain control of your data and processes on your private cloud.
- Github: https://github.com/datmo/datmo
- Tracking and sustaining the performance of Predictive models: https://medium.com/@QuantumBlack/tracking-and-sustaining-the-performance-of-predictive-models-d3116b9976bb
- Why is machine learning deployment hard?: https://towardsdatascience.com/why-is-machine-learning-deployment-hard-443af67493cd
- How to deploy machine learning models: https://christophergs.com/machine%20learning/2019/03/17/how-to-deploy-machine-learning-models/
- Why is Machine learning deployment models hard:https://towardsdatascience.com/why-is-machine-learning-deployment-hard-443af67493cd
- Continuous delivery for Machine learning: https://martinfowler.com/articles/cd4ml.html
- Reliable Insights - a blog on monitoring, scale and operational Sanity: https://www.robustperception.io/blog
ThoughtWorks explains the use cases for prediction logging using the open source fluentd, elastic search, and kibana stack. The implementation is similar to Bedrock prediction store.
For a good understanding of the run metrics we encourage users to track, ie. AUC, ROC, F1, etc., refer to the following articles:
|Pattern 1 (Rest API)*||Pattern 2 (Shared DB)*||Pattern 3 (Streaming)||Pattern 4 (Mobile App)|
|Prediction||On the fly||Batch||Streaming||On the fly|
|Prediction result delivery||Via Rest API||Through the shared DB||Streaming via message Queue||Via in-process API on mobile|
|Latency for prediction||So so||High||Very Low||Low|
|System management difficulty||So so||Easy||Very Hard||So So|
* Bedrock currently supports pattern 1 and pattern 2 via model endpoints and batch scoring pipelines respectively.
After deployment, multiple factors could contribute to model performance degradation:
- Feature distribution changed (feature / concept drift)
- Relationship between features changed (covariate drift)
- Prior distribution drift (prior assumptions of the business)
- Business behaviour affected by model predictions (positive feedback loop)
Sometimes retraining is not sufficient or too expensive:
- Good alerting threshold depends on business insight (garden sprinkler may see 3 minutes delay as acceptable but not for fire alarms)
- Lack of ground truth data (proposed solution is to reweigh samples in the original data set using similar, fresh samples and retrain on adjusted data set https://blog.smola.org/post/4110255196/real-simple-covariate-shift-correction)
Tracking model performance over time illustrates using a simple thought experiment why tracking feature / prediction distributions alone is not enough to determine model performance degradation.