Introducing Boxkite: open source model observability for MLOps teams

So you’re putting machine learning (ML) models into production, but how do you know how they behave when exposed to production data?

Most teams start by monitoring their ML models just like regular software microservices: looking at latencies and error rates. And while latency and error rates is necessary for understanding model behaviour in production, it’s not sufficient.

ML models are fundamentally different to regular software microservices, and that makes observability of models different too: models are trained to automatically find patterns in the data they’re trained on. When the data that the model is exposed to in production varies significantly from the training data, models can quickly degrade.

So how can you monitor the changing distributions of model inputs and outputs in an easy way that works well with the tools your DevOps teams are already using?

Enter Boxkite

Boxkite is an open source and easy way to track model and data drift for Machine Learning models.

Data drift: how did the input data vary from when I trained my model to what it’s getting in production?

Model drift: how are the predictions/classifications that my model is making vary from when it was trained to what it’s predicting in production?

How do I use it?

When you train the model

Just import Boxkite and then add a call to the ModelMonitoringService.export_text  method. This tells Boxkite to export a text file containing the histogram of your model’s statistical distributions of features (the data the model was trained on) and inferences (results of running your model against the training data).

This way, Boxkite takes a snapshot of the shape of the input and output of the model at training time, as well as the model itself. In this example, we’re logging the histogram to MLflow along with the model file itself, so that they can be tracked and versioned there together.

Record features & predictions at runtime

In your model serving code (assuming your model is running in, say, a flask server), initialise the ModelMonitoringService class with the histogram_file collected at training time:

Then when doing inference, call the monitor.log_prediction methods and Boxkite will automatically compare the production time distributions to the training time distributions:

Simply expose metrics and use our Grafana dashboard

Next, expose a Prometheus format /metrics endpoint which just calls into the ModelMonitoringService’s export_http method:

You can then configure the Prometheus instance to scrape your model server. How you do this depends on your setup: you might need to add "true" to your pod annotations, or Prometheus might be set up to scrape your model server already.

What you get
Lastly, configure the Grafana dashboard to get the following!

You can either copy and paste the JSON for the dashboard into an existing dashboard or configure dashboard auto-provisioning in your Grafana instance as shown in the end to end example.

The dashboard gives you:

  1. Baseline distribution over training set for a categorical feature.
  2. Baseline distribution over training set for a continuous variable.
  3. Baseline distribution of predictions for this, a continuous (regression) model.
  4. How the same values vary at runtime (in production, across multiple HA model servers).
  5. The KL divergence and K-S Tests for how the variance of these distributions is varying over time!

Boxkite works with other MLOps tools

The end-to-end demo of Boxkite shows how it works well with other tools like Kubeflow and MLflow:

In this example:

  1. ML engineer trains model in Jupyter notebook, Boxkite generates histogram, both are logged to MLflow.
  2. ML engineer deploys model to Kubernetes cluster (using kubectl).
  3. Model server downloads model artifacts and histograms from MLflow.
  4. Production traffic hits the model server, it records distribution of data & inferences it receives and how it compares to training time metrics.
  5. Prometheus scrapes the multiple HA model servers and aggregates their metrics.
  6. Grafana dashboard queries Prometheus and computes KL/KS divergence on the fly.
  7. ML engineers & DevOps users can see the model & data drift in real time and alert on the divergences exceeding thresholds.

Try it out today

Boxkite is fully open source and available at
Follow one of our tutorials to easily get started and see how Boxkite works with other tools:

See Installation & User Guide for how to use Boxkite in any environment.

As always... any questions, direct them to