Comet Model Production Monitoring (MPM)¶

Comet's Model Production Monitoring (MPM) helps you maintain high quality ML models by monitoring and alerting on defective predictions from models deployed in production. MPM supports all model types and allows you to monitor all your models in one place. It integrates with Comet Experiment Management, allowing you to track model performance from training to production.

Comet MPM is an optional module that can be installed alongside EM

Helm chart / Kubernetes¶

To proceed with applying the Helm chart, please refer first to our Helm chart documentation, which covers the necessary Helm commands. After that, follow the steps below to enable MPM (Model Production Monitoring).

To enable Comet MPM using a Helm chart, update your override values by setting the enabled field for the mpm section to true:

# ...
comet:
  # ...
  mpm:
    enabled: true
# ...

MPM will however NOT work without configuring the additional Druid and Airflow stacks for it as outlined later.

Resource Requirements¶

Our general recommendation is to have 3 MPM pods, with access to 16 vCPUs/Cores and 32GiB of Memory/RAM each. This should support an average utilization of about 4500 predictions per second.

As usual, we recommend isolating your MPM pods from other workloads on your kubernetes cluster, or setting aggressive resource reservations/requests to ensure optimal performance.

Ideally these should also be a seperate node pool from the one used for the core Comet application/Experiment Management. But if this is not necessary if you are not using the EM product (or use it only minimally), or if you set aggressive resource reservations/requests for MPM.

We recommend a dedicated pool of 3 16vCPU/32GiB nodes (or whichever number matches your MPM replicaCount). See the Druid database section on Nodes for further details about node pools.

To specify the node pool the nodeSelector allows you to set a mapping of labels that must be set on any node on which Druid will be scheduled to run.

# ...
backend:
  # ...
  mpm:
    nodeSelector:
      agentpool: comet-mpm
    # tolerations: []
    # affinity: {}
# ...

To aggresively set the resource reservations/requests or adjust the pod count, you can use the following settings:

WARNING: When setting aggressive resource reservations, you must either have spare nodes or much larger nodes if you wish to maintain availability when updating the pods. Otherwise you will not have enough capacity to run more than your configured pod count and will need to scale down and back up to replace pods.

# ...
backend:
  # ...
  mpm:
    # replicaCount: 3
    memoryRequest: 32Gi
    # memoryLimit: 32Gi
    cpuRequest: 16000m
    # cpuLimit: 16000m
# ...

Druid and Airflow¶

MPM has dependencies on Druid DB and Airflow for for orchestrating and storing data. Our Helm chart also installs the subcharts for Druid and Airflow.

When running MPM with Druid and Airflow, first be sure to also set the following in the Helm values:

# ...
backend:
  # ...
  mpm:
    # ...
    druid:
      enabled: true
# ...

We recommend assigning Druid to its own node pool.

The nodeSelector allows you to set a mapping of labels that must be set on any node on which Druid pods will be scheduled to run.

# ...
druid:
  # ...
  broker:
    nodeSelector:
      nodegroup_name: druid
  # ...
  coordinator:
    nodeSelector:
      nodegroup_name: druid
  # ...
  overlord:
    nodeSelector:
      nodegroup_name: druid
  # ...
  historical:
    nodeSelector:
      nodegroup_name: druid
  # ...
  middleManager:
    nodeSelector:
      nodegroup_name: druid
  # ...
  router:
    nodeSelector:
      nodegroup_name: druid
  # ...
  zookeeper:
    nodeSelector:
      nodegroup_name: druid
# ...

Nodes¶

Druid Nodes¶

We need four VMs to run the Druid stack. The specifications for these VMs are as follows:

Number of Instances: 4
Specifications per Instance:
vCPUs: 8
Memory: 32 GiB
Storage: 1000Gi

Airflow Nodes¶

We need three VMs to run the Airflow stack. The specifications for these VMs are as follows:

Number of Instances: 3
Specifications per Instance:
vCPUs: 2
Memory: 4 GiB

On-Prem Baremetal Deployment¶

For running the stack on-prem with baremetal servers, the total compute and storage resources required are as follows:

Compute Resources¶

CPU: 22 cores
16 cores for Druid
6 cores for Airflow
Memory: 128 GiB
74 GiB for Druid
12 GiB for Airflow
remainder for expansion

Storage Resources¶

PV Storage: 1400 GiB (for Druid PVCs)
Node Storage: 100 GiB per node
Total Storage: 2100 GiB

These specifications ensure that our Druid and Airflow stacks run efficiently with the required resources for processing and querying data.

Druid Pod Resources and Replicas¶

In cases of excessively high (or low) utilization, it may be desired to adjust the resource requests/limits for the Druid pods. This can be done from the Helm values, in the Druid section with the other dependency configuration. This section also permits setting pod replica counts. Example:

# ...
druid:
  # ...
  broker:
    replicaCount: 2
    resources:
      requests:
        cpu: 3          # Adjust CPU request to 3 cores
        memory: 12Gi    # Adjust memory request to 12Gi
      limits:
        cpu: 4          # Optionally set a CPU limit
        memory: 14Gi    # Adjust memory limit to 14Gi
# ...

WARNING: When changing replicas or resource requests/limits, you'll need to adjust your node counts and/or sizes to accomodate. When setting aggressive resource reservations, you must either have spare nodes or much larger nodes if you wish to maintain availability when updating the pods. Otherwise you will not have enough capacity to run more than your configured pod count and will need to scale down and back up to replace pods.

Jul. 9, 2024