Introduction to monitoring and alerting

High level approach to monitoring and alerting with CHT applications

This guide applies to all production instances of the CHT for both 3.x (beyond 3.9) and 4.x.

Be sure to see how to deploy a solution to monitor and alert on production CHT instances.

Each deployment will experience different stresses on its resources. Be sure to tune any alerting levels in the case of a false positive so that you may avoid them in the future. Any thresholds for alerts, and even what is alerted on, is just a guideline, not a guarantee of uptime.

Monitoring vs Alerting

Monitoring allows CHT admins to see statistics about their server, often over time. This can be helpful when you want to be aware of growth in your deployment (eg number of active users or number of reports per region). It should not be assumed that these will be checked regularly enough to notice a problem, for example a spike in number of feedback documents.

Alerting is a push mechanism designed to notify users who can act on the alert. These can go over SMS, email, Slack, WhatsApp or any other channel to notify the right users.

The process of setting up monitoring and alerting should be done together. Monitoring sets the baseline and then alerting tells admins when the metric has gone beyond the baseline to a critical state. Certain metrics, like uptime for example, likely do not need to have a monitoring visualization on a dashboard, but the monitoring system should still be the authority to send an alert to denote when the service has restarted unexpectedly.

Outside the CHT

Be sure to monitor important items that the CHT depends on in order to be healthy. You should alert when any of these are close to their maximum (disk space) or minimum (days left of valid TLS certificate):

  • Domain expiration with registrar
  • TLS certificate expiration
  • Disk & swap space
  • CPU utilization
  • Memory utilization
  • Network utilization
  • Process count
  • OS Uptime

Inside the CHT

The monitoring API was added in 3.9.0 and does not require any authentication and so can easily be used with third party tools as they do not need a CHT user account.

All metrics need to be monitored over time so that you can easily see longitudinal patterns when debugging an outage or slow down.

Specific of monitoring

Explosive Growth

Many of the values in the monitoring API do not mean much in isolation. For example if an instance has 10,714,278 feedback docs, is that bad? If it’s years old and has thousands of users, then this is normal. If it is 4 months old and has 100 users, this is a dire problem!

You should monitor these metrics for unexpected growth as measured by percent change over 24 hours. Ideally this can be subjectively calculated when it is more than 5% growth than the prior day. They’re marked as growth in the table below.

Non-Zero Values

Other values should always be zero, and you should alert when they are not. You may opt to alert only when they are non-zero for more than 24 hours. These are marked as non-zero in the table below.

Zero or Near Zero Values

Finally, these values should always be not zero, and you should alert when are zero or very close to it. You may opt to alert only when they are zero for more than 24 hours. They’re marked with zero below.

Elements, types and samples

The names below are extrapolated from the paths in the JSON returned by the API and should be easy to find when viewing the Monitoring API URL on your CHT instance:

NameTypeExample Value
Conflict Countgrowth23,318
CouchDB Medic Doc Countgrowth16,254,271
CouchDB Medic Fragmentationgrowth1.4366029665729645
CouchDB Sentinel Doc Countgrowth15,756,449
CouchDB Sentinel Fragmentationgrowth2.388733774539664
CouchDB Users Doc Countgrowth535
CouchDB Users Fragmentationgrowth2.356411021364134
CouchDB Users Meta Doc Countgrowth10,761,549
Feedback Countgrowth10,714,368
Messaging Outgoing State Duegrowth3,807
Messaging Outgoing State Failednon-zero0
Outbound Push Backlognon-zero0
Sentinel Backlognon-zero0
Date Uptimezero1,626,508.148