CHT Watchdog Setup

Setting up Grafana and Prometheus with the CHT

These instructions apply to both CHT 3.x (beyond 3.12) and CHT 4.x.

Medic maintains CHT Watchdog which is an opinionated configuration of Prometheus (including json_exporter) and Grafana which can easily be deployed using Docker. It is supported on CHT 3.12 and later, including CHT 4.x. By using this solution a CHT deployment can easily get longitudinal monitoring and push alerts using Email, Slack or other mechanisms. All tools are open source and have no licensing fees.

The solution provides both an overview dashboard as well as a detail dashboard. Here is a portion of the overview dashboard:

Screenshot of Grafana Dashboard showing data from Prometheus

Prometheus supports four metric types: Counter, Gauge, Histogram, and Summary. Currently, the CHT only provides Counter and Gauge type metrics. When building panels for Grafana dashboards, Prometheus Functions can be used to manipulate the metric data. Refer to the Grafana Documentation for best practices on building dashboards.

Prerequisites

Setup

These instructions have been tested against Ubuntu, but should work against any OS that meets the prerequisites. They follow a happy path assuming you need to only set a secure password and specify the URL(s) to monitor:

  1. Run the following commands to clone this repository, initialize your .env file, create a secure password and create your data directories:

    cd ~
    git clone https://github.com/medic/cht-watchdog.git
    cd cht-watchdog
    cp cht-instances.example.yml cht-instances.yml
    cp grafana/grafana.example.ini grafana/grafana.ini
    mkdir -p grafana/data && mkdir  -p prometheus/data 
    sudo apt install -y wamerican  # ensures /usr/share/dict/words is present for shuf call below 
    cp .env.example .env
    password=$(shuf -n7 /usr/share/dict/words --random-source=/dev/random | tr '\n' '-' | tr -d "'" | cut -d'-' -f1,2,3,4,5,6,7)
    sed -i -e "s/password/$password/g" .env
    echo;echo "Initial project structure created! To log into Grafana in the browser:";echo 
    echo "    username: medic"
    echo "    password: ${password}";echo
    

    If you’re using docker-compose v2.x, it doesn’t support relative paths and you’ll have to edit your .env file to update paths to absolute path.

    Note that in step 4 below you’ll need the username and password which is printed after you run the above command.

  2. Edit the cht-instances.yml file to have the URLs of your CHT instances. You may include as many URLs of CHT instances as you like.

    Here is an example:

    - targets:
      - https://subsub.sub.example.com
      - https://cht.domain.com
      - https://website.org
    
  3. Run the following command to deploy the stack:

    cd ~/cht-watchdog
    docker compose up -d
    
  4. Grafana is available at http://localhost:3000. See the output from step 1 for your username and password.

If you would like to do more customizing of your deployment, see “Additional Configuration”.

Upgrading

Before upgrading, you should back up both your current configuration settings as well as your Prometheus/Grafana data directories.

Prometheus, Grafana and JSON Exporter

To upgrade these dependencies, update the version numbers set in your .env file (or leave them set to latest). Then run the following commands:

docker compose pull
docker compose up -d

CHT Watchdog

When you see a new version in the GitHub repository, first review the release notes and upgrade instructions. Then, run the following commands to deploy the new configuration (be sure to replace TAG with the tag name associated with the release (e.g. 1.1.0)):

cd ~/cht-watchdog
git fetch
git -c advice.detachedHead=false checkout TAG
docker compose pull
docker compose down
docker compose up -d --remove-orphans

Additional Configuration

When making any changes to your CHT Watchdog configuration (e.g. adding/removing instances from the cht-instances.yml file) make sure to restart all services to pick up the changes:

cd ~/cht-watchdog
docker compose down
docker compose up -d

couch2pg Data

With the release of 1.1.0, Watchdog now supports easily ingesting couch2pg data read in from a Postgres database (supports Postgres >= 9.x).

  1. Copy the two example config files so you can add the correct contents in them. Do so by running this code:

    cd ~/cht-watchdog
    cp exporters/postgres/postgres-instances.example.yml exporters/postgres/postgres-instances.yml
    cp exporters/postgres/postgres_exporter.example.yml exporters/postgres/postgres_exporter.yml
    
  2. Edit postgres-instances.yml you just created and add your target postgres connection URL along with the proper root URL for your CHT instance as the label value. For example, if your postgres server was db.example.com and your CHT instance was cht.example.com the config would be:

    - targets: [db.example.com:5432/cht]
      labels:
        cht_instance: cht.example.com
    
  3. Edit postgres_exporter.yml so that the auth_modules object for your Postgres instance has the proper username and password. Using our db.example.com example from above and assuming a password of super-secret and a username of pg_user, the config would be:

    db.example.com:5432/cht: # Needs to match the target URL in postgres-instances.yml
       type: userpass
       userpass:
         username: pg_user
         password: super-secret
       options:
         sslmode: disable
    
  4. Start your instance up, being sure to include both the existing docker-compose.yml and the docker-compose.postgres-exporter.yml file:

    cd ~/cht-watchdog
    docker compose -f docker-compose.yml -f exporters/postgres/docker-compose.postgres-exporter.yml up -d
    

Prometheus Retention and Storage

By default, historical monitoring data will be stored in Prometheus (PROMETHEUS_DATA directory) for 60 days (configurable by PROMETHEUS_RETENTION_TIME). A longer retention time can be configured to allow for longer-term analysis of the data. However, this will increase the size of the Prometheus data volume. See the Prometheus documentation for more information.

Local storage is not suitable for storing large amounts of monitoring data. If you intend to store multiple years worth of metrics, you should consider integrating Prometheus with a Remote Storage.

Alerts

This configuration includes number of pre-provisioned alert rules. Additional alerting rules (and other contact points) can be set in the Grafana UI.

See both the Grafana high level alert Documentation and provisioning alerts in the UI for more information.

Deleting provisioned alert rules

The provisioned alert rules shipped with CHT Watchdog are intended to be the generally applicable for most CHT deployments. However, not all the alert rules will necessarily be useful for everyone. If you would like to delete any of the provisioned alert rules, you can do so with the following steps:

  1. In Grafana, navigate to “Alerting” and then “Alert Rules” and click the eye icon for the rule you want to delete. Copy the Rule UID which can be found on the right and is a 10 character value like mASYtCQ2j.

  2. Create a delete-rules.yml file

    cd ~/cht-watchdog
    cp grafana/provisioning/alerting/delete-rules.example.yml grafana/provisioning/alerting/delete-rules.yml
    
  3. Update your new delete-rules.yml file to include the Rule UID(s) of the alert rule(s) you want to delete

  4. Restart Grafana

    docker compose restart grafana
    

If you ever want to re-enable the alert rules you deleted, you can simply remove the Rule UID(s) from the delete-rules.yml file and restart Grafana again.

Modifying provisioned alert rules

The provisioned alert rules cannot be modified directly. Instead, you can copy the configuration of a provisioned alert into a new custom alert with the desired changes. Then, remove the provisioned alert.

  1. Open the alert rule you would like to modify in the Grafana alert rules UI and select the “Copy” button.
  2. Update the copied alert rule with the desired changes and save it into a new Evaluation group.
  3. Remove the provisioned alert.
Configuring Contact Points

Grafana supports sending alerts via a number of different methods. Two likely options are Email and Slack.

Email

To support sending email alerts from Grafana, you must update the smtp section of your grafana/grafana.ini file with your SMTP server configuration. Then, in the web interface, add the desired recipient email addresses in the grafana-default-email contact point settings.

Slack

Slack alerts can be configured within the Grafana web GUI for the specific rules you would like to alert on.

Configuration Reference

Environment Variables

All the variables in the .env file:

NameDefaultDescription
GRAFANA_ADMIN_USERmedicUsername for the Grafana admin user
GRAFANA_ADMIN_PASSWORDpasswordPassword for the Grafana admin user
GRAFANA_VERSIONlatestVersion of the grafana/grafana-oss image
GRAFANA_PORT3000Port on the host where Grafana will be available
GRAFANA_BIND127.0.0.1Interface Grafana will bind to. Change to 0.0.0.0 if you want to expose to all interfaces.
GRAFANA_DATA./grafana/dataThe host directory where Grafana data will be stored
GRAFANA_PLUGINSgrafana-discourse-datasourceComma separated list of plugins to install (e.g: grafana-clock-panel,grafana-simple-json-datasource)
JSON_EXPORTER_VERSIONlatestVersion of the prometheuscommunity/json-exporter image
PROMETHEUS_VERSIONlatestVersion of the prom/prometheus image
PROMETHEUS_DATA./prometheus/dataThe host directory where Prometheus data will be stored
PROMETHEUS_RETENTION_TIME60dLength of time that Prometheus will store data (e.g. 15d, 6m, 1y)

CHT Metrics

All CHT metrics in Prometheus:

OpenMetrics nameTypelabel(s)Description
cht_api_*N/AAPI server metrics (see prometheus-api-metrics). Requires CHT Core 4.3.0 or later. Includes stats like server response time in seconds and response size in bytes.
cht_conflict_countGaugeNumber of doc conflicts which need to be resolved manually.
cht_connected_users_countGaugeNumber of users that have connected to the api recently. By default the time interval is 7 days. Otherwise it is equal to the connected_user_interval parameter value used when making the /monitoring request.
cht_couchdb_doc_del_totalCountermedic, sentinel, medic-users-meta, _usersThe number of deleted docs in the db.
cht_couchdb_doc_totalCountermedic, sentinel, medic-users-meta, _usersThe number of docs in the db.
cht_couchdb_fragmentationGaugemedic, sentinel, medic-users-meta, _usersThe fragmentation of the db, lower is better, “1” is no fragmentation.
cht_couchdb_update_sequenceCountermedic, sentinel, medic-users-meta, _usersThe number of changes in the db.
cht_date_current_millisCounterThe current server date in millis since the epoch, useful for ensuring the server time is correct.
cht_date_uptime_secondsCounterHow long API has been running.
cht_feedback_totalCounterNumber of feedback docs created usually indicative of client side errors.
cht_messaging_outgoing_last_hundredGaugegroup, statusCounts of last 100 messages that have received status updates.
cht_messaging_outgoing_totalCounterstatusCounts of the total number of messages.
cht_outbound_push_backlog_countGaugeNumber of changes yet to be processed by Outbound Push.
cht_replication_limit_countGaugeNumber of users that exceeded the replication limit of documents.
cht_sentinel_backlog_countGaugeNumber of changes yet to be processed by Sentinel.
cht_versionN/Aapp, node, couchdbVersion information for the CHT instance (recorded in labels)
couch2pg_progress_sequenceCountermedic, medic-logs, medic-sentinel, medic-users-meta, _usersThe number of db changes that have been processed by couch2pg. Requires couch2pg metrics be enabled.

CHT Core Framework > Overview > CHT Core

The different pieces of a CHT project, how they interact, and what they’re used for

CHT Core Framework > Overview > CHT Watchdog

An open source monitoring system using Grafana and Prometheus