Skip to main content

Prometheus with High Availability

The following example demonstrates configuring a Grafana, Business Studio, and Prometheus trio to work with alerts.

As you know, Prometheus is a free and open-source core technology for monitoring and observability of systems. That means we need a high-availability use case (HA) to show it in bright and shiny armor.

System setup

Before diving into the step-by-step configuration, let's review the system setup we will monitor and create alerts for. The system consists of two environments - Alpha and Charlie.

Alpha cluster

In the Alpha environment, there are two mirrored business engines: engine1.alpha and engine2.alpha. They are connected with the Grafana cluster, which has two mirrored Grafana instances Grafana1 and Grafana2.

Alpha environment.
Alpha environment.

To store the business engine and Grafana configuration, we employ PostgreSQL (can be configured as master-slave or HA). Prometheus is here to collect metrics from the business engine and Grafana.

Charlie cluster

In the Charlie environment, there are three business engines:engine1.charlie, engine2.charlie, and engine3.charlie. They serve one Grafana cloud instance.

Charlie environment.
Charlie environment.

PostgreSQL stores the business engine configuration. Prometheus collects performance metrics from all three business engines.

Business Engine metrics endpoints

Every business engine provides two metrics endpoints to collect performance data:

  • Port 3001 for API server
  • Port 3002 for Scheduler

Use case

We want to monitor the CPU usage by all five business engines distinctively for each provided service (meaning 10 instances to monitor). In the event of either of them exceeding 2%:

  • Create a Grafana anotation.
  • Write logs with alert payload.
  • Create a file on the designated JSON server with all the details of the CPU exceeding event.

Grafana

In Grafana, we want to have a time series visualization where exceeding 2% CPU usage would be visually noticeable immediately.

Time series visualization in Grafana to monitor CPU usage distinctively by 10 instances.
Time series visualization in Grafana to monitor CPU usage distinctively by 10 instances.

Dashboard variables

It is important to note that our use case requires firing an alert only for a particular business engine/service. In all alert messages (in the log records and JSON file notes), we want to know which specific business engine/service causes the problem.

To make it possible, all the observable metrics need to contain this details - business engine/service. To make the distinctive firing possible, we use Grafana dashboard variables.

Dashboard variable configuration

Grafana dashboard variable configuration.
Grafana dashboard variable configuration.

Create a dashboard variable:

  1. Type Query.

  2. Name instance.

  3. Data source prometheus:

    • Query Label values,
    • Label instance,
    • Metric nodejs_version_info.
  4. In the Preview of values, all 10 services that are needed to be monitored are displayed.

Time Series panel configuration

Grafana time series panel configuration.
Grafana time series panel configuration.
  1. Select the configured Prometheus data source.
  2. Specify a query to extract the user's CPU usage.
  3. Specify a query to extract the system CPU usage.
  4. Dashboard variable with all 10 services to monitor.
  5. Select the time series panel.
  6. Set up the threshold where values above 2 are out of the allowable range.

Business Studio

Business Studio can be connected to one or all business engines within a cluster.

Business engines for Alpha and Charlie clusters configured in Business Studio.
Business engines for Alpha and Charlie clusters configured in Business Studio.

Alert history

The section is coming soon...

Alert rule configuration

The section is coming soon...

Data preview

The section is coming soon...