Prometheus with High Availability

The following example demonstrates configuring a Grafana, Business Studio, and Prometheus trio to work with alerts.

As you know, Prometheus is a free and open-source core technology for monitoring and observability of systems. That means we need a high-availability use case (HA) to show it in bright and shiny armor.

System setup

Before diving into the step-by-step configuration, let's review the system setup we will monitor and create alerts for. The system consists of two environments - Alpha and Charlie.

Alpha cluster

In the Alpha environment, there are two mirrored business engines: engine1.alpha and engine2.alpha. They are connected with the Grafana cluster, which has two mirrored Grafana instances Grafana1 and Grafana2.

To store the business engine and Grafana configuration, we employ PostgreSQL (can be configured as master-slave or HA). Prometheus is here to collect metrics from the business engine and Grafana.

Charlie cluster

In the Charlie environment, there are three business engines:engine1.charlie, engine2.charlie, and engine3.charlie. They serve one Grafana cloud instance.

PostgreSQL stores the business engine configuration. Prometheus collects performance metrics from all three business engines.

Business Engine metrics endpoints

Every business engine provides two metrics endpoints to collect performance data:

Port 3001 for API server
Port 3002 for Scheduler

Use case

We want to monitor the CPU usage by all five business engines distinctively for each provided service (meaning 10 instances to monitor). In the event of either of them exceeding 2%:

Create a Grafana anotation.
Write logs with alert payload.
Create a file on the designated JSON server with all the details of the CPU exceeding event.

Grafana

In Grafana, we want to have a time series visualization where exceeding 2% CPU usage would be visually noticeable immediately.

Dashboard variables

It is important to note that our use case requires firing an alert only for a particular business engine/service. In all alert messages (in the log records and JSON file notes), we want to know which specific business engine/service causes the problem.

To make it possible, all the observable metrics need to contain this details - business engine/service. To make the distinctive firing possible, we use Grafana dashboard variables.

Dashboard variable configuration

Create a dashboard variable:

Type Query.
Name instance.
Data source prometheus:
- Query Label values,
- Label instance,
- Metric nodejs_version_info.
In the Preview of values, all 10 services that are needed to be monitored are displayed.

Time Series panel configuration

Select the configured Prometheus data source.
Specify a query to extract the user's CPU usage.
Specify a query to extract the system CPU usage.
Dashboard variable with all 10 services to monitor.
Select the time series panel.
Set up the threshold where values above 2 are out of the allowable range.

Business Studio

Business Studio can be connected to one or all business engines within a cluster.

Alert history

The section is coming soon...

Alert rule configuration

The section is coming soon...

Data preview

The section is coming soon...

System setup​

Alpha cluster​

Charlie cluster​

Business Engine metrics endpoints​

Use case​

Grafana​

Dashboard variables​

Dashboard variable configuration​

Time Series panel configuration​

Business Studio​

Alert history​

Alert rule configuration​

Data preview​

System setup

Alpha cluster

Charlie cluster

Business Engine metrics endpoints

Use case

Grafana

Dashboard variables

Dashboard variable configuration

Time Series panel configuration

Business Studio

Alert history

Alert rule configuration

Data preview