This project shows how virtual machines and web applications can be monitored using tools like Prometheus and Grafana and also shows how you can utilize alert managers to send E-mails alerting the team members on the status of the aplication.
In the context of DevOps monitoring, the terms "Black box exporter," "Node exporter," and "Alertmanager" refer to components within the Prometheus ecosystem, a popular open-source monitoring and alerting toolkit. Each of these components plays a distinct role in gathering, processing, and managing metrics and alerts. Here’s a detailed explanation and summary of each:
The Black Box Exporter is used to probe endpoints (websites, APIs, etc.) and services from the outside, treating them as "black boxes."
-
It performs various types of probes (HTTP, HTTPS, DNS, TCP, ICMP) to check the availability and performance of services.
-
Configurable probes allow for detailed checks and validations, such as ensuring a web page returns the expected content.
-
Results of these probes are exported as metrics, which Prometheus can scrape and analyze.
-
Monitoring the uptime and performance of external services and endpoints.
-
Ensuring service level agreements (SLAs) are met by checking the availability of critical web services.
The Node Exporter is designed to expose hardware and OS metrics from *nix systems (Linux, Unix).
-
Collects a wide variety of system metrics such as CPU usage, memory usage, disk I/O, network statistics, and more.
-
Metrics are gathered using system calls and various kernel interfaces, ensuring accuracy and relevance.
-
Exposes these metrics over HTTP in a format that Prometheus can scrape.
-
Monitoring the health and performance of individual servers.
-
Collecting detailed system-level metrics to diagnose performance issues or resource bottlenecks.
Alertmanager handles alerts sent by client applications such as the Prometheus server.
-
Receives, deduplicates, groups, and routes alerts to various notification channels (email, Slack, PagerDuty, etc.).
-
Allows for complex alerting rules and logic, such as silencing alerts during maintenance windows or escalating alerts based on severity.
-
Supports templating to customize alert messages and notifications.
-
Centralized management of alerts generated by Prometheus.
-
Ensuring that alerts reach the right teams or individuals, minimizing alert fatigue and ensuring critical issues are addressed promptly.
-
Customizing alert notifications to include relevant information, improving incident response times.
In summary, within the Prometheus ecosystem for DevOps monitoring:
-
Black Box Exporter probes external services to ensure they are up and performing as expected.
-
Node Exporter collects and exposes metrics from the system hardware and OS, providing insight into the performance and health of servers.
-
Alertmanager processes and manages alerts generated by Prometheus, ensuring they are routed to the appropriate channels and handled according to defined rules.
These components work together to provide comprehensive monitoring and alerting capabilities, helping DevOps teams maintain high availability, performance, and reliability of their systems and services.
Create two Ec2- instance (monitoring and VM) {t2.meduim, storage 20gb}.
Install The Following Using "Wget" And Extract
- prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0-rc.0/prometheus-2.53.0-rc.0.linux-amd64.tar.gz
Extract using this command
tar -xvf <filename>
Delete the old file and change the name
- Black box
Repeat process for backbox and alertmanager
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_exporter-0.25.0.linux-amd64.tar.gz
- Alert manager
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
git clone web app
https://github.com/UzonduEgbombah/BoardGame.git
Install node exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
repeat similar process to extract,delete and rename
cd into node exporter and run
./node_exporter &
copy VM IP address and add the port 9100 and paste on chrome
To ensure the Board-Game runs install
- Java
- Maven
and run the command
mvn package
cd into target
java -jar database_service_project-0.0.2.jar
now you can copy the vm IP with port 8080 to access BoardGame webb app
on your monitoring server cd into "prometheus and start it withethe command below :
./prometheus $
access with IP and node 9090
- Now let's setup the alertrule
create a new file in prometheus
vi alert_rules.yml
paste the following rule, exit and save "wq"
groups:
- name: alert_rules # Name of the alert rules group
rules:
- alert: InstanceDown
expr: up == 0 # Expression to detect instance down
for: 1m
labels:
severity: "critical"
annotations:
summary: "Endpoint {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."
- alert: WebsiteDown
expr: probe_success == 0 # Expression to detect website down
for: 1m
labels:
severity: critical
annotations:
description: The website at {{ $labels.instance }} is down.
summary: Website down
- alert: HostOutOfMemory
expr: node_memory_MemAvailable / node_memory_MemTotal * 100 < 25 # Expression to detect low memory
for: 5m
labels:
severity: warning
annotations:
summary: "Host out of memory (instance {{ $labels.instance }})"
description: "Node memory is filling up (< 25% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail{mountpoint="/"} * 100) / node_filesystem_size{mountpoint="/"} < 50 # Expression to detect low disk space
for: 1s
labels:
severity: warning
annotations:
summary: "Host out of disk space (instance {{ $labels.instance }})"
description: "Disk is almost full (< 50% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HostHighCpuLoad
expr: (sum by (instance) (irate(node_cpu{job="node_exporter_metrics",mode="idle"}[5m]))) > 80 # Expression to detect high CPU load
for: 5m
labels:
severity: warning
annotations:
summary: "Host high CPU load (instance {{ $labels.instance }})"
description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: ServiceUnavailable
expr: up{job="node_exporter"} == 0 # Expression to detect service unavailability
for: 2m
labels:
severity: critical
annotations:
summary: "Service Unavailable (instance {{ $labels.instance }})"
description: "The service {{ $labels.job }} is not available\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HighMemoryUsage
expr: (node_memory_Active / node_memory_MemTotal) * 100 > 90 # Expression to detect high memory usage
for: 10m
labels:
severity: critical
annotations:
summary: "High Memory Usage (instance {{ $labels.instance }})"
description: "Memory usage is > 90%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: FileSystemFull
expr: (node_filesystem_avail / node_filesystem_size) * 100 < 10 # Expression to detect file system almost full
for: 5m
labels:
severity: critical
annotations:
summary: "File System Almost Full (instance {{ $labels.instance }})"
description: "File system has < 10% free space\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
now open prometheus.yml and remove the "alert_rules.yml" that was hashed
cd into alertmanager and run it, access it with port 9093
To make the rules inserted in the "alert_rules.yml" file to reflect, prometheus must be restarted
pgrep / kill
now it has reflected in prometheus, tap on alerts
- The
&
at the end of each command ensures the process runs in the background. - Ensure that you have configured the
prometheus.yml
andalertmanager.yml
configuration files correctly before starting the services. - Adjust the firewall and security settings to allow the necessary ports (typically 9090 for Prometheus, 9093 for Alertmanager, 9115 for Blackbox Exporter, and 9100 for Node Exporter) to be accessible.
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:9093' # Alertmanager endpoint
rule_files:
- "alert_rules.yml" # Path to alert rules file
# - "second_rules.yml" # Additional rule files can be added here
scrape_configs:
- job_name: "prometheus" # Job name for Prometheus
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"] # Target to scrape (Prometheus itself)
- job_name: "node_exporter" # Job name for node exporter
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["3.110.195.114:9100"] # Target node exporter endpoint
- job_name: 'blackbox' # Job name for blackbox exporter
metrics_path: /probe # Path for blackbox probe
params:
module: [http_2xx] # Module to look for HTTP 200 response
static_configs:
- targets:
- http://prometheus.io # HTTP target
- https://prometheus.io # HTTPS target
- http://3.110.195.114:8080/ # HTTP target with port 8080
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 13.235.248.225:9115 # Blackbox exporter address
route:
group_by:
- alertname
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: email-notifications
receivers:
- name: email-notifications
email_configs:
- to: uzonduegbombah419@gmail.com
from: test@gmail.com
smarthost: smtp.gmail.com:587
auth_username: uzonduegbombah419@gmail.com
auth_identity: uzonduegbombah419@gmail.com
auth_password: qczm kxnu uygh wqja
send_resolved: true
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal:
- alertname
- dev
- instance
- remember to set up your own authentication password
Now restart prometheus and alertmanager
result should be similar
Now to make sure this setup works I had to shutdown the mvn package command and confirm if i also got an email too
reflected succesfully
E-mail came in after "1minute"
okay one more test, we killing the node exporter and see what happens
reflected successfully
now lets wait for the E-mail
uzondu egbombah