Alerts With Alertmanager¶
List Of Alerts¶
- health score < 0.95
- ingress performance
- vm heap usage ratio > 80%
- file descriptor > 75%
- job queue > 10 over x minutes
- job success ratio < 50%
- master executor count > 0
- good http request ratio < 90%
- offline nodes > 5 over 30 minutes
- healtcheck duration > 0.002
-
plugin updates available > 10
-
An alert that triggers if any of the health reports are failing
- An alert that triggers if the file descriptor usage on the master goes above 80%
vm.file.descriptor.ratio
->vm_file_descriptor_ratio
- An alert that triggers if the JVM heap memory usage is over 80% for more than a minute
vm.memory.heap.usage
->vm_memory_heap_usage
- An alert that triggers if the 5 minute average of HTTP/404 responses goes above 10 per minute for more than five minutes
http.responseCodes.badRequest
->http_responseCodes_badRequest
Alert Manager Configuration¶
We can configure Alert Manager via the Prometheus Helm Chart.
All the configuration elements below are part of the prom-values.yaml
we used to when installing Prometheus via Helm.
Get Slack Endpoint¶
There are many ways to get the Alerts out, for all options you can read the Prometheus documentation.
In this guide, I've chosen to use slack, as I find it convenient personally. Slack has a guide on creating webhooks, once you've created an App
you can retrieve an endpoint which you can use directly in the Alertmanager configuration.
Alerts Configuration¶
We configure the alerts within Prometheus itself via a ConfigMap
. We configure the body of the alert configuration file via serverFiles
.alerts
.groups
and serverFiles
.rules
.
We can have a list of rules and a list of groups of rules. For more information how you can configure these rules, consult the Prometheus documentation.
Alert Example¶
Below is an example of an Alert. We have the following fields:
- alert: the name of the alert
- expr: the query that should evaluate to
true
orfalse
- for (optional): duration of the expressions equating to
true
before it fires - labels(optional): you can add key-value pairs to encode more information on the alert, you can use this to select different receiver (e.g., email vs. slack, or different slack channels)
- annotations: we're expected to fill in
summary
anddescription
as shown below, they will header and body of the alert
- alert: JenkinsTooManyJobsQueued
expr: sum(jenkins_queue_size_value) > 5
for: 1m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} too many jobs queued"
description: "{{ $labels.app_kubernetes_io_instance }} has {{ $value }} jobs stuck in the queue"
Alertmanager Configuration¶
We use Alertmanager for what to do with alerts once they happen. We configure this in the same prom-values.yaml
file, in this under alertmanagerFiles
.alertmanager.yml
.
We can create different routes that match on labels or other values. For simplicity sake - this guide is not on Alertmanager's capabilities - we stick to the most straightforward example without any such matching or grouping. For more information on configuring routes, please read the Prometheus configuration documentation.
alertmanagerFiles:
alertmanager.yml:
global: {}
route:
group_by: [alertname, app_kubernetes_io_instance]
receiver: default
receivers:
- name: default
slack_configs:
- api_url: '<REPLACE_WITH_YOUR_SLACK_API_ENDPOINT>'
username: 'Alertmanager'
channel: '#notify'
send_resolved: true
title: "{{ .CommonAnnotations.summary }} "
text: "{{ .CommonAnnotations.description }} {{ .CommonLabels.app_kubernetes_io_instance}} "
title_link: http://my-prometheus.com/alerts
Group Alerts¶
You can group alerts if they are similar or the same with different trigger values (warning vs critical).
serverFiles:
alerts:
groups:
- name: healthcheck
rules:
- alert: JenkinsHealthScoreToLow
# alert info
- alert: JenkinsTooSlowHealthCheck
# alert info
- name: jobs
rules:
- alert: JenkinsTooManyJobsQueued
# alert info
- alert: JenkinsTooManyJobsStuckInQueue
# alert info
The Alerts¶
Behold, my awesome - eh, simple example - alerts. These are by no means the best alerts to create and are by no means alerts you should directly put into production. Please see them as examples to learn from!
Caution
One thing to note especially, the values for the exp
and for
are generally set very low. This is intentional, so they are easy to copy past and test. They should be relatively easy to trigger so you can learn about the relationship between the situation in your master and the alert firing.
Too Many Jobs Queued¶
If there are too many Jobs queued in the Jenkins Master. This event fires if there's more than 10
jobs in the queue for at least 10 minutes.
- alert: JenkinsTooManyJobsQueued
expr: sum(jenkins_queue_size_value) > 10
for: 10m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} too many jobs queued"
description: "{{ $labels.app_kubernetes_io_instance }} has {{ $value }} jobs stuck in the queue"
Jobs Stuck In Queue¶
Sometimes Jobs depend on other Jobs, which means they're not just in the queue, they're stuck in the queue.
- alert: JenkinsTooManyJobsStuckInQueue
expr: sum(jenkins_queue_stuck_value) by (app_kubernetes_io_instance) > 5
for: 5m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} too many jobs queued"
description: " {{ $labels.app_kubernetes_io_instance }} has {{ $value }} jobs in queue"
Jobs Waiting Too Long To Start¶
If Jobs are generally waiting a long time to start, waiting for a build agent to be available or otherwise, we want to know. This value is not very useful - although not completely useless - if you only have PodTemplates as build agents. When you use PodTemplates, this value is the time between the job being scheduled and when the Pod is scheduled in Kubernetes.
- alert: JenkinsWaitingTooMuchOnJobStart
expr: sum (jenkins_job_waiting_duration) by (app_kubernetes_io_instance) > 0.05
for: 1m
labels:
severity: notify
annotations:
summary: "{{ $labels.app_kubernetes_io_instance }} waits too long for jobs"
description: "{{ $labels.app_kubernetes_io_instance }} is waiting on average {{ $value }} seconds to start a job"
health score < 1¶
By default, each Jenkins Master has a health check consisting out of four values. Some plugins will add an entry, such as the CloudBees ElasticSearch Reporter for CloudBees Core. This values range from 0-1, and likely will show 0.25
, 0.50
, 0.75
and 1
as values.
- alert: JenkinsHealthScoreToLow
expr: sum(jenkins_health_check_score) by (app_kubernetes_io_instance) < 1
for: 5m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} has a to low health score"
description: " {{ $labels.app_kubernetes_io_instance }} a health score lower than 100%"
Ingress Too Slow¶
This alert looks at the ingress controller request duration. It fires if the request duration in 0.25 seconds or faster is not achieved for the 95% percentile.
- alert: AppTooSlow
expr: sum(rate(nginx_ingress_controller_request_duration_seconds_bucket{le="0.25"}[5m])) by (ingress) / sum(rate(nginx_ingress_controller_request_duration_seconds_count[5m])) by (ingress) < 0.95
for: 5m
labels:
severity: notify
annotations:
summary: "Application - {{ $labels.ingress }} - is too slow"
description: " {{ $labels.ingress }} - More then 5% of requests are slower than 0.25s"
HTTP Requests Too Slow¶
These are the HTTP requests in Jenkins' webserver itself. We should hold this by must stricter standards than the Ingress controller - which goes through many more layers.
- alert: JenkinsTooSlow
expr: sum(http_requests{quantile="0.99"} ) by (app_kubernetes_io_instance) > 1
for: 3m
labels:
severity: notify
annotations:
summary: "{{ $labels.app_kubernetes_io_instance }} is too slow"
description: "{{ $labels.app_kubernetes_io_instance }} More then 1% of requests are slower than 1s (request time: {{ $value }})"
Too Many Plugin Updates¶
I always prefer having my instance up-to-date, don't you? So why not send an alert if there's more than X number of plugins waiting for an update.
- alert: JenkinsTooManyPluginsNeedUpate
expr: sum(jenkins_plugins_withUpdate) by (app_kubernetes_io_instance) > 3
for: 1m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} too many plugins updates"
description: " {{ $labels.app_kubernetes_io_instance }} has {{ $value }} plugins that require an update"
File Descriptor Ratio > 40%¶
According to CloudBees' documentation, the File Descriptor Ratio should not exceed 40%.
Warning
I don't truly know the correct value level of this metric. So wether this should be 0.0040
or 0.40
I'm not sure. Also, does this make sense in Containers with remote storage? So before you put this in production, please re-evaluate this!
- alert: JenkinsToManyOpenFiles
expr: sum(vm_file_descriptor_ratio) by (app_kubernetes_io_instance) > 0.040
for: 5m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} has a to many open files"
description: " {{ $labels.app_kubernetes_io_instance }} instance has used {{ $value }} of available open files"
Job Success Ratio < 50%¶
Please, please do not use Job success ratios to punish people. But if it is at all possible - which it almost certainly is - keep a respectable level of success. When practicing Continuous Integration, a broken build is a stop the world event, fix it before moving on.
100% success rate should be strived for. It is ok, not to achieve it, yet, one should be as close as possible and not let broken builds rot.
- alert: JenkinsTooLowJobSuccessRate
expr: sum(jenkins_runs_success_total) by (app_kubernetes_io_instance) / sum(jenkins_runs_total_total) by (app_kubernetes_io_instance) < 0.5
for: 5m
labels:
severity: notify
annotations:
summary: "{{$labels.app_kubernetes_io_instance}} has a too low job success rate"
description: "{{$labels.app_kubernetes_io_instance}} instance has less than 50% of jobs being successful"
Offline nodes > 5 over 10 minutes¶
Having nodes offline for quite some time is usually a bad sign. It can be a static agent that can be enabled or reconnect at will, so it isn't bad on its own. Having multiple offline for a long period is likely an issue somewhere, though.
- alert: JenkinsTooManyOfflineNodes
expr: sum(jenkins_node_offline_value) by (app_kubernetes_io_instance) > 5
for: 10m
labels:
severity: notify
annotations:
summary: "{{ $labels.app_kubernetes_io_instance }} has a too many offline nodes"
description: "{{ $labels.app_kubernetes_io_instance }} has {{ $value }} nodes that are offline for some time (5 minutes)"
healtcheck duration > 0.002¶
The health check within Jenkins is talking to itself. This means it is generally really fast. We should be very very strict here, if Jenkins start having trouble measuring its own health, it is a first sign of trouble.
- alert: JenkinsTooSlowHealthCheck
expr: sum(jenkins_health_check_duration{quantile="0.999"})
by (app_kubernetes_io_instance) > 0.001
for: 1m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} responds too slow to health check"
description: " {{ $labels.app_kubernetes_io_instance }} is responding too slow to the regular health check"
GC ThroughPut Too Low¶
Ok, here I am on thin ice. I'm not a JVM expert, so this is just an inspiration. I do not know what would be a reasonable value for triggering an alert here. I'd say, test it!
- alert: JenkinsTooManyPluginsNeedUpate
expr: 1 - sum(vm_gc_G1_Young_Generation_time)by (app_kubernetes_io_instance) / sum (vm_uptime_milliseconds) by (app_kubernetes_io_instance) < 0.99
for: 30m
labels:
severity: notify
annotations:
summary: "{{ $labels.instance }} too low GC throughput"
description: "{{ $labels.instance }} has too low Garbage Collection throughput"
vm heap usage ratio > 70%¶
According to the CloudBees guide on tuning the JVM - which redirects to Oracle - the ration of JVM Heap memory usage should not exceed about 60%. So if we get over 70% for quite some time, expect trouble. As with any of these values, please do not take my word on it, and understand it yourself.
- alert: JenkinsVMMemoryRationTooHigh
expr: sum(vm_memory_heap_usage) by (app_kubernetes_io_instance) > 0.70
for: 3m
labels:
severity: notify
annotations:
summary: "{{$labels.app_kubernetes_io_instance}} too high memory ration"
description: "{{$labels.app_kubernetes_io_instance}} has a too high VM memory ration"
Uptime Less Than Two Hours¶
I absolutely love servers that have excellent uptime. Running services in Containers makes that a thing of the past, such a shame. Still, I'd like my applications - such as Jenkins - to be up for reasonable lengths of time.
In this case we can get notifications on Masters that have restart - for example, when OOMKilled by Kubernetes. We also get an alert when a new Master is created, which if there's selfservice involved is a nice bonus.
- alert: JenkinsNewOrRestarted
expr: sum(vm_uptime_milliseconds) by (app_kubernetes_io_instance) / 3600000 < 2
for: 3m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} has low uptime"
description: " {{ $labels.app_kubernetes_io_instance }} has low uptime and was either restarted or is a new instance (uptime: {{ $value }} hours)"
Full Example¶
server:
ingress:
enabled: true
annotations:
ingress.kubernetes.io/ssl-redirect: "false"
nginx.ingress.kubernetes.io/ssl-redirect: "false"
resources:
limits:
cpu: 100m
memory: 1000Mi
requests:
cpu: 10m
memory: 500Mi
alertmanager:
ingress:
enabled: true
annotations:
ingress.kubernetes.io/ssl-redirect: "false"
nginx.ingress.kubernetes.io/ssl-redirect: "false"
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 5m
memory: 10Mi
kubeStateMetrics:
resources:
limits:
cpu: 10m
memory: 50Mi
requests:
cpu: 5m
memory: 25Mi
nodeExporter:
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 5m
memory: 10Mi
pushgateway:
resources:
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 5m
memory: 10Mi
serverFiles:
alerts:
groups:
- name: jobs
rules:
- alert: JenkinsTooManyJobsQueued
expr: sum(jenkins_queue_size_value) > 5
for: 1m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} too many jobs queued"
description: "{{ $labels.app_kubernetes_io_instance }} has {{ $value }} jobs stuck in the queue"
- alert: JenkinsTooManyJobsStuckInQueue
expr: sum(jenkins_queue_stuck_value) by (app_kubernetes_io_instance) > 5
for: 1m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} too many jobs queued"
description: " {{ $labels.app_kubernetes_io_instance }} has {{ $value }} jobs in queue"
- alert: JenkinsWaitingTooMuchOnJobStart
expr: sum (jenkins_job_waiting_duration) by (app_kubernetes_io_instance) > 0.05
for: 1m
labels:
severity: notify
annotations:
summary: "{{ $labels.app_kubernetes_io_instance }} waits too long for jobs"
description: "{{ $labels.app_kubernetes_io_instance }} is waiting on average {{ $value }} seconds to start a job"
- alert: JenkinsTooLowJobSuccessRate
expr: sum(jenkins_runs_success_total) by (app_kubernetes_io_instance) / sum(jenkins_runs_total_total) by (app_kubernetes_io_instance) < 0.60
for: 1m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} has a too low job success rate"
description: " {{ $labels.app_kubernetes_io_instance }} instance has {{ $value }}% of jobs being successful"
- name: uptime
rules:
- alert: JenkinsNewOrRestarted
expr: sum(vm_uptime_milliseconds) by (app_kubernetes_io_instance) / 3600000 < 2
for: 3m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} has low uptime"
description: " {{ $labels.app_kubernetes_io_instance }} has low uptime and was either restarted or is a new instance (uptime: {{ $value }} hours)"
- name: plugins
rules:
- alert: JenkinsTooManyPluginsNeedUpate
expr: sum(jenkins_plugins_withUpdate) by (app_kubernetes_io_instance) > 3
for: 1m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} too many plugins updates"
description: " {{ $labels.app_kubernetes_io_instance }} has {{ $value }} plugins that require an update"
- name: jvm
rules:
- alert: JenkinsToManyOpenFiles
expr: sum(vm_file_descriptor_ratio) by (app_kubernetes_io_instance) > 0.040
for: 5m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} has a to many open files"
description: " {{ $labels.app_kubernetes_io_instance }} instance has used {{ $value }} of available open files"
- alert: JenkinsVMMemoryRationTooHigh
expr: sum(vm_memory_heap_usage) by (app_kubernetes_io_instance) > 0.70
for: 3m
labels:
severity: notify
annotations:
summary: "{{$labels.app_kubernetes_io_instance}} too high memory ration"
description: "{{$labels.app_kubernetes_io_instance}} has a too high VM memory ration"
- alert: JenkinsTooManyPluginsNeedUpate
expr: 1 - sum(vm_gc_G1_Young_Generation_time)by (app_kubernetes_io_instance) / sum (vm_uptime_milliseconds) by (app_kubernetes_io_instance) < 0.99
for: 30m
labels:
severity: notify
annotations:
summary: "{{ $labels.instance }} too low GC throughput"
description: "{{ $labels.instance }} has too low Garbage Collection throughput"
- name: web
rules:
- alert: JenkinsTooSlow
expr: sum(http_requests{quantile="0.99"} ) by (app_kubernetes_io_instance) > 1
for: 3m
labels:
severity: notify
annotations:
summary: "{{ $labels.app_kubernetes_io_instance }} is too slow"
description: "{{ $labels.app_kubernetes_io_instance }} More then 1% of requests are slower than 1s (request time: {{ $value }})"
- alert: AppTooSlow
expr: sum(rate(nginx_ingress_controller_request_duration_seconds_bucket{le="0.25"}[5m])) by (ingress) / sum(rate(nginx_ingress_controller_request_duration_seconds_count[5m])) by (ingress) < 0.95
for: 5m
labels:
severity: notify
annotations:
summary: "Application - {{ $labels.ingress }} - is too slow"
description: " {{ $labels.ingress }} - More then 5% of requests are slower than 0.25s"
- name: healthcheck
rules:
- alert: JenkinsHealthScoreToLow
expr: sum(jenkins_health_check_score) by (app_kubernetes_io_instance) < 1
for: 5m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} has a to low health score"
description: " {{ $labels.app_kubernetes_io_instance }} a health score lower than 100%"
- alert: JenkinsTooSlowHealthCheck
expr: sum(jenkins_health_check_duration{quantile="0.999"})
by (app_kubernetes_io_instance) > 0.001
for: 1m
labels:
severity: notify
annotations:
summary: " {{ $labels.app_kubernetes_io_instance }} responds too slow to health check"
description: " {{ $labels.app_kubernetes_io_instance }} is responding too slow to the regular health check"
- name: nodes
rules:
- alert: JenkinsTooManyOfflineNodes
expr: sum(jenkins_node_offline_value) by (app_kubernetes_io_instance) > 3
for: 1m
labels:
severity: notify
annotations:
summary: "{{ $labels.app_kubernetes_io_instance }} has a too many offline nodes"
description: "{{ $labels.app_kubernetes_io_instance }} has {{ $value }} nodes that are offline for some time (5 minutes)"
alertmanagerFiles:
alertmanager.yml:
global: {}
route:
group_by: [alertname, app_kubernetes_io_instance]
receiver: default
receivers:
- name: default
slack_configs:
- api_url: '<REPLACE_WITH_YOUR_SLACK_API_URL>'
username: 'Alertmanager'
channel: '#notify'
send_resolved: true
title: "{{ .CommonAnnotations.summary }} "
text: "{{ .CommonAnnotations.description }} {{ .CommonLabels.app_kubernetes_io_instance}} "
title_link: http://my-prometheus.com/alerts