slo-builder

This repo provides a framework that developers can use to specify system SLOs without requiring in-depth Prometheus knowledge.

Installation

Run the following command outside of a go.mod repository to install the slo-builder binary.

go install github.com/gocardless/slo-builder/cmd/slo-builder@latest

You also need to ensure your go/bin directory is available in your $PATH:

export PATH="$HOME/go/bin:$PATH"

Examples

See example-definitions.yaml and example-rules.yaml for full examples of all available SLO templates.

Why?

SLOs are often formulated in business terms first, then translated into monitoring system rules. Good SLOs should be formed as a ratio of good events to total events, and come with an associated error budget- the margin of error you'd expect to consume in normal operation.

By forcing a homogenous format for every SLO, it becomes possible to apply generically useful rules to all different types of SLO. This is even more important when the implementation of such rules are so tricky, and the required learning to produce them so large.

Steps to an SLO

You start with a system, often with a number of SLIs. You then:

Formulate SLOs in business terms
Implement the SLOs in the monitoring system (Prometheus)
Write multi-window alerts for burning error budgets

System components will need different categories of SLO: an SLO for HTTP requests will be structured differently than a batch processing system, for example. This framework offers a collection of predefined templates that map to different types of system, and can help someone unfamiliar with SLOs quickly produce rules that Just Work.

When formulating SLOs in business terms (1), you can use these predefined templates to help inform your selections. This framework can then produce the rules (2) that generate a common input to multi-windowed alerts (3), which are included at the end of this SLO pipeline.

`BaseSLO`

Every SLO has some common behaviour, as represented by the baseSLO type. An example core SLO definition would be:

BaseSLO{
  Name: "MarkPaymentsAsPaidMeetsDeadline",
  ErrorBudget: 0.1,
}

In rule form, we produce a job:slo_definition:none rule which tracks the parameters of the base SLO. This writes a time-series that can be inspected for how the SLO definition changed with time.

We also produce a job:slo_error_budget:ratio which will be used at the end of the SLO pipeline to apply alerting rules. Each of these rules has the name label that is assumed to be unique to each SLO, allowing Prometheus to join series on the name label.

job:slo_definition:none{name="MarkPaymentsAsPaidMeetsDeadline",error_budget="0.1"} 1.0
job:slo_error_budget:ratio{name="MarkPaymentsAsPaidMeetsDeadline"} 0.1

`BatchProcessingSLO`

We'll use an example of a process that transitions many payments into a paid state as a batch process for which we want to apply an SLO.

MarkPaymentsAsPaidMeetsDeadline = BatchProcessingSLO{
  BaseSLO:  BaseSLO{"MarkPaymentsAsPaidMeetsDeadline", 0.1},
  Deadline: time.Duration(2) * time.Hour,
  Volume: `
  1.5 * max_over_time(
    (
      sum by (namespace, release) (
        increase(paysvc_mark_payments_as_paid_marked_as_paid_total[8h])
      )
    )[60d:1h]
  )`,
  Throughput: `
  sum by (namespace, release) (
    rate(paysvc_mark_payments_as_paid_marked_as_paid_total[1m])
  ) > 0`,
}

Users provide a time by which an entire batch must complete, along with an estimation of max volume and a current measurement of throughput. This is enough information to infer a target throughput (knowing how fast items need processing to hit the deadline) which can be used to score each minute of activity from the job.

In total, we produce three rules for this specific SLO:

job:slo_batch_volume:max{name="MarkPaymentsAsPaidMeetsDeadline",namespace="production",release="paysvc-live"}	2058877.53186143
job:slo_batch_throughput_target:max{name="MarkPaymentsAsPaidMeetsDeadline",namespace="production",release="paysvc-live"} 285.95521275853196
job:slo_batch_throughput:interval{name="MarkPaymentsAsPaidMeetsDeadline",namespace="production",release="paysvc-live"} 369.7091257289802

These rules are then consumed by a generic batch processing rule that translates into job:slo_error:ratio<I> rules for each common alert window. The most important rule is the generation of an error 'score' for the batch process, which looks like:

- record: job:slo_batch_error:interval
  expr: |
    1.0 - clamp_max(
      job:slo_batch_throughput:interval / job:slo_batch_throughput_target:max,
      1.0
    )

In this context, job:slo_batch_error:interval is the error score for each interval of the throughput given in the SLO specification. For MarkPaymentsAsPaid, with the target throughput of 285 payments/s calculated over a 1m window, you'd have the following scores:

Throughput	`job:slo_batch_throughput_target:max`	`job:slo_batch_error:interval`
0	285	100%
100	285	65%
285	285	0%
300	285	0%

This means you burn your error budget when the batch job performs below the target throughput, and the rate at which you burn it is dependent on how significantly you fail to meet it. It's also important to note that minutes where the throughput greatly exceeds the target don't 'recoup' error budget- this is an implementation decision, and might be the wrong choice.

Alerting

Every SLO template conforms to our definition of an SLO, which is something that has a name, associated error budget and a constantly refreshed error ratio. In Prometheus terms, that means your SLOs will eventually produce the following time series:

job:slo_error:ratio1m
job:slo_error:ratio5m
job:slo_error:ratio30m
job:slo_error:ratio1h
job:slo_error:ratio2h
job:slo_error:ratio6h
job:slo_error:ratio1d
job:slo_error:ratio7d

As we get these series for every SLO, we can write generic alerting rules that work across any SLO. It happens that building useful alerts on SLO measurements is more complex than it might seem, and leveraging generic alerts is a huge benefit for simplicity.

We use a combination of the SRE workbook and SoundCloud: Alerting on SLOs like Pros to form multi-window error budget burn alerts. The term 'multi-window' indicates that alerts are only triggered when error budget is being burned in both short and long-term intervals: this reduces alert false positives and improves alert reset time, causing alerts to resolve as soon as the problem has been corrected instead of hours after.

Depending on the urgency of the detected error, we'll either page an on-call engineer or open a ticket to handle the error budget burn in business hours. The detection sensitivity windows are listed here:

Alert	Long Window	Short Window	`for` Duration	Burn Rate Factor	Error Budget Consumed
Page	1h	5m	2m	14.4	2%
Page	6h	30m	15m	6	5%
Ticket	1d	2h	1h	3	10%
Ticket	3d	6h	1h	1	10%

Every SLO created with this framework is automatically subscribed to these alerts. Where they get routed- both who is paged, and where a ticket gets created- depends on the team assigned to the SLO.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
cmd/slo-builder		cmd/slo-builder
grafana/dashboards		grafana/dashboards
pkg/templates		pkg/templates
vendor		vendor
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
example-definitions.yaml		example-definitions.yaml
example-rules.yaml		example-rules.yaml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

slo-builder

Installation

Examples

Why?

Steps to an SLO

`BaseSLO`

`BatchProcessingSLO`

Alerting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 10

Uh oh!

Languages

License

gocardless/slo-builder

Folders and files

Latest commit

History

Repository files navigation

slo-builder

Installation

Examples

Why?

Steps to an SLO

BaseSLO

BatchProcessingSLO

Alerting

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 10

Uh oh!

Languages

`BaseSLO`

`BatchProcessingSLO`

Packages