theme | highlightTheme |
---|---|
night |
monokai |
a non-exhaustive opinionated guide
Building predictable systems that you can
reason about
For example:
- Did the latest release impact performance?
- Is there a correlation between system load and latency?
- Why is kafka partition 2 so hot?
--
- Metrics — coarse-grain
- Logs — complete freedom!
- Traces — fine-grain
--
"Pillars" are marketing
Emit data to product which can query.
The product often limits the precision you can include.Metrics are low-fidelity aggregates, tells you of a failure but not why.
Tracing is logging with opinions + tooling
Multiple sources of truth suffer weak correlation
--
Canonical Logs: Wide and structured
- Uncover unknown unknowns
- Useful to everyone
- https://www.honeycomb.io/blog/why-honeycomb-black-swans-unknown-unknowns-and-the-glorious-future-of-doom
- https://www.thoughtworks.com/en-au/radar/techniques/observability-2-0
- https://thenewstack.io/modern-observability-is-a-single-braid-of-data/
- https://baselime.io/blog/canonical-log-lines
--
Focus on good logs
- context
- correlation
- level
Cannot predict future questions
Add context ✅ not data
Do not add whole request/response payloads, these contain data which your observability tooling is not sancti
--
{
"time": "2021-07-25T04:12:50Z",
"application": "authorizer@3.0.1",
"msg": "authorized",
"user_id": "123",
"groups": ["a", "b"],
"cache_used": "1627186370",
"request_id": "a39b28c9",
"corelation_id": "d4289bd7"
}
--
--
{
"msg": "Task finished: ThingProcessor: duration=3.014"
}
- hard to parse
- slow to filter (using
like
operation) - ambiguous unit
--
{
"msg": "Task finished",
"processor": "ThingProcessor",
"duration_ms": 3.014
}
--
--
Q: Which service should log?
graph TD;
queue --> validator;
validator --> sender;
--
graph TD;
controller <--> queue;
controller <--> validator;
controller <--> sender;
The controller handles flow, errors, and logging.
--
func (s *Service) Process() (err error) {
// prepare log context
start := time.Now()
log := StandardLogger.WithField("service", "controller")
defer func() {
if err != nil {
log = log.WithError(err)
}
duration := time.Since(start)
log.WithField("duration_ms", duration.Milliseconds()).
Info("done")
}()
// get next work
var work Work
work, err = queue.Pop()
if err != nil {
return fmt.Errorf("failed to get work: %w", err)
}
log = log.WithField("work_id", item.ID)
// validate
if err = s.validator.IsValid(work.Body); err != nil {
return fmt.Errorf("invalid work: %w", err)
}
// send
if err = s.sender.Send(work.Body); !err != nil {
return fmt.Errorf("failed to send: %w", err)
}
// commit work
work.Delete()
}
a cross-component concern — find concensus
--
Examples
-
event context:
correlation_id, request_id
-
business context:
user_id, asset_id
-
application context:
application, version, environment
--
reaching consensus through tooling
package appcontext
type SystemContext struct {
Application string `json:"application,omitempty"`
Version string `json:"version,omitempty"`
Environment string `json:"environment,omitempty"`
}
func WithSystemContext(ctx context.Context, val SystemContext) context.Context {
return context.WithValue(ctx, key, val)
}
func GetSystemContext(ctx context.Context) (val SystemContext, ok bool) {
val, ok = ctx.Value(key).(SystemContext)
return
}
...
Broadly categorize an event
Reach consensus on meaning
--
--
fatal: The system cannot continue
FATAL: failed to connect to database
--
error: Failed to do the job
ERROR: timeout while saving
--
warning: Processing degraded but can continue
WARN: config unset; using default
--
info: System did what you asked it to do
INFO: user created
INFO: batch complete
--
debug: Low-level supporting steps.
Usually disabled due to poor signal-to-noise ratio.
Danger zone: Take care with sensitive data!
--
--
ERROR: client is not authorized
This belongs in the response to the client:
401 Unauthorized
--
Uninteresting plumbing
INFO: executed 'SELECT * FROM foo'
INFO: parsed JSON
aka. i was prototyping and accidentally committed it
--
Predicting the future
INFO: about to handle request
Trust your error handling!
you'll get it wrong the first time; iterate