3

Canaries to the Rescue: Catching Service Issues Before the End User

 2 years ago
source link: https://hackernoon.com/canaries-to-the-rescue-catching-service-issues-before-the-end-user-38em359h
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Canaries to the Rescue: Catching Service Issues Before the End User

5
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png

@memattchungMatt Chung

Solo-entrepreneur helping companies scale their Dev/Ops.

You launched your service and rapidly onboarding customers. You're moving fast, repeatedly deploying one new feature after another. But with the uptick in releases, bugs are creeping in, and you're finding yourself having to troubleshoot, rollback, squash bugs, and then redeploy changes. Moving fast but breaking things.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

What can you do to quickly detect issues — before your customers report them? By Using Canaries.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

In this post, you'll learn about the concept of canaries, example code, best practices, and other considerations, including both maintenance and financial implications with running them.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

What is a canary

Source: grass-lifeisgood/Shutterstock

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Back in the early 1900s, canaries were used by miners for detecting carbon monoxide and other dangerous gases.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Miners would bring their canaries down with them to the coalmine, and when their canary stopped chirping, it was time for everyone to evacuate immediately.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

In the context of computing systems, canaries perform end-to-end testing, aiming to exercise the entire software stack of your application: they behave like your end-users, emulating customer behavior. The canaries are just pieces of software that are always running and constantly monitoring the state of your system; they emit metrics into your monitoring system (more discussion on monitoring in a separate post), which then triggers an alarm when some defined threshold breaches.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

What do canaries offer?

They answer the question: "Is my service running?" More sophisticated canaries can offer a deeper look into your service.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Instead of canaries just emitting a binary 1 or 0 — up or down — they can be designed such that they emit more meaningful metrics that measure latency from the client's perspective.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

First steps with building your canary

If you don't have any canaries running that monitor your system, you don't necessarily have to start with rolling your own. Your first canary can require little to no code. One way to gain immediate visibility into your system would be to use synthetic monitoring services such as BetterUptime or PingDom, or StatusCake. These services offer a web interface, allowing you to configure HTTP(s) endpoints that their canaries will periodically poll. When their systems detect an issue (e.g., TCP connection failing, bad HTTP response), they can send you email or text notifications.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Or, if your systems are deployed in Amazon Web Services, you can write Python or Node scripts that integrate with CloudWatch.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

But if you are interested in developing your own custom canaries that do more than a simple probe, read on.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Where to begin

Remember, canaries should behave just like real customers. Your customer might be a real human being or another piece of software. Regardless of the type of customer, you'll want to start simple.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Similar to the managed services described above, your first canary should start with emitting a simple metric into your monitoring system, indicating whether the endpoint is up or down.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

For example, if you have a web service, perform a vanilla HTTP GET. When successful, the canary will emit

http_get_homepage_success=1
and under failure,
http_get_homepage_success=0
.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

Example canary - monitoring cache layer

Imagine you have a simple key/value store system that serves as a caching layer. To monitor this layer, every minute, our canary will: 1) perform a write 2) perform a read 3) validate the response.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
while(True):

    successful_run = False

    try:
        put_response = cache_put('foo', 'bar')
        write_successful = put_response == 'OK'
        Publish_metric('cache_engine_successful_write', write_successful)
        value = cache_get('foo')
        successful_read = value = 'bar'
        publish_metric('cache_engine_successful_read', is_successful_read)
        canary_successful_run = True

    Except as error:
        log_exception("Canary failed due to error: %s" % error)
    Finally:
        Publish_metric('cache_engine_canary_successful_run', int(successful_run))
        sleep_for_in_seconds = 60
        sleep(sleep_for_in_seconds)

Cache engine failure during deployment

With this canary in place emitting metrics, we might then choose to integrate the canary with our code deployment pipeline. In the example below, I triggered a code deployment (riddled with bugs), and the canary detected an issue, triggering an automatic rollback:

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Best Practices

The above code example was very unsophisticated, and you'll want to keep the following best practices in mind:

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • The canaries should NOT interfere with the real user experience.
    Although a good canary should test different behaviors/states of your system, they should in no way interfere with the real user experience. That is, their side effects should be self-contained.
  • They should always be on, always running, and should be testing at regular intervals. Ideally, the canary runs frequently (e.g., every 15 seconds, every 1 minute).
  • The alarms that you create when your canary reports an issue should only trigger off more than one datapoint. If your alarms fire off on a single data point, you increase the likelihood of false alarms, engaging your service teams unnecessarily.
  • Integrate the canary into your continuous integration/continuous deployment pipeline. Essentially, the deployment system should monitor the metrics that the canary emits, and if an error is detected for more than N minutes, the deployment should automatically rollback (more of the safety of automated rollbacks in a separate post).
  • When rolling your own canary, do more than just inspect the HTTP headers. Success criteria should be more than verifying that the HTTP status code is a 200 OK. If your web services return payload in the form of JSON, analyze the payload and verify that it's both syntactically and semantically correct.

Cost of canaries

Of course, canaries are not free. Regardless of whether or not you rely on a third-party service or roll your own, you'll need to be aware of the maintenance and financial costs.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Maintenance

0 reactions
heart.png
light.png
money.png
thumbs-down.png

A canary is just another piece of software. The underlying implementation might be just a few bash scripts cobbled together or a full-blown client application. In either case, you need to maintain them just like any other code package.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Financial costs

0 reactions
heart.png
light.png
money.png
thumbs-down.png

How often is the canary running? How many instances of the canary are
running? Are they geographically distributed to test from different locations? These are some of the questions that you must ask since they impact the cost of running them.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Beyond canaries

When building systems, you want a canary that behaves like your customer, one that allows you to quickly detect issues as soon as your service(s) chokes. If you are vending an API, then your canary should exercise the different URIs. If you are testing the front end, then your
canary can be programmed to mimic a customer using a browser using
libraries such as selenium.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The canaries are a great place to start if you are just launching a
service. But there's a lot more work required to create an operationally
robust service. You'll want to inject failures into your system. Also, you need a crystal clear understanding of how your system should behave when its dependencies fail. These are some of the topics that I'll cover
in the next series of blog posts.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Let's Connect

Let's connect and talk more about software and DevOps. Follow me on Twitter: @memattchung

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Previously published on https://blog.mattchung.me/2021/06/21/is-my-service-up-and-running-canaries-to-the-rescue/.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
5
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
by Matt Chung @memattchung. Solo-entrepreneur helping companies scale their Dev/Ops. Read my stories
Join Hacker Noon

Create your free account to unlock your custom reading experience.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK