14

Introducing Pulse: Envoy Mobile’s stats library

 4 years ago
source link: https://eng.lyft.com/pulse-stats-apis-from-envoy-mobile-6db71c6a2f22
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Introducing Pulse: Envoy Mobile’s stats library

Early this year, we published a deep dive after the v0.2 release of Envoy Mobile. A lot has happened since then: Envoy Mobile is now enabled in the production Lyft apps and the open source project recently joined the CNCF. Today, we are excited to announce Pulse, Envoy Mobile’s stats solution.

Stats APIs for mobile? Why?

In February, we published a blog post detailing how Envoy Mobile enabled a new range of end-to-end observability of our network. Pulse is a continuation of this work.

Real-time observability is essential for server-side development. There are existing industry standard open source solutions, such as Prometheus and StatsD. Service owners are accustomed to instrumenting and monitoring their services with various metric types like counters, gauges, and histograms. In comparison, on mobile, observability has conventionally focused on crash reporting and event tracking:

Crash reporting: Apps typically use third party reporting tools focused on crashes and exceptions, such as Crashlytics or Bugsnag.

  • Pros: These tools usually report events with relatively low latency (usually minutes-level).
  • Cons: The reporting is only focused on crashes and exceptions. Additionally, these tools are usually not configurable or extensible. For instance, it’s difficult to integrate them with alarm systems like PagerDuty.

Event tracking: Many apps use in-house or third party systems to gather analytics events like user interactions, diagnostic data, etc. for different purposes.

  • Pros: compared with the crash reporting systems, the analytics events systems usually allow for custom structured data. This affords performing ad-hoc queries to gather insights about specific features of the apps.
  • Cons: data usage is relatively high (compared with crash reporting). Therefore, data reporting resolution is usually in the longer-than-minutes range.

In order to reduce the time for anomaly detection and to save developers’ time on triaging issues in different areas, Pulse provides a set of easy to use APIs for mobile engineers to report time-series data (think data points) from their apps. These time-series then get populated into observability systems and are used to render real-time stats.

At Lyft, we recently started integrating real-time stats into our clients. Using these stats, we were able to build a dashboard that tracks app crashes and enable a corresponding set of PagerDuty alerts. Previously, engineers relied solely on the Bugsnag UI and fished for crashes for specific releases. It was manual and time consuming. Time to root cause issues was delayed by our inability to know immediately when our apps were not behaving as expected. Now, when the app crash metric spikes, our mobile on-call engineers get paged and immediately start acting on the incident.

A few weeks ago, the app crash metric spiked for Lyft’s rider app. Thanks to an alert driven by these stats, the on-call engineers were paged immediately. The engineers were able to act on the issue quickly and merged a hot-fix for the incident, and were able to minimize the potential impact.

Image for post
Image for post
app_crash count metric spiked at ~9:55am PT on Nov. 9

Another valuable aspect of real-time stats is monitoring specific areas of the Lyft apps. Pulse’s stats APIs are flexible and can be applied to any area of the codebase. For example, if the count of taps on the “Request a Ride” button (a relevant interaction for the Lyft rider app) suddenly spikes or drops, it likely suggests a problem worth investigating.

What are the APIs and how does Pulse work?

Currently, Pulse supports two types of stats: Counter and Gauge.

Counter: a value that can be incremented.

Many mobile apps use analytics for tracking occurrences of an event. Conceptually, this practice acts as a counter. Taking the example from the previous section, traditionally when a user tapped the “Request a Ride” button, an analytic event was emitted. With Pulse, the app instead has a “request_ride” counter that is incremented every time when the user taps the button.

Gauge: a value that can be incremented or decremented.

Gauge stats are less common in mobile development compared to Counters. A simple example is a gauge to report the amount of network connections in flight for an app at a given time. The amount of network connections in flight is a useful metric to observe in real-time: when there is an anomaly (a sudden spike or a drop) for this metric, it most likely means there is something wrong.

There is a third type of stats we plan to support, it’s Histogram.

Histogram: a histogram samples observations and counts them in configurable buckets.

For example, reporting request durations or time-to-interact on fresh app starts.

If you want to read more about Pulse’s APIs and how to use them, check the documentation, along with some example apps for both iOS and Android.

Reporting stats to the server

Client stats are received server-side by a gRPC service based on StatsD. The service expects a list of time-series stats serialized as a standard Prometheus MetricFamily. The service then flushes the stats to Lyft’s internal observability systems, where they’re used to populate dashboards and alarms. More details on the full end-to-end system are described in a previous observability blog post.

Current status and next steps

We are currently working on adding support for Histogram stats to Pulse’s APIs, and are also adding the ability to tag specific stats. With tagging, developers will be able to enrich their stats with different dimensions of information (for example, by adding app_foreground or app_background tags to the app_crash stats mentioned above, providing more insight into the conditions during which the app crashes happen).

Our goal is to make time-series metrics not only a tool for mobile development, but a necessity: it should become second nature for development teams to consider what stats they want to monitor on their mobile clients to track performance. We’ve already begun piloting Pulse at Lyft with other feature teams, and are excited to share more of our discoveries as we make progress on Pulse.

In the meantime, feel free to try it out, contribute, or review our roadmap!

Acknowledgements

We would like to thank the following teammates for their work and contributions:

Michael Schore, Jose Nino, Alan Chiu, Michael Rebello, Rafal Augustyniak, Don Yu, and Miguel Juarez


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK