55

PipelineDB 1.0.0 – High-Performance Time-Series Aggregation for PostgreSQL

 5 years ago
source link: https://www.tuicool.com/articles/hit/Jr67Zv
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

TL;DR

  • PipelineDB 1.0 is a PostgreSQL extension for high-performance time-series aggregation via continuous SQL queries.Get started here.
  • PipelineDB enables realtime reporting use cases at scale, where only summary data is needed
  • PipelineDB is now licensed under the liberal Apache 2.0 license

Overview

Just over three years ago we published thevery first release of PipelineDB, which began as a fork of PostgreSQL. About thirty seconds after that, people from all corners of the earth began enthusiastically asking us if we could make PipelineDB an extension of PostgreSQL, rather than a standalone fork. We've always empathized with this sentiment from a user's perspective. In addition to the technical drawbacks, running a fork comes with a very real psychological burden as well.

And from our own perspective, maintaining a fork has incurred its own unique challenges that often weigh heavily against the upside. So as PipelineDB has matured over the years based on usage and feedback from thousands of organizations worldwide

--including several in the Fortune 100--and we as a company have found our financial footing with a reliable business model, the time became right to invest heavily into giving our users what they've asked for: PipelineDB factored out into a standard PostgreSQL extension.

Our dear friends: we genuinely, seriously, from the bottom of our hearts could not be more excited and humbled to give that to you today with PipelineDB 1.0.0 .

What is PipelineDB?

PipelineDB is an open-source PostgreSQL extension that enables realtime reporting use cases by continuously aggregating large streams of time series data down into summary data. It shines the most at larger scales, where storing lots of raw time-series data and aggregating it over and over again becomes inefficient. The amount of value that PipelineDB adds is directly proportional to the amount of continuous aggregation that an analytics use case can benefit from. PipelineDB should be used for analytics use cases that only require summary data, like realtime reporting dashboards.

PipelineDB challenges the widespread paradigm of storing all raw data in a database and then querying it on demand. Inversely, PipelineDB enables users to run continuous aggregations over streaming time-series data, and only store the compact output of these continuous queries as incrementally updated table rows that can be evaluated with minimal query latency.

Our need for this “inverted database” was born out of our exhilarating experience building large-scale data infrastructure in the Ad Tech industry (hey AdRoll ), where the volumes of raw data we were dealing with required breaking new ground technologically just to make even the most basic sense of our streaming time-series data in a low-latency manner. A simple, yet powerful observation began to crystallize in our minds:

Everything we needed to know in realtime about these massive, perpetually growing datasets was well-known by us in advance, by definition.If queries are not known in advance, there is likely not a realtime need.

So, in scenarios where queries are known in advance, they can and should be run continuously in order to make the data infrastructure that powers these realtime analytics applications simpler, faster, and cheaper than they would otherwise be with the traditional “store first, query later” data processing model.

Our use cases at AdRoll were nearly always comprised of some form of summary data: aggregations, unique user counts, sliding window queries, time-series charts, and things that could be displayed as compact visualizations suited for intuitive human consumption. One benefit of continuous computation is that summary data is always available for low-latency lookups. Realtime analytics use cases should never need to compute data more than once, especially over very large datasets. Systems should just had to glance at the always up-to-date aggregate results. Analytics use cases that do not involve a realtime need should be dealt with by a different system, one that does store granular data for ad hoc analysis.

Back then, continuous aggregation at scale necessitated some extremely innovative but complex infrastructure. It was hard to build, but it worked. And most importantly, we saw the data infrastructure we were building at AdRoll as solving an extremely general problem that could be more easily solved in the future by the right product. And that realization brought us to where we are today: PipelineDB 1. 0.

PipelineDB in Action

SQL is a natural choice for a product that enables high-performance continuous aggregation. In addition to allowing for the intuitive expression of continuous aggregations to perform, it becomes quite useful for running further SQL queries on continuously updating aggregate output, which is just stored as regular tables anyways. In PipelineDB, SQL is even used to write time-series events to a stream, which are also structured as tables:

INSERT INTO events_stream (ts, value) VALUES (now(), '0ef346ac');

To perform an aggregation over this stream, we can use PipelineDB's most fundamental abstraction: the continuous view. A continuous view is an aggregation query defined over a stream of time-series data. For example, here's a continuous view that aggregates unique values in our stream, bucketing them by hour:

CREATE VIEW uniques AS
 SELECT hour(ts), COUNT(DISTINCT value) FROM events_stream
GROUP BY hour;

Even if billions and billions of rows are written to events_stream , our continuous view ensures that only one physical row per hour is actually persisted within the database. As soon as the continuous view reads new incoming events and the distinct count is updated the reflect new information, the raw events will be discarded. They aren't stored in PipelineDB, ever, although almost everybody writes their raw data somewhere else. And that simple idea enables PipelineDB users to achieve two important things:

  • Enormous levels of raw event throughput on modest hardware footprints
  • Extremely low read query latencies

It also means that the traditional dependence between data volumes ingested and data volumes stored is broken, facilitating not only high performance but high performance sustained indefinitely. Data volumes can grow exponentially while aggregate data stored in PipelineDB remains constant, or grows only minimally.

While our simple example is a canonical illustration of the core concept behind PipelineDB, it generalizes to virtually any SQL aggregation query you can imagine running on streams of time-series data.

The other type of continuous queries that PipelineDB allows you to define are what we call continuous transforms. Unlike continuous views -- which store aggregate state in incrementally updated tables -- continuous transforms are stateless and simply apply a transformation to a stream, writing out the result to another stream. For example, this continuous transform unpacks a PostgreSQL JSON event into separate columns:

CREATE VIEW unpack WITH (action=transform) AS
 SELECT
  (payload->>’timestamp’)::timestamp AS ts,
  payload->>’user_id’ AS user_id,
  payload->>’url’ AS url
FROM json_stream;

The output of this transform can then be consumed by downstream continuous views and transforms:

CREATE VIEW hourly_uniques AS
 SELECT hour(ts), COUNT(DISTINCT user_id) FROM
output_of(‘unpack’);

And realtime uniques counts can be retrieved with a simple SELECT query against the continuous view:

SELECT * FROM hourly_uniques WHERE hour >= now() - interval ‘4 hours’;

   hour              | count 
-------------------------------+-------
 2018-10-24 00:00:00 |     44123
 2018-10-24 01:00:00 |     56332
 2018-10-24 02:00:00 |     64847
 2018-10-24 03:00:00 |     72905

PipelineDB ships with a rich set of builtin aggregates and other functionality indispensable for high-performance time-series analytics:

  • HyperLogLog-based distincts counting, merging, and manipulation
  • Bloom filters for set membership analysis
  • Top-K and “heavy hitters” tracking
  • Distributions and percentiles analysis
  • Much more and if we don’t have something you need, feel free to let us know !

In addition to a rich set of functionality being available for defining continuous queries, continuous queries may also be chained together into arbitrarily complex topologies of continuous computation. Each continuous query produces its own output stream of its incremental updates, which can be consumed by another continuous query as any other stream.

Technical Notes

We began laying the groundwork for delivering PipelineDB as an extension beginning with version0.9.7. Each release since then has moved PipelineDB incrementally closer to a standalone extension, so version 1.0.0 does not include any radical interface changes. The main things that have changed are:

  • Non-standard syntax has been removed. CREATE CONTINUOUS ... syntax has been replaced by CREATE VIEW , and CREATE STREAM … has been replaced by CREATE FOREIGN TABLE … ( docs )
  • Configuration parameters are now qualified by pipelinedb . For example, continuous_query_num_workers is now pipelinedb.num_workers . (docs)
  • PostgreSQL pg_dump , pg_restore , and pg_upgrade tooling is now used instead of the PipelineDB variants ( pipeline-dump , pipeline-restore , pipeline-upgrade )

We have also renamed certain functions and aggregates to be descriptive about what problem they solve for you. Previous versions of PipelineDB had some fancy aggregates that operated on data structures useful for streaming computation, but we felt that their names were too opaque and thus not as helpful as they could be for users:

  • “Top-K” now represents Filtered-Space-Saving (docs)
  • “Distributions” now refer to T-Digests (docs)
  • “Frequency” now refers to Count-Min-Sketch (docs)

Beyond that, almost everything else in this release is internal rework that makes PipelineDB adhere to PostgreSQL’s standard extension interface. You won’t notice much difference in terms of how it feels to use PipelineDB, and we’ve taken great care to ensure that is the case.

PipelineDB 1.0.0 currently supports PostgreSQL versions 10.1 , 10.2 , 10.3 , 10.4 , and 10.5 . Binaries supporting PostgreSQL 11 will be published soon and will not require a new version release.

Licensing

Another important topic we’d like to address is licensing. Since its release, PipelineDB had been licensed under the GPL version 3. But packaging PipelineDB as an extension to an existing database warranted serious consideration about what PipelineDB’s license should be. The debate was extensive and interesting. There are compelling arguments to be made for more restrictive open-source licensing and for liberal licensing, each options with its own set of benefits and drawbacks.

Ultimately we decided to relicense PipelineDB under the Apache 2.0 open-source license. We felt that a liberal license is more consistent with our philosophy about open-source software, and more inline with our values as a company. Greater adoption means a stronger community, maximized product usage and feedback, and a broader conduit into a sustainable business model that allows us to keep reinvesting into product engineering.

We’ve been lucky enough to have a genuinely amazing community of users whose feedback, ideas and input have been instrumental in making PipelineDB what it is today. And we want to do everything we can to make PipelineDB accessible to even more people and organizations.

What’s Coming in the Next Release

Now that PipelineDB is packaged as a relatively lightweight extension, you can expect faster release cycles. Along with any necessary maintenance on 1.0.0 , the release following this one is currently slated to add one new major area of functionality: automated partitioning for continuous views.

It is a very common usage pattern with PipelineDB to summarize time-series by some temporal component (e.g. minute, hour, day), and thus partitioning by time range will allow users to keep queries against their continuous views as performant as possible as they grow over time. We’re keeping the work slated for the follow up release to 1.0.0 minimal in order to get partitioning out to users as soon as we possibly can.

Stay tuned, and enjoy PipelineDB 1.0.0 !


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK