Seafowl: a database for analytics at the edge
source link: https://www.splitgraph.com/blog/seafowl
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Seafowl: a database for analytics at the edge
Our new project: a CDN-friendly analytical database that's up to 10x faster than PostgreSQL and up to 5x faster than Splitgraph.
For the past few months, we've been hard at work on something interesting and it's finally ready for people to start experimenting with.
Seafowl is our new standalone project, inspired by the design of modern cloud data warehouses like Snowflake and BigQuery, as well as the lessons we learned over the past four years of building Splitgraph and the Splitgraph DDN.
It's a CDN and HTTP-cache friendly database designed for analytics at the edge: dashboards, notebooks, visualizations and other data-driven read-intensive Web applications that need to be responsive while operating on large datasets.
curl https://demo.seafowl.io/q/30dcf5da00f94e3fbd28cf98242f7de3b164a440cadfaa6f7de74369ffc2177d.csv \
-i -XGET -H "Content-Type: application/json" \
-d@- <<EOF
{"query": "SELECT country_of_production, COUNT(*) AS count \
FROM supply_chains \
GROUP BY 1 ORDER BY 2 DESC"}
EOF
HTTP/2 200
...
cf-cache-status: HIT
{"country_of_production":"BRAZIL","count":2386600}
{"country_of_production":"ARGENTINA","count":260293}
{"country_of_production":"INDONESIA","count":155751}
{"country_of_production":"ECUADOR","count":96842}
{"country_of_production":"PARAGUAY","count":29848}
...
Fast analytics...
Seafowl is written in Rust and relies on Apache DataFusion to plan and execute queries. For analytical queries, it's:
- up to 10x faster than vanilla PostgreSQL
- ...or up to 4x faster than PostgreSQL with the
parquet_fdw
extension - ...or around the level of DuckDB in performance, sometimes beating it (apart from some pathological queries where it has memory issues, as well as queries that use a lot of subqueries where the optimizer might not be able to build an optimal query plan)
We have some benchmarks which aren't as in-depth as TPC-DS, but do show that Seafowl makes a lot of seemingly heavy SQL queries now feasible for powering interactive applications.
For SELECT
queries, Seafowl supports a large subset of the PostgreSQL dialect.
If there's something missing, you can
write a user-defined function
for Seafowl in anything that compiles to WebAssembly.
You can write data to Seafowl with:
- standard SQL statements
(including
CREATE TABLE AS
statements for future dbt support!) - by uploading a CSV or a Parquet file
- by setting up a "live" table that points to a file on an HTTP server
You can also "bake" a dataset into a Docker image by bundling Seafowl together with its data. This is a nifty scaling strategy for smaller (well, up to 1GB, which can be as big as the 55M row subset of the New York Taxi dataset we use for our benchmarks) and read-only data.
...at the edge
"The edge" is a handwavy and marketing-y term, but it describes what Seafowl is all about well enough. You can deploy Seafowl to the edge or cache query results at the edge.
Seafowl ships as a single 50MB binary that starts in 10ms, making it easily deployable to modern serverless environments, anywhere in the world. Our own tutorial uses Fly.io as an example.
Seafowl's query execution API follows HTTP cache semantics. This means you can put Seafowl behind a CDN like Cloudflare or a cache like Varnish and have query results cached and delivered to your users in milliseconds.
Even without a dedicated cache, you can get the benefits of caching query results in your user's browser (to see it in action, go to our demo, press F12, go to the Network tab and refresh the page).
Try it out!
Seafowl is now available to download and experiment with, though, like with all early stage projects, it likely has plenty of bugs and missing features. As we continue its development, we might break APIs or change storage formats in a backwards-incompatible way.
Feel free to check out our Observable demo, go through our quickstart or through a longer tutorial in which you'll deploy Seafowl to Fly.io, put it behind Cloudflare or Varnish and reproduce our Observable demo notebook.
When something breaks, please do raise a GitHub issue. And, if it doesn't (or doesn't break badly enough), give us a GitHub star!
Future features
There are many features we're planning for Seafowl. Where appropriate, we'll also aim to upstream these changes into DataFusion itself.
JSON support
Seafowl currently supports a limited subset of PostgreSQL operators and
functions. The most notable missing feature is the JSON data type for storage of
arbitrary semi-structured objects, as well as helper functions like
json_extract_path
/ json_agg
.
See the GitHub issue for more details.
Live querying
You can currently query Parquet/CSV files over HTTP from Seafowl, but you can't query any other databases "live", like you could with Splitgraph. It's always been difficult for us to write PostgreSQL foreign data wrappers in C, since we'd have to do a lot of work to correctly push down various parts of the PostgreSQL query tree. With the extensibility of DataFusion and the memory safety of Rust, we expect manipulating the query plan to be much easier.
See the GitHub issue for more details.
Advanced PostgreSQL client support
Splitgraph has always been mostly PostgreSQL-first, with great support for PostgreSQL clients (including ones that run complex introspection queries, like Metabase).
With Seafowl, we flipped this upside down: it's HTTP-first, with basic
PostgreSQL wire protocol support as an afterthought. Perhaps it would be
interesting to polish that more, add more information_schema
tables and make
Seafowl queryable by existing PostgreSQL clients, dashboarding and BI tools.
This would also open up more data sources, for example, using Airbyte to load data into Seafowl.
...Seafowl Cloud?
No open-source project announcement nowadays is complete without teasing a managed version of it.
In our case, we already have Splitgraph and we're currently working on a better UI for it, one that moves us from obscure concepts of "data images" and "repositories" to more familiar ones like "tables" and "databases". Over time, we're also planning on migrating Splitgraph's query execution engine from PostgreSQL to Seafowl.
With that in mind, we could provide a management interface for your Seafowl instance or, later on, a true multi-tenant pay-for-what-you-use analytical query execution API, powered by Seafowl.
In the words of a certain company, "we're just getting started".
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK