UNION ALL, pt. II: polymorphic data
以下为 快照 页面，建议前往来源网站查看，会有更好的阅读体验。
UNION ALL, pt. II: polymorphic data
In the previous post we started discussing the UNION ALL operator, and some ideas around it, such as constant datasets, SQL notation and vertical joins.
It turns out that there is an interesting connection between UNION ALL and polymorphic data. There seems to be at least three kinds of polymorphic data (I’m not sure if this is a commonly-used term, I borrowed it from the polymorphic associations in ActiveRecord, the Ruby ORM):
A noun could be linked to several other nouns (e.g., you can add images to posts, to comments and to user’s profile);
Several different nouns could be treated similarly (e.g. you can sell both books and washing machines, and their price and weight could be handled in the same way);
Some either/or/or data structures could be represented in interesting ways. This is going to be the topic of today’s post.
Distributed calculation service
We’ll implement the service that calculates arithmetic expressions.
The client submits an expression, such as “2+2”, and receives the unique ID that corresponds to this expression;
There are many distributed worker processes; one of them will pick up this expression and calculate it;
The result will be stored in the database, available for retrieval by the client;
If the expression is invalid (e.g. “2/0” or “4 +”), the error is generated and stored in the database, again available for retrieval by the client;
We’re going to use a hybrid model with a separate queue processor such as Kafka and a database: we don’t really want to discuss database-based queue implementations, ever. But we will treat the database as a reliability baseline, and tolerate the queue processor sometimes being down.
We’ll begin exploring the design space of the database part starting with the implementation that uses four narrow tables. Then we’ll move to the single-table implementation and see what happens along the way.
CREATE TABLE exprs ( id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT, expression VARCHAR(255) NOT NULL -- e.g., "2+2" ); CREATE TABLE exprs_pending ( id INTEGER NOT NULL PRIMARY KEY -- no other fields ); CREATE TABLE exprs_results ( id INTEGER NOT NULL PRIMARY KEY, result VARCHAR(100) NOT NULL -- e.g. "4" ); CREATE TABLE exprs_errors ( id INTEGER NOT NULL PRIMARY KEY, error_text VARCHAR(100) NOT NULL -- e.g. "Division by zero" );
Here is what happens in each scenario:
When the client submits a new expression to calculate, we add a row to the exprs table, thus receiving a new unique ID, and in the same transaction we insert a row into the exprs_pending table; then we submit a new task to the queue processor (using the received ID);
When the queue processor schedules a new worker with one of the IDs, it first checks if this ID is still in the exprs_pending table, and if not then it does nothing. Otherwise, the worker calculates the expression and inserts the result value into the expr_results table; in the same transaction it removes the row from the exprs_pending table;
If the expression could not be calculated due to syntax error, division by zero or something like that, the error is inserted into expr_errors table, and in the same transaction the row is removed from exprs_pending;
When the client wants to retrieve the calculation results, we first check exprs_results table and return the result if it is ready; otherwise, we check exprs_errors table and return the error information if it is there; otherwise, we check exprs_pending table and return the “pending” response.
If the queue processor failed when we tried to submit a new task, we still have a record in the exprs_pending table; this allows us to periodically reprocess this table, re-submitting all tasks to the queue processor.
This is a very simple and straightforward implementation. It is minimal in a sense that there are no NULL values, and each table has predictable behaviour that intuitively corresponds to the expected traffic:
The exprs table grows linearly with the size of our business, and it is append-only;
The exprs_pending table is almost always very small (if our workers are able to process traffic in time); this table has twice as much write traffic as other tables (one insert and one delete for each task), but because the table is so narrow it’s going to be very efficient;
The exprs_results table mostly tracks the behavior of exprs table (if the average rate of incorrect expressions submitted by the clients stays low);
The exprs_errors table is also linear to the traffic, but with a pretty small coefficient (if it starts to grow faster then we have some misbehaving clients, and we should monitor it).
Let’s create a view that presents all those tables as a single dataset. Why we’re doing that is a separate interesting question.
CREATE VIEW v_exprs AS SELECT e.id, e.expression, ‘pending’ AS status, NULL AS result, NULL AS error_text FROM exprs_pending ep INNER JOIN exprs.e ON ep.id = e.id UNION ALL SELECT e.id, e.expression, ‘processed’ AS status, er.result, NULL AS error_text FROM exprs_results er INNER JOIN exprs.e ON er.id = e.id UNION ALL SELECT e.id, e.expression, ‘error’ AS status, NULL AS result, ee.error_text FROM exprs_errors ee INNER JOIN exprs.e ON ee.id = e.id;
This view corresponds to a wide table with the following schema:
CREATE TABLE exprs_all ( id INTEGER NOT NULL PRIMARY KEY AUTO_INCREMENT, expression VARCHAR(255) NOT NULL, status VARCHAR(16) NOT NULL, result VARCHAR(100), error_text VARCHAR(100) );
What’s interesting here is that many people would begin with implementing exactly this table in the first place, and won’t bother with splitting it into narrow tables. It’s trivial to write down how each of the scenarios listed above would be handled in this schema.
We could say that the design of this table is less than minimal. This substack is called “Minimal modeling” because I want to understand if there is a useful definition of minimality.
Why is it not minimal?
This design requires storing NULL values;
In practice it requires an index on ‘status’ field, to improve the performance of typical queries;
It needs a ‘status’ field (it was not needed in the narrow schema);
The behavior of the wide table is not as straightforward: there are many updates happening, including updates on indexed columns;
This is not a critique of a traditional approach, even though “minimal” definitely sounds cooler. In practice wide tables work perfectly well for a wide range of tasks and traffic patterns. We really just want to analyze this design space to death.
Here are some random notes that could be useful for the future discussions.
1/ In the previous post we said that “UNION ALL [...] provides a way to cleanly split processing of independent datasets”. It really seems that the key word here is “independent”. We can see that in the scenarios breakdown for the narrow tables: in each case only a small subset of tables is used. If we write down the scenarios for the wide table, we’ll see that a single table is used in all cases, by definition. This creates a dependency between the scenarios.
We’re going to explore the nature of data dependencies in the future posts.
2/ It seems that constant values in a table column is a bit of an anti-pattern (from the minimality point of view). We’ve discussed pure and structural attributes, and it feels that this statement would be important in the discussion of other structural types.
3/ It seems that neither of both schemas completely solve the data integrity problem. You can find a lot of tangled attributes and rows, with potential discrepancies just shifting below the problem surface and popping up here and there.
Introducing CHECK constraints in the wide table as an attempt to solve this would only work for this particular either/or data structure: namely, if both attributes in branches are scalar (“result” and “error_text” in our case). We could change the problem a little bit: clients submit PDF files, and our workers extract all the images from them. This change makes CHECK constraints non-applicable again.
4/ The distribution of values in the ‘status’ field is not balanced. Normally the majority of values would be “processed”, some small percentage would be “error”, and only a handful of rows should be in a “pending” state. This makes the index not as efficient as it could be.
The goal of this post is to show that the traditional wide table schema for this problem really seems to be a UNION ALL of independent datasets. We already know that UNION ALL both combines them and provides a way to split them out if needed. So we could have analyzed the wide schema and would arrive at the four-table implementation. Thus, we have an interesting analysis tool that may become useful again when we’ll discuss the polymorphic data in depth.
P.S.: follow me on Twitter: https://twitter.com/alexeymakhotkin.