21

Supporting all kinds of outer references in derived tables (lateral, or not)

 4 years ago
source link: https://www.tuicool.com/articles/mi2iiuE
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

ZnEjUfR.jpg!web

(Image credit: Pixabay).

Inmy earlier post, I showed how MySQL, since version 8.0.14, has support for LATERAL derived tables. With LATERAL, a JOIN can have a second table – a subquery-based derived table – be defined based on values from columns of the first table, and thus be re-calculated for each row of the first table. Typically:

SELECT ... FROM t1, LATERAL (SELECT ... FROM t2
                ^            WHERE t2.col=t1.col ... ) AS derived;
                |                           |
                |                           |
                +---------------------------+

We say that in the second table ( derived ),  t1.col is a “lateral outer reference” to the first table t1 . The word “lateral” hints at the fact that the referenced table is placed “next to”, “on the side of” the derived table (i.e. both are part of the same FROM clause). The arrow goes “laterally”.

While implementing this LATERAL feature, I added another related one at the same time: support for non -lateral outer references in derived tables.

A formal definition of the feature may be too dry matter here; the design doc ( here ) would be ever dryer. I’ll rather show a practical example, so you can see what it’s useful for.

Let’s pick some sample hierarchical data, which I already used in some earlier posts:

CREATE TABLE employees (
id INT PRIMARY KEY NOT NULL,
name VARCHAR(100) NOT NULL,
manager_id INT NULL,
INDEX (manager_id),
FOREIGN KEY (manager_id) REFERENCES employees (id)
);
 
INSERT INTO employees VALUES
(333, "Yasmina", NULL), # Yasmina is the CEO (manager_id is NULL)
(198, "John", 333), # John has ID 198 and reports to 333 (Yasmina)
(692, "Tarek", 333),
(29, "Pedro", 198),
(4610, "Sarah", 29),
(72, "Pierre", 29),
(123, "Adil", 692);

And let’s ask the question: how many direct and indirect reports does each person have?

Here’s my solution:

SELECT emp.*,
(
  WITH RECURSIVE reports AS
  (
    SELECT emp.id
    UNION ALL
    SELECT e.id
      FROM reports AS rep JOIN employees AS e
        ON rep.id = e.manager_id
  )
  SELECT COUNT(*)-1 FROM reports
) AS count_of_all_reports
FROM employees AS emp;

Description: for each employee (each row of ’emp’ – line 13):

  1. evaluate a scalar subquery (lines 2-12) which:
  2. builds a CTE (lines 3-10), by recursively finding all direct and indirect reports of the employee
  3. counts the rows of the CTE (line 11), substracting one to not count the employee
  4. returns the count.

Result:

+------+---------+------------+----------------------+
| id   | name    | manager_id | count_of_all_reports |
+------+---------+------------+----------------------+
|   29 | Pedro   |        198 |                    2 |
|   72 | Pierre  |         29 |                    0 |
|  123 | Adil    |        692 |                    0 |
|  198 | John    |        333 |                    3 |
|  333 | Yasmina |       NULL |                    6 |
|  692 | Tarek   |        333 |                    1 |
| 4610 | Sarah   |         29 |                    0 |
+------+---------+------------+----------------------+
7 rows in set (0.02 sec)

And that brings me to the feature announced in the introduction. Please look at the CTE’s definition: it starts recursion from “SELECT emp.id” which is a reference to the current employee we want to count for; this “emp.id” comes from a row of “emp”, which comes from outside of the CTE.

If we draw an arrow from “reference” to “referenced column”, this arrow starts from the CTE, traverses its boundaries, traverses the surrounding scalar subquery’s boundaries too, and reaches to the top query. That is why it’s not a lateral reference.

SELECT emp.*,
(
  WITH RECURSIVE reports AS
  (           +----------------------------------+
              |                                  |
    SELECT emp.id                                |
    UNION ALL                                    |
    SELECT e.id                                  |
      FROM reports AS rep JOIN employees AS e    |
        ON rep.id = e.manager_id                 |
  )                                              | crosses CTE's bounds
  SELECT COUNT(*)-1 FROM reports                 |
) AS count_of_all_reports                        | crosses scalar subquery's bounds
FROM employees AS emp;                           |
                   ^                             |
                   |                             |
                   +-----------------------------+ reaches to farthest outside

Before MySQL 8.0.14 this wasn’t possible (MySQL would complain that it doesn’t know what “emp.id” is, in the CTE’s definition).

Now MySQL detects this reference; it concludes that the scalar subquery and its contained CTE must be re-calculated for every row of “emp.id”.

Looking at EXPLAIN for our query:

+----+--------------------+------------+------------+------+---------------+------------+---------+--------+------+----------+------------------------+
| id | select_type        | table      | partitions | type | possible_keys | key        | key_len | ref    | rows | filtered | Extra                  |
+----+--------------------+------------+------------+------+---------------+------------+---------+--------+------+----------+------------------------+
|  1 | PRIMARY            | emp        | NULL       | ALL  | NULL          | NULL       | NULL    | NULL   |    7 |   100.00 | NULL                   |
|  2 | DEPENDENT SUBQUERY |            | NULL       | ALL  | NULL          | NULL       | NULL    | NULL   |    3 |   100.00 | NULL                   |
|  3 | DEPENDENT DERIVED  | NULL       | NULL       | NULL | NULL          | NULL       | NULL    | NULL   | NULL |     NULL | No tables used         |
|  4 | UNCACHEABLE UNION  | rep        | NULL       | ALL  | NULL          | NULL       | NULL    | NULL   |    2 |   100.00 | Recursive; Using where |
|  4 | UNCACHEABLE UNION  | e          | NULL       | ref  | manager_id    | manager_id | 5       | rep.id |    1 |   100.00 | Using index            |
+----+--------------------+------------+------------+------+---------------+------------+---------+--------+------+----------+------------------------+

we see that MySQL has recognized that the scalar subquery is “dependent” (depends on outer data), and the same is true for the derived table. It has also seen that the content of the UNION in the CTE is “uncacheable” – cannot be “cached”, must be re-calculated every time.

To recap, starting from MySQL 8.0.14:

  • by default, when parsing a derived table’s definition, MySQL accepts non-lateral outer references, as in the example query above.
  • if you add the LATERAL keyword, MySQL also accepts lateral outer references; in other words, it additionally searches in the FROM clause containing the derived table.

Note: there are other solutions to the reports-count problem. One solution is to build a large result in one pass using one recursive CTE, to list all connections between all employees and each of their indirect managers, then use this large result to aggregate

for each manager; I had coded it at the end of this post . It works, but it’s hard to read.  What  we do above, instead, is to generate smaller sets from the hierarchy, piece by piece. So it’s “walk a part of the hierarchy / aggregate / repeat” instead of “walk the whole  hierarchy / aggregate”.

That’s all for today. I hope I have demonstrated that the new feature gives you more expressive power when writing queries, and of course one less “oh apparently the DBMS doesn’t like it”  :grinning:

Thank you for using MySQL!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK