52

Peter Zaitsev: Finding Table Differences on Nullable Columns Using MySQL Generat...

 5 years ago
source link: https://www.tuicool.com/articles/hit/mUr2amv
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

bE7FbqY.jpg!web Some time ago, a customer had a performance issue with an internal process. He was comparing, finding, and reporting the rows that were different between two tables. This is simple if you use a LEFT JOIN and an  IS NULL comparison over the second table in the WHERE clause, but what if the column could be null? That is why he used UNION, GROUP BY and a HAVING clauses, which resulted in poor performance.

The challenge was to be able to compare each row using a LEFT JOIN over NULL values.

The challenge in more detail

I’m not going to use the customer’s real table. Instead, I will be comparing two sysbench tables with the same structure:

CREATE TABLE `sbtest1` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `k` int(10) unsigned DEFAULT NULL,
  `c` char(120) DEFAULT NULL,
  `pad` char(60) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `k` (`k`,`c`,`pad`)
) ENGINE=InnoDB

It is sightly different from the original sysbench schema, as this version can hold NULL values. Both tables have the same number of rows. We are going to set to NULL one row on each table:

update sbtest1 set k=null where limit 1;
update sbtest2 set k=null where limit 1;

If we execute the comparison query, we get this result:

mysql> select "sbtest1",a.* from 
    -> sbtest1 a left join 
    -> sbtest2 b using (k,c,pad) 
    -> where b.id is null union 
    -> select "sbtest2",a.* from 
    -> sbtest2 a left join 
    -> sbtest1 b using (k,c,pad)
    -> where b.id is null;
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| sbtest1 | id   | k    | c                                                                                                                       | pad                                                         |
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| sbtest1 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |
| sbtest2 | 4462 | NULL | 64568100364-99474573987-46567807085-85185678273-10829479379-85901445105-43623848418-63872374080-59257878609-82802454375 | 07052127207-33716235481-22978181904-76695680520-07986095803 |
+---------+------+------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
2 rows in set (3.00 sec)

As you can see, column k is NULL. In both cases it failed and reported those rows to be different. This is not new in MySQL, but it would be nice to have a way to sort this issue out.

Solution

The solution is based on GENERATED COLUMNS with a hash function (md5) and stored in a binary(16) column:

ALTER TABLE sbtest1 
ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL, 
ADD INDEX (real_id);
ALTER TABLE sbtest2 
ADD COLUMN `real_id` binary(16) GENERATED ALWAYS AS (unhex(md5(concat(ifnull(`k`,'NULL'),ifnull(`c`,'NULL'),ifnull(`pad`,'NULL'))))) VIRTUAL,
ADD INDEX (real_id);

Adding the index is also part of the solution. Now, let’s execute the query using the new column to join the tables:

mysql> select "sbtest1",a.k,a.c,a.pad from
    -> sbtest1 a left join
    -> sbtest2 b using (real_id)
    -> where b.id is null union
    -> select "sbtest2",a.k,a.c,a.pad from
    -> sbtest2 a left join
    -> sbtest1 b using (real_id)
    -> where b.id is null;
Empty set (2.31 sec)

We can see an improvement in the query performance—it now takes 2.31 sec whereas before it was 3.00 sec—and that the result is as expected. We could say that that’s all, and no possible improvement can be made. However, is not true. Even though the query is running faster, it is possible to optimize it in this way:

mysql> select "sbtest1",a.k,a.c,a.pad
    -> from sbtest1 a
    -> where a.id in (select a.id
    ->   from sbtest1 a left join
    ->   sbtest2 b using (real_id)
    ->   where b.id is null) union
    -> select "sbtest2",a.k,a.c,a.pad
    -> from sbtest2 a
    -> where a.id in (select a.id
    ->   from sbtest2 a left join
    ->   sbtest1 b using (real_id)
    ->   where b.id is null);
Empty set (1.60 sec)

Why is this faster? The first query is performing two subqueries. Each subquery is very similar. Let’s check the explain plan:

mysql> explain select "sbtest1",a.k,a.c,a.pad from
    -> sbtest1 a left join
    -> sbtest2 b using (real_id)
    -> where b.id is null;
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key     | key_len | ref              | rows   | filtered | Extra                                |
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+
|  1 | SIMPLE      | a     | NULL       | ALL  | NULL          | NULL    | NULL    | NULL             | 315369 |   100.00 | NULL                                 |
|  1 | SIMPLE      | b     | NULL       | ref  | real_id       | real_id | 17      | sbtest.a.real_id |     27 |    10.00 | Using where; Not exists; Using index |
+----+-------------+-------+------------+------+---------------+---------+---------+------------------+--------+----------+--------------------------------------+

As you can see, it is performing a full table scan over the first table and using real_id to join the second table. The real_id is a generated column, so it needs to execute the function to get the value to join the second table. That means that it’s going to take time.

If we analyze the subquery of the second query:

mysql> explain select "sbtest1",a.k,a.c,a.pad
    -> from sbtest1 a
    -> where a.id in (select a.id
    ->   from sbtest1 a left join
    ->   sbtest2 b using (real_id)
    ->   where b.id is null);
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+
| id | select_type  | table       | partitions | type   | possible_keys | key        | key_len | ref              | rows   | filtered | Extra                                |
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+
|  1 | SIMPLE       | a           | NULL       | index  | PRIMARY       | k          | 187     | NULL             | 315369 |   100.00 | Using where; Using index             |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_key>    | <auto_key> | 4       | sbtest.a.id      |      1 |   100.00 | NULL                                 |
|  2 | MATERIALIZED | a           | NULL       | index  | PRIMARY       | real_id    | 17      | NULL             | 315369 |   100.00 | Using index                          |
|  2 | MATERIALIZED | b           | NULL       | ref    | real_id       | real_id    | 17      | sbtest.a.real_id |     27 |    10.00 | Using where; Not exists; Using index |
+----+--------------+-------------+------------+--------+---------------+------------+---------+------------------+--------+----------+--------------------------------------+

We are going to see that it is performing a full index scan over the first table, and that the generated column has never been executed. That is how we can go from an inconsistent result of three seconds, to a consistent result of 2.31 seconds, to finally reach a performant query using the faster time of 1.60 seconds.

Conclusions

This is not the first blog post that I’ve done about generated columns. I think that it is a useful feature for several scenarios where you need to improve performance. In this particular case, it’s also presenting a workaround to expected inconsistencies with LEFT JOINS with NULL values. It is also important to mention that this improved a process in a real world scenario.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK