MySQL: IS NULL conditions and indexes

With time these Roman columns are going NULL

in Baalbak, Lebanon

It is not uncommon for an application to use WHERE conditions like this:

WHERE status = 'DELETED' OR status IS NULL

Actually, this particular one comes from the real world, some years ago.

If you run EXPLAIN , such conditions typically only cause the type column to be show ref_or_null . Whereas, without the check on NULL , it will be ref .

But does this mean that only a small detail in the execution will change, while the query will still be extremely fast? In case no one told you, any good database professional has one answer that is almost always valid. And that answer is: it depends . I know it’s frustrating, but maybe I can alleviate the frustration by explaining why it depends.

A look at numbers

I created a test table with slightly more than 3.5M rows. I built an index on the columns (a, b) , in this order, and another on (b) only. Then I’ve run the following queries, and I felt good.

mysql> SELECT COUNT(*) FROM t WHERE (a = 2 OR a IS NULL) AND (b =  5);
+----------+
| COUNT(*) |
+----------+
|      212 |
+----------+
1 row in set (0.01 sec)

mysql> SELECT COUNT(*) FROM t WHERE (a = 2) AND (b =  5 OR b IS NULL);
+----------+
| COUNT(*) |
+----------+
|      120 |
+----------+
1 row in set (0.01 sec)

As you can see, if we add an IS NULL condition to any column in the index, the query remains fast. The problem is when we add it to more than one column.

mysql> SELECT COUNT(*) FROM t WHERE (a = 2 OR a IS NULL) AND (b =  5 OR b IS NULL);
+----------+
| COUNT(*) |
+----------+
|  1466664 |
+----------+
1 row in set (1 min 21.32 sec)

Blimey! It’s dead slow, right? You may think that this depends on the number of rows. We are selecting a much higher number of rows, that’s the reason for the slowness. We can test it easily: I replaced all NULL s with 0 values and repeated the query:

mysql> SELECT COUNT(*) FROM t WHERE (a = 2 OR a IS NULL) AND (b =  5 OR b IS NULL);
+----------+
| COUNT(*) |
+----------+
|      120 |
+----------+
1 row in set (0.02 sec)

Very fast, as expected! But, is this slowdown really normal? Let’s see what happens if we look for 0 values:

mysql> SELECT COUNT(*) FROM t WHERE (a = 2 OR a = 0) AND (b =  5 OR b = 0);
+----------+
| COUNT(*) |
+----------+
|  2457536 |
+----------+
1 row in set (1.93 sec)

We selected even more rows this time, but the query took less than 2 seconds. Still slow, but it’s huge improvement: the other version took 81 seconds!

Query plans

What is the difference between these two queries? The first could not use the right index, because ref_or_null cannot be used on multiple columns . It used the index on (b) instead.:

mysql> EXPLAIN SELECT COUNT(*) FROM t WHERE (a = 2 OR a IS NULL) AND (b =  5 OR b IS NULL) \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t
   partitions: NULL
         type: ref_or_null
possible_keys: idx_b,idx_a_b
          key: idx_b
      key_len: 10
          ref: NULL
         rows: 1815359
     filtered: 11.43
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

The other version checks two regular values per column, so it uses range on the right index:

<pre><code>mysql> EXPLAIN SELECT COUNT(*) FROM t WHERE (a = 2 OR a = 0) AND (b =  5 OR b = 0) \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t
   partitions: NULL
         type: range
possible_keys: idx_b,idx_a_b
          key: idx_a_b
      key_len: 10
          ref: NULL
         rows: 1908763
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)</code></pre>

A range search can involve multiple indexes. The Extra column confirms that the query is executed by only reading the idx_a_b index.

More details on this optimisation are in IS NULL optimization , in MySQL documentation.

Equal followed by range

You may or may not be familiar with this aspect of query optimisation, that I will probably describe in a later post. If you know what I’m talking about, you may at least be glad to know that the “= must precede >” rule works with both versions of the query. In other words, it is not affected by the ref_or_null search type.

mysql> EXPLAIN SELECT COUNT(*) FROM t WHERE (a = 2000 OR a = 0) AND (b >  5 OR b = 0) \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t
   partitions: NULL
         type: range
possible_keys: idx_b,idx_a_b
          key: idx_a_b
      key_len: 10
          ref: NULL
         rows: 1761559
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.01 sec)

mysql> EXPLAIN SELECT COUNT(*) FROM t WHERE (a = 2000 OR a IS NULL) AND (b >  5 OR b IS NULL) \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t
   partitions: NULL
         type: range
possible_keys: idx_b,idx_a_b
          key: idx_a_b
      key_len: 10
          ref: NULL
         rows: 4
     filtered: 100.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

To NULL or not to NULL

There may be several reasons to use a non-value. Some of them are conceptual; I will try to explain in a later post that the SQL NULL marker does not address them, because it is inherently wrong.

Other reasons may be technical. For example, Using NULL as default value by FromDual’s Shinguz, argues that using NULL makes tables smaller.

But I believe that – ideally – we should avoid anything that makes a good query plan impossible. This is something you may want to think about, before declaring your next column NULL able.

What do you think about this? Please let me know in the comments.

If you found this article useful,

you may consider to contact me for the training:

MySQL Query Optimisation

Toodle pip,
Federico

Photo credit: link

A look at numbers

Query plans

Equal followed by range

To NULL or not to NULL

Recommend

windows下GitHub的SSH key配置

Custom defn macros with clojure.spec

Leaf：美团分布式ID生成服务开源

图解：链表的快慢指针，解决 80% 的链表面试题

java~api返回值的标准化

Usability improvements in GCC 9

Java虚拟机如何处理异常

100 Most Popular Languages on GitHub in 2019

Queuesim: A simple bounded queue simulator

活动 Web 页面人机识别验证的探索与实践

About Joyk