36

SQL That Meant Well: Applying Conditions in the Subquery vs the Outer Query

 6 years ago
source link: https://www.tuicool.com/articles/hit/f2UvU3y
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

This is a new series that has been percolating in my head for a while. When I help clients with SQL, we’re often adding (or removing) indexes to help SQL performance. It has been said (though not by me) that SQL re-write is obsolete because the Db2 Optimizer will take care of that. But there are still some ways you can write a query that will have it using a much more expensive access plan than is needed. This series covers SQL that returned the required results, but does so in a way that could have been written much more efficiently.

Much of what is reviewed here would be true for any RDBMS, but I know it to be true for these queries on Db2 for Linux, Unix, and Windows. Table or column names may be changed to protect the innocent.

This particular SQL was found in the an analysis of the package cache using my favorite SQL for finding problem SQL in the package cache . In several different time frames analyzed, it was found to be consuming more than 10% of the Rows Read by all the queries in the package cache.

SQL Statement or Fragment Found in the Wild

SELECT *
FROM    (SELECT
                     ROW_NUMBER() OVER (PARTITION BY person_id 
                        ORDER BY person_id, environment_start_dt DESC ) seq
                   , person_id
                   , sample1
                   , sample2
                   , sample3
                   , environment_no
                   , environment_start_dt
        FROM
        schema.ENVIRNOMENT
        ) AS recentenv
WHERE
       seq = 1
       and person_id=?
;

Access Plan

Explaining this query showed most of the expense was in the table scan of schema.ENVIRONMENT.

Rows
     RETURN
     (   1)
      Cost
       I/O
       |
   1.91471e-06
     FILTER
     (   2)
     29995.8
      17868
       |
     573015
     TBSCAN
     (   3)
     29880.2
      17868
       |
     573015
     SORT
     (   4)
     29859.9
      17868
       |
     573015
     TBSCAN
     (   5)
     29505.3
      17868
       |
     573015
 TABLE: SCHEMA
 ENVIRONMENT
       Q1

A Better Way to Get the Same Results

We tried this query instead to reduce the amount of data that actually needed to be processed:

SELECT *
FROM    (SELECT
                     ROW_NUMBER() OVER (PARTITION BY person_id 
                        ORDER BY person_id, environment_start_dt DESC ) seq
                   , person_id
                   , sample1
                   , sample2
                   , sample3
                   , environment_no
                   , environment_start_dt
        FROM
        schema.ENVIRNOMENT
        WHERE person_id=?
        ) AS recentenv
WHERE
       seq = 1
;

Modified Access Plan

Looking at the access plan, this syntax made a world of difference:

Total Cost:             32.5022
        Query Degree:           1

                Rows
               RETURN
               (   1)
                Cost
                 I/O
                 |
                  1
               FILTER
               (   2)
               32.5022
                  3
                 |
               1.09716
               FETCH
               (   3)
               32.5014
                  3
             /---+----\
         1.09716      573015
         IXSCAN   TABLE: SCHEMA
         (   4)    ENVIRONMENT
         21.6697        Q1
            2
           |
         573015
     INDEX: DB2INST1
    IDX_ENVIRONMENT_01
           Q1

The Difference

That’s a 99.999% improvement in the timeron cost – not only do we need to evaluate fewer rows, but there is also an index that we can use to get those rows.

If we look at what we’re asking Db2 to do with each query, the difference is easy to understand:

NvuEjiR.jpg!web

Original Query:First, figure out a table that returns 6 columns from the table that represents every row from environment along with a column ranking the rows for each person_id from most recent environment start date (1) to least recent. After you have that, find the rows that match a specific person_id, and out of those, find the one row with a rank of 1.

32Ezaum.jpg!webModified Query: First, figure out a table that returns 6 columns from the table, along with a column ranking the rows from most recent environment start date (1) to least recent for a specific person_id . Then return the row with a rank of 1.

Because the condition is applied on the inner query, and the cardinality of this column is pretty low (it is fairly unique), the rows that even need to be evaluated and sorted is much lower. The first query had to evaluate and work with over 500,000 rows to return one. The second query on average has to deal with one or two. This requires far fewer I/Os. Even if the table is already entirely in the buffer pool, this requires fewer CPU cycles to execute the query. With the modified query, we’re also getting the data in order from an index, which luckily allows us to also eliminate the sort.


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK