46

Analyzing Amazon Aurora Slow Logs with pt-query-digest

 5 years ago
source link: https://www.tuicool.com/articles/hit/aq2yemq
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

yee6JbF.jpg!web In this blog post we shall discuss how you can analyze slow query logs from Amazon Aurora for MySQL , (referred to as Amazon Aurora in the remaining blog). The tools and techniques explained here apply to the other MySQL compatible services available under Amazon Aurora. However, we’ll focus specially on analyzing slow logs from Amazon Aurora version 2 (MySQL 5.7 compatible) usingpt-query-digest. We believe there is a bug in Aurora where it logs really big numbers for query execution and lock times for otherwise really fast queries.

So, the main steps we need are:

  1. Enable slow query logging on your Amazon Aurora DB parameter group, apply the change when appropriate.
  2. Download the slow log(s) that match the time that you are interested to investigate, and optionally concatenate them.
  3. Run pt-query-digest on the downloaded logs and check the results.

Enable slow query logging

For our testing we decided to capture all the SELECT queries that were hitting our Amazon Aurora instance, mainly because we had a sysbench OLTP read only workload and that wouldn’t really have a lot of slow queries. An easy way to do so is to enable the capture of slow query logs and set long_query_time to 0 — you will need to enable slow query logging. To achieve that, we created a new DB parameter group and applied it to our test Aurora instance with the following three parameters set as below:

slow_query_log=1
long_query_time=0
min_examined_row_limit=0

Once you have the above configuration applied to Amazon RDS, you will be able to see slow query logs being created in the Amazon RDS console.

Download the log file

You can download the log file of your choice using either the Amazon RDS console OR you can use the following AWS CLI command to achieve the same:

$ aws rds download-db-log-file-portion --db-instance-identifier perconasupport  --starting-token 0 --output text --log-file-name slowquery/mysql-slowquery.log.2018-09-03.09 > mysql-slowquery.log.2018-09-03.09

Depending on the size of the chosen log file, the above command will take some time to complete the download.

Run pt-query-digest on the log file

Once the file has been downloaded you can analyse that using the following pt-query-digest command.

$ pt-query-digest --group-by fingerprint --order-by Query_time:sum mysql-slowquery.log.2018-09-03.09

On our Aurora test slow log file, the initial results didn’t look right so we had to apply a workaround. Here is the header of the initial results from pt-query-digest:

# 456.2s user time, 2.5s system time, 43.80M rss, 141.48M vsz
# Current date: Tue Sep 4 15:54:21 2018
# Hostname: aahmed-GL503VD
# Files: mysql-slowquery.log.2018-09-03.09
# Overall: 5.13M total, 60 unique, 1.43k QPS, 507.43Gx concurrency _______
# Time range: 2018-09-03T08:00:04 to 2018-09-03T09:00:03
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 1826227663297288s 1us 18446744073710s 355917782s 761us 80127878922s 93us
# Lock time 1401952549601936s 0 18446744073710s 273229812s 44us 70205933577s 23us
# Rows sent 94.71M 0 100 19.35 97.36 37.62 0.99
# Rows examine 216.26M 0 300 44.19 299.03 84.74 0.99
# Query size 196.24M 5 1.24k 40.08 72.65 18.90 36.69
 
# Profile
# Rank Query ID Response time Calls R/Call 
# ==== ====================== =========================== ======= ========
# 1 0xE81D0B3DB4FB31BC5... 1346612317380813.0000 73.7% 3194111 421592210.5966 18... SELECT sbtest?
# 2 0x9934EF6887CC7A638... 147573952589685.0625 8.1% 319381 462062403.8051 18... SELECT sbtest?
# 3 0x8D589AFA4DFAEEED8... 110680464442264.1094 6.1% 319411 346514254.1812 18... BEGIN
# 4 0xFF7C69F51BBD3A736... 92233720368565.1875 5.1% 319388 288782673.0139 18... SELECT sbtest?
# 5 0xFFFCA4D67EA0A7888... 73786976294861.9844 4.0% 321238 229695665.8143 18... COMMIT
# MISC 0xMISC 55340232221335.8281 3.0% 657509 84166501.4796 0.0 <43 ITEMS>

What’s wrong with the above results is that the total query Exec time and Lock time are very large numbers. Digging deeper into the logs revealed a problem with the slow logs themselves that had very large numbers for Query time & Lock time for some queries. For instance in our case, of 5.13 million queries in the log file, only 111 had the anomaly. Even so, it was enough to skew the results.

# Time: 2018-09-03T08:41:47.363522Z
--
SELECT c FROM sbtest1 WHERE id=24278;
# Time: 2018-09-03T08:41:49.363224Z
# User@Host: perconasupport[perconasupport] @ [172.30.2.111] Id: 20869
# Query_time: 18446744073709.550781 Lock_time: 18446744073709.550781 Rows_sent: 1 Rows_examined: 1
SET timestamp=1535964109;
SELECT c FROM sbtest2 WHERE id=989322;
# Time: 2018-09-03T08:41:49.363296Z
--
BEGIN;
# Time: 2018-09-03T08:41:53.362947Z
# User@Host: perconasupport[perconasupport] @ [172.30.2.111] Id: 20873
# Query_time: 18446744073709.550781 Lock_time: 18446744073709.550781 Rows_sent: 1 Rows_examined: 1
SET timestamp=1535964113;
SELECT c FROM sbtest1 WHERE id=246889;
# Time: 2018-09-03T08:41:53.363003Z

Incorrect logging

The above two queries are, in fact, really fast, but for some reason the execution time & lock times are wrongly logged in the slow query log. Since the number of such query log records is statistically negligible compared to the total number of queries, we decided to ask pt-query-digest to ignore them using the command line parameter –attribute-value-limit . The default value of this parameter is 0. We decided to increase that to 2^32, and make it ignore the large numbers from the slow query log. So, the pt-query-digest command became:

$ pt-query-digest --group-by fingerprint --order-by Query_time:sum --attribute-value-limit=4294967296 mysql-slowquery.log.2018-09-03.09

This caused the 111 queries with the bad log times to be ignored and the results looked good. In our case, the ignored queries were bad variants of queries for which good versions existed. You can tell this because the number of unique queries remained the same as before after the bad variants were ignored. However, this may not always hold true and one should expect to lose some fidelity, especially if you are analyzing a smaller slow log.

# 441s user time, 450ms system time, 38.19M rss, 111.76M vsz
# Current date: Tue Sep 4 16:23:33 2018
# Hostname: aahmed-GL503VD
# Files: mysql-slowquery.log.2018-09-03.09
# Overall: 5.13M total, 60 unique, 1.43k QPS, 0.30x concurrency __________
# Time range: 2018-09-03T08:00:04 to 2018-09-03T09:00:03
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 1096s 1us 198ms 213us 761us 431us 93us
# Lock time 180s 0 103ms 34us 44us 161us 23us
# Rows sent 94.71M 0 100 19.35 97.36 37.62 0.99
# Rows examine 216.26M 0 300 44.19 299.03 84.74 0.99
# Query size 196.24M 5 1.24k 40.08 72.65 18.90 36.69
 
# Profile
# Rank Query ID Response time Calls R/Call V/M Ite
# ==== =========================== ============== ======= ====== ===== ===
# 1 0xE81D0B3DB4FB31BC558CAE... 400.1469 36.5% 3194111 0.0001 0.00 SELECT sbtest?
# 2 0xF0C5AE75A52E847D737F39... 161.4065 14.7% 319453 0.0005 0.00 SELECT sbtest?
# 3 0xFFFCA4D67EA0A788813031... 155.8740 14.2% 321238 0.0005 0.00 COMMIT
# 4 0x8D589AFA4DFAEEED85FFF5... 107.9827 9.9% 319411 0.0003 0.00 BEGIN
# 5 0x9934EF6887CC7A6384D1DE... 94.1002 8.6% 319381 0.0003 0.00 SELECT sbtest?
# 6 0xFF7C69F51BBD3A736EEB1B... 79.9279 7.3% 319388 0.0003 0.00 SELECT sbtest?
# 7 0xA729E7889F57828D3821AE... 75.3969 6.9% 319398 0.0002 0.00 SELECT sbtest?
# MISC 0xMISC 21.1212 1.9% 18658 0.0011 0.0 <41 ITEMS>
 
# Query 1: 1.27k QPS, 0.16x concurrency, ID 0xE81D0B3DB4FB31BC558CAEF5F387E929 at byte 358647353
# Scores: V/M = 0.00
# Time range: 2018-09-03T08:00:04 to 2018-09-03T08:42:00
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 62 3194111
# Exec time 36 400s 10us 198ms 125us 332us 300us 80us
# Lock time 74 134s 0 26ms 42us 49us 154us 27us
# Rows sent 3 3.01M 0 1 0.99 0.99 0.11 0.99
# Rows examine 1 3.01M 0 1 0.99 0.99 0.11 0.99
# Query size 57 112.37M 32 38 36.89 36.69 0.53 36.69
# String:
# Databases perconasupport
# Hosts 172.30.2.111
# Users perconasupport
# Query_time distribution
# 1us
# 10us ################################################################
# 100us ##############
# 1ms #
# 10ms #
# 100ms #
# 1s

That number looks familiar

The really big number 18446744073709.550781 seemed to ring a bell. A quick web search revealed that it could be a regression of an old bug in MySQL’s code. The following bugs were found to have the same value being reported for query exec time & query lock time.

  1. https://bugs.mysql.com/bug.php?id=59757
  2. https://bugs.mysql.com/bug.php?id=63524
  3. https://bugs.mysql.com/bug.php?id=35396

Once slow logs were enabled, we used this sysbench command  to generate the workload for the Amazon Aurora instance. You might like to try it yourselves. Please note that this used sysbench version 1.0.14.

$ sysbench --db-driver=mysql --mysql-user=perconasupport --mysql-host=perconasupport-1234567.cgmobiazycdv.eu-west-1.rds.amazonaws.com --mysql-password=XXXXXXX  --mysql-db=perconasupport --range_size=100 --table_size=1000000 --tables=2 --threads=6 --events=0 --time=600 --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua run

If you are an Amazon Aurora user, have you found any problems analyzing slow query logs? You are welcome to use the comments section, below, to let me know.

Percona Toolkit

pt-query-digest is part of Percona Toolkit, a collection of advanced open source command-line tools, developed and used by the Percona technical staff.Percona Toolkit is open source and free to download and use.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK