MySQL 101: Basic MySQL Server Triage

Back to the Blog

So your MySQL server has crashed. What do you do now? When a server is down, in my opinion, there are two steps that are essential and both are extremely important and neither should be neglected:

Save diagnostic information for determining the root cause analysis (RCA).
Get the server back up and running.

Too many people rush to Step #2 and lose pertinent diagnostics from Step #1. Likewise, too many people will spend too much time on Step #1 and delay getting to Step #2 and restoring service. The goal is to collect diagnostics as quickly as possible for later review while getting service restored as fast as possible.

As a Technical Account Manager (TAM) and assisting on server restoration calls, I have seen both issues at play. Technical resources have a tendency to get so bogged down in trying to understand the cause of the server outage that they forget that the downtime is costing the business money. The desire to crawl through server logs, review metrics, pour-over system metrics, and so on, can be too tempting for some who are concerned that important diagnostic data will be lost when service is restored. This is a valid concern, but there must be a middle ground.

Conversely, many, especially those in management, will demand service is restored immediately to continue business functions. Of course, after the service is back up, the demand for an RCA will come. Sadly, many metrics, and some logs, are lost when a server is bounced. Below are basic guidelines on what metrics to collect for MySQL. The steps are in no particular order.

Save a copy of the MySQL Error Log.

Shell

sudo cp /path/to/datadir/*.log /some/where/safe
Make a copy of the MySQL configuration file.

Shell

sudo cp /path/to/my.cnf /some/where/safe
Make a copy of system logs and save them somewhere on persistent storage in a location that will not be overwritten. Consider doing something like the following on Linux:

Shell

sudo cp /var/log/syslog /some/where/safe/syslog
sudo cp /var/log/messages /some/where/safe/messages
sudo journalctl -e > /some/where/safe/journalctl.txt
If MySQL is running still and you can log in, get some MySQL metrics. You will want to save the output into files somewhere.

Shell

sudo mysqladmin -i10 -c10 proc > /some/where/safe/mysql_procs.txt
mysql> SHOW GLOBAL VARIABLES;
sudo mysqladmin -i10 -c10 ext > /some/where/safe/mysql_ext.txt
mysql> SHOW ENGINE INNODB STATUS\G
If MySQL is running and you have Percona Toolkit, you should collect some pt-stalk output.

Shell

sudo ./pt-stalk --no-stalk --iterations=2 --sleep=30 --dest=/some/where/safe -- --user=root --password=<mysql-root-pass>;
If you have space and time, a copy of the database files (data directory in MySQL) could be helpful. Certainly, for many installations, getting all of the data files will be impossible. If it is a small database and space and time allow, it can be best to get all the files just in case.

Shell

sudo cp -R /path/to/datadir /some/where/safe/datadir
Copy database logs and save them somewhere safe for later review. Systems like Percona XtraDB Cluster (PXC) will create GRA files during an issue which can be really helpful to look at to determine the root cause. By combining the GRA header file with the contents of the GRA log files, you can use the mysqlbinlog command to get the records of transactions causing issues. More information can be found in one of our older blogs here
Percona XtraDB Cluster (PXC): what about GRA_*.log files?.

Shell

sudo cp /path/to/data/dir/GRA* /some/where/safe/datadir/
Save system metrics pertaining to CPU, I/O, and memory usage:

Shell

sudo mpstat -a 1 60 > /some/where/safe/mpstat.txt
sudo vmstat 1 60 > /some/where/safe/vmstat.txt
sudo iostat -dmx 1 60 > /some/where/safe/iostat.txt
Save system info.

Shell

sudo cat /proc/cpuinfo > /some/where/safe/cpuinfo.txt
If you have Percona Toolkit, the following would be very helpful:

Shell

sudo pt-summary > /some/where/safe/pt-summary.txt
sudo pt-mysql-summary > /some/where/safe/pt-mysql-summary.txt
Get hardware diagnostics.

Shell

# disk info
sudo df -k > /some/where/safe/df_k.txt
sudo lsblk -o KNAME,SCHED,SIZE,TYPE,ROTA > /some/where/safe/lsblk.txt
sudo lsblk --all > $PTDEST/lsblk-all;
# lv/pv/vg only for systems with LVM
sudo lvdisplay --all --maps > /some/where/safe/lvdisplau-all-maps.txt
sudo pvdisplay --maps > /some/where/safe/pvdisplay-maps.txt
sudo pvs -v > /some/where/safe/pvs_v.txt
sudo vgdisplay > /some/where/safe/vgdisplay.txt
# nfsstat for systems with NFS mounts
sudo nfsstat -m > /some/where/safe/nfsstat_m.txt
sudo nfsiostat 1 120 > /some/where/safe/nfsiostat.txt
# Collect hardware information
sudo dmesg > /some/where/safe/dmesg.txt
sudo dmesg -T free -m > /some/where/safe/dmesg_free.txt
sudo dmesg -T > /some/where/safe/dmesg_t.txt
sudo ulimit -a > /some/where/safe/ulimit_a.txt
sudo cat /proc/sys/vm/swappiness > /some/where/safe/swappiness
sudo numactl --hardware > /some/where/safe/numactl-hardware.txt

It goes without saying, it would be best to script the above into a useful bash script you can run when there is an issue. Just be sure to test the script in advance of an issue.

Again, the goal is to preserve useful diagnostic data that could be useful for determining the root cause of the issue at a later time after the service is restored. Just don’t get caught up in looking through the above diagnostics! Certainly, more data is better but the above is a great starting point. As time goes on, you may realize you wish you had other metrics and can add them to your script or Standard Operating Procedure (SOP).

Naturally, adding monitoring like Percona Monitoring and Management (PMM) would be a great option that can save you a lot of time and collect even more trends over time which can be extremely helpful.

With the above diagnostics, you would have a ton of information in the event of an issue to find the root cause. Now, you can sort through the diagnostics. Of course, if you need help with that, Percona can help you here as well.

MySQL 101: Basic MySQL Server Triage

Recommend

2021年，PPP项目为何又火起来了？

揭秘都市传说“化尸水”：强酸强碱，真能让尸体消失无踪吗？

孩子王创业板首发过会、阿里被罚182.28亿元、安居客赴港上市及贝索斯登顶福布斯亿万富...

量子骗术，骗遍全球

加拿大卫生部授权Resverlogix开始治疗新冠肺炎的研究

腾讯云印尼数据中心开服，加码东南亚云服务布局

从皇家园林到科技中心，海淀如何一步步走到现在？

New Orleans Game Camp creates opportunities for a more diverse pipeline

The Schlep to City Hall

管窥“锅具冠军”康巴赫背后的爆款逻辑

About Joyk