2

MySQL 101: Basic MySQL Server Triage

 3 years ago
source link: https://www.percona.com/blog/2021/04/07/mysql-101-basic-mysql-server-triage/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
MySQL 101: Basic MySQL Server Triage

MySQL 101 Server TriageSo your MySQL server has crashed.  What do you do now?  When a server is down, in my opinion, there are two steps that are essential and both are extremely important and neither should be neglected:

  1. Save diagnostic information for determining the root cause analysis (RCA).
  2. Get the server back up and running.

Too many people rush to Step #2 and lose pertinent diagnostics from Step #1.  Likewise, too many people will spend too much time on Step #1 and delay getting to Step #2 and restoring service.  The goal is to collect diagnostics as quickly as possible for later review while getting service restored as fast as possible.

As a Technical Account Manager (TAM) and assisting on server restoration calls, I have seen both issues at play.  Technical resources have a tendency to get so bogged down in trying to understand the cause of the server outage that they forget that the downtime is costing the business money.  The desire to crawl through server logs, review metrics, pour-over system metrics, and so on, can be too tempting for some who are concerned that important diagnostic data will be lost when service is restored.  This is a valid concern, but there must be a middle ground.

Conversely, many, especially those in management, will demand service is restored immediately to continue business functions.  Of course, after the service is back up, the demand for an RCA will come.  Sadly, many metrics, and some logs, are lost when a server is bounced.  Below are basic guidelines on what metrics to collect for MySQL.  The steps are in no particular order.

  1. Save a copy of the MySQL Error Log.
    Shell
    sudo cp /path/to/datadir/*.log /some/where/safe
  2. Make a copy of the MySQL configuration file.
    Shell
    sudo cp /path/to/my.cnf /some/where/safe
  3. Make a copy of system logs and save them somewhere on persistent storage in a location that will not be overwritten.  Consider doing something like the following on Linux:
    Shell
    sudo cp /var/log/syslog /some/where/safe/syslog
    sudo cp /var/log/messages /some/where/safe/messages
    sudo journalctl -e > /some/where/safe/journalctl.txt
  4. If MySQL is running still and you can log in, get some MySQL metrics.  You will want to save the output into files somewhere.
    Shell
    sudo mysqladmin -i10 -c10 proc > /some/where/safe/mysql_procs.txt
    mysql> SHOW GLOBAL VARIABLES;
    sudo mysqladmin -i10 -c10 ext > /some/where/safe/mysql_ext.txt
    mysql> SHOW ENGINE INNODB STATUS\G
  5. If MySQL is running and you have Percona Toolkit, you should collect some pt-stalk output.
    Shell
    sudo ./pt-stalk --no-stalk --iterations=2 --sleep=30 --dest=/some/where/safe -- --user=root --password=<mysql-root-pass>;
  6. If you have space and time, a copy of the database files (data directory in MySQL) could be helpful.  Certainly, for many installations, getting all of the data files will be impossible.  If it is a small database and space and time allow, it can be best to get all the files just in case.
    Shell
    sudo cp -R /path/to/datadir /some/where/safe/datadir
  7. Copy database logs and save them somewhere safe for later review.  Systems like Percona XtraDB Cluster (PXC) will create GRA files during an issue which can be really helpful to look at to determine the root cause.  By combining the GRA header file with the contents of the GRA log files, you can use the mysqlbinlog command to get the records of transactions causing issues.  More information can be found in one of our older blogs here
    Percona XtraDB Cluster (PXC): what about GRA_*.log files?.
    Shell
    sudo cp /path/to/data/dir/GRA* /some/where/safe/datadir/
  8. Save system metrics pertaining to CPU, I/O, and memory usage:
    Shell
    sudo mpstat -a 1 60 > /some/where/safe/mpstat.txt
    sudo vmstat 1 60 > /some/where/safe/vmstat.txt
    sudo iostat -dmx 1 60 > /some/where/safe/iostat.txt
  9. Save system info.
    Shell
    sudo cat /proc/cpuinfo > /some/where/safe/cpuinfo.txt
  10. If you have Percona Toolkit, the following would be very helpful:
    Shell
    sudo pt-summary > /some/where/safe/pt-summary.txt
    sudo pt-mysql-summary > /some/where/safe/pt-mysql-summary.txt
  11. Get hardware diagnostics.
    Shell
    # disk info
    sudo df -k > /some/where/safe/df_k.txt
    sudo lsblk -o KNAME,SCHED,SIZE,TYPE,ROTA > /some/where/safe/lsblk.txt
    sudo lsblk --all > $PTDEST/lsblk-all;
    # lv/pv/vg only for systems with LVM
    sudo lvdisplay --all --maps > /some/where/safe/lvdisplau-all-maps.txt
    sudo pvdisplay --maps > /some/where/safe/pvdisplay-maps.txt
    sudo pvs -v > /some/where/safe/pvs_v.txt
    sudo vgdisplay > /some/where/safe/vgdisplay.txt
    # nfsstat for systems with NFS mounts
    sudo nfsstat -m > /some/where/safe/nfsstat_m.txt
    sudo nfsiostat 1 120 > /some/where/safe/nfsiostat.txt
    # Collect hardware information
    sudo dmesg > /some/where/safe/dmesg.txt
    sudo dmesg -T free -m > /some/where/safe/dmesg_free.txt
    sudo dmesg -T > /some/where/safe/dmesg_t.txt
    sudo ulimit -a > /some/where/safe/ulimit_a.txt
    sudo cat /proc/sys/vm/swappiness > /some/where/safe/swappiness
    sudo numactl --hardware > /some/where/safe/numactl-hardware.txt

It goes without saying, it would be best to script the above into a useful bash script you can run when there is an issue.  Just be sure to test the script in advance of an issue.

Again, the goal is to preserve useful diagnostic data that could be useful for determining the root cause of the issue at a later time after the service is restored.  Just don’t get caught up in looking through the above diagnostics!  Certainly, more data is better but the above is a great starting point.  As time goes on, you may realize you wish you had other metrics and can add them to your script or Standard Operating Procedure (SOP).

Naturally, adding monitoring like Percona Monitoring and Management (PMM) would be a great option that can save you a lot of time and collect even more trends over time which can be extremely helpful.

With the above diagnostics, you would have a ton of information in the event of an issue to find the root cause.  Now, you can sort through the diagnostics.  Of course, if you need help with that, Percona can help you here as well.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK