5

MySQL Bugs: #108012: Data corruption after graceful restart

 1 year ago
source link: https://bugs.mysql.com/bug.php?id=108012
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Data corruption after graceful restart

Bug #108012 Data corruption after graceful restart
Submitted: 28 Jul 7:10 Modified: 29 Jul 14:17
Reporter: Aleksei Faians Email Updates:
Status: Not a Bug Impact on me: None 
Category:MySQL Server Severity:S1 (Critical)
Version:8.0.30 OS:Linux
Assigned to: MySQL Verification Team CPU Architecture:x86 (x86-64)
[28 Jul 7:10] Aleksei Faians
Description:
I am running InnoDB cluster in Multi-Primary mode. Today I upgraded it from 8.0.29 to 8.0.30 and restarted one of the nodes after it successfully joined the cluster after upgrade. It failed to start. Here's the log:

2022-07-28T06:49:15.578613Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.30).
2022-07-28T06:49:19.039242Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
2022-07-28T06:49:21.041788Z 0 [System] [MY-011651] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.'
2022-07-28T06:49:22.477842Z 0 [System] [MY-010910] [Server] /opt/mysql/bin/mysqld: Shutdown complete (mysqld 8.0.30)  MySQL Community Server - GPL.
2022-07-28T06:49:22.837595Z 0 [System] [MY-010116] [Server] /opt/mysql/bin/mysqld (mysqld 8.0.30) starting as process 1871329
2022-07-28T06:49:22.847355Z 0 [Warning] [MY-013907] [InnoDB] Deprecated configuration parameters innodb_log_file_size and/or innodb_log_files_in_group have been used to compute innodb_redo_log_capacity=2147483648. Please use innodb_r
edo_log_capacity instead.
2022-07-28T06:49:22.850039Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-07-28T06:49:23.469149Z 1 [ERROR] [MY-012962] [InnoDB] The redo log file ./#innodb_redo/#ib_redo1 size 647168 is not a multiple of innodb_page_size
2022-07-28T06:49:23.469187Z 1 [ERROR] [MY-012930] [InnoDB] Plugin initialization aborted with error Generic error.
2022-07-28T06:49:23.925465Z 1 [ERROR] [MY-010334] [Server] Failed to initialize DD Storage Engine
2022-07-28T06:49:23.926123Z 0 [ERROR] [MY-010020] [Server] Data Dictionary initialization failed.
2022-07-28T06:49:23.926176Z 0 [ERROR] [MY-010119] [Server] Aborting
2022-07-28T06:49:23.927214Z 0 [System] [MY-010910] [Server] /opt/mysql/bin/mysqld: Shutdown complete (mysqld 8.0.30)  MySQL Community Server - GPL.

Restarting doesn't fix anything. I will have to re-initialize the node.

How to repeat:
1. Create InnoDB cluster in Multi-Primary mode
2. Do some noticeable reads and writes
2. Restart a node
3. Get unlucky
[28 Jul 8:17] Aleksei Faians
Even after successful --initialize subsequent server startup failed with the same error. I managed to get server started after removing this parameter from configuration:

innodb_log_file_size = 1G
[28 Jul 8:28] Aleksei Faians
After a few restarts a new empty just initialized instance got the same issue:

2022-07-28T08:25:57.973231Z 0 [System] [MY-010116] [Server] /opt/mysql/bin/mysqld (mysqld 8.0.30) starting as process 1952662
2022-07-28T08:25:57.983186Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-07-28T08:25:58.229729Z 1 [ERROR] [MY-013857] [InnoDB] The redo log file ./#innodb_redo/#ib_redo7 is smaller than 65536 bytes, which indicates it was not generated by InnoDB or become corrupted. Please restore the correct file.
2022-07-28T08:25:58.229761Z 1 [ERROR] [MY-012930] [InnoDB] Plugin initialization aborted with error Generic error.
2022-07-28T08:25:58.692173Z 1 [ERROR] [MY-010334] [Server] Failed to initialize DD Storage Engine
2022-07-28T08:25:58.692811Z 0 [ERROR] [MY-010020] [Server] Data Dictionary initialization failed.
2022-07-28T08:25:58.692864Z 0 [ERROR] [MY-010119] [Server] Aborting
2022-07-28T08:25:58.693731Z 0 [System] [MY-010910] [Server] /opt/mysql/bin/mysqld: Shutdown complete (mysqld 8.0.30)  MySQL Community Server - GPL.

This looks like a critical bug to me now.
[28 Jul 17:34] Jakub Lopuszanski
Hello Aleksei!
Thanks for report. Let's focus on the last scenario, which you've mentioned in "28 Jul 8:28" comment. I'd like to reproduce it, so I need a few more details.
1. What was the exact config and command line arguments which you've used for `--initialize`?
2. What does `ls -lh ./#innodb_redo` show right after initialize?
3. What config and command line argument did you use to start the server after initialization?
4. How did you stop the server (by `kill`? or SHUTDOWN or what?)
5. Could you share `ls -lh ./#innodb_redo` results before and after the iteration in which the server lost ability to boot?
[28 Jul 18:03] Jakub Lopuszanski
So far I've tried the most natural interpretation of the scenario you've provided but I've failed to reproduce the situation:

[jlopusza@compile-el8-01]/export/home/tmp/jlopusza/mysql-release/runtime_output_directory: rm -rf ../data/
[jlopusza@compile-el8-01]/export/home/tmp/jlopusza/mysql-release/runtime_output_directory: ./mysqld --initialize-insecure --innodb-log-file-size=1G
2022-07-28T18:00:07.813644Z 0 [System] [MY-013169] [Server] /export/home/tmp/jlopusza/mysql-release/runtime_output_directory/mysqld (mysqld 8.0.30) initializing of server in progress as process 2522136
2022-07-28T18:00:07.819359Z 0 [Warning] [MY-013907] [InnoDB] Deprecated configuration parameters innodb_log_file_size and/or innodb_log_files_in_group have been used to compute innodb_redo_log_capacity=2147483648. Please use innodb_redo_log_capacity instead.
2022-07-28T18:00:07.820396Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-07-28T18:00:08.134955Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2022-07-28T18:00:09.017610Z 6 [Warning] [MY-010453] [Server] root@localhost is created with an empty password ! Please consider switching off the --initialize-insecure option.
[jlopusza@compile-el8-01]/export/home/tmp/jlopusza/mysql-release/runtime_output_directory: ls -lh ../data/#innodb_redo/
total 2.0G
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo0'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo10_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo11_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo12_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo13_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo14_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo15_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo16_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo17_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo18_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo19_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo1_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo20_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo21_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo22_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo23_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo24_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo25_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo26_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo27_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo28_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo29_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo2_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo30_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo31_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo3_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo4_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo5_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo6_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo7_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo8_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo9_tmp'
[jlopusza@compile-el8-01]/export/home/tmp/jlopusza/mysql-release/runtime_output_directory: ./mysqld --innodb-log-file-size=1G --port=30303 --socket=here.sock --skip-mysqlx
2022-07-28T18:00:23.426625Z 0 [System] [MY-010116] [Server] /export/home/tmp/jlopusza/mysql-release/runtime_output_directory/mysqld (mysqld 8.0.30) starting as process 2522187
2022-07-28T18:00:23.432153Z 0 [Warning] [MY-013907] [InnoDB] Deprecated configuration parameters innodb_log_file_size and/or innodb_log_files_in_group have been used to compute innodb_redo_log_capacity=2147483648. Please use innodb_redo_log_capacity instead.
2022-07-28T18:00:23.433449Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-07-28T18:00:23.531051Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2022-07-28T18:00:23.707638Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2022-07-28T18:00:23.707680Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2022-07-28T18:00:23.708648Z 0 [Warning] [MY-011810] [Server] Insecure configuration for --pid-file: Location '/export/home/tmp' in the path is accessible to all OS users. Consider choosing a different directory.
2022-07-28T18:00:23.719997Z 0 [System] [MY-010931] [Server] /export/home/tmp/jlopusza/mysql-release/runtime_output_directory/mysqld: ready for connections. Version: '8.0.30'  socket: 'here.sock'  port: 30303  Source distribution.
Killed
[jlopusza@compile-el8-01]/export/home/tmp/jlopusza/mysql-release/runtime_output_directory: ls -lh ../data/#innodb_redo/
total 2.0G
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo0'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo10_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo11_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo12_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo13_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo14_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo15_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo16_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo17_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo18_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo19_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo1_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo20_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo21_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo22_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo23_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo24_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo25_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo26_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo27_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo28_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo29_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo2_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo30_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo31_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo3_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo4_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo5_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo6_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo7_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo8_tmp'
-rw-r----- 1 jlopusza common 64M Jul 28 20:00 '#ib_redo9_tmp'

As you can see the files should have 64M (=1G*2/32), so it's indeed a bit strange that InnoDB finds much smaller files on your disk.
Are you perhaps running out of space on this disc? Or restoring the files from some backup?
[28 Jul 18:32] MySQL Verification Team
Hi Alexei,

Upload please your full config too.

thanks
[29 Jul 2:40] Aleksei Faians
The issue does not always reproduce. Last time it happened only after I restarted server several times.

Also, I am using this binary package: https://dev.mysql.com/get/Downloads/MySQL-8.0/mysql-8.0.30-linux-glibc2.17-x86_64-minimal....

Config:
[mysqld]
user            = mysql-portal
pid-file        = /run/mysqld-portal/mysqld.pid
socket          = /run/mysqld-portal/mysqld.sock
port            = 13306
datadir         = /compressed/lib/mysql-portal
bind-address            = 0.0.0.0
mysqlx-bind-address     = 0.0.0.0
key_buffer_size         = 16M
log_error = /var/log/mysql-portal/error.log
max_binlog_size   = 100M
key_buffer_size                 = 256K
innodb_buffer_pool_size         = 8G
innodb_buffer_pool_instances    = 8
join_buffer_size                = 64M
sort_buffer_size                = 8M
read_rnd_buffer_size            = 1M
performance_schema              = ON
table_open_cache                = 10000
skip_name_resolve               = OFF
max_connections                 = 100
mysqlx_max_connections          = 20
tmpdir                          = /compressed/tmp
secure-file-priv                = /compressed/lib/mysql-portal-files
mysqlx_port                     = 13307
mysqlx_socket                   = /run/mysqld-portal/mysqlx.sock
report_host                     = portal-db.bsrealm.net

Command used to initialize server:
sudo -u mysql-portal /opt/mysql/bin/mysqld --defaults-file=/etc/mysql-portal/my.cnf --initialize

Command used to start/stop/restart server:
systemctl start/stop/restart mysql-portal.service

Systemd unit file:
[Unit]
Description=MySQL Community Server
After=network.target

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
User=mysql-portal
Group=mysql-portal
PIDFile=/run/mysqld-portal/mysqld.pid
ExecStart=/opt/mysql/bin/mysqld --defaults-file=/etc/mysql-portal/my.cnf
TimeoutSec=600
Restart=on-failure
RuntimeDirectory=mysqld-portal
RuntimeDirectoryMode=755
LimitNOFILE=infinity
Environment=MYSQLD_PARENT_PID=1
Environment=LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

Before:
~# ls -lh /compressed/lib/mysql-portal/#innodb_redo/
total 819K
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo10_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo11_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo12_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo13_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo14_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo15_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo16_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo17_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo18_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo19_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo20_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo21_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo22_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo23_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo24_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo25_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo26_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo27_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo28_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo29_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo30_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo31_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo32_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo33_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo34_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo35_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo36_tmp'
-rw-r----- 1 mysql-portal mysql-portal 2.4M Jul 29 05:37 '#ib_redo5'
-rw-r----- 1 mysql-portal mysql-portal 408K Jul 29 05:37 '#ib_redo6'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo7_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo8_tmp'
-rw-r----- 1 mysql-portal mysql-portal    0 Jul 29 05:37 '#ib_redo9_tmp'

After:
# ls -lh /compressed/lib/mysql-portal/#innodb_redo/
total 152K
-rw-r----- 1 mysql-portal mysql-portal 408K Jul 29 05:37 '#ib_redo6'
[29 Jul 2:41] Aleksei Faians
> Are you perhaps running out of space on this disc? Or restoring the files from some backup?

# df -h /compressed/
Filesystem      Size  Used Avail Use% Mounted on
compressed      900G  284G  616G  32% /compressed
[29 Jul 2:44] Aleksei Faians
Also please note, that this issue only happens with 8.0.30. I had no single issue with 8.0.29 version for months.
[29 Jul 2:52] Aleksei Faians
I tried this archive: https://dev.mysql.com/get/Downloads/MySQL-8.0/mysql-8.0.30-linux-glibc2.12-x86_64.tar.xz

The problem doesn't seem to be occurring now.

So the question is why the glibc2.17 based package is broken (again)?
[29 Jul 6:45] MySQL Verification Team
Hi,

So if I understood you correctly the mysql-8.0.30-linux-glibc2.17-x86_64 fails for you "every time" or "most of the times" and mysql-8.0.30-linux-glibc2.12-x86_64 does not fail at all? 

I tried glibc2.17 version 4 times on one server with centos and 2 times on second server with oracle linux and I did not manage to reproduce the corruption.

What filesystem are you using, what OS version, what is your glibc version on the server?

Thanks
[29 Jul 6:58] Aleksei Faians
> So if I understood you correctly the mysql-8.0.30-linux-glibc2.17-x86_64 fails for you "every time" or "most of the times" and mysql-8.0.30-linux-glibc2.12-x86_64 does not fail at all? 

Right.

> I tried glibc2.17 version 4 times on one server with centos and 2 times on second server with oracle linux and I did not manage to reproduce the corruption.
> What filesystem are you using, what OS version, what is your glibc version on the server?

ZFS with compression=on, Ubuntu 20.04. Please also note that I am preloading jemalloc.

$ cat /proc/version
Linux version 5.15.0-41-generic (buildd@lcy02-amd64-105) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #44~20.04.1-Ubuntu SMP Fri Jun 24 13:27:29 UTC 2022

$ ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

$ apt list libjemalloc2
Listing... Done
libjemalloc2/focal,now 5.2.1-1ubuntu1 amd64 [installed]

$ zpool status
  pool: compressed
 state: ONLINE
  scan: scrub repaired 375K in 0 days 00:16:33 with 0 errors on Sun Jul 10 00:40:35 2022
config:

        NAME                        STATE     READ WRITE CKSUM
        compressed                  ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x5002538f41c27845  ONLINE       0     0     0
            wwn-0x5002538f41c2784c  ONLINE       0     0     0

errors: No known data errors
[29 Jul 8:52] Jakub Lopuszanski
Hello Aleksei,
thank you for providing a lot of details!

AFAIU in your latest config there is no innodb_log_file_size, innodb_log_files_in_group nor innodb_redo_log_capacity, so the expectation should be that the default of innodb_redo_log_capacity=100*1024*1024 bytes, which should be divided into 32 files giving 3276800 bytes per file.

Yet, your ls -l listing after initialize shows that some files are completely empty, and some are much shorter.

This seems to me to indicate some failure during "resizing them up" - they are indeed created initially empty, and then, as they should essentially be filled with zeros (or actually "whatever") up to the expected length, we do not use write() which would be slow, but instead use fallocate().
In particular, this happens here:
  log_create_unused_file(const Log_files_context & ctx, unsigned __int64 file_id, unsigned __int64 size_in_bytes)C++
  log_files_prepare_unused_file(log_t & log, unsigned __int64 file_id, unsigned __int64 start_lsn, unsigned __int64 checkpoint_lsn, bool create_first_data_block, unsigned __int64 & file_size)
  log_files_create_file(log_t & log, unsigned __int64 file_id, unsigned __int64 start_lsn, unsigned __int64 checkpoint_lsn, bool create_first_data_block)
  log_files_create(log_t & log, unsigned __int64 flushed_lsn, unsigned __int64 & checkpoint_lsn)
  log_sys_init(bool expect_no_files, unsigned __int64 flushed_lsn, unsigned __int64 & new_files_lsn)
  srv_start(bool create_new_db)
  innobase_init_files(dict_init_mode_t dict_init_mode, List<Plugin_tablespace const> * tablespaces)

which calls os_file_set_size_fast(file_path.c_str(), file, 0, size_in_bytes, true) which you can find here:
https://github.com/mysql/mysql-server/blob/mysql-8.0.30/storage/innobase/os/os0file.cc#L55...
```
#if !defined(NO_FALLOCATE) && defined(UNIV_LINUX) && \
    defined(HAVE_FALLOC_FL_ZERO_RANGE)
...
 int ret =
      fallocate(pfs_file.m_file, FALLOC_FL_ZERO_RANGE, offset, size - offset);
...
#endif /* !NO_FALLOCATE && UNIV_LINUX && HAVE_FALLOC_FL_ZERO_RANGE */

  return os_file_set_size(name, pfs_file, offset, size, flush);
```
So, if the binary was build on a system which seemed to support `fallocate(...HAVE_FALLOC_FL_ZERO_RANGE..)`, we use it.
We should also print an error message (only once for the whole mysqld liftime) if it fails:
```
  if (print_message) {
    ib::info(ER_IB_MSG_1359) << "fallocate() failed with errno " << errno
                             << " - falling back to writing NULLs.";
    print_message = false;
  }
```
I don't see it mentioned in the fragments of logs you've shared so far, but this could be because:
a) because you're using low verbosity level. Note that by default ib::info's are note printed https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_log_error_verb... so you'd have to change it to level 3
b) the message was printed earlier for some other operation (say, tablespace generation)

Anyway, if fallocate seems unsupported or failed with error, we fallback to os_file_set_size, which does a series of writes.
I think we have not reached this call at all. I base this suspicion on following clues:
- some of your files have 0 length, for example: 0 Jul 29 05:37 '#ib_redo13_tmp'
- some have size not divisible by UNIV_PAGE_SIZE (16KB according to your config), for example:  408K Jul 29 05:37 '#ib_redo6'
- some have non-zero size (for example #ib_redo5 and #ib_redo6), and these files are exactly those which are not temporary, which means they store actual redo log data, so were written to, which suggests that writes with data "in principle" work on this machine, so if we got to call os_file_set_size then it would work (at least "should")

So, what I am now suspecting the most is that fallocate "pretends to work" - does not report an error, but also does not extend the file length.

Aleksei, would it be possible for you to run a simple experiment with small C++ program using `fallocate` on this machine?
$ cat f.cpp

#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <fcntl.h>
#include <cstdlib>
#include <cstdio>
#include <linux/falloc.h>
int main(int argc,const char * args[]) {
  if(argc!=3){
    printf("Please run with:\n  %s path_to_create len\n", args[0]);
    return 2;
  }
  const auto path=args[1];
  const int size=atoi(args[2]);
  printf("Creating file %s\n", path);
  const int fd = open(path,O_RDWR | O_CREAT | O_EXCL,  S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP);
  if(fd == -1){
    printf("Creation has failed\n");
    return 3;
  }
  printf("fallocate %d bytes\n", size);
  const int ret=fallocate(fd, FALLOC_FL_ZERO_RANGE, 0, size);
  printf("fallocate returned %d\n", ret);
  return(0);
}

$ g++ f.cpp -o f
$ ./f /compressed/lib/mysql-portal/#innodb_redo/example 3276800
```
and share the output?
In my case it was something like this:
```
[jlopusza@compile-el8-01]/export/home/tmp/jlopusza/mysql-release/runtime_output_directory: ./f ./example 3276800
Creating file ./example
fallocate 3276800 bytes
fallocate returned 0
[jlopusza@compile-el8-01]/export/home/tmp/jlopusza/mysql-release/runtime_output_directory: ls -lh example
-rw-r----- 1 jlopusza common 3.2M Jul 29 10:48 example
```
But I wonder what happens on your machine and the specific drive & file system you have in /compressed/lib/mysql-portal/#innodb_redo
[29 Jul 9:05] Jakub Lopuszanski
> ZFS with compression=on, Ubuntu 20.04. Please also note that I am preloading jemalloc.

Then perhaps this thread is relevant to our situation:
https://github.com/openzfs/zfs/issues/326
[29 Jul 9:45] Aleksei Faians
root@home:/compressed# mkdir testing
root@home:/compressed# chown testing:testing testing
root@home:/compressed# ls -ld /compressed/testing/
drwxr-xr-x 2 testing testing 2 Jul 29 12:38 /compressed/testing/
root@home:/compressed# cd testing/
root@home:/compressed/testing# cat > f.cpp
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <fcntl.h>
#include <cstdlib>
#include <cstdio>
#include <linux/falloc.h>
int main(int argc,const char * args[]) {
  if(argc!=3){
    printf("Please run with:\n  %s path_to_create len\n", args[0]);
    return 2;
  }
  const auto path=args[1];
  const int size=atoi(args[2]);
  printf("Creating file %s\n", path);
  const int fd = open(path,O_RDWR | O_CREAT | O_EXCL,  S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP);
  if(fd == -1){
    printf("Creation has failed\n");
    return 3;
  }
  printf("fallocate %d bytes\n", size);
  const int ret=fallocate(fd, FALLOC_FL_ZERO_RANGE, 0, size);
  printf("fallocate returned %d\n", ret);
  return(0);
}
root@home:/compressed/testing# sudo -u testing g++ f.cpp -o f
root@home:/compressed/testing# sudo -u testing ./f /compressed/testing/example 3276800
Creating file /compressed/testing/example
fallocate 3276800 bytes
fallocate returned 0
root@home:/compressed/testing# ls -lh
total 7.0K
-rw-rw---- 1 testing testing   0 Jul 29 12:40 example
-rwxrwxr-x 1 testing testing 17K Jul 29 12:39 f
-rw-r--r-- 1 root    root    693 Jul 29 12:39 f.cpp

So, I see that fallocate fails for some reason. But why does this problem only hits me:
1. Only with with glibc 2.17 package
2. Only with 8.0.30 version

Because I never had this issue with 8.0.29 (and older) glibc 2.17 package.
[29 Jul 10:01] Aleksei Faians
I've also managed to nail down problem to FALLOC_FL_ZERO_RANGE flag. Documentation says that not all filesystems support it and error should be returned if it's not supported. For some reason, with ZFS no error is returned.

If I replace FALLOC_FL_ZERO_RANGE with 0 in f.cpp, file is created properly:

root@home:/compressed/testing# ls -l
total 128
-rw-rw---- 1 testing testing 3276800 Jul 29 12:57 example
[29 Jul 10:38] Jakub Lopuszanski
> But why does this problem only hits me:
> 1. Only with with glibc 2.17 package
> 2. Only with 8.0.30 version

In mysql-8.0.29 the os_file_set_size_fast was used in just one place, in
create_log_file() :

https://github.com/mysql/mysql-server/blob/mysql-8.0.29/storage/innobase/srv/srv0start.cc#...

AFAICT, the mysql-8.0.29, also validated size of files on restart:
If you look into open_log_file() function it gets the size of file here:
https://github.com/mysql/mysql-server/blob/mysql-8.0.29/storage/innobase/srv/srv0start.cc#...
and then the caller (srv_start() in the case of create_new_db=false) does some checking:
https://github.com/mysql/mysql-server/blob/mysql-8.0.29/storage/innobase/srv/srv0start.cc#...
1. asserts that the size is not "undefined"
      ut_a(size != (os_offset_t)-1);
2. that the size is multiple of UNIV_PAGE_SIZE:
      if (size & ((1 << UNIV_PAGE_SIZE_SHIFT) - 1)) {
        ib::error(ER_IB_MSG_1137, logfilename, ulonglong{size});
        return (srv_init_abort(DB_ERROR));
      }
3. finally, it just accepts that the size of first file (ib_redo0) it has found on disc is the one to expect from other files:
```
  if (i == 0) {
        srv_log_file_size = size;
...
  } else if (!srv_dedicated_server && size != srv_log_file_size) {
...
    ib::error(ER_IB_MSG_1138, logfilename, ulonglong{size},
                  srv_log_file_size);

    return (srv_init_abort(DB_ERROR));
  }
```

I'd expect that if you use the same implementation of fallocate then mysql-8.0.29 should also fail to start if you stopped it immediatelly after --initialize.
When you say it works fine on mysql-8.0.29, do you mean 
a) specifically that `--initialize` creates files of correct size
or
b) that you have not seen problems in production so far?

If it is b), then plausible explanation is that in 8.0.29 we reused the same two files over and over again in cyclical fashion, so if you let the server run for long enough then it should have write actual redo log data into files, which would make their size reach the full capacity, so after restart they'd appear full. 
This theory doesn't fit entirely, as `--initialize` usually doesn't produce enough redo log data to fill two files up to the brim.
So perhaps you've used 8.0.29 with a different implementation of fallocate?
Or perhaps copied/restored redo log files from a different machine/file system?

At any rate, could you repeat your experiment with mysql-8.0.29 on the same filesystem?
Also, do you happen what link/exact package name did you use to download 8.0.29? Perhaps it was using different version of glibc?
[29 Jul 10:51] Aleksei Faians
> I'd expect that if you use the same implementation of fallocate then mysql-8.0.29 should also fail to start if you stopped it immediatelly after --initialize.
> When you say it works fine on mysql-8.0.29, do you mean 
> a) specifically that `--initialize` creates files of correct size
> or
> b) that you have not seen problems in production so far?

The answer is b.

> So perhaps you've used 8.0.29 with a different implementation of fallocate?
> Or perhaps copied/restored redo log files from a different machine/file system?

At time of initialization version was 8.0.26 or something and it was running on ext4 fs. Later if was moved to ZFS. So if before 8.0.30 fallocate() was only used at initialization time, it explains everything.

> At any rate, could you repeat your experiment with mysql-8.0.29 on the same filesystem?

I don't see a point. We already know that fallocate() fails on ZFS when used with FALLOC_FL_ZERO_RANGE and it doesn't even return the error, so it's a problem in kernel/glibc/driver, not in MySQL. But now since we know it happens, I believe, some kind of check or a workaround should be applied. I would also like to report the problem upstream, but I don't know where exactly this report should go. I'd appreciate it if you could point me.

> what link/exact package name did you use to download 8.0.29

https://downloads.mysql.com/archives/get/p/23/file/mysql-8.0.29-linux-glibc2.17-x86_64-min...
[29 Jul 11:14] Jakub Lopuszanski
I've consulted our Release Engineering team and they say that the glibc2.12 variant of MySQL is build&compiled on EL6, on which FALLOC_FL_ZERO_RANGE is not supported:
```
-bash-4.1$  ack FALLOC_FL_ZERO_RANGE  /usr/include/
-bash-4.1$  ack FALLOC  /usr/include/
/usr/include/linux/falloc.h
1:#ifndef _FALLOC_H_
2:#define _FALLOC_H_
4:#define FALLOC_FL_KEEP_SIZE   0x01 /* default is extend size */
5:#define FALLOC_FL_PUNCH_HOLE  0x02 /* de-allocates range */
8:#endif /* _FALLOC_H_ */

/usr/include/linux/fuse.h
262:    FUSE_FALLOCATE     = 43,
-bash-4.1$
```
so, the generated binary doesn't even try to use fallocate (as it is disabled by #ifdef) and it goes right to the slow path of writing zeros manually.

Assuming that on your combination of machine+glibc+ZFS+OS the fallocate implementation is broken in the sense of returning 0 (success) even though it has not changed the (uncompressed) size of the file (which your experiments with f.cpp seem to suggest), then indeed using glibc2.12 variant of MySQL could have saved you from experiencing this bug, while using glibc2.17 variant (which is built on machine which supports FALLOC_FL_ZERO_RANGE) has exposed you to the problem.

> I don't see a point. We already know that fallocate() fails on ZFS when used with FALLOC_FL_ZERO_RANGE and it doesn't even return the error, so it's a problem in kernel/glibc/driver, not in MySQL. 

So it seems, right.

> But now since we know it happens, I believe, some kind of check or a workaround should be applied. 

Hm, I'm not sure what can I offer here. I assume you really want to keep the redo log on ZFS (probably to take snapshots?). Then what you might try is to either use the glibc2.12 version as you've already figured, or try 
the Ubuntu-specific binary we also offer for download (but this one might also "assume" fallocate works correctly), or recompile the mysqld from source disabling the fragment for HAVE_FALLOC_FL_ZERO_RANGE.

I'm not sure we can really "fix" the "generic linux build" variant, as we really want to use fallocate for performance reasons, and if the compile-time check, and runtime checks are both "lying to us", that the function is working fine, then what exactly can we do about it?

Perhaps we could check if the size after the call is what we expect it to be, and if it is not, then we could emit a warning and fall back to slow path?
I'm not sure of the performance impact of this.
What I am sure of, though, is that it adds to the code complexity.

Is the "use glibc2.12 variant" a viable option for you?

> I would also like to report the problem upstream, but I don't know where exactly this report should go. I'd appreciate it if you could point me.

Thank you! As I've mentioned before I see these people work on it:
https://github.com/openzfs/zfs/issues/326
but I am not sure if the bug is there, or in glibc.
[29 Jul 11:32] Jakub Lopuszanski
Hi Aleksei,
I've found this report:
https://bugs.launchpad.net/ubuntu/+source/mysql-8.0/+bug/1969247
[29 Jul 11:41] Aleksei Faians
> Hm, I'm not sure what can I offer here.

I am not trying to solve my own problem right now. I want you to make sure that no one else will get their instances destroyed. I am fine with glibc 2.12 build for now.

So, one possible workaround is to implement a simple check at startup to make sure that fallocate() actually creates files of desired size, which in case of failure disables further usage of fallocate().

Another possible workaround is to disable usage of fallocate() completely if ZFS is detected. This won't affect performance much because compression makes zero-filling files an almost instant operation:

root@home:~# time dd if=/dev/zero of=/compressed/testing/zeroes.dat bs=1M count=10240 conv=fsync
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 3.00176 s, 3.6 GB/s

real    0m3.005s
user    0m0.029s
sys     0m2.282s

And I really doubt that anyone would run MySQL on ZFS without compression.

> As I've mentioned before I see these people work on it: https://github.com/openzfs/zfs/issues/326

This one is closed, fallocate() support was added to ZFS and it works if FALLOC_FL_ZERO_RANGE is not used. So it looks to me like the bug is in glibc, it doesn't seem to make sure that FALLOC_FL_ZERO_RANGE is supported by fs when issuing fallocate() system call. I'll report it to Launchpad, they'll probably know what to do next.

Thanks for the very professional debugging session! :)
[29 Jul 13:59] Jakub Lopuszanski
Also, if anyone else is experiencing this particular issue, I advice not to remove redo log files! They may contain valueable data, and more importantly, perhaps can be fixed, by simply appending zeros to them, so that the size is at 64K and divisible by innodb_page_size (which usually is 16K).
In theory you should be able to that using `truncate` or `fallocate` command-line utilities, but given that they might use fallocate function and thus fail for same reason as MySQL, you might prefer to use `dd` to literally write zeros.
I haven't tested it myself, but I think that something like:
```
for x in *ib_redo*;
do 
  s=$(stat --format="%s" $x); 
  # this assumes innodb_page_size = 16K
  ((B=1<<14)); 
  # round up to multiple of B
  ((ns=(s+B-1)/B*B )); 
  # number of bytes missing
  ((delta=ns-s));
  if((0<$delta));
  then 
    echo "Missing $delta bytes in $x"; 
    # make a backup, in case I've messed up dd call
    cp $x $x.backup;
    dd if=/dev/zero bs=1 count=$delta >> $x;   
  fi; 
done
```
should round up all files to 16K.
Similar approach could be used to fix those which are smaller than 65536.
[22 Aug 12:01] Reio Remma
Just upgraded to 8.0.30 and also happened on this.

mysql-community-server.x86_64 8.0.30-1.el8 @mysql80-community

Running in AlmaLinux release 8.6 container on Proxmox along with ZFS.

Managed to switch over to my single non-ZFS slave.

Is my only option right now to take a backup from the non-ZFS server and revert all to 8.0.29?
[22 Aug 17:20] Reio Remma
I can confirm that the script above worked nicely to fix redo log an a slave.
[23 Aug 8:39] Jakub Lopuszanski
Thank you Reio Remma for conducting this real world experiment, and sharing results - I hope it might help others in similar situation.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK