

Can Disk Space Be Saved in MySQL by Adding a Primary Key?
source link: https://www.percona.com/blog/can-disk-space-be-saved-in-mysql-by-adding-a-primary-key/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Can Disk Space Be Saved in MySQL by Adding a Primary Key?
February 9, 2024
Historically, MySQL does not require explicit primary key defined on tables, and it’s like that by default till this day (MySQL version 8.3.0). Such a requirement is imposed through two replication methods, though: Group Replication and Percona XtraDB Cluster (PXC), where using tables without a primary key is not allowed by default. There are many well-known negative performance implications for tables lacking a primary key, where the most painful is terrible replication speed.
Today, I would like to make a quick point about one more reason to have a primary key: data size on disk!
Let’s consider this very simple example table:
mysql > show create table test1\G *************************** 1. row *************************** Table: test1 Create Table: CREATE TABLE `test1` ( `a` int NOT NULL, `b` bigint DEFAULT NULL, KEY `a` (`a`), KEY `b` (`b`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci 1 row in set (0.00 sec) |
Populated with 10M test rows, it takes 748M on disk. Now, given that my test table has column “a” with unique values:
mysql > select count(*) from test1; +----------+ | count(*) | +----------+ | 10000000 | +----------+ 1 row in set (1.34 sec) mysql > select count(DISTINCT(a)) from test1; +--------------------+ | count(DISTINCT(a)) | +--------------------+ | 10000000 | +--------------------+ 1 row in set (5.25 sec) |
I will change the (secondary) index type to primary:
mysql > alter table test1 add primary key(a), drop key a; Query OK, 0 rows affected (48.90 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql > show create table test1\G *************************** 1. row *************************** Table: test1 Create Table: CREATE TABLE `test1` ( `a` int NOT NULL, `b` bigint DEFAULT NULL, PRIMARY KEY (`a`), KEY `b` (`b`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci 1 row in set (0.00 sec) |
The table was re-created as a result, and its size on disk was reduced to 588M, quite significantly! Why has this happened? We have exactly the same data, and both columns are indexed in both cases! Let’s check more details about the table before and after the change.
Before, without PK, when both columns were indexed via a secondary key, we could see the following:
mysql > select SPACE,INDEX_ID,i.NAME as index_name, t.NAME as table_name, CLUST_INDEX_SIZE, OTHER_INDEX_SIZE from information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLESPACES t USING(space) JOIN information_schema.INNODB_TABLESTATS ts WHERE t.NAME=ts.NAME AND t.NAME='db1/test1'\G *************************** 1. row *************************** SPACE: 50 INDEX_ID: 232 index_name: a table_name: db1/test1 CLUST_INDEX_SIZE: 24699 OTHER_INDEX_SIZE: 22242 *************************** 2. row *************************** SPACE: 50 INDEX_ID: 231 index_name: b table_name: db1/test1 CLUST_INDEX_SIZE: 24699 OTHER_INDEX_SIZE: 22242 *************************** 3. row *************************** SPACE: 50 INDEX_ID: 230 index_name: GEN_CLUST_INDEX table_name: db1/test1 CLUST_INDEX_SIZE: 24699 OTHER_INDEX_SIZE: 22242 3 rows in set (0.00 sec) |
So, there is a third index! As seen in a more detailed view of each index via the innodb_ruby tool, it’s the biggest one in terms of size (id=230):
$ innodb_space -f msb_8_3_0/data/db1/test1.ibd space-indexes id name root fseg fseg_id used allocated fill_factor 230 4 internal 3 27 27 100.00% 230 4 leaf 4 24634 24672 99.85% 231 5 internal 5 21 21 100.00% 231 5 leaf 6 12627 12640 99.90% 232 6 internal 7 13 13 100.00% 232 6 leaf 8 9545 9568 99.76% |
This is how the InnoDB engine works; if no explicit PK is defined, it will add an internal one named GEN_CLUST_INDEX
. As it contains the whole data row, it’s size overhead is significant.
After replacing the secondary index with the explicit primary key, the hidden one is no longer needed. Therefore, we’re left with two indexes in total:
mysql > select SPACE,INDEX_ID,i.NAME as index_name, t.NAME as table_name, CLUST_INDEX_SIZE,OTHER_INDEX_SIZE from information_schema.INNODB_INDEXES i JOIN information_schema.INNODB_TABLESPACES t USING(space) JOIN information_schema.INNODB_TABLESTATS ts WHERE t.NAME=ts.NAME AND t.NAME='db1/test1'\G *************************** 1. row *************************** SPACE: 54 INDEX_ID: 237 index_name: b table_name: db1/test1 CLUST_INDEX_SIZE: 23733 OTHER_INDEX_SIZE: 13041 *************************** 2. row *************************** SPACE: 54 INDEX_ID: 236 index_name: PRIMARY table_name: db1/test1 CLUST_INDEX_SIZE: 23733 OTHER_INDEX_SIZE: 13041 2 rows in set (0.01 sec) |
$ innodb_space -f msb_8_3_0/data/db1/test1.ibd space-indexes id name root fseg fseg_id used allocated fill_factor 236 4 internal 3 21 21 100.00% 236 4 leaf 4 20704 23712 87.31% 237 5 internal 5 17 17 100.00% 237 5 leaf 6 11394 13024 87.48% |
Hidden (internal) clustered index (GEN_CLUST_INDEX) vs. generated invisible primary key (GIPK)
Every InnoDB table has a clustered key, so not defining one will not save any disk space, and sometimes, on the contrary, as demonstrated above. So, even if none of the existing columns of your problematic table is unique, it is still best to add another unique column to act as the primary key. The internal GEN_CLUST_INDEX
one is not exposed to the upper MySQL layer, and only the InnoDB engine is aware of it, hence it is useless for replication speed. Therefore, an explicit PK is always a better solution.
However, if adding a new PK column is out of the question due to a legacy application problem, you should still enforce the primary key by using an invisible one. This way, you will gain performance benefits, and at the same time, the change is transparent to applications.
Let’s see how it works in practice:
mysql > set sql_require_primary_key=1; Query OK, 0 rows affected (0.00 sec) mysql > create table nopk (a int); ERROR 3750 (HY000): Unable to create or change a table without a primary key, when the system variable 'sql_require_primary_key' is set. Add a primary key to the table or unset this variable to avoid this message. Note that tables without a primary key can cause performance problems in row-based replication, so please consult your DBA before changing this setting. mysql > set sql_generate_invisible_primary_key=1; Query OK, 0 rows affected (0.00 sec) mysql > create table nopk (a int); Query OK, 0 rows affected (0.02 sec) mysql > show create table nopk\G *************************** 1. row *************************** Table: nopk Create Table: CREATE TABLE `nopk` ( `my_row_id` bigint unsigned NOT NULL AUTO_INCREMENT /*!80023 INVISIBLE */, `a` int DEFAULT NULL, PRIMARY KEY (`my_row_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci 1 row in set (0.00 sec) mysql > select * from nopk; +------+ +------+ | 100 | +------+ 1 row in set (0.00 sec) |
So, our application is not aware of the new column at all. But we can still use it if needed, for example, to easily split table reads or writes into predictable chunks:
mysql > select my_row_id,a from nopk; +-----------+------+ | my_row_id | a | +-----------+------+ | 1 | 100 | +-----------+------+ 1 row in set (0.00 sec) |
Please note that for existing schemas lacking a primary key, before you enforce the sql_require_primary_key
variable, best to first enable the sql_generate_invisible_primary_key
and re-create the data using logical dump and restore. A simple table optimization will not add the invisible PK. In any case, having an invisible PK should be a win-win solution for legacy applications.
To summarize:
• It may be worth checking if changing the index type can save disk space!
• If adding a primary key wasn’t possible due to the application limits, consider using an invisible one!
Percona Distribution for MySQL is a complete, stable, scalable, and secure open source MySQL solution, delivering enterprise-grade database environments for your most critical business applications. Deploy anywhere and implement easily with one-to-one compatibility with MySQL Community Edition.
Try Percona Distribution for MySQL today!
Share This Post!
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK