MySQL Data Archival With Minimal Disruption

Back to the Blog

MySQL Data Archival We all know that data is important, and some businesses need historical data to be available all the time. The problem is that queries on large tables perform poorly if they are not properly optimized. We get many customer requests in Managed Services to purge/archive large tables, and to achieve it, we use pt-archiver.

Recently, we received a request to archive a large table, and the customer was worried about the downtime and performance issues during the archival.

We proposed a solution to the customer to archive the table using pt-archive. The idea is to archive old data to other tables and keep the latest data on the current table with minimal performance issues. All of the data will remain available and can be queried anytime.

Percona Toolkit

In the blog, I will not explain how to use the pt-archiver, but we will discuss a use case of the pt-archiver.

If you are unfamiliar with pt-archiver, please refer to “Want to archive tables? Use Percona Toolkit’s pt-archive “; it explains how pt-archiver works and various pt-archiver arguments.

We will test data archival to keep 100 days’ worth of data with a few simple steps for demonstration.

Note: This method uses timestamp datatype to filter the data.

Create two dummy tables.
Insert records in the source table.
Archive the record from the source to the destination table using –where condition per business requirements.
Rename the tables.
Add Pt-archiver as a cron.

This diagram better illustrates the process.

Screenshot-2022-11-22-at-10.03.05-PM-1024x392.png

Remember —Date and –where the condition in this example is just a reference. Use the archiving condition in pt-archiver as per business requirements.

Let’s create a source table and insert records using mysql_random_data_load:

Shell

Create Table: CREATE TABLE `blogpost` (

`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,

`created_at` timestamp NULL DEFAULT NULL,

`updated_at` timestamp NULL DEFAULT NULL,

`deleted_at` timestamp NULL DEFAULT NULL,

PRIMARY KEY (`id`),

KEY `blogpost_created_at_index` (`created_at`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1

1 row in set (0.00 sec)

Let’s create the destination table;

(Using the following table for demonstration purposes only)

Shell

create table if not exists blogpost_new like blogpost;

Query OK, 0 rows affected (0.02 sec)

Create the following triggers using pt-online-schema-change

Why triggers?

Any modifications to data in the original tables during the copy will be reflected in the new table because the pt-online-schema-change creates triggers on the original table to update the corresponding rows in the new table.

(For demonstration purposes, I have added triggers created from the pt-online-schema-change test run)

Shell

# Event: DELETE

delimiter //

CREATE TRIGGER `pt_osc_test_blogpost_del` AFTER DELETE ON `test`.`blogpost` FOR EACH ROW BEGIN DECLARE CONTINUE HANDLER FOR 1146 begin end; DELETE IGNORE FROM `test`.`blogpost_new` WHERE `test`.`blogpost_new`.`id` <=> OLD.`id`; END

delimiter ; //

# Event : UPDATE

delimiter //

CREATE TRIGGER `pt_osc_test_blogpost_upd` AFTER UPDATE ON `test`.`blogpost` FOR EACH ROW BEGIN DECLARE CONTINUE HANDLER FOR 1146 begin end; DELETE IGNORE FROM `test`.`_blogpost_new` WHERE !(OLD.`id` <=> NEW.`id`) AND `test`.`blogpost_new`.`id` <=> OLD.`id`; REPLACE INTO `test`.`_blogpost_new` (`id`, `created_at`, `updated_at`, `deleted_at`) VALUES (NEW.`id`, NEW.`created_at`, NEW.`updated_at`, NEW.`deleted_at`); END

delimiter ; //

# Event : INSERT

delimiter //

CREATE TRIGGER `pt_osc_test_blogpost_ins` AFTER INSERT ON `test`.`blogpost` FOR EACH ROW BEGIN DECLARE CONTINUE HANDLER FOR 1146 begin end; REPLACE INTO `test`.`blogpost_new` (`id`, `created_at`, `updated_at`, `deleted_at`) VALUES (NEW.`id`, NEW.`created_at`, NEW.`updated_at`, NEW.`deleted_at`);END

delimiter ; //

Verify that all triggers have been created.

Shell

SELECT * FROM INFORMATION_SCHEMA.TRIGGERS WHERE TRIGGER_SCHEMA='test' and EVENT_OBJECT_TABLE in ('blogpost')\G

The output should be something like

Shell

+--------------------------+

| trigger_name |

+--------------------------+

| pt_osc_test_blogpost_ins |

| pt_osc_test_blogpost_upd |

| pt_osc_test_blogpost_del |

+--------------------------+

3 rows in set (0.01 sec)

Copy the last 100 days of data using the pt-archiver. Verify with –dry-run,

(Screen session can be used to perform pt-archiver if the table is large in size.)

Shell

pt-archiver --source h=localhost,D=test,t=blogpost,i=blogpost_created_at_index \

--dest h=localhost,D=test,t=blogpost_new \

--where "created_at >= date_sub(curdate(), interval 100 day)" --no-delete --no-check-columns \

--limit=10000 --progress=10000 --no-check-charset --dry-run

The output should be something like this:

Shell

SELECT /*!40001 SQL_NO_CACHE */ `id`,`created_at`,`updated_at`,`deleted_at` FROM `test`.`blogpost` FORCE INDEX(`blogpost_created_at_index`) WHERE (created_at >= date_sub(curdate(), interval 100 day)) ORDER BY `created_at` LIMIT 10000

INSERT INTO `test`.`blogpost_new`(`id`,`created_at`,`updated_at`,`deleted_at`) VALUES (?,?,?,?)

Let’s execute the pt-archiver:

(Following Pt-archiver will copy 100 days worth of the data to _new table and triggers will up to date the _new table.)

Shell

pt-archiver --source h=localhost,D=test,t=blogpost,i=blogpost_created_at_index \

--dest h=localhost,D=test,t=blogpost_new \

--where "created_at >= date_sub(curdate(), interval 100 day)" --no-delete --no-check-columns \

--limit=10000 –replace --progress=10000 --no-check-charset

The pt-archiver output should be something like this:

Shell

TIME ELAPSED COUNT

2022-09-28T23:50:20 0 0

2022-09-28T23:50:44 24 10000

2022-09-28T23:51:08 48 20000

2022-09-28T23:51:25 65 27590

Once the pt-archiver finishes, check the condition code of the pt-archiver:

Shell

echo $?

(should be 0)

The next step is to check if the rows have been inserted into a new table and compare it with the original table.

NOTE: The below results are just examples of tests; use the right date or where condition:

Shell

select max(created_at) from test.blogpost_new;

+---------------------+

| max(created_at) |

+---------------------+

| 2022-27-09 02:11:00 |

+---------------------+

Shell

select count(1) from test.blogpost where created_at <= date_sub('2022-27-09', interval 100 day);

+----------+

| count(1) |

+----------+

| 65366 |

+----------+

1 row in set (0.02 sec)

Shell

select count(1) from test.blogpost_new where created_at >= date_sub('2022-09-26', interval 100 day);

+----------+

| count(1) |

+----------+

| 23758 |

+----------+

1 row in set (0.01 sec)

Why rename tables?

The idea is to keep 100 days’ worth of data in the new table and rename it as the source table, and the original table with the _archive prefix will have all the data, including the last 100 days.

Shell

RENAME TABLE blogpost TO blogpost_archive, blogpost_new to blogpost;

It should appear something like this:

Shell

+--------------------------------+

| Tables_in_test |

+--------------------------------+

| blogpost |

| blogpost_archive |

+--------------------------------+

Drop the triggers created using pt-online-schema-change .

Shell

DROP TRIGGER IF EXISTS test.test_blogpost_ins;

DROP TRIGGER IF EXISTS test.test_blogpost_upd;

DROP TRIGGER IF EXISTS test.test_blogpost_del;

Once the table rename is completed, copy data from the blogpost table to _archive table.

Let’s add the pt-archiver command to cron to make the process automatic. (It is advisable to create the script, use the below archiver command, and test it.)

Shell

pt-archiver --source h=localhost,D=test,t=blogpost,i=blogpost_created_at_index \

--dest h=localhost,D=test,t=blogpost_archive \

--where "created_at >= date_sub(curdate(), interval 100 day)" --no-delete --no-check-columns \

--limit=10000 --progress=10000 –replace b=0,L=yes –bulk-insert --no-check-charset >dev>null 2>1> path/to/logs

Now it is time to check if the cron was successful.

Check if the rows deleted have been copied to the test.blogpost_archive table, and deleted from the test.blogpost:

Shell

SELECT COUNT(1) FROM test.blogpost_archive WHERE created_at <= date_sub('2022-09-02', interval 100 day);

Verify if the blogpost table has 100 days of data, and the following query should return 0 rows:

Shell

SELECT COUNT(1) FROM test.blogpost WHERE created_at <= date_sub('2022-09-02', interval 100 day);

Hope you found this use case of the pt-archiver helpful when you need to purge/archive large tables!

MySQL Data Archival With Minimal Disruption

Recommend

Golang系列(35)-GZIP压缩/解压

The Chevrolet S10 EV Was An Electric Pickup Truck Ahead Of Its Time

Why Faulty Streetlights Are Turning Cities Purple — and Why It's Worrisome

AR 城市规划可视化创企 inCitu 完成 200 万美元融资

Snap 发布新报告，强调 AR 在营销方面日益增长的潜力

2022科大讯飞AI开发者大赛收官，人工智能驱动农业新课题

NFT 奖励游戏公司 Burn Ghost 完成 310 万美元融资，Drive by DraftKings 等领投

100 companies in UK switch to four-day working week with no pay cut

炒鞋、饭圈、桌游：青少年亚文化的理想国

Google Settles With FTC Over Paying Personalities to Promote Pixel 4

About Joyk