pt-archiver

The Problem

I recently worked on a customer engagement where the customer needed to archive a high amount of rows from different tables into another server (in this example for simplicity I am just archiving the results into a file).

As explained in this other blog post, “Want to archive tables? Use Percona Toolkit’s pt-archiver“, you can use pt-archiver to purge/archive rows from a table that match any “WHERE” condition, but this case was not that easy as the archive/delete condition was complex and involved joining many tables…

The archive conditions involved four tables with the following query and the following table schema. In the example, there are no foreign keys, but this method can be used also with foreign keys by reordering the table archive/purge.

Percona Toolkit's pt-archiver

And the delete condition is the following:

Shell

DELETE table1, table2, table3, table4

FROM table1

INNER JOIN table3 ON table1.id = table3.table1_id

INNER JOIN table2 ON table1.table2_id = table2.id

INNER JOIN table4 ON (table3.table4_id = table4.id AND table4.cond = 'Value1')

WHERE table1.created_at < '2020-01-01 00:00:00';

It can be seen that for a row to be archived, it depends on the existence and condition of other rows in other tables. Trying to purge/archive one table at a time is not a possible solution, because once a row has been purged/archived, it is not possible to find the other referenced rows that need to be purged/archived together with that one.

So, how do we proceed in this case?

The Solution

For tackling the above problem, the best is to set up a transient table containing all the pairs of rows to be purged/archived, i.e:

Shell

mysql> select * from tmp_ids_to_remove ;

+-----------+-----------+-----------+-----------+

+-----------+-----------+-----------+-----------+

| 1 | 1 | 1 | 1 |

| 1 | 1 | 2 | 1 |

| 1 | 1 | 3 | 1 |

| 3 | 3 | 5 | 3 |

+-----------+-----------+-----------+-----------+

For the above example, the following rows from each table have to be purged:

Table1: ids = {1,3}
Table2: ids = {1,3}
Table3: ids = {1,2,3,5}
Table4: ids = {1,3}

Then the pt-archiver from Percona Toolkit can be used to purge/archive one table at a time, checking that the row to be purged does exist on “tmp_ids_to_remove”. The pt-archiver expression would be similar to:

Shell

--where 'EXISTS(SELECT tableX_id FROM percona.tmp_ids_to_remove purge_t WHERE id=purge_t.tableX_id)'

And the query for populating table should be something similar to INSERT INTO tmp_ids_to_remove ( SELECT <query with the delete condition>) i.e:

Shell

INSERT INTO percona.tmp_ids_to_remove ( SELECT table1.id, table2.id, table3.id, table4.id

FROM table1

INNER JOIN table3 ON table1.id = table3.table1_id

INNER JOIN table2 ON table1.table2_id = table2.id

INNER JOIN table4 ON (table3.table4_id = table4.id AND table4.cond = 'Value1')

WHERE table1.created_at < '2020-01-01 00:00:00');

Things to consider:

Instead of creating one “big” table containing all the rows, multiple smaller tables can be created. For simplicity and easier data view, one big table was used in this example.
The above insert might lock a lot of rows which can impact server performance depending on transaction size and current server load. Either run the query out of business hours or if not possible and to keep referential integrity, SELECT …. INTO OUTFILE and then load into another table; the select part would be faster and non-locking.
The table with the tmp_ids_to_remove should have an index for each column since pt-archiver will need the index to fast check the row to be removed
If the amount of rows you need to purge/archive is in the various GB, you should adjust the “WHERE” condition to only process a few million rows at a time and process the rows in batches. Trying to execute a huge transaction (by either populating a big enough tmp_ids_to_remove or purge/archive all rows at once) will be performance detrimental.

Note: The above solution aims for data consistency at the cost of performance. If for whatever reason the purge/archive gets stopped halfway through, you will still know which rows ids are meant for purging since they are kept on tmp_ids_to_remove table.

On my GitHub repository, you can find an example scenario file and an example script for doing a test archive. The script is POC (proof of concept) and you should execute on a test env:

Instructions for usage are:

Download the scripts:

Shell

curl https://raw.githubusercontent.com/ctutte/blog_complex_archive/master/setup.sql > setup.sql

curl https://github.com/ctutte/blog_complex_archive/blob/master/archiver_script.sh > archiver_script.sh

Create the test env:

Shell

mysql -u root -p < setup.sql

Configure the script:

Shell

chmod a+x archiver_script.sh

On archiver_script.sh configure various parameters at the top (USER/PASS/SOURCE_DSN)
Finally, execute the script:

Shell

./archiver_script.sh

The archived rows are deleted from the DB, and the archived rows are written to /tmp/table_name.out file.

Conclusion

Trying to purge/archive rows for complex conditions or when trying to keep data consistency can be hard. The above solution will generate an intermediate table and be based on pt-archiver for purging/archiving rows in a tidy way and can be automated to be able to purge/archive millions of rows that otherwise would not be possible to do manually.

Note: This example is from a real case scenario but was obfuscated and simplified. It might still seem “unnecessarily complex” but it was kept like that so that the proposed solution makes sense.

Under similar scenarios, a much easier/faster solution might be suitable, but other times due to business logic or other restrictions, a more complex solution must be implemented.

The Problem

The Solution

Conclusion

Recommend

以太坊交易相同gas price，哪个在前？

numeric_limits

犀牛财经看市：美股涨跌互现

一脸问号的快递员来自亚马逊机器人的解雇通知

请问DAPP连接钱包，钱包需要做哪些支持，才能被DAPP连接上，并且访问到当前地址。

新老司机烦透了！马自达CX-5启动先听广播还关不了官方回复寒心

Virgin Orbit suddenly has a viable rocket, so what comes next?

GitHub - Masayuki-JP/Health-Habit-Assistant: A simple health assistant app built...

亚马逊、微软和阿里——2020年全球IaaS公有云服务市场前三

平安基金又现人事变动余斌离任旗下6只基金

About Joyk