3

MyRocks Use Case: Big Dataset

 1 year ago
source link: https://www.percona.com/blog/myrocks-use-case-big-dataset/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Percona Database Performance Blog

MyRocks Use Case Big DatasetOne of the questions I am often asked is in what cases I would prefer MyRocks over InnoDB. We have covered MyRocks in our blog previously:

MyRocks Performance – Percona Database Performance Blog

Saving With MyRocks in The Cloud – Percona Database Performance Blog

But it would be good to refresh some materials.

This time I want to take an interesting (and real) data set, which I also covered previously: the Reddit Comments dataset (see Big Dataset: All Reddit Comments – Analyzing with ClickHouse – Percona Database Performance Blog). The dataset is still available for download from http://files.pushshift.io/reddit/submissions/ and it includes all recent comments to June 2022.

The size of the dataset is what is interesting, for example, the comments for January 2022 is 118GB and the total January-June 2022 dataset size is 729GB.

What is also interesting about this dataset is it comes in JSON format, and now MySQL provides wide capabilities to work with JSON files directly.

For example, we can directly load the dataset as:

mysqlsh [email protected]/rs -- util importJson /storage/vadim/reddit/RS_2022-01 --table=stor --tableColumn=doc

Into a table created as:

CREATE TABLE stor (
doc json DEFAULT NULL,
id int NOT NULL AUTO_INCREMENT,
PRIMARY KEY (id)
)

Now we can compare load time into MyRocks vs InnoDB for the same files.

It is worth noting that I used a fast NVMe storage for the MySQL database, which is beneficial for InnoDB (as InnoDB performs better on the fast storage, see Saving With MyRocks in The Cloud – Percona Database Performance Blog).

So loading six files into MyRocks storage engine:

Processed 124.26 GB in 32091070 documents in 1 hour 45 min 4.8562 sec (5.09K documents/s)
Processed 116.83 GB in 29843162 documents in 1 hour 40 min 45.0204 sec (4.94K documents/s)
Processed 128.81 GB in 32677372 documents in 1 hour 54 min 16.1496 sec (4.77K documents/s)
Processed 130.34 GB in 33002461 documents in 1 hour 56 min 43.0586 sec (4.71K documents/s)
Processed 142.88 GB in 34838318 documents in 2 hours 1 min 22.7340 sec (4.78K documents/s)
Processed 139.09 GB in 34395243 documents in 2 hours 4 min 49.8066 sec (4.59K documents/s)

And the final size in MyRocks is 238G.  MyRocks uses a mix of LZ4 + Zstandard compression to achieve this size (smaller than the original dataset).

The same for InnoDB storage engine:

Processed 124.26 GB in 32091070 documents in 2 hours 17 min 33.0404 sec (3.89K documents/s)
Processed 116.83 GB in 29843162 documents in 2 hours 8 min 6.3595 sec (3.88K documents/s)
Processed 128.81 GB in 32677372 documents in 2 hours 28 min 8.5292 sec (3.68K documents/s)
Processed 130.34 GB in 33002461 documents in 2 hours 31 min 25.8357 sec (3.63K documents/s)
Processed 142.88 GB in 34838318 documents in 2 hours 44 min 56.9327 sec (3.52K documents/s)
Processed 139.09 GB in 34395243 documents in 2 hours 39 min 53.3889 sec (3.59K documents/s)

The final size in InnoDB is 991G.

Note: InnoDB does not use compression in this test. Although the InnoDB engine has compression capabilities, this is not what we typically recommend to use due to different issues.

The MyRock average insert throughout is 4.81K documents/sec and InnoDB is 3.7K documents/sec, so there is a 24 percent gain in loading speed comparing MyRocks to InnoDB.

Now let’s compare some aggregation queries, with benefits to show how we can handle JSON documents in MySQL.

Find the top subreddits (by the number of comments) that are not marked as “Adult only” (I did not risk including “adult” subreddits in the output…):

SELECT doc->"$.subreddit",count(*) cnt FROM stor WHERE JSON_VALUE(doc, "$.over_18" RETURNING UNSIGNED) IS NOT TRUE GROUP BY 1 ORDER BY cnt DESC LIMIT 100;

The first lines of output:

+-------------------------+--------+
| doc->"$.subreddit" | cnt |
+-------------------------+--------+
| "AskReddit" | 329683 |
| "teenagers" | 131401 |
| "memes" | 102118 |
| "AutoNewspaper" | 82080 |
| "relationship_advice" | 66883 |
| "UltraAlgo" | 64958 |
| "antiwork" | 54697 |
| "NoStupidQuestions" | 52531 |
| "NFTsMarketplace" | 48647 |
| "CryptoCurrency" | 46817 |

And execution times:

MyRocks: 100 rows in set (6 min 20.31 sec)
InnoDB: 100 rows in set (8 min 10.29 sec)

The not-so-good parts of MyRocks

Full disclosure: I also want to show the part where MyRocks performs worse than InnoDB. For this, I will use queries that extract comments from the author.

For this we need to create an index by “author”, and because this is all stored in a JSON document, we need to create a VIRTUAL column “author” as:

ALTER TABLE stor ADD COLUMN author VARCHAR(255) GENERATED ALWAYS AS ( doc->"$.author" );

And for InnoDB this is practically an instant operation:

Query OK, 0 rows affected (1 min 22.61 sec)

While for MyRocks it takes time:

Query OK, 61934232 rows affected (45 min 46.61 sec)

The difference is that MyRocks does not support any “INSTANT” DDL operations yet, while InnoDB does. However, we see that exactly  “INSTANT” operations cause major bugs and incompatibilities in InnoDB in versions 8.0.29 and 8.0.30 (see Percona XtraBackup 8.0.29 and INSTANT ADD/DROP Columns – Percona Database Performance Blog) so I would ask for extra caution before using “INSTANT” operations in InnoDB.

And now we can create an index on a virtual column:

MyRocks:

ALTER TABLE stor ADD KEY (author);

Query OK, 0 rows affected (14 min 37.34 sec)

InnoDB:

ALTER TABLE stor ADD KEY (author);

Query OK, 0 rows affected (40 min 4.39 sec)

MyRocks took much less time creating a new index on a column. To show where MyRocks is worse I will use the poor-man benchmark available in MySQL, namely the BENCHMARK command, such as:

SELECT BENCHMARK(50000000,(with t1 as (with q as (select FLOOR(1 + RAND()*(7685162 -1 + 1)) c1) select * from authors,q where id=q.c1) select count(*) cnt from stor,t1 where stor.author=t1.author limit 100));

Basically, I will execute 50mln times a query that extracts comments from a random author.

For InnoDB execution times (less is better, three different attempts):

1 row in set (2 min 14.66 sec)
1 row in set (1 min 53.10 sec)
1 row in set (1 min 40.18 sec)

For MyRocks:

1 row in set (6 min 38.60 sec)
1 row in set (5 min 42.83 sec)
1 row in set (6 min 4.76 sec)

To put this into perspective, for InnoDB it resulted in 442k queries/sec, and for MyRocks 137k queries/sec (compression and write-optimized data structure play a role here). We need to highlight these results WITHOUT network communication. In normal circumstances, there will be added network latency and the difference between two engines will be a lot less.

Conclusion

MyRocks makes a good use case for a big dataset (in our test we loaded ¾ TB dataset) providing a good insertion rate and small compressed size of the final product.

As a drawback of compression and a write-optimized engine, MyRocks is behind InnoDB in quick “index lookup” queries, which is acceptable in my opinion, but you should evaluate this for your usage.

Appendix:

Hardware specifications and configurations:

2022-MyRocks-reddit/benchmarkspec.md at main · Percona-Lab-results/2022-MyRocks-reddit (github.com)


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK