18

TEXT and BLOB good practices

 5 years ago
source link: https://www.tuicool.com/articles/hit/3m6vaae
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
eURbm2r.jpg!web According to Steve McQueen, it’s better to avoid BLOB

TEXT and BLOB are MySQL variable-length types for long texts and long binary data. They are often misused. Sometimes they are used because they are an ORM’s default type for strings. Sometimes because the user thinks that VARCHAR has a 255 length limit (which was true in very old MySQL versions, but not today).

In this article, I argue that sometimes these types should be avoided, and I talk a bit about good practices.

Choosing between TEXT and BLOB

TEXT and BLOB are similar datatypes. The difference between them is that TEXT has the notions of character set and collation, which allow a smart string comparison and string sorts. This means that MySQL knows the alphabetical order, whether case matters or not, if an accented A should be considered equal to a plain A, and so on. On the other hand, BLOB is treated as a mere sequence of bytes, which works well – for example – for videos or audio contents, compiled code, and other data in binary format. BLOB msy also work well for texts, as long as sorting and smart comparisons don’t matter, and the applications know how to decode the characters. For example, I would use VARBINARY (the smaller version of BLOB ) to store hashes and checksums, because the only operation we need to do with them is a byte-by-byte comparison.

Choosing a size

In this article, I use generically the terms TEXT and BLOB . But there are actually several datatypes in this family:

  • TINYTEXT / TINYBLOB
  • SMALLTEXT / SMALLBLOB
  • MEDIUMTEXT / MEDIUMBLOB
  • TEXT / BLOB
  • LONGTEXT / LONGBLOB

Theoretically, one should always choose the smallest possible datatype. For example notice that LONGBLOB is far bigger than the necessary in the vast majority of cases, and often even TINYTEXT or TINYBLOB are big enough. But in practice, if you are not sure, you can just be pessimistic and choose the type you feel more comfortable with. You will not allocate more space on disk or memory, and there will hardly be a noticeable performance loss.

Do you need to store BLOBs in the database?

Now that we discussed the characteristics of TEXT and BLOB macro-types and the specific types, we know enough to ask ourselves a more important question: should we actually use BLOB ? In other words, do we really need to store images, audios, videos, binary data, etc, in the database?

Nowadays, most people would say no, and in most cases they are right. Storing media contents takes space on the same disk used to store the rest of the database, and this could be a problem if your data is big-ish and your disk is small-ish. Writing those data makes transactions longer, which could create performance or locking problems, and use more CPU time. Reading those data copies them into the buffer pool, using space that maybe would better be used for other, smaller values. Both writing and reading these data will consume network bandwidth. Replication may also suffer and lag because the writes to replicate are bigger.

In PostgreSQL there are no TEXT and BLOB types but one can use, respectively, big varchar s and bytea instead. Some of the previously mentioned concerns are also valid in Postgres. Another one should also be considered: from a physical point of view, UPDATE statements create a copy of the affected rows. Therefore, storing big data in rows that are frequently UPDATE d will make the table bloat (regardless which columns are modified).

For these reasons, people usually don’t store media data in relational databases . They store media files URLs instead – or not even that, if the application is able to find out the URLs automatically.

On the other side, there is a reason to store media data in relational databases. Imagine you store an image in a file server and its metadata in a DBMS. Transactions guarantee that all metadata will persist or the transaction will fail. The file server may give similar guarantees. But even in that case, you have two different transactions for a single logical data unit , which makes transactions substantially useless. So, depending on how much you value your data, you may still choose to store images in a DBMS.

Should BLOBs be in a separate table?

InnoDB internally stores TEXT , BLOB , VARCHAR and VARBINARY types in a special way, because they can be big. If a value is shorter than 768 bytes, it is stored in the same memory page as the rest of the row. Otherwise, it is stored in one or more separate pages, depending on how big it is. In the row page, a 20 bytes pointer is written for every additional page.

This is true with the DYNAMIC and COMPRESSED row formats. DYNAMIC is the default with MySQL 8.0 and MariaDB 10.2, but COMPACT was the default in older versions. This means that, if you didn’t set the row format to DYNAMIC explicitly and you don’t use compressed tables, long TEXT and BLOB values are stored in-row.

The former method makes the row faster to read if you don’t read long TEXT and BLOB columns, because it can save a lot of IO. The second method makes all the reads on the table slower.

There is however one thing to note: most developers tend to always use SELECT * , which reads all columns. As a consequence, even if InnoDB wisely stores long TEXT and BLOB s in separate pages, in practice many queries unnecessarily read those pages.

Also, as we mentioned, those queries will probably move TEXT and BLOB values into the buffer pool. And returning those values to the clients will increase the network traffic.

The wise solution is to convince developers to list the columns they need in their queries. But in my experience it is a very hard task. Even harder is to ask them to review old queries to fix this problem. Not only because they don’t want to do such a boring (and error-prone) task, but also because they cannot. The management decides their priorities, and those priorities are usually new features. Consolidation of existing code is usually considered as an optional effort that doesn’t create much value, in some places it’s almost regarded as a hobby. See my article about the important ofsetting proper SLOs, it should provide enough information to understand the economical value of code consolidation.

I’m not saying you should give up with query optimisations. Poorly optimised queries put a hard limit to the performance and scalability goals we can achieve. Developers should dedicate a part of their time to write queries properly, and refactor the ones that create issues.

That said, we need to be realistic. Suppose a development team comes up with a new table like this, that is going to be frequently read.

CREATE TABLE product (
    id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(50) NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    description TEXT,
    photo_front BLOB,
    photo_back BLOB
);

Notice that there are three TEXT or BLOB columns. If we review the new tables before they make it to production, we can reach the developers and ask how often those columns will be accessed. Don’t expect that they can come up with precise information about the usage of a feature that is not yet released, they are not time lords . But their educated guesses are often close enough. Suppose that the answer is: photo_front is accessed by almost all queries on the product table, which is expected to happen several times a second, but description and photo_back are only accessed by the query that returns the details on a single product, which will be called much less often.

Hopefully you can look together at some graphs about URLs usage and do some basic math, which will make the guess more educated.

So, something that we can do is to move description and photo_back to a separate table to avoid the overhead caused by SELECT * (again: no matter how better it is to avoid it, if you know that it’s not going to happen). The product details query will be a JOIN , but we decided that we don’t care because it will happen seldom.

The problem with DEFAULT values

Only starting from MySQL 8.0 and MariaDB 10.2 it is possible to assign a default value to TEXT or BLOB columns. These versions significantly improved the default values handling. Older versions leave us with two options:

  • Always specifying an empty string in our INSERT s, if we don’t want to write
  • Avoiding to declare the column as NOT NULL . In this way, the default value will be NULL .

But NULL isinconsistent in SQL, so I’d prefer to avoid it when possible.

Toodle pip,
Federico


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK