36

std::string is not a Container for Raw Data

 5 years ago
source link: https://www.tuicool.com/articles/hit/Ev6re2y
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Sometimes we need unformatted data, simple byte sequences. At first glance, std::string might be a fitting data structure for that, but it is not.

Think about data we get from networks, a CAN bus, another process. Serialized binary data that has to be interpreted before it can be used in our business logic. The natural way to manage this kind of data is having sequence containers like std::vector or std::array of std::byte or, lacking C++17 support, unsigned char . Sometimes we also see uint8_t , which on many platforms is unsigned char .

However, there is another contiguous container for 8-bit values that seems tempting to be used as a means to transport byte sequences: std::string . I am not sure about the reasons to do this apart from std::string being slightly less to type than std::vector<unsigned char> , meaning that I can not see any reason at all. On the contrary, it is a bad idea for several reasons.

‘\0’ delimiters

Many string operations rely on having zero-terminated character sequences. That means that there is exactly one null character, and that is at the end. Plain byte sequences, on the other hand, can contain an arbitrary number of null bytes anywhere. While std::string can store sequences with null characters, we have to be very careful to not use functions that take const char* , because those would truncate at the first null character.

Semantics

The major reason not to use std::string is semantics: When we see that type in our code, we naturally expect a series of readable characters. We expect some text. When it is misused as a series of raw bytes, it is confusing to maintainers of our codebase. It gets even worse if we expose the use of std::string as a raw data container via an API that has to be used by someone else.

Especially in locations where we convert text to serialized raw data or vice versa, it will be very confusing to determine which std::string is text and which is raw data.

Type safety

Apart from confusing the developer, having the same type for two nontrivial uses can be error prone as it neglects the safety mechanisms the strong typing of C++ gives us. Imagine for example a function that takes some text and some serialized raw data – both would take std::string and could easily switch places by accident.

Conclusion

Instead of std::string , use std::vector<std::byte> or std::vector<unsigned char> . While this already nicely says “sequence of bytes”, consider using a typedef. For evenstronger typing, use a wrapper structure with a meaningful name.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK