GitHub - traversc/qs: Fast serialization of R objects
source link: https://github.com/traversc/qs
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
README.md
qs
Quick serialization of R objects
This package provides an interface for quickly writing (serializing) and reading (de-serializing) objects to and from disk. The goal of this package is to provide a lightning-fast and complete replacement for the saveRDS
and readRDS
functions in R.
Inspired by the fst
package, qs
uses a similar block-compression approach using the zstd
library and direct "in memory" compression, which allows for lightning quick serialization. It differs in that it uses a more general approach for attributes and object references for common data types (numeric data, strings, lists, etc.), meaning any S3 object built on common data types, e.g., tibble
s, time-stamps, bit64
, etc. can be serialized. For less common data types (formulas, environments, functions, etc.), qs
relies on built in R serialization functions via the RApiSerialize
package followed by block-compression.
For character vectors, qs
also uses the alt-rep system to quickly read in string data.
Installation
install.packages("qs")
or devtools::install_github("traversc/qs")
(Requires R version 3.5 or higher)
Features
The table below compares the features of different serialization approaches in R.
qs fst saveRDS Not Slow ✔ ✔ X Numeric Vectors ✔ ✔ ✔ Integer Vectors ✔ ✔ ✔ Logical Vectors ✔ ✔ ✔ Character Vectors ✔ ✔ ✔ Character Encoding ✔ (vector-wide only) ✔ Complex Vectors ✔ X ✔ Data.Frames ✔ ✔ ✔ On disk row access X ✔ X Attributes ✔ Some ✔ Lists / Nested Lists ✔ X ✔ Multi-threaded X (Not Yet) ✔ X
Summary Benchmarks
The table below lists serialization speed for several different data types.
qs
saveRDS
fst
1 thread
fst
4 threads
Write
Read
Write
Read
Write
Read
Write
Read
Integer Vector
sample(1e8)
1015.2 MB/s
889.8 MB/s
27.1 MB/s
135.5 MB/s
686.6 MB/s
442.4 MB/s
699.1 MB/s
567.9 MB/s
Numeric Vector
runif(1e8)
861.2 MB/s
954.0 MB/s
24.3 MB/s
131.9 MB/s
744.0 MB/s
638.7 MB/s
754.4 MB/s
848.0 MB/s
Character Vector
qs::randomStrings(1e7)
1312.9 MB/s
715.8 MB/s*
49.1 MB/s
43.9 MB/s
1440.9 MB/s
59.5 MB/s
1536.3 MB/s
59.3 MB/s
List
map(1:1e5,sample(100))
197.2 MB/s
311.5 MB/s
7.7 MB/s
123.5 MB/s
N/A
N/A
N/A
N/A
Environment
map(1:1e5,sample(100))
names(x)<-1:1e5
as.environment(x)
56.0 MB/s
117.5 MB/s
7.7 MB/s
89.6 MB/s
N/A
N/A
N/A
N/A
Additional Benchmarks
Data.Frame benchmark
Benchmarks for serializing and de-serializing large data.frames (5 million rows) composed of a numeric column (rnorm
), an integer column (sample(5e6)
), and a character vector column (random alphanumeric strings of length 50). See dataframe_bench.png
for a comparison using different compression parameters.
This benchmark also includes materialization of alt-rep data, for an apples-to-apples comparison.
Serialization speed with default parameters:
Method write time (s) read time (s) qs 0.49391294 8.8818166 fst (1 thread) 0.37411811 8.9309314 fst (4 thread) 0.3676273 8.8565951 saveRDS 14.377122 12.467517Serialization speed with different parameters
The numbers in the figure reflect the compression parameter used. qs
uses the zstd
compression library, and compression parameters range from -50 to 22 (qs
uses a default value of -1). fst
defines it's own compression range through a combination of zstd
and lz4
algorithms, ranging from 0 to 100 (default: 0).
Nested List benchmark
Benchmarks for serialization of random nested lists with random attributes (approximately 50 Mb). See the nested list example in the tests folder.
Serialization speed with default parameters:
Method write time (s) read time (s) qs 0.17840716 0.19489372 saveRDS 3.484225 0.58762548Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK