Super-fast flat file parsing in C# and Java with a perfect hash function
source link: https://blog.vermorel.com/journal/2015/2/16/super-fast-flat-file-parsing-in-c-and-java-with-a-perfect-ha.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Super-fast flat file parsing in C# and Java with a perfect hash function
At Lokad, (almost) all we do is to crunch flat text files. It’s not that we haven’t tried anything else - we did - many times - and it went poorly. Flat files are ubiquitous, well understood, and they yield very good performance both of the write side and the read side when working under tight budgets.
Keep in mind that the files we crunch are frequently generated by our clients, so while ProtoBuf or Cap’n Proto are very cool, asking our clients to deliver such formats would be roughly equivalent asking them to reimplement their in-house Java ERP in Haskell. To preserve the sanity of our clients, we keep it simple and we stick to flat files.
However we have decided to make flat file read fast, really fast. Thus, one of us decided to tackle the challenge dead-on, and came up with a very nice pattern: file parsing starts with a Perfect Hash Function preprocessing. Simply put, the flat file gets tokenized, and then each token gets replaced by an integer uniquely identifying this piece of string. Not only this saves a tremendous amount of string object instantiation, but afterward, all the complex parsing operations, such as parsing a date, can be performed only once, even if the token is encountered hundreds of times in the file. Performance-wise, it works because flat files tend to be very denormalized and very redundant.
We have released a tiny open source package codenamed Lokad.FlatFiles for C#/.NET (and a Java version too) under the MIT license. This library takes care of generating the perfect hashes out of a flat file. Our (unfair) benchmarks indicate that we typically reach about 30MB/second on a single CPU. Then, when the subsequent parsing operations take advantage of the token hashing, the speed-up is so massive that this initial perfect hashing tend to completly dominate the total CPU cost - so we stay at roughly 30MB/second.
Recommend
-
74
Version 3.4.1 released January 22, 2023 This update brings 3 bug fixes, described below. We've also updated the Summernote plugin. ❤️ Support our free and open source miss...
-
19
Dexbox A lightweight dex file parsing library Introduction ( 中文 ) Dexbox is a lightweight dex file pa...
-
13
Why 4:54 is the perfect speedrun - Super Mario Bros. World Record Speedrun Explained16,756 views•Premiered 6 hours ago ...
-
3
optimize-js Optimize a JavaScript file for faster initial execution and parsing, by wrapping all immediately-invoked functions or likely-to-be-invoked functions in parentheses.
-
7
1. Introduction ChoETL is an open source ETL (extract, transform and load) framework...
-
4
Background Mentioned in the previous post, we have
-
1
Renato Martins September 5, 2022 4 minute read...
-
2
Why a super-size MacBook Air might be the perfect laptop [The CultCast]
-
2
Parsing the .DS_Store file format About two years ago I came across a .DS_Store file and wanted to ex...
-
1
P Kaushik Srinivas June 17, 2023 2 minute rea...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK